Re: [HACKERS] Crash on promotion when recovery.conf is renamed

Поиск
Список
Период
Сортировка
От Magnus Hagander
Тема Re: [HACKERS] Crash on promotion when recovery.conf is renamed
Дата
Msg-id CABUevEzEoXQAc0ifyVF7XYpPrXOCt3P2_B465bLO4W_uFQirbw@mail.gmail.com
обсуждение исходный текст
Ответ на [HACKERS] Crash on promotion when recovery.conf is renamed  (Magnus Hagander <magnus@hagander.net>)
Ответы Re: [HACKERS] Crash on promotion when recovery.conf is renamed  (Michael Paquier <michael.paquier@gmail.com>)
Список pgsql-hackers


On Fri, Dec 16, 2016 at 7:08 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Dec 15, 2016 at 11:25:10AM +0200, Heikki Linnakangas wrote:
> On 12/15/2016 10:44 AM, Magnus Hagander wrote:
> > I wonder if there might be more corner cases like this, but in this
> > particular one it seems easy enough to just say that failing to rename
> > recovery.conf because it didn't exist is safe.
>
> Yeah. It's unexpected though, so I think erroring out is quite reasonable.
> If the recovery.conf file went missing, who knows what else is wrong.

i'd rather not mess up with the exiting behavior and just complain to the
pilot to not do that. This enters in the category of not removing the
backup_label file after taking a backup...

I'm happy to say that people shouldn't do that. However, we have a system that is unnecessarily fragile, by making something like that so easy to break. This could equally happen in other ways. For example, someone could accidentally provision a recovery.conf with the wrong permissions.

Reducing the fragility of such an important part of the system is a big improvement, IMO. Of course, we have to be extra careful when touching those parts.

Bottom line, I'm perfectly fine with failing in such a scenario. I'm not fine with leaving the user with a corrupt cluster if we can avoid it.


As for your comparison, we fixed the backup_label part by adding support for taking backups without using the fragile system. So it's not like we didn't recognize the problem.

 

> > But in the case of failing to rename recovery.conf for example because of
> > permissions errors, we don't want to ignore it. But we also really don't
> > want to end up with this kind of inconsistent data directory IMO. I don't
> > know that code well enough to suggest how to fix it though -- hoping for
> > input for someone who knows it closer?
>
> Hmm. Perhaps we should write the timeline history file only after renaming
> recovery.conf. In general, we should keep the window between writing the
> timeline history file and writing the end-of-recovery record as small as
> possible. I don't think we can eliminate it completely, but it makes sense
> to minimize it. Something like the attached (completely untested).

I haven't looked at the code either, but the reason is definitely solid - let's keep the time as short as possible. In particular, keeping it recoverable for things that are caused by humans, because they will keep making mistakes. And editing recovery.conf is a normal thing for a DBA to do (sure, not removing it -- but it's one of the few files in the data directory that the DBA is actually supposed to handle manually, so that increases the risk).

And also by doing things in an order that will at least make it less likely to end up corrupt if people manage to do it anyway.  


I did some tests. And after a lookup it looks like a good thing to book
the timeline number at an earlier step and let other nodes know about
it. So +1 on it.
 

Looking at PrescanPreparedTransactions(), I am thinking as well that it would
be better to get a hard failure when bumping on a corrupted 2PC file. Future
files are one thing, but corrupted files should be treated more carefully.

Again without looking at it, I agree (so much easier that way :P). Ignoring corruption is generally a bad idea. Failing hard makes the user notice the error, and makes it possible to initiate recovery from a standby or from backups or something, or to *intentionally* remove/clear/ignore it. 

--

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Ashutosh Sharma
Дата:
Сообщение: Re: [HACKERS] Hang in pldebugger after git commit : 98a64d0
Следующее
От: Erik Rijkers
Дата:
Сообщение: Re: [HACKERS] Logical Replication WIP