Обсуждение: SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea

Поиск
Список
Период
Сортировка

SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea

От
Tom Lane
Дата:
I have just finished trudging through a bunch of code and trying to make
it secure against being interrupted by die() at arbitrary instants.
However, I am under no illusion that I have succeeded in making the
world safe for SIGTERM, and you shouldn't be either.  There is just way
too much code that is potentially invoked during proc_exit; even if we
fixed every line of our code, there's C library code that's not under
our control.  For example, malloc/free are not interrupt-safe on many
platforms, last I heard.  Do you want to put START/END_CRIT_SECTION
around every memory allocation operation?  I don't.

I think we'd be lots better off to abandon the notion that we can exit
directly from the SIGTERM interrupt handler, and instead treat SIGTERM
the same way we treat QueryCancel: set a flag that is inspected at
specific places where we know we are in a good state.

Comments?
        regards, tom lane


RE: SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea

От
"Hiroshi Inoue"
Дата:
> -----Original Message-----
> From: Tom Lane
> 
> I have just finished trudging through a bunch of code and trying to make
> it secure against being interrupted by die() at arbitrary instants.

It seems that START/END_CRIT_SECTION() is called in both
existent and newly added places.
Isn't it appropriate to call a diffrent macro using a separate 
CriticalSectionCount variable in newly added places ?

Regards.
Hiroshi Inoue


Re: SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea

От
Tom Lane
Дата:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> Isn't it appropriate to call a diffrent macro using a separate 
> CriticalSectionCount variable in newly added places ?

Why?  What difference do you see in the nature of the critical sections?
They all look the same to me: hold off cancel/die response.
        regards, tom lane


RE: SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea

От
"Hiroshi Inoue"
Дата:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > Isn't it appropriate to call a diffrent macro using a separate 
> > CriticalSectionCount variable in newly added places ?
> 
> Why?  What difference do you see in the nature of the critical sections?
> They all look the same to me: hold off cancel/die response.
>

I've thought that the main purpose of CRIT_SECTION is to
force redo recovery for any errors during the CRIT_SECTION
to complete the critical operation e.g. bt_split(). Note that
elog(ERROR/FATAL) is changed to elog(STOP) if Critical
SectionCount > 0.  Postgres 7.1 stll lacks an undo functionality
and AbortTransaction() does little about rolling back the
transaction. PostgreSQL seems to have to retry the critical
operation by running a redo recovery after killing all backends.

Regards.
Hiroshi Inoue


Re: SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea

От
Tom Lane
Дата:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
>> Why?  What difference do you see in the nature of the critical sections?
>> They all look the same to me: hold off cancel/die response.

> I've thought that the main purpose of CRIT_SECTION is to
> force redo recovery for any errors during the CRIT_SECTION
> to complete the critical operation e.g. bt_split().

How could it force redo?  Rollback, maybe, but that should happen
anyway.

> Note that elog(ERROR/FATAL) is changed to elog(STOP) if Critical
> SectionCount > 0.

Not in current sources ;-).

Perhaps Vadim will say that I broke his error scheme, but if so it's
his own fault for not documenting such delicate code at all.  I believe
he's out of town this weekend, so let's wait till he gets back and then
discuss it some more.  Perhaps there is a need to distinguish xlog-
related critical sections from other ones, or perhaps not.
        regards, tom lane


Re: SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea

От
Hiroshi Inoue
Дата:
Hmm, I've seen neither my posting nor your reply
on hackers ML.

Tom Lane wrote:
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> >> Why?  What difference do you see in the nature of the critical sections?
> >> They all look the same to me: hold off cancel/die response.
> 
> > I've thought that the main purpose of CRIT_SECTION is to
> > force redo recovery for any errors during the CRIT_SECTION
> > to complete the critical operation e.g. bt_split().
> 
> How could it force redo?

Doesn't proc_exit(non-zero) force shuttdown recovery ?
AFAIK, Postgres doesn't have a rollback recovery 
functionality yet.

> Rollback, maybe, but that should happen
> anyway.
> 
> > Note that elog(ERROR/FATAL) is changed to elog(STOP) if Critical
> > SectionCount > 0.
> 
> Not in current sources ;-).
>

Oh you removed the code 20 hours ago. AFAIK, the (equivalent)
code has lived there from the first appearance of CRIT_SECTION.
Is there any reason to remove the code ?
Regards.
Hiroshi Inoue


Re: SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea

От
Tom Lane
Дата:
Hiroshi Inoue <Inoue@tpf.co.jp> writes:
>>>> I've thought that the main purpose of CRIT_SECTION is to
>>>> force redo recovery for any errors during the CRIT_SECTION
>>>> to complete the critical operation e.g. bt_split().
>> 
>> How could it force redo?

> Doesn't proc_exit(non-zero) force shuttdown recovery ?

It forces a shutdown and restart, but that does not do anything good
that I can see.  The WAL log entry hasn't been made, typically, so there
is nothing to redo.  If there *were* a log entry, and the redo failed
again (pretty likely), then we'd have an infinite crash/try to
restart/crash cycle, which is just about the worst possible behavior.
So I'm not seeing what the point is.

> Oh you removed the code 20 hours ago. AFAIK, the (equivalent)
> code has lived there from the first appearance of CRIT_SECTION.
> Is there any reason to remove the code ?

Because I think turning an elog(ERROR) into a system-wide crash is
not a good idea ;-).  If you are correct that this behavior is necessary
for WAL-related critical sections, then indeed we need two kinds of
critical sections, one that just holds off cancel/die response and one
that turns elog(ERROR) into a dangerous weapon.  I'm going to wait and
see Vadim's response before I do anything ...
        regards, tom lane


Re: SIGTERM -> elog(FATAL) -> proc_exit() is probably a bad idea

От
Hiroshi Inoue
Дата:
Tom Lane wrote:
> 
> Hiroshi Inoue <Inoue@tpf.co.jp> writes:
> >>>> I've thought that the main purpose of CRIT_SECTION is to
> >>>> force redo recovery for any errors during the CRIT_SECTION
> >>>> to complete the critical operation e.g. bt_split().
> >>
> >> How could it force redo?
> 
> > Doesn't proc_exit(non-zero) force shuttdown recovery ?
> 
> It forces a shutdown and restart, but that does not do anything good
> that I can see.  The WAL log entry hasn't been made, typically, so there
> is nothing to redo.  If there *were* a log entry, and the redo failed
> again (pretty likely), then we'd have an infinite crash/try to
> restart/crash cycle, which is just about the worst possible behavior.
> So I'm not seeing what the point is.
> 

It seems a nature of 7.1 recovery scheme.
Once a WAL log entry is made, recovery should 
complete the log in regardless of the cause of
recovery(elog, system error like SEGV etc).

I've wondered why no one has asked how we could
recover from a recovery failure. Unfortunately,
I don't know the answer. Recovery failure seems
veeeeery serious because postmaster couldn't
start if the startup recovery fails.
In addtion I have another anxiety. I don't know
how robust WAL is against general bugs not
directly related to WAL.

Regards.
Hiroshi Inoue