Обсуждение: a small proposal for avoiding foot-shooting

Поиск
Список
Период
Сортировка

a small proposal for avoiding foot-shooting

От
Andrew Gierth
Дата:
Currently setting "log_error_verbosity = terse" causes all HINT output
to the log file to be suppressed (along with some other stuff).

I propose that this behaviour be changed such that 'terse' is ignored
for all log messages of FATAL or PANIC severity.

The main reason for this is messages like (from xlog.c):
           ereport(PANIC,                   (errmsg("could not locate required checkpoint record"),
errhint("Ifyou are not restoring from a backup, try removing the file \"%s/backup_label\".", DataDir)));
 

The HINT: there is at least as important as the error message, and the
lack of a clear direction to the admin in this case is a gold-plated
invitation to do something stupid that will seriously mess up the data,
such as manually copying back the xlog file that recovery seems to be
asking for at that point.

[This is all based on today's lesson in disaster recovery from the IRC
chan, so it's not theoretical.]

-- 
Andrew (irc:RhodiumToad)


Re: a small proposal for avoiding foot-shooting

От
Tom Lane
Дата:
Andrew Gierth <andrew@tao11.riddles.org.uk> writes:
> I propose that this behaviour be changed such that 'terse' is ignored
> for all log messages of FATAL or PANIC severity.
> [ on the strength of a single example ]

This seems like using a blunderbuss where a rifle is called for.

There may indeed be some places where we have HINTS that are conveying
pretty important information, but I see no argument whatsoever that
the importance of a hint is determined by the severity level of the
message it's attached to.

I could see inventing some kind of additional ereport decoration that
says "force the hint to be printed", but realize that this is only
likely to have any effect in the postmaster log --- we can't guarantee
to control what clients do with subsidiary message fields.  So the
value seems a bit limited anyway.

It seems like it might be better to rephrase error messages to ensure
that anything really critical is mentioned in the primary message.
In this case, perhaps instead oferrmsg("could not locate required checkpoint record")
we could have it printerrmsg("could not locate checkpoint record specified in file \"%s/backup_label\".", DataDir)
assuming we did actually get the location from there.

Anyway, you've omitted a lot of details that would be necessary
to judge exactly what was misleading about what the DBA saw.
        regards, tom lane


Re: a small proposal for avoiding foot-shooting

От
Alvaro Herrera
Дата:
Tom Lane wrote:

> It seems like it might be better to rephrase error messages to ensure
> that anything really critical is mentioned in the primary message.
> In this case, perhaps instead of
>     errmsg("could not locate required checkpoint record")
> we could have it print
>     errmsg("could not locate checkpoint record specified in file \"%s/backup_label\".", DataDir)
> assuming we did actually get the location from there.

Is errdetail not printed when verbosity = terse?  I wonder if
backup_label should be mentioned in DETAIL rather than HINT.  Perhaps:

errmsg("could not locate required checkpoint record"),
errdetail("\"%s/backup_label\" specifies a checkpoint record not found.", DataDir),
errhint("If you are not restoring from a backup, try removing that file.")

or something like that.

I agree that this seems like critical information that should not be
printed only in a hint message.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: a small proposal for avoiding foot-shooting

От
Andrew Gierth
Дата:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:
>> I propose that this behaviour be changed such that 'terse' is>> ignored for all log messages of FATAL or PANIC
severity.>>[ on the strength of a single example ]
 
Tom> This seems like using a blunderbuss where a rifle is called for.

Maybe so.
Tom> There may indeed be some places where we have HINTS that areTom> conveying pretty important information, but I see
noargumentTom> whatsoever that the importance of a hint is determined by theTom> severity level of the message it's
attachedto.
 

With a small number of exceptions, FATAL and PANIC messages are of
the form "the database won't start (due to X)" or "the database just
died (due to X)". A relatively small proportion of them have errhint
or errdetail records, but those that do have detail records also tend
to have extremely unhelpful errmsg text.

In fact (at least in 8.3) there is only one PANIC message with an
errhint, and only one with an errdetail, and both of those ought to be
in the "must print to avoid confusing the DBA" category. The FATAL
messages are more of a mixed bag.
Tom> I could see inventing some kind of additional ereport decorationTom> that says "force the hint to be printed", but
realizethat thisTom> is only likely to have any effect in the postmaster log --- weTom> can't guarantee to control what
clientsdo with subsidiaryTom> message fields.  So the value seems a bit limited anyway.
 

For PANIC messages especially, the postmaster log is really what
counts.
Tom> It seems like it might be better to rephrase error messages toTom> ensure that anything really critical is
mentionedin the primaryTom> message.
 
Tom> In this case, perhaps instead ofTom>     errmsg("could not locate required checkpoint record")Tom> we could have
itprintTom>     errmsg("could not locate checkpoint record specified in fileTom>     \"%s/backup_label\".",
DataDir)Tom>assuming we did actually get the location from there.
 

That's still not capturing the important part of the HINT message in
this specific case, which is "you must remove the backup_label file
now if you're not trying to restore from a backup".

(The current behaviour where recovery CANNOT succeed without manual
intervention if the database went down while pg_start_backup is in
effect is of course entirely suboptimal. Lack of clear direction in
the error message as to what to do in that circumstance is pretty
much unforgiveable.)
Tom> Anyway, you've omitted a lot of details that would be necessaryTom> to judge exactly what was misleading about
whatthe DBA saw.
 

This is exactly what the DBA saw (following a pg_ctl restart -mimmediate):

----
2008-12-20 10:26:57 EST FATAL:  the database system is starting up
2008-12-20 10:26:57 EST LOG:  database system was interrupted; last known up at 2008-12-20 10:24:00 EST
2008-12-20 10:26:57 EST FATAL:  the database system is starting up
2008-12-20 10:26:57 EST FATAL:  the database system is starting up
2008-12-20 10:26:57 EST LOG:  could not open file "pg_xlog/00000001000001E100000087" (log file 481, segment 135): No
suchfile or directory
 
2008-12-20 10:26:57 EST LOG:  invalid checkpoint record
2008-12-20 10:26:57 EST PANIC:  could not locate required checkpoint record
2008-12-20 10:26:57 EST LOG:  startup process (PID 1634) was terminated by signal 6: Aborted
2008-12-20 10:26:57 EST LOG:  aborting startup due to startup process failure
----

(Earliest xlog file actually present at that time was
00000001000001E20000004A.)

-- 
Andrew.


Re: a small proposal for avoiding foot-shooting

От
Andrew Gierth
Дата:
>>>>> "Alvaro" == Alvaro Herrera <alvherre@commandprompt.com> writes:
Alvaro> Is errdetail not printed when verbosity = terse?

No, it's not.

-- 
Andrew.


Re: a small proposal for avoiding foot-shooting

От
"Albe Laurenz"
Дата:
Tom Lane wrote:
>> I propose that this behaviour be changed such that 'terse' is ignored
>> for all log messages of FATAL or PANIC severity.
>> [ on the strength of a single example ]
>
[...]
>
> It seems like it might be better to rephrase error messages to ensure
> that anything really critical is mentioned in the primary message.
> In this case, perhaps instead of
>     errmsg("could not locate required checkpoint record")
> we could have it print
>     errmsg("could not locate checkpoint record specified in file \"%s/backup_label\".", DataDir)
> assuming we did actually get the location from there.
>
> Anyway, you've omitted a lot of details that would be necessary
> to judge exactly what was misleading about what the DBA saw.

I *guess* it is the problem addressed by

http://archives.postgresql.org/pgsql-committers/2008-04/msg00275.php
and
http://archives.postgresql.org/pgsql-committers/2008-04/msg00358.php

Yours,
Laurenz Albe


Re: a small proposal for avoiding foot-shooting

От
Andrew Gierth
Дата:
>>>>> "Albe" == Albe Laurenz <laurenz.albe@wien.gv.at> writes:
Albe> I *guess* it is the problem addressed by
Albe> http://archives.postgresql.org/pgsql-committers/2008-04/msg00275.phpAlbe> andAlbe>
http://archives.postgresql.org/pgsql-committers/2008-04/msg00358.php

No; the problem is with stop -mimmediate (not -mfast), or any other
PANIC shutdown of the postmaster during backup.

-- 
Andrew.