Обсуждение: RFC: pg_stat_logmsg
Greetings, Attached please find a tarball (rather than a patch) for a proposed new contrib extension, pg_stat_logmsg. The basic idea is to mirror how pg_stat_statements works, except the logged messages keyed by filename, lineno, and elevel are saved with a aggregate count. The format string is displayed (similar to a query jumble) for context, along with function name and sqlerrcode. I threw this together rather quickly over the past couple of days between meetings, so not claiming that it is committable (and lacks documentation and regression tests as well), but I would love to get feedback on: 1/ the general concept 2/ the pg_stat_statement-like implementation 3/ contrib vs core vs external project Some samples and data: `make installcheck` with the extension loaded: 8<------------------ # All 215 tests passed. real 2m24.854s user 0m0.086s sys 0m0.283s 8<------------------ `make installcheck` without the extension loaded: 8<------------------ # All 215 tests passed. real 2m26.765s user 0m0.076s sys 0m0.293s 8<------------------ Sample output after running make installcheck a couple times (plus a few manually generated ERRORs): 8<------------------ test=# select sum(count) from pg_stat_logmsg where elevel > 20; sum ------- 10554 (1 row) test=# \x Expanded display is on. test=# select * from pg_stat_logmsg where elevel > 20 order by count desc; -[ RECORD 1 ]------------------------------- filename | aclchk.c lineno | 2811 elevel | 21 funcname | aclcheck_error sqlerrcode | 42501 message | permission denied for schema %s count | 578 -[ RECORD 2 ]------------------------------- filename | scan.l lineno | 1241 elevel | 21 funcname | scanner_yyerror sqlerrcode | 42601 message | %s at or near "%s" count | 265 ... test=# select * from pg_stat_logmsg where elevel > 20 and sqlerrcode = 'XX000'; -[ RECORD 1 ]--------------------------------------- filename | tid.c lineno | 352 elevel | 21 funcname | currtid_for_view sqlerrcode | XX000 message | ctid isn't of type TID count | 2 -[ RECORD 2 ]--------------------------------------- filename | pg_locale.c lineno | 2493 elevel | 21 funcname | pg_ucol_open sqlerrcode | XX000 message | could not open collator for locale "%s": %s count | 2 ... 8<------------------ Part of the thinking is that people with fleets of postgres instances can use this to scan for various errors that they care about. Additionally it would be useful to look for "should not happen" errors. I will register this in the July CF and will appreciate feedback. Thanks! -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi
so 1. 7. 2023 v 1:57 odesílatel Joe Conway <mail@joeconway.com> napsal:
Greetings,
Attached please find a tarball (rather than a patch) for a proposed new
contrib extension, pg_stat_logmsg.
The basic idea is to mirror how pg_stat_statements works, except the
logged messages keyed by filename, lineno, and elevel are saved with a
aggregate count. The format string is displayed (similar to a query
jumble) for context, along with function name and sqlerrcode.
I threw this together rather quickly over the past couple of days
between meetings, so not claiming that it is committable (and lacks
documentation and regression tests as well), but I would love to get
feedback on:
1/ the general concept
2/ the pg_stat_statement-like implementation
3/ contrib vs core vs external project
Some samples and data:
`make installcheck` with the extension loaded:
8<------------------
# All 215 tests passed.
real 2m24.854s
user 0m0.086s
sys 0m0.283s
8<------------------
`make installcheck` without the extension loaded:
8<------------------
# All 215 tests passed.
real 2m26.765s
user 0m0.076s
sys 0m0.293s
8<------------------
Sample output after running make installcheck a couple times (plus a few
manually generated ERRORs):
8<------------------
test=# select sum(count) from pg_stat_logmsg where elevel > 20;
sum
-------
10554
(1 row)
test=# \x
Expanded display is on.
test=# select * from pg_stat_logmsg where elevel > 20 order by count desc;
-[ RECORD 1 ]-------------------------------
filename | aclchk.c
lineno | 2811
elevel | 21
funcname | aclcheck_error
sqlerrcode | 42501
message | permission denied for schema %s
count | 578
-[ RECORD 2 ]-------------------------------
filename | scan.l
lineno | 1241
elevel | 21
funcname | scanner_yyerror
sqlerrcode | 42601
message | %s at or near "%s"
count | 265
...
test=# select * from pg_stat_logmsg where elevel > 20 and sqlerrcode =
'XX000';
-[ RECORD 1 ]---------------------------------------
filename | tid.c
lineno | 352
elevel | 21
funcname | currtid_for_view
sqlerrcode | XX000
message | ctid isn't of type TID
count | 2
-[ RECORD 2 ]---------------------------------------
filename | pg_locale.c
lineno | 2493
elevel | 21
funcname | pg_ucol_open
sqlerrcode | XX000
message | could not open collator for locale "%s": %s
count | 2
...
8<------------------
Part of the thinking is that people with fleets of postgres instances
can use this to scan for various errors that they care about.
Additionally it would be useful to look for "should not happen" errors.
I will register this in the July CF and will appreciate feedback.
This can be a very interesting feature. I like it.
Regards
Pavel
Thanks!
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On 6/30/23 23:20, Pavel Stehule wrote: > so 1. 7. 2023 v 1:57 odesílatel Joe Conway <mail@joeconway.com > <mailto:mail@joeconway.com>> napsal: > Part of the thinking is that people with fleets of postgres instances > can use this to scan for various errors that they care about. > Additionally it would be useful to look for "should not happen" errors. > > I will register this in the July CF and will appreciate feedback. > > This can be a very interesting feature. I like it. Thanks! FWIW, I just modified it to provide the localized text of the elevel rather than the internal number. I also localized the message format string: 8<------------------------------ psql (16beta2) Type "help" for help. test=# \x Expanded display is on. test=# select * from pg_stat_logmsg where elevel = 'ERROR' and sqlerrcode = 'XX000' and count > 1; -[ RECORD 1 ]--------------------------------------------- filename | tablecmds.c lineno | 10908 elevel | ERROR funcname | ATExecAlterConstraint sqlerrcode | XX000 message | cannot alter constraint "%s" on relation "%s" count | 2 -[ RECORD 2 ]--------------------------------------------- filename | user.c lineno | 2130 elevel | ERROR funcname | check_role_membership_authorization sqlerrcode | XX000 message | role "%s" cannot have explicit members count | 2 test=# set lc_messages ='sv_SE.UTF8'; SET test=# select * from pg_stat_logmsg where elevel = 'FEL' and sqlerrcode = 'XX000' and count > 1; -[ RECORD 1 ]--------------------------------------------- filename | tablecmds.c lineno | 10908 elevel | FEL funcname | ATExecAlterConstraint sqlerrcode | XX000 message | kan inte ändra villkoret "%s" i relation "%s" count | 2 -[ RECORD 2 ]--------------------------------------------- filename | user.c lineno | 2130 elevel | FEL funcname | check_role_membership_authorization sqlerrcode | XX000 message | rollen "%s" kan inte ha explicita medlemmar count | 2 8<------------------------------ New tarball attached. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Вложения
Hi, On Sat, Jul 1, 2023 at 8:57 AM Joe Conway <mail@joeconway.com> wrote: > > Greetings, > > Attached please find a tarball (rather than a patch) for a proposed new > contrib extension, pg_stat_logmsg. > > The basic idea is to mirror how pg_stat_statements works, except the > logged messages keyed by filename, lineno, and elevel are saved with a > aggregate count. The format string is displayed (similar to a query > jumble) for context, along with function name and sqlerrcode. > > > Part of the thinking is that people with fleets of postgres instances > can use this to scan for various errors that they care about. > Additionally it would be useful to look for "should not happen" errors. Interesting idea and use cases. I'm concerned that displaying the format string could not be helpful in some cases. For example, when raising an ERROR while reading WAL records, we typically write the error message stored in record->errormsg_buf: in ReadRecord(): if (errormsg) ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), (errmsg_internal("%s", errormsg) /* already translated */ )); In this case, the error message stored in pg_stat_logmsg is just '%s'. The filename and line number columns might help identify the actual error but it requires users to read the source code and cannot know the actual error message. A similar case is where we construct the error message on the fly. For example, in LogRecoveryConflict() the string of the recovery conflict description comes from get_recovery_conflict_desc(): in LogRecoveryConflict(): ereport(LOG, errmsg("recovery still waiting after %ld.%03d ms: %s", msecs, usecs, get_recovery_conflict_desc(reason)), nprocs > 0 ? errdetail_log_plural("Conflicting process: %s.", "Conflicting processes: %s.", nprocs, buf.data) : 0); The user might want to search the error message by the actual conflict reason, but cannot. In this case, I'd like to see the actual error message (I'd like to normalize the number part, though). That being said, using the format string for the error messages like "ERROR: relation "nonexist_table" does not exist" makes sense to me since we can avoid having too many similar entries. So I feel that we might need to figure out what part of the log message should be normalized like pg_stat_statement does for query strings. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Jul 6, 2023 at 12:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > On Sat, Jul 1, 2023 at 8:57 AM Joe Conway <mail@joeconway.com> wrote: > > > > The basic idea is to mirror how pg_stat_statements works, except the > > logged messages keyed by filename, lineno, and elevel are saved with a > > aggregate count. The format string is displayed (similar to a query > > jumble) for context, along with function name and sqlerrcode. > > > > > > Part of the thinking is that people with fleets of postgres instances > > can use this to scan for various errors that they care about. > > Additionally it would be useful to look for "should not happen" errors. > I'm concerned that displaying the format string could not be helpful > in some cases. For example, when raising an ERROR while reading WAL > records, we typically write the error message stored in > record->errormsg_buf: > > in ReadRecord(): > if (errormsg) > ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), > (errmsg_internal("%s", errormsg) /* already > translated */ )); > > In this case, the error message stored in pg_stat_logmsg is just '%s'. > The filename and line number columns might help identify the actual > error but it requires users to read the source code and cannot know > the actual error message. I believe that the number of such error messages, the ones with very little descriptive content, is very low in Postgres code. These kinds of messages are not prevalent, and must be used when code hits obscure conditions, like seen in your example above. These are the kinds of errors that Joe is referring to as "should not happen". In these cases, even if the error message was descriptive, the user will very likely have to dive deep into code to find out the real cause. I feel that the unique combination of file name and line number is useful information, even in cases where the format string not very descriptive. So I believe the extension's behaviour in this regard is acceptable. In cases where the extension's output is not descriptive enough, the user can use the filename:lineno pair to look for fully formed error messages in the actual log files; they may have to make appropriate changes to log_* parameters, though. If we still feel strongly about getting the actual message for these cases, perhaps we can develop a heuristic to identify such messages, and use either full or a prefix of the 'message' field, instead of 'message_id' field. The heuristic could be: strlen(message_id) is too short, or that message_id is all/overwhelmingly format specifiers (e.g. '%s: %d'). The core benefit of this extension is to make it easy for the user to discover error messages affecting their workload. The user may be interested in knowing the most common log messages in their server logs, so that they can work on reducing those errors or warnings. Or they may be interested in knowing when their workload hits unexpected/serious error messages, even if they're infrequent. The data exposed by pg_stat_logmsg view would serve as a starting point, but I'm guessing it would not be sufficient for their investigation. On Fri, Jun 30, 2023 at 4:57 PM Joe Conway <mail@joeconway.com> wrote: > I would love to get > feedback on: > > 1/ the general concept > 2/ the pg_stat_statement-like implementation > 3/ contrib vs core vs external project +1 for the idea; a monitoring system trained at this view can bubble up problems to users that may otherwise go unnoticed. Piggybacking on, or using pg_stat_statments implementation as a model seems fine. I believe making this available as a contrib module hits the right balance; not all installations may want this, hence in-core, always collecting data, seems undesirable; if it's an external project, it won't be discoverable, I see that as a very high bar for someone to trust it and begin to use it. Best regards, Gurjeet http://Gurje.et
On 7/7/23 01:38, Gurjeet Singh wrote: > On Thu, Jul 6, 2023 at 12:37 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: >> On Sat, Jul 1, 2023 at 8:57 AM Joe Conway <mail@joeconway.com> wrote: >> > >> > The basic idea is to mirror how pg_stat_statements works, except the >> > logged messages keyed by filename, lineno, and elevel are saved with a >> > aggregate count. The format string is displayed (similar to a query >> > jumble) for context, along with function name and sqlerrcode. >> > >> > >> > Part of the thinking is that people with fleets of postgres instances >> > can use this to scan for various errors that they care about. >> > Additionally it would be useful to look for "should not happen" errors. > >> I'm concerned that displaying the format string could not be helpful >> in some cases. For example, when raising an ERROR while reading WAL >> records, we typically write the error message stored in >> record->errormsg_buf: >> >> in ReadRecord(): >> if (errormsg) >> ereport(emode_for_corrupt_record(emode, xlogreader->EndRecPtr), >> (errmsg_internal("%s", errormsg) /* already >> translated */ )); >> >> In this case, the error message stored in pg_stat_logmsg is just '%s'. >> The filename and line number columns might help identify the actual >> error but it requires users to read the source code and cannot know >> the actual error message. > > I believe that the number of such error messages, the ones with very > little descriptive content, is very low in Postgres code. These kinds > of messages are not prevalent, and must be used when code hits obscure > conditions, like seen in your example above. These are the kinds of > errors that Joe is referring to as "should not happen". In these > cases, even if the error message was descriptive, the user will very > likely have to dive deep into code to find out the real cause. Agreed. As an example, after running `make installcheck` 8<----------------- select sum(count) from pg_stat_logmsg; sum ------ 6005 (1 row) select message,sum(count) from pg_stat_logmsg where length(message) < 30 and elevel in ('ERROR','FATAL','PANIC') and message like '%\%s%' escape '\' group by message order by length(message); message | sum -------------------------------+----- %s | 107 "%s" is a view | 9 "%s" is a table | 4 %s is a procedure | 1 invalid size: "%s" | 13 %s at or near "%s" | 131 %s at end of input | 3 ... 8<----------------- I only see three message formats there that are generic enough to be of concern (the first and last two shown -- FWIW I did not see any more of them as the fmt string gets longer) So out of 6005 log events, 241 fit this concern. Perhaps given the small number of message format strings affected, it would make sense to special case those, but I am not sure it is worth the effort, at least for version 1. > I feel that the unique combination of file name and line number is > useful information, even in cases where the format string not very > descriptive. So I believe the extension's behaviour in this regard is > acceptable. > > In cases where the extension's output is not descriptive enough, the > user can use the filename:lineno pair to look for fully formed error > messages in the actual log files; they may have to make appropriate > changes to log_* parameters, though. Right > If we still feel strongly about getting the actual message for these > cases, perhaps we can develop a heuristic to identify such messages, > and use either full or a prefix of the 'message' field, instead of > 'message_id' field. The heuristic could be: strlen(message_id) is too > short, or that message_id is all/overwhelmingly format specifiers > (e.g. '%s: %d'). Based on the above analysis (though granted, not all inclusive), it seems like just special casing the specific message format strings of interest would work. > The core benefit of this extension is to make it easy for the user to > discover error messages affecting their workload. The user may be > interested in knowing the most common log messages in their server > logs, so that they can work on reducing those errors or warnings. Or > they may be interested in knowing when their workload hits > unexpected/serious error messages, even if they're infrequent. The > data exposed by pg_stat_logmsg view would serve as a starting point, > but I'm guessing it would not be sufficient for their investigation. Yes, exactly. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On 7/9/23 14:13, Joe Conway wrote: > On 7/7/23 01:38, Gurjeet Singh wrote: >>> In this case, the error message stored in pg_stat_logmsg is just '%s'. >>> The filename and line number columns might help identify the actual >>> error but it requires users to read the source code and cannot know >>> the actual error message. >> >> I believe that the number of such error messages, the ones with very >> little descriptive content, is very low in Postgres code. These kinds >> of messages are not prevalent, and must be used when code hits obscure >> conditions, like seen in your example above. These are the kinds of >> errors that Joe is referring to as "should not happen". In these >> cases, even if the error message was descriptive, the user will very >> likely have to dive deep into code to find out the real cause. > > Agreed. As an example, after running `make installcheck` > > 8<----------------- > select sum(count) from pg_stat_logmsg; > sum > ------ > 6005 > (1 row) > > select message,sum(count) > from pg_stat_logmsg > where length(message) < 30 > and elevel in ('ERROR','FATAL','PANIC') > and message like '%\%s%' escape '\' > group by message > order by length(message); > message | sum > -------------------------------+----- > %s | 107 > "%s" is a view | 9 > "%s" is a table | 4 > %s is a procedure | 1 > invalid size: "%s" | 13 > %s at or near "%s" | 131 > %s at end of input | 3 > ... > 8<----------------- > > I only see three message formats there that are generic enough to be of > concern (the first and last two shown -- FWIW I did not see any more of > them as the fmt string gets longer) > > So out of 6005 log events, 241 fit this concern. > > Perhaps given the small number of message format strings affected, it > would make sense to special case those, but I am not sure it is worth > the effort, at least for version 1. Attached is an update, this time as a patch against 17devel. It is not quite complete, but getting close IMHO. Changes: -------- 1. Now is integrated into contrib as a "Additional Supplied Extension" 2. Documentation added 3. I had a verbal conversation with Andres and he reminded me that the original idea for this was to collect data across fleets of pg hosts so we could look for how often "should never happen" errors actually happen. As such, we need to think in terms of aggregating the info from many hosts, potentially with many localized languages for the messages. So I converted the "message" column back to an untranslated string, and added a "translated_message" column which is localized. An alternative approach might be to provide a separate function that accepts the message string and returns the translation. Thoughts on that? 4. In the same vein, I added a pgversion column since the filename and line number for the same log message could vary across major or even minor releases. Not done: --------- 1. The main thing lacking at this point is a regression test. 2. No special casing for message == "%s". I am still not convinced it is worthwhile to do so. Comments gratefully welcomed. Thanks, -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com