Обсуждение: Normalization of utility queries in pg_stat_statements
Hi all, (Adding Bertrand in CC.) $subject is a follow-up of the automation of query jumbling for utilities and DDLs, and attached is a set of patches that apply normalization to DDL queries across the board, for all utilities. This relies on tracking the location of A_Const nodes while removing from the query jumbling computation the values attached to the node, as as utility queries can show be stored as normalized in pg_stat_statements with some $N parameters. The main case behind doing that is of course monitoring, where we have seen some user instances willing to get more information but see pg_stat_statements as a bottleneck because the query ID of utility queries are based on the computation of their string, and is value-sensitive. That's the case mentioned by Bertrand Drouvot for CALL and SET where workloads full of these easily bloat pg_stat_statements, where we concluded about more automation in this area (so here it is): https://www.postgresql.org/message-id/36e5bffe-e989-194f-85c8-06e7bc88e6f7%40amazon.com For example, this makes possible the following grouping: - CALL func(1,2); CALL func(1,3); => CALL func($1,$2) - EXPLAIN SELECT 1; EXPLAIN SELECT 1; => EXPLAIN SELECT $1; - CREATE MATERIALIZED VIEW aam AS SELECT 1; becomes "CREATE MATERIALIZED VIEW aam AS SELECT $1". Query jumbling for DDLs and utilities happens now automatically, still are not represented correctly in pg_stat_statements (one bit of documentation I missed previously refers to the fact that these depend on their query strings, which is not the case yet). By the way, while looking at all that, I have really underestimated the use of Const nodes in utilities, as some queries can finish with the same query ID even if different values are stored in a query, still don't show up as normalized in pg_stat_statements, so the current state of HEAD is not good, though you would need to use the same object name to a conflict for most of them. So that's my mistake here with 3db72eb. If folks think that we'd better have a revert of this automated query jumbling facility based on this argument, that would be fine for me, as well. The main case I have noticed in this area is EXPLAIN, by the way. Note that it is actually easy to move to the ~15 approach of having a query ID depending on the Const node values for DDLs, by having a custom implementation in queryjumblefuncs.c for Const nodes, where we apply the constant value and don't store a location for normalization if a query has a utility once this information is stored in a JumbleState. This rule influences various DDLs, as well, once it gets applied across the board, and it's been some work to identify all of them, but I think that I have caught them all as the regression database offers all the possible patterns: - CREATE VIEW, CTAS, CREATE MATERIALIZED VIEW which have Const nodes depending on their attached queries, for various clauses. - ALTER TABLE/INDEX/FOREIGN with DEFAULT, SET components. - CREATE TABLE with partition bounds. - BEGIN and ABORT, with transaction commands getting grouped together. The attached patch set includes as a set of regression tests for pg_stat_statements for *all* the utility queries that have either Const or A_Const nodes, so as one can see the effect that all this stuff has. This is based on a diff of the contents of pg_stat_statements on the regression database once all these normalization rules are applied. Compilation of a Const can also be made depending on the type node. However, all that makes no sense if applying the same normalization rules to all the queries across the board, because all the queries would follow the same rules. That's the critical bit IMO. From what I get, the bloat of pg_stat_statements for all utilities is something that would be helpful for all such queries, still different things could be done on a per-node basis. Perhaps this is too aggressive as it is and people don't like it, though, so feedback is welcome. I'd like to think that maximizing grouping is nice though, because it leads to no actual loss of information on the workload pattern for the queries involved, AFAIU. This sentence may be overoptimistic. So, attached is a patch set, that does the following: - 0001 is a refactoring of the regression tests of pg_stat_statements by splitting a bit the tests. I bumped into that while getting confused at how the tests are now when it comes to the handling of utilities and track_planning, where these tests silently rely on other parts of the same file with different GUC settings. This refactoring is useful on its own, IMO, and the tests show the same output as previously. - 0002 is the addition of tests in pg_stat_statements for all the DDL and utility patterns that make use of Const and A_Const nodes. Even if query jumbling of utilities is done through their text string or their nodes, this is also useful. - 0003 is the code of the feature, that switches pg_stat_statements to properly normalize utility queries, with a modification to A_Const so as normalization can be applied to it. With the generation of the code for query jumbling being automated based on the node definitions, this is straight-forward as a code change, but the changes are basically impossible to track without all the patterns tracked by 0002. Thoughts and comments are welcome. 0001 and 0002 are useful on their own to keep track of utilities that use Const and A_Const after going through the query jumbling, even if an approach based on query string or the automated query jumbling for utilities is used (the query string approach a bit its value). I'll add that to the next commit fest. Thanks, -- Michael
Вложения
On Wed, Feb 08, 2023 at 12:05:24PM +0900, Michael Paquier wrote: > Thoughts and comments are welcome. 0001 and 0002 are useful on their > own to keep track of utilities that use Const and A_Const after going > through the query jumbling, even if an approach based on query string > or the automated query jumbling for utilities is used (the query > string approach a bit its value). I'll add that to the next commit > fest. While wondering about this stuff about the last few days and discussing with bertrand, I have changed my mind on the point that there is no need to be that aggressive yet with the normalization of the A_Const nodes, because the query string normalization of pg_stat_statements is not prepared yet to handle cases where a A_Const value uses a non-quoted value with whitespaces. The two cases where I saw an impact is on the commands that can define an isolation level: SET TRANSACTION and BEGIN. For example, applying normalization to A_Const nodes does the following as of HEAD: 1) BEGIN TRANSACTION READ ONLY, READ WRITE, DEFERRABLE, NOT DEFERRABLE; BEGIN TRANSACTION $1 ONLY, $2 WRITE, $3, $4 DEFERRABLE 2) SET TRANSACTION ISOLATION LEVEL READ COMMITTED; SET TRANSACTION ISOLATION LEVEL $1 COMMITTED On top of that, specifying a different isolation level may cause these commands to be grouped, which is not really cool. All that could be done incrementally later on, in 17~ or later depending on the adjustments that make sense. Attached is an updated patch set. 0003 is basically the same as v3, that I have kept around for clarity in case one wants to see the effect of a A_Const normalization to all the related commands, though I am not proposing that for an upstream integration. 0002 has been completed with a couple of commands to track all the commands with A_Const, so as we never lose sight of what happens. 0004 is what I think could be done for PG16, where normalization affects only Const. At the end of the day, this reflects the following commands that use Const nodes because they use directly queries, so the same rules as SELECT and DMLs apply to them: - DECLARE - EXPLAIN - CREATE MATERIALIZED VIEW - CTAS, SELECT INTO Comments and thoughts welcome. Thanks, -- Michael
Вложения
Hi, On 2/16/23 1:34 AM, Michael Paquier wrote: > While wondering about this stuff about the last few days and > discussing with bertrand, I have changed my mind on the point that > there is no need to be that aggressive yet with the normalization of > the A_Const nodes, because the query string normalization of > pg_stat_statements is not prepared yet to handle cases where a A_Const > value uses a non-quoted value with whitespaces. The two cases where I > saw an impact is on the commands that can define an isolation level: > SET TRANSACTION and BEGIN. > > For example, applying normalization to A_Const nodes does the > following as of HEAD: > 1) BEGIN TRANSACTION READ ONLY, READ WRITE, DEFERRABLE, NOT DEFERRABLE; > BEGIN TRANSACTION $1 ONLY, $2 WRITE, $3, $4 DEFERRABLE > 2) SET TRANSACTION ISOLATION LEVEL READ COMMITTED; > SET TRANSACTION ISOLATION LEVEL $1 COMMITTED > > On top of that, specifying a different isolation level may cause these > commands to be grouped, which is not really cool. All that could be > done incrementally later on, in 17~ or later depending on the > adjustments that make sense. > Thanks for those patches! Yeah, agree about the proposed approach. > 0002 has been > completed with a couple of commands to track all the commands with > A_Const, so as we never lose sight of what happens. 0004 is what I > think could be done for PG16, where normalization affects only Const. > At the end of the day, this reflects the following commands that use > Const nodes because they use directly queries, so the same rules as > SELECT and DMLs apply to them: > - DECLARE > - EXPLAIN > - CREATE MATERIALIZED VIEW > - CTAS, SELECT INTO > 0001: I like the idea of splitting the existing tests in dedicated files. What do you think about removing: " SET pg_stat_statements.track_utility = FALSE; SET pg_stat_statements.track_planning = TRUE; " In the new pg_stat_statements.sql? That way pg_stat_statements.sql would always behave with default values for those (currently we are setting both of them as non default). Then, with the default values in place, if we feel that some tests are missing we could add them in utility.sql or planning.sql accordingly. 0002: Produces: v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:834: trailing whitespace. CREATE VIEW view_stats_1 AS v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:838: trailing whitespace. CREATE VIEW view_stats_1 AS warning: 2 lines add whitespace errors. +-- SET TRANSACTION ISOLATION +BEGIN; +SET TRANSACTION ISOLATION LEVEL READ COMMITTED; +SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; +SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; What about adding things like "SET SESSION CHARACTERISTICS AS TRANSACTION..." too? 0003 and 0004: Thanks for keeping 0003 that's useful to see the impact of A_Const normalization. Looking at the diff they produce, I also do think that 0004 is what could be done for PG16. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Feb 16, 2023 at 10:55:32AM +0100, Drouvot, Bertrand wrote: > In the new pg_stat_statements.sql? That way pg_stat_statements.sql would always behave > with default values for those (currently we are setting both of them as non default). > > Then, with the default values in place, if we feel that some tests > are missing we could add them in > utility.sql or planning.sql > accordingly. I am not sure about this part, TBH, so I have left these as they are. Anyway, while having a second look at that, I have noticed that it is possible to extract as an independent piece all the tests related to level tracking. Things are worse than I thought initially, actually, because we had test scenarios mixing planning and level tracking, but the tests don't care about measuring plans at all, see around FETCH FORWARD, meaning that queries on the table pg_stat_statements have just been copy-pasted around for the last few years. There were more tests that used "test" for a table name ;) I have been pondering about this part, and the tracking matters for DO blocks and PL functions, so I have moved all these cases into a new, separate file. There is a bit more that can be done, for WAL tracking and roles near the end of pg_stat_statements.sql, but I have left that out for now. I have checked the output of the tests before and after the refactoring for quite a bit of time, and the outputs match so there is no loss of coverage. 0001 looks quite committable at this stage, and that's independent on the rest. At the end this patch creates four new test files that are extended in the next patches: utility, planning, track and cleanup. > 0002: > > Produces: > v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:834: trailing whitespace. > CREATE VIEW view_stats_1 AS > v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:838: trailing whitespace. > CREATE VIEW view_stats_1 AS > warning: 2 lines add whitespace errors. Thanks, fixed. > +-- SET TRANSACTION ISOLATION > +BEGIN; > +SET TRANSACTION ISOLATION LEVEL READ COMMITTED; > +SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; > +SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; > > What about adding things like "SET SESSION CHARACTERISTICS AS > TRANSACTION..." too? That's a good idea. It is again one of these fancy cases, better to keep a track of them in the long-term.. > 0003 and 0004: > Thanks for keeping 0003 that's useful to see the impact of A_Const normalization. > > Looking at the diff they produce, I also do think that 0004 is what > could be done for PG16. I am wondering if others have an opinion to share about that, but, yes, 0004 seems enough to begin with. We could always study more normalization areas in future releases, taking it slowly. -- Michael
Вложения
Hi, On 2/17/23 3:35 AM, Michael Paquier wrote: > On Thu, Feb 16, 2023 at 10:55:32AM +0100, Drouvot, Bertrand wrote: >> In the new pg_stat_statements.sql? That way pg_stat_statements.sql would always behave >> with default values for those (currently we are setting both of them as non default). >> >> Then, with the default values in place, if we feel that some tests >> are missing we could add them in > utility.sql or planning.sql >> accordingly. > > I am not sure about this part, TBH, so I have left these as they are. > > Anyway, while having a second look at that, I have noticed that it is > possible to extract as an independent piece all the tests related to > level tracking. Things are worse than I thought initially, actually, > because we had test scenarios mixing planning and level tracking, but > the tests don't care about measuring plans at all, see around FETCH > FORWARD, meaning that queries on the table pg_stat_statements have > just been copy-pasted around for the last few years. There were more > tests that used "test" for a table name ;) > > I have been pondering about this part, and the tracking matters for DO > blocks and PL functions, so I have moved all these cases into a new, > separate file. There is a bit more that can be done, for WAL tracking > and roles near the end of pg_stat_statements.sql, but I have left that > out for now. I have checked the output of the tests before and after > the refactoring for quite a bit of time, and the outputs match so > there is no loss of coverage. > > 0001 looks quite committable at this stage, and that's independent on > the rest. At the end this patch creates four new test files that are > extended in the next patches: utility, planning, track and cleanup. > Thanks! LGTM. >> 0002: >> >> Produces: >> v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:834: trailing whitespace. >> CREATE VIEW view_stats_1 AS >> v2-0002-Add-more-test-for-utility-queries-in-pg_stat_stat.patch:838: trailing whitespace. >> CREATE VIEW view_stats_1 AS >> warning: 2 lines add whitespace errors. > > Thanks, fixed. > Thanks! >> +-- SET TRANSACTION ISOLATION >> +BEGIN; >> +SET TRANSACTION ISOLATION LEVEL READ COMMITTED; >> +SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; >> +SET TRANSACTION ISOLATION LEVEL SERIALIZABLE; >> >> What about adding things like "SET SESSION CHARACTERISTICS AS >> TRANSACTION..." too? > > That's a good idea. It is again one of these fancy cases, better to > keep a track of them in the long-term.. > Right. 002 LGTM. >> 0003 and 0004: >> Thanks for keeping 0003 that's useful to see the impact of A_Const normalization. >> >> Looking at the diff they produce, I also do think that 0004 is what >> could be done for PG16. > > I am wondering if others have an opinion to share about that, but, > yes, 0004 seems enough to begin with. We could always study more > normalization areas in future releases, taking it slowly. Agree. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Fri, Feb 17, 2023 at 09:36:27AM +0100, Drouvot, Bertrand wrote: > On 2/17/23 3:35 AM, Michael Paquier wrote: >> 0001 looks quite committable at this stage, and that's independent on >> the rest. At the end this patch creates four new test files that are >> extended in the next patches: utility, planning, track and cleanup. > > Thanks! LGTM. Thanks. I have applied the set of regression tests in 0001 and 0002. Note that I have changed the order of the attributes when querying pg_stat_statements, to make easier to follow the diffs generated by the normalization. The unaligned mode would be another option, but it makes not much sense as long as there are no more than two fields with variable lengths. Some extra notes about that: - Should the test for the validation WAL generation metrics be moved out? I am not sure that it makes much sense to separate it as it has a short purpose. - Same issue with user activity, which creates a few roles and makes sure that their activity is tracked? We don't look at the userid in this case, which does not make much sense to me. - Same issue with locking clauses, worth a file of their own? The main file is still named pg_stat_statements.sql, perhaps it should be renamed to something more generic, like general.sql? Or perhaps we could just split the main file with a select.sql (with locking clauses) and a dml.sql? >> I am wondering if others have an opinion to share about that, but, >> yes, 0004 seems enough to begin with. We could always study more >> normalization areas in future releases, taking it slowly. > > Agree. These last ones are staying around for a few more weeks, until the middle of the next CF, I guess. After all this is done, the final changes are very short, showing the effects of the normalization, as of: 6 files changed, 45 insertions(+), 35 deletions(-) -- Michael
Вложения
On Mon, Feb 20, 2023 at 11:32:23AM +0900, Michael Paquier wrote: > These last ones are staying around for a few more weeks, until the > middle of the next CF, I guess. After all this is done, the final > changes are very short, showing the effects of the normalization, as > of: > 6 files changed, 45 insertions(+), 35 deletions(-) With the patches.. -- Michael
Вложения
On Mon, Feb 20, 2023 at 11:34:59AM +0900, Michael Paquier wrote: > With the patches.. Attached is an updated patch set, where I have done more refactoring work for the regression tests of pg_stat_statements, splitting pg_stat_statments.sql into the following files: - user_activity.sql for the role-level resets. - wal.sql for the WAL generation tracking. - dml.sql for insert/update/delete/merge and row counts. - The main file is renamed to select.sql, as it now only covers SELECT patterns. There is no change in the code coverage or the patterns tested. And with that, I am rather comfortable with the shape of the regression tests moving forward. 0002 and 0003 are equivalent to the previous 0003 and 0004 in v4, that switch pg_stat_statements to apply the normalization to utilities that use Const nodes. -- Michael
Вложения
Hi, On 3/1/23 5:47 AM, Michael Paquier wrote: > On Mon, Feb 20, 2023 at 11:34:59AM +0900, Michael Paquier wrote: >> With the patches.. > > Attached is an updated patch set, where I have done more refactoring > work for the regression tests of pg_stat_statements, splitting > pg_stat_statments.sql into the following files: > - user_activity.sql for the role-level resets. > - wal.sql for the WAL generation tracking. > - dml.sql for insert/update/delete/merge and row counts. > - The main file is renamed to select.sql, as it now only covers SELECT > patterns. > Thanks! Splitting even more and removing pg_stat_statements.sql/out does make sense to me, so +1 for the patch. Applying 0001 produces: Applying: Split more regression tests of pg_stat_statements .git/rebase-apply/patch:1735: new blank line at EOF. + .git/rebase-apply/patch:2264: new blank line at EOF. + warning: 2 lines add whitespace errors. Nits: +++ b/contrib/pg_stat_statements/sql/wal.sql @@ -0,0 +1,22 @@ +-- +-- Validate WAL generation metrics +-- + +SET pg_stat_statements.track_utility = FALSE; + +-- utility "create table" should not be shown This comment is coming from the previous pg_stat_statements.sql but I wonder if it makes sense here as testing utility is not the initial purpose of wal.sql. Same comment for dml.sql: +-- utility "create table" should not be shown +CREATE TEMP TABLE pgss_dml_tab (a int, b char(20)); What about removing those comments? > There is no change in the code coverage or the patterns tested. I had a look (comparing the new .sql files with the old pg_stat_statements.sql content) and I agree. Except from the Nits above, 0001 LGTM. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Thu, Mar 02, 2023 at 08:12:24AM +0100, Drouvot, Bertrand wrote: > Applying 0001 produces: > > Applying: Split more regression tests of pg_stat_statements > .git/rebase-apply/patch:1735: new blank line at EOF. > + > .git/rebase-apply/patch:2264: new blank line at EOF. > + > warning: 2 lines add whitespace errors. Indeed, removed. > What about removing those comments? Removing these two as well. > Except from the Nits above, 0001 LGTM. Thanks for double-checking, applied 0001 to finish this part of the work. I am attaching the remaining bits as of the attached, combined into a single patch. I am going to look at it again at the beginning of next week and potentially apply it so as the normalization reflects to the reports of pg_stat_statements. -- Michael
Вложения
Hi Michael! I'm rebasing a patch "Tracking statements entry timestamp in pg_stat_statements" for applying after this patch. I've noted that current tests are not quite independent one from another. There is two statements in the end of user_activity.sql test: DROP ROLE regress_stats_user1; DROP ROLE regress_stats_user2; Those are done after the last pg_stat_statements_reset call in this test file and thus, those are included in checks of wal.out file: query | calls | rows | wal_bytes_generated | wal_records_generated | wal_records_ge_rows ----------------------------------------------------------------------- -------+-------+------+---------------------+-----------------------+-- ------------------- DELETE FROM pgss_wal_tab WHERE a > $1 | 1 | 1 | t | t | t DROP ROLE regress_stats_user1 | 1 | 0 | t | t | t DROP ROLE regress_stats_user2 | 1 | 0 | t | t | t Those statements is not related to any WAL tests. It seems a little bit incorrect to me. Are we need some changes here? -- Andrei Zubkov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
On Mon, Mar 06, 2023 at 03:50:55PM +0300, Andrei Zubkov wrote: > Those statements is not related to any WAL tests. It seems a little bit > incorrect to me. The intention is to have each file run in isolation, so this is incorrect as it stands. Thanks for the report! -- Michael
Вложения
On Fri, Mar 03, 2023 at 09:37:27AM +0900, Michael Paquier wrote: > Thanks for double-checking, applied 0001 to finish this part of the > work. I am attaching the remaining bits as of the attached, combined > into a single patch. Doing so as a single patch was not feeling right as this actually fixes issues with the location calculations for the Const node, so I have split that into three commits and finally applied the whole. As a bonus, please see attached a patch to apply the normalization to CALL statements using the new automated infrastructure. OUT parameters can be passed to a procedure, hence I guess that these had better be silenced as well. This is not aimed at being integrated, just for reference. -- Michael
Вложения
On Wed, Mar 8, 2023 at 2:19 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Mar 03, 2023 at 09:37:27AM +0900, Michael Paquier wrote: > > Thanks for double-checking, applied 0001 to finish this part of the > > work. I am attaching the remaining bits as of the attached, combined > > into a single patch. > > Doing so as a single patch was not feeling right as this actually > fixes issues with the location calculations for the Const node, so I > have split that into three commits and finally applied the whole. > > As a bonus, please see attached a patch to apply the normalization to > CALL statements using the new automated infrastructure. OUT > parameters can be passed to a procedure, hence I guess that these had > better be silenced as well. This is not aimed at being integrated, > just for reference. > -- > Michael I tested it. all normally works fine. only 1 corner case: set pg_stat_statements.track = 'all'; drop table if Exists cp_test; CREATE TABLE cp_test (a int, b text); CREATE or REPLACE PROCEDURE ptest1(x text) LANGUAGE SQL AS $$ INSERT INTO cp_test VALUES (1, x); $$; CREATE or REPLACE PROCEDURE ptest3(y text) LANGUAGE SQL AS $$ CALL ptest1(y); CALL ptest1($1); $$; SELECT pg_stat_statements_reset(); CALL ptest3('b'); SELECT calls, toplevel, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C"; returns: calls | toplevel | rows | query -------+----------+------+------------------------------------ 1 | t | 0 | CALL ptest3($1) 2 | f | 2 | INSERT INTO cp_test VALUES ($2, x) 1 | t | 1 | SELECT pg_stat_statements_reset() here, the intermediate CALL part is optimized away. or should I expect CALL ptest1($1) also in pg_stat_statements?
On Wed, Aug 16, 2023 at 05:11:47PM +0800, jian he wrote: > SELECT calls, toplevel, rows, query FROM pg_stat_statements ORDER BY > query COLLATE "C"; > returns: > calls | toplevel | rows | query > -------+----------+------+------------------------------------ > 1 | t | 0 | CALL ptest3($1) > 2 | f | 2 | INSERT INTO cp_test VALUES ($2, x) > 1 | t | 1 | SELECT pg_stat_statements_reset() > > here, the intermediate CALL part is optimized away. or should I expect > CALL ptest1($1) also in pg_stat_statements? I would have guessed that ptest1() being called as part of ptest3() should show up in the report if you use track = all, as all the nested queries of a function, even if it is pure SQL, ought to show up. Now note that ptest1() not showing up is not a new behavior, ~15 does the same thing by missing it. -- Michael