Обсуждение: BUG #18205: Performance regression with NOT NULL checks.
The following bug has been logged on the website: Bug reference: 18205 Logged by: Daniel Migowski Email address: dmigowski@ikoffice.de PostgreSQL version: 15.5 Operating system: Windows + Linux Description: I found out that selecting from a wide table with a single not-null where clause leads to severe performance regression when upgrading from PostgreSQL 9.5 to PostgreSQL 15. I am doing the following queries without parallel query enabled because in a real world case we cannot go parallel anyway. select count(id) FROM testtable t WHERE t.y IS NULL; Here are the timings of the query (three runs): PG 9.5: 0,33s 0,33s 0,33s PG 15: 0,43s 0,44s 0,43s Please create a test table this way: drop table if exists testtable cascade; SELECT generate_series::int4 as id, null::int4 a, null::int4 b, null::int4 c, null::int4 d, null::int4 e, null::int4 f, null::int4 g, null::int4 h, null::int4 i, null::int4 j, null::int4 k, null::int4 l, null::int4 m, null::int4 n, null::int4 o, null::int4 p, null::int4 q, null::int4 r, null::int4 s, null::int4 t, null::int4 u, null::int4 v, null::int4 w, null::int4 x, null::int4 y, null::int4 z into testtable FROM generate_series(1,6000000,1); -- On PG15: set max_parallel_workers = 0; set max_parallel_workers_per_gather = 0; I already talked to despesz about this and he timed the query on different PG versions, which seem to be the versions from PostgreSQL's Debian repository: - 9.5.25 : 590.958 ms - 9.6.24 : 607.228 ms - 10.23 : 820.779 ms - 11.22 : 746.122 ms - 12.17 : 829.786 ms - 13.13 : 804.878 ms - 14.10 : 772.415 ms - 15.5 : 774.749 ms - 16.1 : 746.802 ms Starting with PostgreSQL 10 there is a severe degration in performance. I just cannot believe later versions of PostgreSQL being so much slower than the original versions.
PG Bug reporting form <noreply@postgresql.org> writes: > I found out that selecting from a wide table with a single not-null where > clause leads to severe performance regression when upgrading from PostgreSQL > 9.5 to PostgreSQL 15. I spent some time poking into this. "git bisect" pins the blame on commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755 Author: Andres Freund <andres@anarazel.de> Date: Tue Mar 14 15:45:36 2017 -0700 Faster expression evaluation and targetlist projection. The time needed for the seqscan gets about 50% worse at that commit (on my test machine, anyway): -> Seq Scan on testtable t (cost=0.00..51892.80 rows=9730 width=4) (actual time=0.010..204.937 rows=6000000 loops=1) Filter: (y IS NULL) versus -> Seq Scan on testtable t (cost=0.00..51892.80 rows=9730 width=4) (actual time=0.013..317.069 rows=6000000 loops=1) Filter: (y IS NULL) "perf" says that the extra time is mostly spent in slot_deform_tuple's inner loop: for (; attnum < natts; attnum++) { Form_pg_attribute thisatt = att[attnum]; if (hasnulls && att_isnull(attnum, bp)) { values[attnum] = (Datum) 0; isnull[attnum] = true; slow = true; /* can't use attcacheoff anymore */ continue; } ... which confused me, because that code doesn't look materially different in v10 than 9.6. I eventually realized that the reason is that we reach slot_deform_tuple with natts = 26 in the new code, but in the old code we do not, thanks to this short-circuit in slot_getattr: /* * check if target attribute is null: no point in groveling through tuple */ if (HeapTupleHasNulls(tuple) && att_isnull(attnum - 1, tup->t_bits)) { *isnull = true; return (Datum) 0; } So that results in not having to deconstruct most of the tuple, whereas in the new code we do have to, thanks to b8d7f053c's decision to batch all the variable-value-extraction work. This is a pretty narrow corner case: it would only matter if the column you're testing for null-ness is far past any other column the query needs to fetch. So I'm not sure that it's worth doing anything about. You could imagine special processing for NullTest-on-a-simple-Var: exclude the Var from the ones that we extract normally and instead compile it into some sort of "is column k NULL" test on the HeapTuple. But that seems messy, and it could be a significant pessimization for storage methods that aren't like heapam. On the whole I'm inclined to say "sorry, that's the price of progress". But it is a bit sad that a patch intended to be a performance win shows a loss this big on a (presumably) real-world case. regards, tom lane
Hi, On 2023-11-19 14:08:05 -0500, Tom Lane wrote: > So that results in not having to deconstruct most of the tuple, > whereas in the new code we do have to, thanks to b8d7f053c's > decision to batch all the variable-value-extraction work. Yea, I think we were aware at the time that this does have downsides - it's just that the worst case behaviour of *not* batching are much bigger than the worst-case downside of batching. > This is a pretty narrow corner case: it would only matter if the > column you're testing for null-ness is far past any other column > the query needs to fetch. So I'm not sure that it's worth doing > anything about. You could imagine special processing for > NullTest-on-a-simple-Var: exclude the Var from the ones that we > extract normally and instead compile it into some sort of "is column k > NULL" test on the HeapTuple. We actually did add fastpaths for a few similar cases: ExecJustInnerVar() etc will just use slot_getattr(). These can be used when the result is just a single variable. However, the goal there was more to avoid "interpreter startup" overhead, rather than evaluation overhead. Those fastpaths don't match in this case though, the generated "program" is: EEOP_SCAN_FETCHSOME EEOP_SCAN_VAR EEOP_NULLTEST_ISNULL EEOP_QUAL EEOP_DONE which matches none of the fastpaths we recognize. In fact, I don't think any of the fastpaths currently can be applied to quals, which is somewhat sad :( Right now the "pattern matching" happens in ExecReadyInterpretedExpr(). For portions that makes sense (as they're interpretation specific), but the transformation from EEOP_SCAN_FETCHSOME to just fetching a single column imo should be moved into execExpr.c, by making expr_setup_walker() a bit smarter. I think that might help in a fair number of cases - quals on a single column aren't exactly rare. In this specific case we also could elide the EEOP_QUAL, the NULLTEST cannot return a NULL, so we don't need any qual specific behaviour. For this case alone it'd probably not be worth tracking whether something can be NULL. But it could be worthwhile if we made it more generic, e.g. eliding strictness checks before function calls could be a nice win. > But that seems messy, and it could be a significant pessimization for > storage methods that aren't like heapam. My concern is more that they're pretty narrow - they apply only if a single column is referenced. If we just need attributed 100 and 101, it's still expensive to deform leading columns. To address that, Ithink we could benefit from a tuple deforming routine that's more aware of which columns are needed. Even just omitting storing values in tts_values/nulls that aren't needed yielded decent wins when I experimented in the past. The hard part of course is to figure out when to use a "selective" deforming routine, which will have *worse* performance if most columns are used, and when to just deform everything up to natts. I think the current deforming logic is actually optimized towards heapam to a problematic degree - deforming leading columns in a columnar database hurts really badly. > On the whole I'm inclined to say "sorry, that's the price of > progress". Agreed, we can't really go back to deforming individual columns more widely - that can get *really* expensive. But I think we can do plenty to make the performance of this specific case, both by improving the generated expression program and: > But it is a bit sad that a patch intended to be a > performance win shows a loss this big on a (presumably) real-world > case. I think we might be able to micro-optimize that code a bit to reduce the decrease in performance. Neither gcc nor clang is able to to optimize the cost of the null-check meaningfully, and there a number of dependent instructions: if (hasnulls && att_isnull(attnum, bp)) { values[attnum] = (Datum) 0; isnull[attnum] = true; slow = true; /* can't use attcacheoff anymore */ continue; } static inline bool att_isnull(int ATT, const bits8 *BITS) { return !(BITS[ATT >> 3] & (1 << (ATT & 0x07))); } What if we instead load 8 bytes of the bitmap into a uint64 before entering the loop, and shift an "index" mask into the bitmap by one each iteration through the loop? Then we'd only need to do the access to bitmap every 64 attributes. I think we could also get rid of the hasnulls check in each iteration that way, by just checking hasnull whenever we load portions of the bitmap into a local variable, and loading "no-nulls" into it when !hasnulls. Greetings, Andres Freund
Thank you for the detailed explanation. It surely has a reason to remove the previously used short-circuit in slot_getattr,at least in the special case when we And yes, in my real world use case which I tried to understand I have a wide table with 81 columns, and I am using column1,43,18,75 and filter by attribute 82,42, and 24. Don't know how to handle this, maybe I should rework the layout of all my tables and move the mostly used columns to thebeginning. -----Ursprüngliche Nachricht----- Von: Tom Lane <tgl@sss.pgh.pa.us> Gesendet: Sonntag, 19. November 2023 20:08 An: Daniel Migowski <dmigowski@ikoffice.de> Cc: pgsql-bugs@lists.postgresql.org; Andres Freund <andres@anarazel.de> Betreff: Re: BUG #18205: Performance regression with NOT NULL checks. PG Bug reporting form <noreply@postgresql.org> writes: > I found out that selecting from a wide table with a single not-null > where clause leads to severe performance regression when upgrading > from PostgreSQL > 9.5 to PostgreSQL 15. I spent some time poking into this. "git bisect" pins the blame on commit b8d7f053c5c2bf2a7e8734fe3327f6a8bc711755 Author: Andres Freund <andres@anarazel.de> Date: Tue Mar 14 15:45:36 2017 -0700 Faster expression evaluation and targetlist projection. The time needed for the seqscan gets about 50% worse at that commit (on my test machine, anyway): -> Seq Scan on testtable t (cost=0.00..51892.80 rows=9730 width=4) (actual time=0.010..204.937 rows=6000000 loops=1) Filter: (y IS NULL) versus -> Seq Scan on testtable t (cost=0.00..51892.80 rows=9730 width=4) (actual time=0.013..317.069 rows=6000000 loops=1) Filter: (y IS NULL) "perf" says that the extra time is mostly spent in slot_deform_tuple's inner loop: for (; attnum < natts; attnum++) { Form_pg_attribute thisatt = att[attnum]; if (hasnulls && att_isnull(attnum, bp)) { values[attnum] = (Datum) 0; isnull[attnum] = true; slow = true; /* can't use attcacheoff anymore */ continue; } ... which confused me, because that code doesn't look materially different in v10 than 9.6. I eventually realized that the reasonis that we reach slot_deform_tuple with natts = 26 in the new code, but in the old code we do not, thanks to this short-circuitin slot_getattr: /* * check if target attribute is null: no point in groveling through tuple */ if (HeapTupleHasNulls(tuple) && att_isnull(attnum - 1, tup->t_bits)) { *isnull = true; return (Datum) 0; } So that results in not having to deconstruct most of the tuple, whereas in the new code we do have to, thanks to b8d7f053c'sdecision to batch all the variable-value-extraction work. This is a pretty narrow corner case: it would only matter if the column you're testing for null-ness is far past any othercolumn the query needs to fetch. So I'm not sure that it's worth doing anything about. You could imagine special processingfor NullTest-on-a-simple-Var: exclude the Var from the ones that we extract normally and instead compile it into some sort of"is column k NULL" test on the HeapTuple. But that seems messy, and it could be a significant pessimization for storagemethods that aren't like heapam. On the whole I'm inclined to say "sorry, that's the price of progress". But it is a bit sad that a patch intended to bea performance win shows a loss this big on a (presumably) real-world case. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2023-11-19 14:08:05 -0500, Tom Lane wrote: >> So that results in not having to deconstruct most of the tuple, >> whereas in the new code we do have to, thanks to b8d7f053c's >> decision to batch all the variable-value-extraction work. > Yea, I think we were aware at the time that this does have downsides - it's > just that the worst case behaviour of *not* batching are much bigger than the > worst-case downside of batching. Agreed. Still ... > We actually did add fastpaths for a few similar cases: ExecJustInnerVar() etc > will just use slot_getattr(). These can be used when the result is just a > single variable. However, the goal there was more to avoid "interpreter > startup" overhead, rather than evaluation overhead. Yeah. Also, if I'm reading the example appropriately, Daniel's case *does* involve fetching more than a single column --- but the other ones are up near the start so we didn't use to have to deform very much of the tuple. > What if we instead load 8 bytes of the bitmap into a uint64 before entering > the loop, and shift an "index" mask into the bitmap by one each iteration > through the loop? Meh. Seems like a micro-optimization that does nothing for the big-O problem. One thing to think about is that I suspect "all the columns are null" is just a simple test case and not very representative of the real-world problem. In the real case, probably quite a few of the leading columns are non-null, which would make Daniel's issue even worse because slot_deform_tuple would have to do significantly more work that it didn't do before. Shaving cycles off the null-column fast path would be proportionally less useful too. It might well be that what you suggest is worth doing just to cut the cost of slot_deform_tuple across the board, but I don't think it's an answer to this complaint specifically. regards, tom lane
Hi, On 2023-11-19 21:15:37 +0000, Daniel Migowski wrote: > Thank you for the detailed explanation. It surely has a reason to remove the > previously used short-circuit in slot_getattr, at least in the special case > when we The short-circuit in slot_getattr() still exists, we just don't reach it in this case. > And yes, in my real world use case which I tried to understand I have a wide > table with 81 columns, and I am using column 1,43,18,75 and filter by > attribute 82,42, and 24. Are most of the columns NULL or not? > Don't know how to handle this, maybe I should rework the layout of all my > tables and move the mostly used columns to the beginning. It might also be worth to split the table into multiple tables and join when necessary. Independent of this issue, relatively wide tuples, where you don't use most of the columns, will cause unnecessary IO. Greetings, Andres
Hi, On 2023-11-19 14:17:44 -0800, Andres Freund wrote: > On 2023-11-19 21:15:37 +0000, Daniel Migowski wrote: > > And yes, in my real world use case which I tried to understand I have a wide > > table with 81 columns, and I am using column 1,43,18,75 and filter by > > attribute 82,42, and 24. > > Are most of the columns NULL or not? Another question: In the real query, how selective is the WHERE clause? In your test query all rows are returned and you have no columns in the select list, but it doesn't sound like that's quite your real workload... Greetings, Andres Freund
Hi, On 2023-11-19 16:30:49 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > We actually did add fastpaths for a few similar cases: ExecJustInnerVar() etc > > will just use slot_getattr(). These can be used when the result is just a > > single variable. However, the goal there was more to avoid "interpreter > > startup" overhead, rather than evaluation overhead. > > Yeah. Also, if I'm reading the example appropriately, Daniel's case > *does* involve fetching more than a single column --- but the other ones > are up near the start so we didn't use to have to deform very much of > the tuple. Hm - I see. The initial response didn't make it sound like that, but the follow-up clarified that aspect. > > What if we instead load 8 bytes of the bitmap into a uint64 before entering > > the loop, and shift an "index" mask into the bitmap by one each iteration > > through the loop? > > Meh. Seems like a micro-optimization that does nothing for the big-O > problem. One thing to think about is that I suspect "all the columns > are null" is just a simple test case and not very representative of > the real-world problem. In the real case, probably quite a few of > the leading columns are non-null, which would make Daniel's issue > even worse because slot_deform_tuple would have to do significantly > more work that it didn't do before. Shaving cycles off the null-column > fast path would be proportionally less useful too. It doesn't make it algorithmically better, you're right - but I think it's quite noticeable even in the case of the other columns having values. I changed the test to insert 0 insto all columns other than y, and changed the WHERE clause to IS NOT NULL, to avoid the overhead of the aggregation path. Profile: - 91.27% 0.00% postgres postgres [.] ExecProcNode (inlined) ExecProcNode (inlined) - ExecScan - 60.21% ExecQual (inlined) - ExecEvalExprSwitchContext (inlined) - 59.70% ExecInterpExpr - 54.05% slot_getsomeattrs (inlined) - 53.52% slot_getsomeattrs_int - 52.93% tts_buffer_heap_getsomeattrs - 52.88% slot_deform_heap_tuple (inlined) + 12.39% fetch_att (inlined) + 12.26% att_isnull (inlined) + 0.14% asm_sysvec_apic_timer_interrupt + 1.60% BoolGetDatum (inlined) + 0.01% asm_sysvec_apic_timer_interrupt 0.35% MemoryContextSwitchTo (inlined) - 30.47% ExecScanFetch (inlined) + 29.87% SeqNext + 0.01% asm_sysvec_apic_timer_interrupt 0.23% MemoryContextReset + 0.01% asm_sysvec_apic_timer_interrupt So even here we spend a decent amount of the time in null bitmap handling. I also am not so sure that you're right accessing multiple columns makes the slot_attisnull() path even more advantageous - it's considerably slower when accessing multiple columns, so if you actually access more columns, the benefit will shrink. Particularly if there's variable width columns that need to be skipped over and such. > I have a wide table with 81 columns, and I am using column 1,43,18,75 and filter by attribute 82,42, and 24. Projecting four columns and filtering by three gets a fair bit more expensive if you actually deform them individually. > It might well be that what you suggest is worth doing just to cut > the cost of slot_deform_tuple across the board, but I don't think > it's an answer to this complaint specifically. I did suggest multiple things for a reason :). slot_deform_tuple() is generally a pretty big bottleneck, so efficiency improvements would be interesting... And if were to only deform columns we actually need, handling the null bitmap more efficiently becomes more important, so it could even help cases like this. Greetings, Andres Freund
Hi, On 2023-11-19 14:41:47 -0800, Andres Freund wrote: > It doesn't make it algorithmically better, you're right - but I think it's > quite noticeable even in the case of the other columns having values. > > I changed the test to insert 0 insto all columns other than y, and changed the > WHERE clause to IS NOT NULL, to avoid the overhead of the aggregation > path. Profile: > > - 91.27% 0.00% postgres postgres [.] ExecProcNode (inlined) > ExecProcNode (inlined) > - ExecScan > - 60.21% ExecQual (inlined) > - ExecEvalExprSwitchContext (inlined) > - 59.70% ExecInterpExpr > - 54.05% slot_getsomeattrs (inlined) > - 53.52% slot_getsomeattrs_int > - 52.93% tts_buffer_heap_getsomeattrs > - 52.88% slot_deform_heap_tuple (inlined) > + 12.39% fetch_att (inlined) > + 12.26% att_isnull (inlined) > + 0.14% asm_sysvec_apic_timer_interrupt > + 1.60% BoolGetDatum (inlined) > + 0.01% asm_sysvec_apic_timer_interrupt > 0.35% MemoryContextSwitchTo (inlined) > - 30.47% ExecScanFetch (inlined) > + 29.87% SeqNext > + 0.01% asm_sysvec_apic_timer_interrupt > 0.23% MemoryContextReset > + 0.01% asm_sysvec_apic_timer_interrupt > > > So even here we spend a decent amount of the time in null bitmap handling. If I put prewarm the data into shared buffers and change the table so there is a NULL in one of the leading columns, this changes to: - 94.15% 0.00% postgres postgres [.] ExecProcNode (inlined) ExecProcNode (inlined) - ExecScan - 79.15% ExecQual (inlined) - ExecEvalExprSwitchContext (inlined) - 78.50% ExecInterpExpr - 71.62% slot_getsomeattrs (inlined) - 70.52% slot_getsomeattrs_int - 69.79% tts_buffer_heap_getsomeattrs - 69.56% slot_deform_heap_tuple (inlined) + 20.25% att_isnull (inlined) + 9.45% fetch_att (inlined) + 0.39% asm_sysvec_apic_timer_interrupt + 1.88% BoolGetDatum (inlined) + 0.09% asm_sysvec_apic_timer_interrupt 0.51% MemoryContextSwitchTo (inlined) + 0.02% asm_sysvec_apic_timer_interrupt + 14.39% ExecScanFetch (inlined) 0.14% MemoryContextReset + 0.03% asm_sysvec_apic_timer_interrupt Reducing the time spent in att_isnull() wouldn't get us to < 10 timings, but it'd certainly help to close the gap. Of course you can make the difference more extreme by adding a lot more leading columns, but still. Greetings, Andres Freund
Hello Andres, Half of them are null in average, 30% are even not null. The test case I sent you was artificial to show the problem, butit's the same slower behaviour when I fill all the columns with values except the one in the where clause. Greeting, Daniel -----Ursprüngliche Nachricht----- Von: Andres Freund <andres@anarazel.de> Gesendet: Sonntag, 19. November 2023 23:18 An: Daniel Migowski <dmigowski@ikoffice.de> Cc: Tom Lane <tgl@sss.pgh.pa.us>; pgsql-bugs@lists.postgresql.org Betreff: Re: BUG #18205: Performance regression with NOT NULL checks. Hi, On 2023-11-19 21:15:37 +0000, Daniel Migowski wrote: > Thank you for the detailed explanation. It surely has a reason to > remove the previously used short-circuit in slot_getattr, at least in > the special case when we The short-circuit in slot_getattr() still exists, we just don't reach it in this case. > And yes, in my real world use case which I tried to understand I have > a wide table with 81 columns, and I am using column 1,43,18,75 and > filter by attribute 82,42, and 24. Are most of the columns NULL or not? > Don't know how to handle this, maybe I should rework the layout of all > my tables and move the mostly used columns to the beginning. It might also be worth to split the table into multiple tables and join when necessary. Independent of this issue, relativelywide tuples, where you don't use most of the columns, will cause unnecessary IO. Greetings, Andres
Hello Andreas, in this case the NOT NULL check matches all columns, but the other where clauses only match 30%. The software is used bymore than hundred customers and they all use different features, but my queries must of course work in every case. I have indeed a lot of tables with 80 columns or so, and will now try to restructure them so the important columns come sooner.But in many queries we do select maybe 5-20 fields and filter by an arbitrary field, so sparse selection of attributeswould be VERY beneficial for us in many of the expensive queries. Queries returning many rows or joining and aggregatingdata are mostly just returning part of the table, and that is for us where the performance is needed. Kind regards, Daniel Migowski PS: Btw many queries that result in showing a list of stuff in the GUI, like, invoice list with customer, show the name andnumber of customer from the customer table. We will now ensure that these "Display name" attributes will be at the startof the table so they can be retrieved efficiently. Another example for sparse column selection, but with the right organisationwe don't need to optimize here. -----Ursprüngliche Nachricht----- Von: Andres Freund <andres@anarazel.de> Gesendet: Sonntag, 19. November 2023 23:39 An: Daniel Migowski <dmigowski@ikoffice.de> Cc: Tom Lane <tgl@sss.pgh.pa.us>; pgsql-bugs@lists.postgresql.org Betreff: Re: BUG #18205: Performance regression with NOT NULL checks. Hi, On 2023-11-19 14:17:44 -0800, Andres Freund wrote: > On 2023-11-19 21:15:37 +0000, Daniel Migowski wrote: > > And yes, in my real world use case which I tried to understand I > > have a wide table with 81 columns, and I am using column 1,43,18,75 > > and filter by attribute 82,42, and 24. > > Are most of the columns NULL or not? Another question: In the real query, how selective is the WHERE clause? In your test query all rows are returned and youhave no columns in the select list, but it doesn't sound like that's quite your real workload... Greetings, Andres Freund