Обсуждение: Incorrect cost for MergeAppend
Hello hackers, While investigating some query plans, I noticed some code that seems to be wrong: when create_merge_append_path() estimates the cost of sorting an input, it calls cost_sort() passing subpath->parent->tuples as the number of tuples. Shouldn't it use subpath->parent->rows or even subpath->rows instead? The `tuples` variable doesn't account for the filters on the relation, so this leads to incorrect cost estimates when there are highly selective filters, and Sort + Append is chosen instead of MergeAppend. I don't have a good repro yet because the plans I've been studying involve a third-party extension, but the code looks incorrect so I thought I should ask. Here is a link to this code on GitHub: https://github.com/postgres/postgres/blob/6a1ea02c491d16474a6214603dce40b5b122d4d1/src/backend/optimizer/util/pathnode.c#L1469-L1477 --- Alexander Kuzmenkov Timescale
On Mon, Jan 29, 2024 at 6:11 PM Alexander Kuzmenkov <akuzmenkov@timescale.com> wrote: > > Hello hackers, > > While investigating some query plans, I noticed some code that seems > to be wrong: when create_merge_append_path() estimates the cost of > sorting an input, it calls cost_sort() passing subpath->parent->tuples > as the number of tuples. Shouldn't it use subpath->parent->rows or > even subpath->rows instead? The `tuples` variable doesn't account for > the filters on the relation, so this leads to incorrect cost estimates > when there are highly selective filters, and Sort + Append is chosen > instead of MergeAppend. All other callers of cost_sort() except plan_cluster_use_sort() are using rows instead of tuples. Even plan_cluster_use_sort() has rel->rows = rel->tuples, it's actually passing rows. So agree with your suggestion. However a test will be good since this code is quite old. -- Best Wishes, Ashutosh Bapat
Here is a small patch that reproduces the problem on two tables with inheritance, and fixes it. I'll add it to the Commitfest. On Tue, Jan 30, 2024 at 8:20 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Mon, Jan 29, 2024 at 6:11 PM Alexander Kuzmenkov > <akuzmenkov@timescale.com> wrote: > > > > Hello hackers, > > > > While investigating some query plans, I noticed some code that seems > > to be wrong: when create_merge_append_path() estimates the cost of > > sorting an input, it calls cost_sort() passing subpath->parent->tuples > > as the number of tuples. Shouldn't it use subpath->parent->rows or > > even subpath->rows instead? The `tuples` variable doesn't account for > > the filters on the relation, so this leads to incorrect cost estimates > > when there are highly selective filters, and Sort + Append is chosen > > instead of MergeAppend. > > All other callers of cost_sort() except plan_cluster_use_sort() are > using rows instead of tuples. Even plan_cluster_use_sort() has > rel->rows = rel->tuples, it's actually passing rows. So agree with > your suggestion. However a test will be good since this code is quite > old. > > -- > Best Wishes, > Ashutosh Bapat
Вложения
Hi, > Here is a small patch that reproduces the problem on two tables with > inheritance, and fixes it. I'll add it to the Commitfest. Thanks for the patch. I can confirm that it changes the plan from Sort + Append to MergeAppend. Before: ``` explain (costs off) select * from ma0 where a < 1000 order by a; QUERY PLAN --------------------------------------------------------- Sort Sort Key: ma0.a -> Append -> Index Only Scan using ma0_pkey on ma0 ma0_1 Index Cond: (a < 1000) -> Seq Scan on ma1 ma0_2 Filter: (a < 1000) ``` After: ``` =# explain (costs off) select * from ma0 where a < 1000 order by a; QUERY PLAN --------------------------------------------------- Merge Append Sort Key: ma0.a -> Index Only Scan using ma0_pkey on ma0 ma0_1 Index Cond: (a < 1000) -> Sort Sort Key: ma0_2.a -> Seq Scan on ma1 ma0_2 Filter: (a < 1000) ``` The rest of the tests pass. -- Best regards, Aleksander Alekseev
On Wed, 31 Jan 2024 at 00:06, Alexander Kuzmenkov <akuzmenkov@timescale.com> wrote: > Here is a small patch that reproduces the problem on two tables with > inheritance, and fixes it. I'll add it to the Commitfest. Good catch. It seems to have been broken since MergeAppends were added in 11cad29c9. The code fix looks good to me. The tests look a bit heavy. I wondered if you could have used a UNION ALL query with the tenk1 table so you didn't have to create tables, however, I see that case works ok since the parent->tuples is set in set_subquery_size_estimates() from "rel->tuples = sub_final_rel->cheapest_total_path->rows;". I played around with the following and see your patch changes the plan to a Merge Append away from an Append -> Sort plan. create table tenk_parent (like tenk1); alter table tenk1 inherit tenk_parent; alter table tenk2 inherit tenk_parent; explain (costs off) select * from tenk_parent where thousand = 0 order by tenthous; alter table tenk1 no inherit tenk_parent; alter table tenk2 no inherit tenk_parent; I'm just not sure if we should be messing with tenk1 and tenk2 like that. You should likely focus on trying to find a test that does not require making 2 tables with 10k rows. David
On Tue, Jan 30, 2024 at 1:20 PM David Rowley <dgrowleyml@gmail.com> wrote: > You should likely focus on trying to find a test that does not require > making 2 tables with 10k rows. Is 1k smallint OK? It should fit in one page. Still reproduces the error, and the entire test case runs in under 10 ms.
Вложения
On Wed, 31 Jan 2024 at 02:23, Alexander Kuzmenkov <akuzmenkov@timescale.com> wrote: > > On Tue, Jan 30, 2024 at 1:20 PM David Rowley <dgrowleyml@gmail.com> wrote: > > You should likely focus on trying to find a test that does not require > > making 2 tables with 10k rows. > > Is 1k smallint OK? It should fit in one page. Still reproduces the > error, and the entire test case runs in under 10 ms. I had a go at making it a bit smaller without going dangerously close to where the plan might change. The following seems to work. create table ma0(a int primary key); create table ma1() inherits (ma0); insert into ma0 select generate_series(1, 400); insert into ma1 select generate_series(1, 200); analyze ma0; analyze ma1; explain (costs off) select * from ma0 where a < 100 order by a; drop table ma0 cascade; As for backpatching this. It does risk plans changing in stable versions of PostgreSQL, which we normally try to avoid. However, it seems quite similar to 1e731ed12a, albeit, far more long-standing. I'm currently thinking we should backpatch this back to 12. Does anyone disagree? David
On Wed, Jan 31, 2024 at 4:33 AM David Rowley <dgrowleyml@gmail.com> wrote: > > On Wed, 31 Jan 2024 at 02:23, Alexander Kuzmenkov > <akuzmenkov@timescale.com> wrote: > > > > On Tue, Jan 30, 2024 at 1:20 PM David Rowley <dgrowleyml@gmail.com> wrote: > > > You should likely focus on trying to find a test that does not require > > > making 2 tables with 10k rows. > > > > Is 1k smallint OK? It should fit in one page. Still reproduces the > > error, and the entire test case runs in under 10 ms. > > I had a go at making it a bit smaller without going dangerously close > to where the plan might change. The following seems to work. > > create table ma0(a int primary key); > create table ma1() inherits (ma0); > insert into ma0 select generate_series(1, 400); > insert into ma1 select generate_series(1, 200); > analyze ma0; > analyze ma1; > > explain (costs off) select * from ma0 where a < 100 order by a; > > drop table ma0 cascade; On my laptop with all default settings with patch postgres@231714=#explain select * from ma0 where a < 100 order by a; QUERY PLAN -------------------------------------------------------------------------------------- Merge Append (cost=6.94..18.90 rows=198 width=4) Sort Key: ma0.a -> Index Only Scan using ma0_pkey on ma0 ma0_1 (cost=0.15..9.88 rows=99 width=4) Index Cond: (a < 100) -> Sort (cost=6.78..7.03 rows=99 width=4) Sort Key: ma0_2.a -> Seq Scan on ma1 ma0_2 (cost=0.00..3.50 rows=99 width=4) Filter: (a < 100) (8 rows) without patch #explain select * from ma0 where a < 100 order by a; QUERY PLAN ---------------------------------------------------------------------- Sort (cost=19.04..19.54 rows=198 width=4) Sort Key: ma0.a -> Append (cost=0.00..11.49 rows=198 width=4) -> Seq Scan on ma0 ma0_1 (cost=0.00..7.00 rows=99 width=4) Filter: (a < 100) -> Seq Scan on ma1 ma0_2 (cost=0.00..3.50 rows=99 width=4) Filter: (a < 100) (7 rows) postgres@233864=#select (19.54 - 18.90)/19.54, (19.04 - 18.09)/19.04; ?column? | ?column? ------------------------+------------------------ 0.03275332650972364381 | 0.04989495798319327731 (1 row) Those numbers are higher than 1% (#define STD_FUZZ_FACTOR 1.01) but slight variation in all the GUCs that affect cost, might bring the difference closer to STD_FUZZ_FACTOR. Given how close they are, maybe it's not such a good idea to backpatch. If the plans change for better, users won't complain but otherwise they will complain and will have no way to go back to their good plans in production. -- Best Wishes, Ashutosh Bapat
On Wed, 31 Jan 2024 at 18:58, Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > with patch > Merge Append (cost=6.94..18.90 rows=198 width=4) > without patch > Sort (cost=19.04..19.54 rows=198 width=4) > Those numbers are higher than 1% (#define STD_FUZZ_FACTOR 1.01) but > slight variation in all the GUCs that affect cost, might bring the > difference closer to STD_FUZZ_FACTOR. > > Given how close they are, maybe it's not such a good idea to > backpatch. The reason those numbers are close is because I reduced the row count on the test tables to a point where we only just get the Merge Append plan with a small margin. I don't see the test case costs as a relevant factor for if we backpatch or not. What is relevant are things like: For: * It's a clear bug and what's happening now is clearly wrong. * inheritance/partitioned table plan changes for the better in minor versions Against: * Nobody has complained for 13 years, so maybe it's unlikely anyone is suffering too much. * Possibility of inheritance/partitioned table plans changing for the worse in minor versions For now, I'm on the fence on this one. David
On Wed, Jan 31, 2024 at 12:12 PM David Rowley <dgrowleyml@gmail.com> wrote: > > What is relevant are things like: > > For: > * It's a clear bug and what's happening now is clearly wrong. > * inheritance/partitioned table plan changes for the better in minor versions > > Against: > * Nobody has complained for 13 years, so maybe it's unlikely anyone is > suffering too much. > * Possibility of inheritance/partitioned table plans changing for the > worse in minor versions > That's what I am thinking as well. And the plans that may change for the worse are the ones where the costs with and without the patch are close. Just to be clear, the change is for good and should be committed to the master. It's the backpatching I am worried about. -- Best Wishes, Ashutosh Bapat
I'd be happy to see this backpatched. What kind of regressions are we worried about? I'd say partition-wise sort + merge should be faster than append + sort for reasonably sized tables. That's basically what tuplesort does inside. Moreso, this can enable index scans on partitions, which is an even better plan. On Wed, Jan 31, 2024 at 7:46 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > > On Wed, Jan 31, 2024 at 12:12 PM David Rowley <dgrowleyml@gmail.com> wrote: > > > > What is relevant are things like: > > > > For: > > * It's a clear bug and what's happening now is clearly wrong. > > * inheritance/partitioned table plan changes for the better in minor versions > > > > Against: > > * Nobody has complained for 13 years, so maybe it's unlikely anyone is > > suffering too much. > > * Possibility of inheritance/partitioned table plans changing for the > > worse in minor versions > > > > That's what I am thinking as well. And the plans that may change for > the worse are the ones where the costs with and without the patch are > close. > > Just to be clear, the change is for good and should be committed to > the master. It's the backpatching I am worried about. > > -- > Best Wishes, > Ashutosh Bapat
On Wed, Jan 31, 2024 at 1:33 PM Alexander Kuzmenkov <akuzmenkov@timescale.com> wrote: > I'd be happy to see this backpatched. What kind of regressions are we > worried about? I'd say partition-wise sort + merge should be faster > than append + sort for reasonably sized tables. That's basically what > tuplesort does inside. Moreso, this can enable index scans on > partitions, which is an even better plan. To put it another way, this change enables our usual cost model for MergeAppend to work correctly in the presence of filters. I think we can consider this model to be reasonably correct, and we don't currently have major problems with MergeAppend being chosen instead of Sort + Append in cases where it's suboptimal, right? So applying it properly in case with filters is not likely to introduce problems.
On 2024-Jan-31, Alexander Kuzmenkov wrote: > To put it another way, this change enables our usual cost model for > MergeAppend to work correctly in the presence of filters. I think we > can consider this model to be reasonably correct, and we don't > currently have major problems with MergeAppend being chosen instead of > Sort + Append in cases where it's suboptimal, right? So applying it > properly in case with filters is not likely to introduce problems. Since we have a minor coming up very soon, I think it's not a good idea to backpatch right now. Maybe you can push to master now, and consider whether to backpatch later. The problem is -- if somebody has an application that gets good plans with the current cost model, and you change the cost model and the plans become worse, what do they do? If you change this in a major release, this is not an issue because they must test their queries before upgrading and if they fail to realize a problem exists then it's their fault. If you change it in a minor release, then those people will be very upset that things were changed suddenly, and they may get wary of future minor upgrades, which we don't want. Plus, they would need to do careful testing before doing the minor upgrade. Maybe plans can only go from bad to good and never from good to bad. But are we 100% certain that that is the case? People who are **desperate** to get this improvement can perhaps run a patched version in the meantime. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Hi,
>Since we have a minor coming up very soon, I think it's not a good idea
>to backpatch right now. Maybe you can push to master now, and consider
>whether to backpatch later.
>The problem is -- if somebody has an application that gets good plans
>with the current cost model, and you change the cost model and the plans
>become worse, what do they do? If you change this in a major release,
>this is not an issue because they must test their queries before
>upgrading and if they fail to realize a problem exists then it's their
>fault. If you change it in a minor release, then those people will be
>very upset that things were changed suddenly, and they may get wary of
>future minor upgrades, which we don't want.
>Since we have a minor coming up very soon, I think it's not a good idea
>to backpatch right now. Maybe you can push to master now, and consider
>whether to backpatch later.
>The problem is -- if somebody has an application that gets good plans
>with the current cost model, and you change the cost model and the plans
>become worse, what do they do? If you change this in a major release,
>this is not an issue because they must test their queries before
>upgrading and if they fail to realize a problem exists then it's their
>fault. If you change it in a minor release, then those people will be
>very upset that things were changed suddenly, and they may get wary of
>future minor upgrades, which we don't want.
I agree with this, especially as we tell our customers that such changes do not happen from one minor release to another.
Regards
Daniel
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Since we have a minor coming up very soon, I think it's not a good idea > to backpatch right now. Maybe you can push to master now, and consider > whether to backpatch later. As a rule, we don't back-patch changes like this ever. People don't appreciate plans changing in minor releases. regards, tom lane
On Thu, 1 Feb 2024 at 04:32, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Since we have a minor coming up very soon, I think it's not a good idea > > to backpatch right now. Maybe you can push to master now, and consider > > whether to backpatch later. > > As a rule, we don't back-patch changes like this ever. People don't > appreciate plans changing in minor releases. Pushed to master. Thanks for the report and the fix, Alexander.
On Wed, Jan 31, 2024 at 9:49 PM David Rowley <dgrowleyml@gmail.com> wrote: > Pushed to master. > > Thanks for the report and the fix, Alexander. Thank you!