Обсуждение: benchmark results comparing versions 15.2 and 16
This is mostly a hobby project for me - my other hobby is removing invasive weeds. I am happy to answer questions and run more tests, but turnaround for answers won't be instant. Getting results from Linux perf for these tests is on my TODO list. For now I am just re-running a subset of these to get more certainty that the regressions are real and not noise.
These are results for the insert benchmark on a small server comparing performance for versions 15.2 and 16. For version 16 I built from source at git sha 1ab763fc.
https://github.com/postgres/postgres/commit/1ab763fc22adc88e5d779817e7b42b25a9dd7c9e
Late in the version 15 beta release cycle this benchmark found a significant regression. I don't see anything significant this time, but there are potential small regressions.
More detail on how I run the benchmark and the HW is here, the server is small -- Beelink with 8 cores, 16G RAM and 1T NVMe SSD.
http://smalldatum.blogspot.com/2023/05/the-insert-benchmark-postgres-versions.html
Performance reports are linked below. But first, disclaimers:
* the goal is to determine whether there are CPU improvements or regressions. To make that easier to spot I disable fsync on commit.
* my scripts compute CPU/operation where operation is a SQL statement. However, CPU in this case is measured by vmstat and includes CPU from the benchmark client and Postgres server
* the regressions here are small, usually less than 5% which means it can be hard to distinguish between normal variance and a regression but the results are repeatable
* the links below are to the Summary section which includes throughput (absolute and relative). The relative throughput is the (throughput for PG 16 / throughput for PG 15.2) and
* I used the same compiler options for the builds of 15.2 and 16
Summary of the results:
* from r1 - insert-heavy (l.i0, l.i1) and create indexes (l.x) steps are ~2% slower in PG 16
* from r2 - create index (l.x) step is ~4% slower in PG 16
* from r3 - regressions are similar to r1
* from r4, r5 and r6 - regressions are mostly worse than r1, r2, r3. Note r4, r5, r6 are the same workload as r1, r2, r3 except the database is cached by PG for r1, r2, r3 so the r4, r5 and r6 benchmarks will do much more copying from the OS page cache into the Postgres buffer pool.
I will repeat r1, r2, r4 and r5 but with the tests run in a different order to confirm this isn't just noise.
Database cached by Postgres:
r1) 1 table, 1 client - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.1u.1tno.cached/all.html#summary
r2) 4 tables, 4 clients - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tno.cached/all.html#summary
r3) 1 table, 4 clients - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.cached/all.html#summary
Database cached by OS but not by Postgres:
r4) 1 table, 1 client - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.1u.1tno.1g/all.html#summary
r5) 4 tables, 4 clients - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tno.1g/all.html#summary
r6) 1 table, 4 clients - https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.1g/all.html#summary
mdcallag@gmail.com
Hi, On 2023-05-05 10:45:12 -0700, MARK CALLAGHAN wrote: > This is mostly a hobby project for me - my other hobby is removing > invasive weeds. Hah :) > Summary of the results: > * from r1 - insert-heavy (l.i0, l.i1) and create indexes (l.x) steps are > ~2% slower in PG 16 > * from r2 - create index (l.x) step is ~4% slower in PG 16 > * from r3 - regressions are similar to r1 > * from r4, r5 and r6 - regressions are mostly worse than r1, r2, r3. Note > r4, r5, r6 are the same workload as r1, r2, r3 except the database is > cached by PG for r1, r2, r3 so the r4, r5 and r6 benchmarks will do much > more copying from the OS page cache into the Postgres buffer pool. One thing that's somewhat odd is that there's very marked changes in l.i0's p99 latency for the four clients cases - but whether 15 or 16 are better differs between the runs. r2) p99 20m.pg152_o3_native_lto.cx7 300 20m.pg16prebeta.cx7 23683 r3) p99 20m.pg152_o3_native_lto.cx7 70245 20m.pg16prebeta.cx7 8191 r5) p99 20m.pg152_o3_native_lto.cx7 11188 20m.pg16prebeta.cx7 72720 r6) p99 20m.pg152_o3_native_lto.cx7 1898 20m.pg16prebeta.cx7 31666 I do wonder if there's something getting scheduled in some of these runs increasing latency? Or what we're seeing depends on the time between the start of the server and the start of the benchmark? It is interesting that the per-second throughput graph shows a lot of up/down at the end: https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.1g/tput.l.i0.html Both 15 and 16 have one very high result at 70s, 15 then has one low, but 16 has two low results. Greetings, Andres Freund
One thing that's somewhat odd is that there's very marked changes in l.i0's
p99 latency for the four clients cases - but whether 15 or 16 are better
differs between the runs.
From the response time sections for l.i0 the [1ms, 4ms) bucket has 99% or more for all 6 cases.
For example, https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.1u.1tno.cached/all.html#l.i0.rt
But yes, the p99 is as you state. I will wade through my test scripts tomorrow to see how the p99 is computed.
I do wonder if there's something getting scheduled in some of these runs
increasing latency?
Or what we're seeing depends on the time between the start of the server and
the start of the benchmark? It is interesting that the per-second throughput
graph shows a lot of up/down at the end:
https://mdcallag.github.io/reports/23_05_04_ibench.beelink.pg16b.4u.1tyes.1g/tput.l.i0.html
Both 15 and 16 have one very high result at 70s, 15 then has one low, but 16
has two low results.
The IPS vs time graphs are more interesting for benchmark steps that run longer. Alas, this can't run too long if the resulting database is to fit in <= 16G. But that is a problem for another day. The IPS vs time graphs are not a flat line, but I am not ready to pursue that as problem unless it shows multi-second write-stalls (fortunately it does not).
mdcallag@gmail.com
Hello Mark, 05.05.2023 20:45, MARK CALLAGHAN wrote: > This is mostly a hobby project for me - my other hobby is removing invasive weeds. I am happy to answer questions and > run more tests, but turnaround for answers won't be instant. Getting results from Linux perf for these tests is on my > TODO list. For now I am just re-running a subset of these to get more certainty that the regressions are real and not > noise. > It's a very interesting topic to me, too. I had developed some scripts to measure and compare postgres`s performance using miscellaneous public benchmarks (ycsb, tpcds, benchmarksql_tpcc, htapbench, benchbase, gdprbench, s64da-benchmark, ...). Having compared 15.3 (56e869a09) with master (58f5edf84) I haven't seen significant regressions except a few minor ones. First regression observed with a simple pgbench test: pgbench -i benchdb pgbench -c 10 -T 300 benchdb (with default compilation options and fsync = off) On master I get: tps = 10349.826645 (without initial connection time) On 15.3: tps = 11296.064992 (without initial connection time) This difference is confirmed by multiple test runs. `git bisect` for this regression pointed at f193883fc. Best regards, Alexander
Hi, On 2023-05-08 16:00:01 +0300, Alexander Lakhin wrote: > This difference is confirmed by multiple test runs. `git bisect` for this > regression pointed at f193883fc. I can reproduce a significant regression due to f193883fc of a workload just running SELECT CURRENT_TIMESTAMP; A single session running it on my workstation via pgbench -Mprepared gets before: tps = 89359.128359 (without initial connection time) after: tps = 83843.585152 (without initial connection time) Obviously this is an extreme workload, but that nevertheless seems too large to just accept... Michael, the commit message notes that there were no measured performance regression - yet I see one in a trivial test. What were you measuring? I'm a bit surprised by the magnitude of the regression, but it's not surprising that there is a performance effect. You're replacing something that doesn't go through the whole generic function rigamarole, and replace it with something that does... Looking at two perf profiles, the biggest noticable difference is Before: - 5.51% 0.13% postgres postgres [.] ExecInitResult - 5.38% ExecInitResult + 2.29% ExecInitResultTupleSlotTL - 2.22% ExecAssignProjectionInfo - 2.19% ExecBuildProjectionInfo 0.47% ExecReadyInterpretedExpr - 0.43% ExecInitExprRec - 0.10% palloc AllocSetAlloc.localalias (inlined) + 0.32% expression_tree_walker_impl.localalias (inlined) + 0.28% get_typlen 0.09% ExecPushExprSlots + 0.06% MemoryContextAllocZeroAligned + 0.04% MemoryContextAllocZeroAligned 0.02% exprType.localalias (inlined) + 0.41% ExecAssignExprContext + 0.35% MemoryContextAllocZeroAligned 0.11% ExecInitQual.localalias (inlined) + 0.11% _start + 0.02% 0x55b89c764d7f After: - 6.57% 0.17% postgres postgres [.] ExecInitResult - 6.40% ExecInitResult - 3.00% ExecAssignProjectionInfo - ExecBuildProjectionInfo - 0.91% ExecInitExprRec - 0.65% ExecInitFunc 0.23% fmgr_info_cxt_security + 0.18% palloc0 + 0.07% object_aclcheck 0.04% fmgr_info 0.05% check_stack_depth + 0.05% palloc + 0.58% expression_tree_walker_impl.localalias (inlined) + 0.55% get_typlen 0.37% ExecReadyInterpretedExpr + 0.11% MemoryContextAllocZeroAligned 0.09% ExecPushExprSlots 0.04% exprType.localalias (inlined) + 2.77% ExecInitResultTupleSlotTL + 0.50% ExecAssignExprContext + 0.09% MemoryContextAllocZeroAligned 0.05% ExecInitQual.localalias (inlined) + 0.10% _start I.e. we spend more time building the expression state for expression evaluation, because we now go through the generic ExecInitFunc(), instead of something dedicated. We also now need to do permission checking etc. I don't think that's the entirety of the regression... Greetings, Andres Freund
Hi, On 2023-05-08 12:11:17 -0700, Andres Freund wrote: > Hi, > > On 2023-05-08 16:00:01 +0300, Alexander Lakhin wrote: > > This difference is confirmed by multiple test runs. `git bisect` for this > > regression pointed at f193883fc. > > I can reproduce a significant regression due to f193883fc of a workload just > running > SELECT CURRENT_TIMESTAMP; > > A single session running it on my workstation via pgbench -Mprepared gets > before: > tps = 89359.128359 (without initial connection time) > after: > tps = 83843.585152 (without initial connection time) > > Obviously this is an extreme workload, but that nevertheless seems too large > to just accept... Added an open item for this. Greetings, Andres Freund
I have two more runs of the benchmark in progress so we will have 3 results for each of the test cases to confirm that the small regressions are repeatable.
I was using: ./configure --prefix=$pfx --enable-debug CFLAGS="-O3 -march=native -mtune=native -flto" LDFLAGS="-flto" > o.cf.$x 2> e.cf.$x
Adding -no-omit-frame-pointer fixes the problem, so I am repeating the benchmark with that to confirm there are still regressions and then I will get flamegraphs.
mdcallag@gmail.com
08.05.2023 16:00, Alexander Lakhin wrote: > ... Having compared 15.3 (56e869a09) with master > (58f5edf84) I haven't seen significant regressions except a few minor ones. > First regression observed with a simple pgbench test: Another noticeable, but not critical performance degradation is revealed by query 87 from TPC-DS (I use s64da-benchmark): https://github.com/swarm64/s64da-benchmark-toolkit/blob/master/benchmarks/tpcds/queries/queries_10/87.sql With `prepare_benchmark --scale-factor=2`, `run_benchmark --scale-factor=10` I get on master: 2023-05-10 09:27:52,888 INFO : finished 80/103: query 87 of stream 0: 2.26s OK but on REL_15_STABLE: 2023-05-10 08:13:40,648 INFO : finished 80/103: query 87 of stream 0: 1.94s OK This time `git bisect` pointed at 3c6fc5820. Having compared execution plans (both attached), I see the following differences (3c6fc5820~1 vs 3c6fc5820): -> Subquery Scan on "*SELECT* 1" (cost=149622.00..149958.68 rows=16834 width=21) (actual time=1018.606..1074.468 rows=93891 loops=1) -> Unique (cost=149622.00..149790.34 rows=16834 width=17) (actual time=1018.604..1064.790 rows=93891 loops=1) -> Sort (cost=149622.00..149664.09 rows=16834 width=17) (actual time=1018.603..1052.591 rows=94199 loops=1) -> Gather (cost=146588.59..148440.33 rows=16834 width=17) (actual time=880.899..913.978 rows=94199 loops=1) vs -> Subquery Scan on "*SELECT* 1" (cost=147576.79..149829.53 rows=16091 width=21) (actual time=1126.489..1366.751 rows=93891 loops=1) -> Unique (cost=147576.79..149668.62 rows=16091 width=17) (actual time=1126.487..1356.938 rows=93891 loops=1) -> Gather Merge (cost=147576.79..149547.94 rows=16091 width=17) (actual time=1126.487..1345.253 rows=94204 loops=1) -> Unique (cost=146576.78..146737.69 rows=16091 width=17) (actual time=1124.426..1306.532 rows=47102 loops=2) -> Sort (cost=146576.78..146617.01 rows=16091 width=17) (actual time=1124.424..1245.110 rows=533434 loops=2) -> Subquery Scan on "*SELECT* 2" (cost=52259.82..52428.16 rows=8417 width=21) (actual time=653.640..676.879 rows=62744 loops=1) -> Unique (cost=52259.82..52343.99 rows=8417 width=17) (actual time=653.639..670.405 rows=62744 loops=1) -> Sort (cost=52259.82..52280.86 rows=8417 width=17) (actual time=653.637..662.428 rows=62744 loops=1) -> Gather (cost=50785.20..51711.07 rows=8417 width=17) (actual time=562.158..571.737 rows=62744 loops=1) -> HashAggregate (cost=49785.20..49869.37 rows=8417 width=17) (actual time=538.263..544.336 rows=31372 loops=2) -> Nested Loop (cost=0.85..49722.07 rows=8417 width=17) (actual time=2.049..469.747 rows=284349 loops=2) vs -> Subquery Scan on "*SELECT* 2" (cost=48503.68..49630.12 rows=8046 width=21) (actual time=700.050..828.388 rows=62744 loops=1) -> Unique (cost=48503.68..49549.66 rows=8046 width=17) (actual time=700.047..821.836 rows=62744 loops=1) -> Gather Merge (cost=48503.68..49489.31 rows=8046 width=17) (actual time=700.047..814.136 rows=62744 loops=1) -> Unique (cost=47503.67..47584.13 rows=8046 width=17) (actual time=666.348..763.403 rows=31372 loops=2) -> Sort (cost=47503.67..47523.78 rows=8046 width=17) (actual time=666.347..730.336 rows=284349 loops=2) -> Nested Loop (cost=0.85..46981.72 rows=8046 width=17) (actual time=1.852..454.111 rows=284349 loops=2) -> Subquery Scan on "*SELECT* 3" (cost=50608.83..51568.70 rows=7165 width=21) (actual time=302.571..405.305 rows=23737 loops=1) -> Unique (cost=50608.83..51497.05 rows=7165 width=17) (actual time=302.568..402.818 rows=23737 loops=1) -> Gather Merge (cost=50608.83..51443.31 rows=7165 width=17) (actual time=302.567..372.246 rows=287761 loops=1) -> Sort (cost=49608.81..49616.27 rows=2985 width=17) (actual time=298.204..310.075 rows=95920 loops=3) -> Nested Loop (cost=2570.65..49436.52 rows=2985 width=17) (actual time=3.205..229.192 rows=95920 loops=3) vs -> Subquery Scan on "*SELECT* 3" (cost=50541.84..51329.11 rows=5708 width=21) (actual time=302.042..336.820 rows=23737 loops=1) -> Unique (cost=50541.84..51272.03 rows=5708 width=17) (actual time=302.039..334.329 rows=23737 loops=1) -> Gather Merge (cost=50541.84..51229.22 rows=5708 width=17) (actual time=302.039..331.296 rows=24128 loops=1) -> Unique (cost=49541.81..49570.35 rows=2854 width=17) (actual time=298.771..320.560 rows=8043 loops=3) -> Sort (cost=49541.81..49548.95 rows=2854 width=17) (actual time=298.770..309.603 rows=95920 loops=3) -> Nested Loop (cost=2570.52..49378.01 rows=2854 width=17) (actual time=3.209..230.291 rows=95920 loops=3) From the commit message and the discussion [1], it's not clear to me, whether this plan change is expected. Perhaps it's too minor issue to bring attention to, but maybe this information could be useful for v16 performance analysis. [1] https://postgr.es/m/CAApHDvo8Lz2H=42urBbfP65LTcEUOh288MT7DsG2_EWtW1AXHQ@mail.gmail.com Best regards, Alexander
Вложения
On Thu, 11 May 2023 at 01:00, Alexander Lakhin <exclusion@gmail.com> wrote: > This time `git bisect` pointed at 3c6fc5820. Having compared execution plans > (both attached), I see the following differences (3c6fc5820~1 vs 3c6fc5820): Based on what you've sent, I'm uninspired to want to try to do anything about it. The patched version finds a plan that's cheaper. The row estimates are miles off with both plans. I'm not sure what we're supposed to do here. It's pretty hard to make changes to the planner's path generation without risking that a bad plan is chosen when it wasn't beforehand with bad row estimates. Is the new plan still slower if you increase work_mem so that the sort no longer goes to disk? Maybe the planner would have picked Hash Aggregate if the row estimates had been such that cost_tuplesort() knew that the sort would have gone to disk. David
On Tue, May 09, 2023 at 09:48:24AM -0700, Andres Freund wrote: > On 2023-05-08 12:11:17 -0700, Andres Freund wrote: >> I can reproduce a significant regression due to f193883fc of a workload just >> running >> SELECT CURRENT_TIMESTAMP; >> >> A single session running it on my workstation via pgbench -Mprepared gets >> before: >> tps = 89359.128359 (without initial connection time) >> after: >> tps = 83843.585152 (without initial connection time) >> >> Obviously this is an extreme workload, but that nevertheless seems too large >> to just accept... > > Added an open item for this. Thanks for the report, I'll come back to it and look at it at the beginning of next week. In the worst case, that would mean a revert of this refactoring, I assume. -- Michael
Вложения
11.05.2023 01:27, David Rowley wrote: > On Thu, 11 May 2023 at 01:00, Alexander Lakhin <exclusion@gmail.com> wrote: >> This time `git bisect` pointed at 3c6fc5820. Having compared execution plans >> (both attached), I see the following differences (3c6fc5820~1 vs 3c6fc5820): > Based on what you've sent, I'm uninspired to want to try to do > anything about it. The patched version finds a plan that's cheaper. > The row estimates are miles off with both plans. I've made sure that s64da-benchmark performs analyze before running the queries (pg_class.reltuples fields for tables in question contain actual counts), so it seems that nothing can be done on the benchmark side to improve those estimates. > ... It's pretty hard to make changes to the > planner's path generation without risking that a bad plan is chosen > when it wasn't beforehand with bad row estimates. Yeah, I see. It's also interesting to me, which tests perform better after that commit. It takes several hours to run all tests, so I can't present results quickly, but I'll try to collect this information next week. > Is the new plan still slower if you increase work_mem so that the sort > no longer goes to disk? Maybe the planner would have picked Hash > Aggregate if the row estimates had been such that cost_tuplesort() > knew that the sort would have gone to disk. Yes, increasing work_mem to 50MB doesn't affect the plans (new plans attached), though the sort method changed to quicksort. The former plan is still executed slightly faster. Best regards, Alexander
Вложения
On Thu, May 11, 2023 at 01:28:40PM +0900, Michael Paquier wrote: > On Tue, May 09, 2023 at 09:48:24AM -0700, Andres Freund wrote: >> On 2023-05-08 12:11:17 -0700, Andres Freund wrote: >>> I can reproduce a significant regression due to f193883fc of a workload just >>> running >>> SELECT CURRENT_TIMESTAMP; >>> >>> A single session running it on my workstation via pgbench -Mprepared gets >>> before: >>> tps = 89359.128359 (without initial connection time) >>> after: >>> tps = 83843.585152 (without initial connection time) >>> >>> Obviously this is an extreme workload, but that nevertheless seems too large >>> to just accept... Extreme is adapted for a worst-case scenario. Looking at my notes from a few months back, that's kind of what I did on my laptop, which was the only machine I had at hand back then: - Compilation of code with -O2. - Prepared statement of a simple SELECT combined with a DO block running executed the query in a loop a few million times on a single backend: PREPARE test AS SELECT CURRENT_TIMESTAMP; DO $$ BEGIN FOR i IN 1..10000000 LOOP EXECUTE 'EXECUTE test'; END LOOP; END $$; - The second test is mentioned at [1], with a generate_series() on a keyword. - And actually I recall some pgbench runs similar to that.. But I don't have any traces of that in my notes. This was not showing much difference, and it does not now, either. Funny thing is that the pre-patch period was showing signs of being a bit slower in this environment. Anyway, I have just tested the DO case in a second "bigbox" environment that I have set up for benchmarking a few days ago, and the DO test is showing me a 1%~1.5% regression in runtime. That's repeatable so that's not noise. I have re-run a bit more pgbench (1 client, prepared query with a single SELECT on a SQL keyword, etc.). And, TBH, I am not seeing as much difference as you do (nothing with default pgbench setup, FWIW), still that's able to show a bit more difference than the other two cases. HEAD shows me an average output close to 43900 TPS (3 run of 60s each, for instance), while relying on SQLValueFunction shows an average of 45000TPS. That counts for ~2.4% output regression here on bigbox based on these numbers. Not a regression as high as mentioned above, still that's visible. >> Added an open item for this. > > Thanks for the report, I'll come back to it and look at it at the > beginning of next week. In the worst case, that would mean a revert > of this refactoring, I assume. So, this involves commits 7aa81c6, f193883 and fb32748. 7aa81c6 has added some tests, so I would let the tests it added be on HEAD as the precision was not tested for the SQL keywords this has added cover for. fb32748 has removed the dependency of SQLValueFunction on collations by making the name functions use COERCE_SQL_SYNTAX. One case where these could be heavily used is RLS, for example, so that could be visible. f193883 has removed the rest and the timestamp keywords. I am not going to let that hang in the air with beta1 getting released next week, though, so attached are two patches to revert the change (these would be combined, just posted this way for clarity). The only conflict I could see is caused by the query jumbling where a SQLValueFunction needs "typmod" and "op" in the computation, ignoring the rest, so the node definition primnodes.h gains the required node_attr. Making the execution path cheaper while avoiding the collation tweaks for SQLValueFunction would be nice, but not at this time of the year on this branch. [1]: https://www.postgresql.org/message-id/Y0+dHDYA46UnnLs/@paquier.xyz -- Michael
Вложения
Hi, On 2023-05-15 14:20:24 +0900, Michael Paquier wrote: > On Thu, May 11, 2023 at 01:28:40PM +0900, Michael Paquier wrote: > > On Tue, May 09, 2023 at 09:48:24AM -0700, Andres Freund wrote: > >> On 2023-05-08 12:11:17 -0700, Andres Freund wrote: > >>> I can reproduce a significant regression due to f193883fc of a workload just > >>> running > >>> SELECT CURRENT_TIMESTAMP; > >>> > >>> A single session running it on my workstation via pgbench -Mprepared gets > >>> before: > >>> tps = 89359.128359 (without initial connection time) > >>> after: > >>> tps = 83843.585152 (without initial connection time) > >>> > >>> Obviously this is an extreme workload, but that nevertheless seems too large > >>> to just accept... > > Extreme is adapted for a worst-case scenario. Looking at my notes > from a few months back, that's kind of what I did on my laptop, which > was the only machine I had at hand back then: > - Compilation of code with -O2. I assume without assertions as well? > I have re-run a bit more pgbench (1 client, prepared query with a > single SELECT on a SQL keyword, etc.). And, TBH, I am not seeing as > much difference as you do (nothing with default pgbench setup, FWIW), > still that's able to show a bit more difference than the other two > cases. > HEAD shows me an average output close to 43900 TPS (3 run of > 60s each, for instance), while relying on SQLValueFunction shows an > average of 45000TPS. That counts for ~2.4% output regression here > on bigbox based on these numbers. Not a regression as high as > mentioned above, still that's visible. 45k seems too low for a modern machine, given that I get > 80k in such a workload, on a workstation with server CPUs (i.e. many cores, but not that fast individually). Hence wondering about assertions being enabled... I get quite variable performance if I don't pin client / server to the same core, but even the slow performance is faster than 45k. Greetings, Andres Freund
On Mon, May 15, 2023 at 05:14:47PM -0700, Andres Freund wrote: > On 2023-05-15 14:20:24 +0900, Michael Paquier wrote: >> On Thu, May 11, 2023 at 01:28:40PM +0900, Michael Paquier wrote: >> Extreme is adapted for a worst-case scenario. Looking at my notes >> from a few months back, that's kind of what I did on my laptop, which >> was the only machine I had at hand back then: >> - Compilation of code with -O2. > > I assume without assertions as well? Yup, no assertions. > 45k seems too low for a modern machine, given that I get > 80k in such a > workload, on a workstation with server CPUs (i.e. many cores, but not that > fast individually). Hence wondering about assertions being enabled... Nope, disabled. > I get quite variable performance if I don't pin client / server to the same > core, but even the slow performance is faster than 45k. Okay. You mean with something like taskset or similar, I guess? -- Michael
Вложения
Hi, On 2023-05-16 09:42:31 +0900, Michael Paquier wrote: > > I get quite variable performance if I don't pin client / server to the same > > core, but even the slow performance is faster than 45k. > > Okay. You mean with something like taskset or similar, I guess? Yes. numactl --physcpubind ... in my case. Linux has an optimization where it does not need to send an IPI when the client and server are scheduled on the same core. For single threaded ping-pong tasks like pgbench -c1, that can make a huge difference, particularly on larger CPUs. So you get a lot better performance when forcing things to be colocated. Greetings, Andres Freund
On Mon, May 15, 2023 at 05:54:53PM -0700, Andres Freund wrote: > Yes. numactl --physcpubind ... in my case. Linux has an optimization where it > does not need to send an IPI when the client and server are scheduled on the > same core. For single threaded ping-pong tasks like pgbench -c1, that can make > a huge difference, particularly on larger CPUs. So you get a lot better > performance when forcing things to be colocated. Yes, that's not bringing the numbers higher with the simple cases I reported previously, either. Anyway, even if I cannot see such a high difference, I don't see how to bring back the original numbers you are reporting without doing more inlining and tying COERCE_SQL_SYNTAX more tightly within the executor's portions for the FuncExprs, and there are the collation assumptions as well. Perhaps that's not the correct thing to do with SQLValueFunction remaining around, but nothing can be done for v16, so I am planning to just revert the change before beta1, and look at it again later, from scratch. -- Michael
Вложения
15.05.2023 08:20, Michael Paquier wrote: > I am not going to let that hang in the air with beta1 getting released > next week, though, so attached are two patches to revert the change > (these would be combined, just posted this way for clarity). I can confirm that the patches improve (restore) performance for my test: pgbench -i benchdb pgbench -c 10 -T 300 benchdb tps (over three runs): HEAD (08c45ae23): 10238.441580, 10697.202119, 10706.764703 HEAD with the patches: 11134.510118, 11176.554996, 11150.338488 f193883fc~1 (240e0dbac) 11082.561388, 11233.604446, 11087.071768 15.3 (8382864eb) 11328.699555, 11128.057883, 11057.934392 Best regards, Alexander
On Tue, May 16, 2023 at 06:00:00PM +0300, Alexander Lakhin wrote: > I can confirm that the patches improve (restore) performance for my test: > pgbench -i benchdb > pgbench -c 10 -T 300 benchdb Thanks for running these! > tps (over three runs): > HEAD (08c45ae23): > 10238.441580, 10697.202119, 10706.764703 > > HEAD with the patches: > 11134.510118, 11176.554996, 11150.338488 > > f193883fc~1 (240e0dbac) > 11082.561388, 11233.604446, 11087.071768 > > 15.3 (8382864eb) > 11328.699555, 11128.057883, 11057.934392 The numbers between f193883fc~1 and HEAD+patch are close to each other. It does not seem to make the whole difference with 15.3, but most of it. The difference can also be explained with some noise, based on the number patterns of the third runs? I have now applied the revert, ready for beta1. Thanks for the feedback! -- Michael
Вложения
The numbers between f193883fc~1 and HEAD+patch are close to each other. It does not seem to make the whole difference with 15.3, but most of it. The difference can also be explained with some noise, based on the number patterns of the third runs? I have now applied the revert, ready for beta1. Thanks for the feedback!
Thank you for paying attention to it!
Yes, I ran the benchmark on my workstation, so numbers could vary due parallel
activity. Now I've compared 15.3 (8382864eb) with d8c3106bb and 1d369c9e9,
this time with the CPU boost mode disabled:
1d369c9e9:
10007.130326, 10047.722122, 9920.612426, 10016.053846, 10060.606408
d8c3106bb:
10492.100485, 10505.326827, 10535.918137, 10625.904871, 10573.608859
15.3:
10458.752330, 10308.677192, 10366.496526, 10489.395275, 10319.458041
Best "1d369c9e9" worse than "15.3" by 4.1 percents (10060.61 < 10489.40)
Average "1d369c9e9" worse than "15.3" by 3.6 percents (10010.43 < 10388.56)
Best "d8c3106bb" better than "15.3" by 1.3 percents (10625.90 > 10489.40)
Average "d8c3106bb" better than "15.3" by 1.5 percents (10546.57 > 10388.56)
So it seems that there is nothing left on this plate.
Best regards,
Alexander
On Fri, May 5, 2023 at 10:01 PM MARK CALLAGHAN <mdcallag@gmail.com> wrote:I have two more runs of the benchmark in progress so we will have 3 results for each of the test cases to confirm that the small regressions are repeatable.
mdcallag@gmail.com
Hi, On 2023-05-19 15:44:09 -0700, MARK CALLAGHAN wrote: > On Tue, May 9, 2023 at 10:36 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote: > > > > > > > On Fri, May 5, 2023 at 10:01 PM MARK CALLAGHAN <mdcallag@gmail.com> wrote: > > > >> I have two more runs of the benchmark in progress so we will have 3 > >> results for each of the test cases to confirm that the small regressions > >> are repeatable. > >> > > > I repeated the benchmark a few times using a more recent PG16 build (git > sha 08c45ae2) and have yet to see any significant changes. So that is good > news. My testing scripts have been improved so I should be able to finish > the next round of tests in less time. With "yet to see any significant changes" do you mean that the runs are comparable with earlier runs, showing the same regression? Or that the regression vanished? Or ...? Greetings, Andres Freund
With "yet to see any significant changes" do you mean that the runs are
comparable with earlier runs, showing the same regression? Or that the
regression vanished? Or ...?
The workload was in-memory on a small server (8 cores, 16G RAM) and the workload had 1 connection (no concurrency).
For some details on past benchmarks like this see: http://smalldatum.blogspot.com/2023/03/searching-for-performance-regressions.html
My focus is on changes >= 10%, so a value <= 0.90 or >= 1.10.
I used 3 builds of Postgres that I call def, o2_nofp, o3_nofp and ran the benchmark once per build. The results for each build are similar
and I only share the o2_nofp results there.
Good news, that I have not fully explained ...
One of the microbenchmarks gets ~1.5X more transactions/second (TPS) in PG 16 prebeta vs 15.2 and 15.3 for a read-only transaction that does:
* 2 select statements that scans 10k rows from an index (these are Q1 and Q3 below and are slower in PG 16)
* 2 select statements that scans 10k rows from an index and does aggregation (these are Q2 and Q4 below are are a lot faster in PG 16)
The speedup for Q2 and Q4 is larger than the slowdown for Q1/Q3 so TPS is ~1.5X more for PG 16.
Query plans don't appear to have changed. I assume some code got slower and some got faster for the same plan.
Each of these microbenchmarks run a read-only transaction with 4 SQL statements. The statements are here:
---
Throughput relative < 1.0 means perf regressed, > 1.0 means perf improved
col-1 : PG 15.3 with the o2_nofp build
col-2 : PG 16 prebeta build on May 19 at git sha 1c006c0 with the o2_nofp build
col-1 col-2
0.99 1.03 hot-points_range=100
1.02 1.05 point-query.pre_range=100
1.06 1.10 point-query_range=100
0.97 1.01 points-covered-pk.pre_range=100
0.98 1.02 points-covered-pk_range=100
0.98 1.01 points-covered-si.pre_range=100
1.00 1.00 points-covered-si_range=100
1.00 1.01 points-notcovered-pk.pre_range=100
1.00 1.01 points-notcovered-pk_range=100
1.01 1.03 points-notcovered-si.pre_range=100
1.01 1.01 points-notcovered-si_range=100
1.00 0.99 random-points.pre_range=1000
1.00 1.02 random-points.pre_range=100
1.01 1.01 random-points.pre_range=10
1.01 1.00 random-points_range=1000
1.01 1.01 random-points_range=100
1.02 1.01 random-points_range=10
1.00 1.00 range-covered-pk.pre_range=100
1.00 1.00 range-covered-pk_range=100
1.00 0.99 range-covered-si.pre_range=100
1.00 0.99 range-covered-si_range=100
1.03 1.01 range-notcovered-pk.pre_range=100
1.02 1.00 range-notcovered-pk_range=100
1.01 1.01 range-notcovered-si.pre_range=100
1.01 1.01 range-notcovered-si_range=100
1.04 1.54 read-only.pre_range=10000 <<<<<<<<<<
1.00 1.00 read-only.pre_range=100
1.01 1.01 read-only.pre_range=10
1.03 1.45 read-only_range=10000 <<<<<<<<<<
1.01 1.01 read-only_range=100
1.04 1.00 read-only_range=10
1.00 0.99 scan_range=100
1.00 1.02 delete_range=100
1.01 1.02 insert_range=100
1.01 1.00 read-write_range=100
1.01 0.98 read-write_range=10
1.01 1.01 update-index_range=100
1.00 1.00 update-inlist_range=100
1.02 1.02 update-nonindex_range=100
1.03 1.03 update-one_range=100
1.02 1.02 update-zipf_range=100
1.03 1.03 write-only_range=10000
In each case there are two "Execution Time" entries. The top one is from PG 15.2 and the bottom from PG 16 prebeta
Summary:
* Queries that do a sort show the largest improvement in PG 16 (Q2, Q4)
* Queries that don't do a sort are slower in PG 16 (Q1, Q3)
Q1.10k: explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000;
Execution Time: 4.222 ms
Execution Time: 6.243 ms
Q1.100k: explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000;
Execution Time: 36.508 ms
Execution Time: 49.344 ms
Q2.10k: explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 order by c;
Execution Time: 38.224 ms
Execution Time: 15.700 ms
Q2.100k: explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000 order by c;
Execution Time: 392.380 ms
Execution Time: 219.022 ms
Q3.10k: explain analyze SELECT SUM(k) FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000;
Execution Time: 3.660 ms
Execution Time: 3.994 ms
Q3.100k: explain analyze SELECT SUM(k) FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000;
Execution Time: 35.917 ms
Execution Time: 39.055 ms
Q4.10k: explain analyze SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 ORDER BY c;
Execution Time: 29.998 ms
Execution Time: 18.877 ms
Q4.100k: explain analyze SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 ORDER BY c;
Execution Time: 29.272 ms
Execution Time: 18.265 ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------
Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..525.86 rows=8971 width=121) (actual time=0.061..3.676 rows=10001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10010000))
Planning Time: 0.034 ms
Execution Time: 4.222 ms
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..589.80 rows=10068 width=121) (actual time=0.094..5.456 rows=10001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10010000))
Planning Time: 0.063 ms
Execution Time: 6.243 ms
--- Q1.100k : explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..5206.44 rows=89700 width=121) (actual time=0.017..31.166 rows=100001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10100000))
Planning Time: 0.024 ms
Execution Time: 36.508 ms
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..5845.86 rows=100671 width=121) (actual time=0.029..42.285 rows=100001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10100000))
Planning Time: 0.061 ms
Execution Time: 49.344 ms
--- Q2.10k : explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 order by c;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1114.85..1137.28 rows=8971 width=121) (actual time=36.561..37.429 rows=10001 loops=1)
Sort Key: c
Sort Method: quicksort Memory: 2025kB
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..525.86 rows=8971 width=121) (actual time=0.022..3.776 rows=10001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10010000))
Planning Time: 0.059 ms
Execution Time: 38.224 ms
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=1259.19..1284.36 rows=10068 width=121) (actual time=14.419..15.042 rows=10001 loops=1)
Sort Key: c
Sort Method: quicksort Memory: 1713kB
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..589.80 rows=10068 width=121) (actual time=0.023..3.473 rows=10001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10010000))
Planning Time: 0.049 ms
Execution Time: 15.700 ms
--- Q2.100k : explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000 order by c;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=18413.03..18637.28 rows=89700 width=121) (actual time=300.717..385.193 rows=100001 loops=1)
Sort Key: c
Sort Method: external merge Disk: 12848kB
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..5206.44 rows=89700 width=121) (actual time=0.028..29.590 rows=100001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10100000))
Planning Time: 0.048 ms
Execution Time: 392.380 ms
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=20749.26..21000.94 rows=100671 width=121) (actual time=154.969..211.572 rows=100001 loops=1)
Sort Key: c
Sort Method: external merge Disk: 12240kB
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..5845.86 rows=100671 width=121) (actual time=0.026..34.278 rows=100001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10100000))
Planning Time: 0.034 ms
Execution Time: 219.022 ms
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=548.29..548.30 rows=1 width=8) (actual time=3.645..3.646 rows=1 loops=1)
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..525.86 rows=8971 width=4) (actual time=0.024..2.587 rows=10001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10010000))
Planning Time: 0.036 ms
Execution Time: 3.660 ms
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=614.97..614.98 rows=1 width=8) (actual time=3.980..3.980 rows=1 loops=1)
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..589.80 rows=10068 width=4) (actual time=0.024..2.993 rows=10001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10010000))
Planning Time: 0.038 ms
Execution Time: 3.994 ms
--- Q3.100k : explain analyze SELECT SUM(k) FROM sbtest1 WHERE id BETWEEN 10000000 AND 10100000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=5430.69..5430.70 rows=1 width=8) (actual time=35.901..35.902 rows=1 loops=1)
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..5206.44 rows=89700 width=4) (actual time=0.017..25.256 rows=100001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10100000))
Planning Time: 0.032 ms
Execution Time: 35.917 ms
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=6097.53..6097.55 rows=1 width=8) (actual time=39.034..39.035 rows=1 loops=1)
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..5845.86 rows=100671 width=4) (actual time=0.018..29.291 rows=100001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10100000))
Planning Time: 0.051 ms
Execution Time: 39.055 ms
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=1114.85..1159.71 rows=8971 width=121) (actual time=26.335..29.435 rows=10001 loops=1)
-> Sort (cost=1114.85..1137.28 rows=8971 width=121) (actual time=26.333..27.085 rows=10001 loops=1)
Sort Key: c
Sort Method: quicksort Memory: 2025kB
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..525.86 rows=8971 width=121) (actual time=0.021..2.968 rows=10001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10010000))
Planning Time: 0.052 ms
Execution Time: 29.998 ms
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=1259.19..1309.53 rows=10068 width=121) (actual time=14.203..18.318 rows=10001 loops=1)
-> Sort (cost=1259.19..1284.36 rows=10068 width=121) (actual time=14.200..14.978 rows=10001 loops=1)
Sort Key: c
Sort Method: quicksort Memory: 1713kB
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..589.80 rows=10068 width=121) (actual time=0.030..3.475 rows=10001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10010000))
Planning Time: 0.053 ms
Execution Time: 18.877 ms
--- Q4.100k : explain analyze SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 ORDER BY c;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=1114.85..1159.71 rows=8971 width=121) (actual time=25.567..28.709 rows=10001 loops=1)
-> Sort (cost=1114.85..1137.28 rows=8971 width=121) (actual time=25.565..26.320 rows=10001 loops=1)
Sort Key: c
Sort Method: quicksort Memory: 2025kB
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..525.86 rows=8971 width=121) (actual time=0.025..2.926 rows=10001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10010000))
Planning Time: 0.052 ms
Execution Time: 29.272 ms
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=1259.19..1309.53 rows=10068 width=121) (actual time=13.620..17.714 rows=10001 loops=1)
-> Sort (cost=1259.19..1284.36 rows=10068 width=121) (actual time=13.618..14.396 rows=10001 loops=1)
Sort Key: c
Sort Method: quicksort Memory: 1713kB
-> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..589.80 rows=10068 width=121) (actual time=0.024..3.478 rows=10001 loops=1)
Index Cond: ((id >= 10000000) AND (id <= 10010000))
Planning Time: 0.039 ms
Execution Time: 18.265 ms
mdcallag@gmail.com
* no regressions
* a few queries are >= 1.5 times faster which make the read-only transaction >= 1.5 times faster
http://smalldatum.blogspot.com/2023/05/postgres-16beta1-looks-good-vs-sysbench.html
mdcallag@gmail.com
On Tue, 23 May 2023 at 07:40, MARK CALLAGHAN <mdcallag@gmail.com> wrote: (pg15) > --- Q2.10k : explain analyze SELECT c FROM sbtest1 WHERE id BETWEEN 10000000 AND 10010000 order by c; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------- > Sort (cost=1114.85..1137.28 rows=8971 width=121) (actual time=36.561..37.429 rows=10001 loops=1) > Sort Key: c > Sort Method: quicksort Memory: 2025kB > -> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..525.86 rows=8971 width=121) (actual time=0.022..3.776 rows=10001loops=1) > Index Cond: ((id >= 10000000) AND (id <= 10010000)) > Planning Time: 0.059 ms > Execution Time: 38.224 ms (pg16 b1) > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------------------- > Sort (cost=1259.19..1284.36 rows=10068 width=121) (actual time=14.419..15.042 rows=10001 loops=1) > Sort Key: c > Sort Method: quicksort Memory: 1713kB > -> Index Scan using sbtest1_pkey on sbtest1 (cost=0.44..589.80 rows=10068 width=121) (actual time=0.023..3.473 rows=10001loops=1) > Index Cond: ((id >= 10000000) AND (id <= 10010000)) > Planning Time: 0.049 ms > Execution Time: 15.700 ms It looks like the improvements here are due to qsort being faster on v16. To get an idea of the time taken to perform the actual qsort, you can't use the first row time vs last row time in the Sort node, as we must (obviously) have performed the sort before outputting the first row. I think you could get a decent idea of the time taken to perform the qsort by subtracting the actual time for the final Index scan row from the actual time for the Sort's first row. That's 36.561 - 3.776 = 32.785 ms for pg15's plan, and 14.419 - 3.473 = 10.946 ms pg16 b1's c6e0fe1f2 might have helped improve some of that performance, but I suspect there must be something else as ~3x seems much more than I'd expect from reducing the memory overheads. Testing versions before and after that commit might give a better indication. David
On Sun, May 28, 2023 at 2:42 PM David Rowley <dgrowleyml@gmail.com> wrote: > c6e0fe1f2 might have helped improve some of that performance, but I > suspect there must be something else as ~3x seems much more than I'd > expect from reducing the memory overheads. Testing versions before > and after that commit might give a better indication. I'm virtually certain that this is due to the change in default collation provider, from libc to ICU. Mostly due to the fact that ICU is capable of using abbreviated keys, and the system libc isn't (unless you go out of your way to define TRUST_STRXFRM when building Postgres). Many individual test cases involving larger non-C collation text sorts showed similar improvements back when I worked on this. Offhand, I believe that 3x - 3.5x improvements in execution times were common with high entropy abbreviated keys on high cardinality input columns at that time (this was with glibc). Low cardinality inputs were more like 2.5x. I believe that ICU is faster than glibc in general -- even with TRUST_STRXFRM enabled. But the TRUST_STRXFRM thing is bound to be the most important factor here, by far. -- Peter Geoghegan
I can do that, but it will take a few days before the server is available.
On Sun, May 28, 2023 at 2:42 PM David Rowley <dgrowleyml@gmail.com> wrote:
> c6e0fe1f2 might have helped improve some of that performance, but I
> suspect there must be something else as ~3x seems much more than I'd
> expect from reducing the memory overheads. Testing versions before
> and after that commit might give a better indication.
I'm virtually certain that this is due to the change in default
collation provider, from libc to ICU. Mostly due to the fact that ICU
is capable of using abbreviated keys, and the system libc isn't
(unless you go out of your way to define TRUST_STRXFRM when building
Postgres).
Many individual test cases involving larger non-C collation text sorts
showed similar improvements back when I worked on this. Offhand, I
believe that 3x - 3.5x improvements in execution times were common
with high entropy abbreviated keys on high cardinality input columns
at that time (this was with glibc). Low cardinality inputs were more
like 2.5x.
I believe that ICU is faster than glibc in general -- even with
TRUST_STRXFRM enabled. But the TRUST_STRXFRM thing is bound to be the
most important factor here, by far.
--
Peter Geoghegan
mdcallag@gmail.com
Hi Mark, On Tue, May 30, 2023 at 1:03 PM MARK CALLAGHAN <mdcallag@gmail.com> wrote: > Do you want me to try PG 16 without ICU or PG 15 with ICU? > I can do that, but it will take a few days before the server is available. Sorry for the late reply. Most of the Postgres developers (myself included) are attending pgCon right now. It would be nice to ascertain just how much of a boost we're getting from our use of ICU as far as sysbench goes. I'd appreciate having that information. We discussed the choice of ICU as default collation provider at yesterday's developer meeting: https://wiki.postgresql.org/wiki/PgCon_2023_Developer_Meeting#High_level_thoughts_and_feedback_on_moving_toward_ICU_as_the_preferred_collation_provider https://wiki.postgresql.org/wiki/StateOfICU Just confirming my theory about abbreviated keys (without rerunning the benchmark) should be simple - perhaps that would be a useful place to start. You could just rerun the two individual queries of interest from an interactive psql session. There are even low-level debug messages available through the trace_sort GUC. From a psql session you'd run something along the lines of "set client_min_messages=log; set trace_sort=on; $QUERY". You'll see lots of LOG messages with specific information about the use of abbreviated keys and the progress of each sort. Thanks -- Peter Geoghegan
Hello David, 11.05.2023 16:00, Alexander Lakhin wrote: > Yeah, I see. It's also interesting to me, which tests perform better after > that commit. It takes several hours to run all tests, so I can't present > results quickly, but I'll try to collect this information next week. To my regret, I could not find such tests that week, so I decided to try later, after reviewing my benchmark running infrastructure. But for now, as Postgres Pro company graciously shared the benchmarking infrastructure project [1], it's possible for anyone to confirm or deny my results. (You can also see which tests were performed and how.) Having done one more round of comprehensive testing, I still couldn't find winning tests for commit 3c6fc5820 (but reassured that test s64da_tpcds.query87 loses a little (and also s64da_tpcds.query38)). So it seems to me that the tests I performed or their parameters is not representative enough for that improvement, unfortunately. [1] https://github.com/postgrespro/pg-mark Best regards, Alexander