Обсуждение: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)
The following bug has been logged on the website: Bug reference: 13493 Logged by: Graeme Bell Email address: grb@skogoglandskap.no PostgreSQL version: 9.3.9 Operating system: Linux Description: Hi, pl/pgsql doesn't scale properly on postgres 9.3/9.4 with multiple processors beyond 2-3 processors for me, regardless of the machine I use or the benchmark/project. I discovered this painfully during a project here where we were running a small 'forest simulator' in pl/pgsql on different datasets simultaneously. To highlight the problem, I've provided a benchmark that demonstrates two problems (on my machines). 1. The first problem is scaling when you have lots of pl/pgsql code running and doing real work. 2. The second problem is scaling when you have a table column as an input parameter. The benchmark results & source code are here: https://github.com/gbb/ppppt Another set of benchmarks showing the same phenomena on PG9.3 and PG9.4 can be found here, under 'BENCHMARKS.md' https://github.com/gbb/par_psql I would be grateful if others could run the benchmark and confirm/disconfirm the result. If confirmed, the result may be of special interest to e.g. the postgis & pgrouting communities. This result is from a 16-core machine with 128GB of memory and a lot of random I/O capacity (there's no writing involved though other than a bit of WAL, and not much data, so this shouldn't matter).
On July 8, 2015 12:27:39 AM GMT+02:00, grb@skogoglandskap.no wrote: >The following bug has been logged on the website: > >Bug reference: 13493 >Logged by: Graeme Bell >Email address: grb@skogoglandskap.no >PostgreSQL version: 9.3.9 >Operating system: Linux >Description: > >Hi, > >pl/pgsql doesn't scale properly on postgres 9.3/9.4 with multiple >processors >beyond 2-3 processors for me, regardless of the machine I use or the >benchmark/project. > >I discovered this painfully during a project here where we were running >a >small 'forest simulator' in pl/pgsql on different datasets >simultaneously. > >To highlight the problem, I've provided a benchmark that demonstrates >two >problems (on my machines). > >1. The first problem is scaling when you have lots of pl/pgsql code >running >and doing real work. > >2. The second problem is scaling when you have a table column as an >input >parameter. > >The benchmark results & source code are here: > >https://github.com/gbb/ppppt > >Another set of benchmarks showing the same phenomena on PG9.3 and PG9.4 >can >be found here, under 'BENCHMARKS.md' > >https://github.com/gbb/par_psql > >I would be grateful if others could run the benchmark and >confirm/disconfirm >the result. > >If confirmed, the result may be of special interest to e.g. the postgis >& >pgrouting communities. > >This result is from a 16-core machine with 128GB of memory and a lot of >random I/O capacity (there's no writing involved though other than a >bit of >WAL, and not much data, so this shouldn't matter). That sounds odd.. Could you provide perf profiles with this running? Also have you reproduced on 9.5? Do you have transparent huge pages disabled? --- Please excuse brevity and formatting - I am writing this on my mobile phone.
grb@skogoglandskap.no writes: > pl/pgsql doesn't scale properly on postgres 9.3/9.4 with multiple processors > beyond 2-3 processors for me, regardless of the machine I use or the > benchmark/project. > The benchmark results & source code are here: > https://github.com/gbb/ppppt First off, thanks for providing a concrete test case! It's always a lot easier to investigate when a problem can be reproduced locally. Having said that ... plpgsql is really designed as a glue language for SQL queries, not for heavy-duty computation, so these examples aren't exactly showing it at its best. It would be worth your while to consider using some other convenient programming language, perhaps plperl or plpython or plv8, if you want to do self-contained calculations on the server side. But I think that the main problem you are seeing here is from snapshot acquisition contention. By default, plpgsql acquires a new snapshot for each statement inside a function, and that results in a lot of contention for the ProcArray if you're maxing out a multicore machine. Depending on what you're actually doing inside the function, you might well be able to mark it stable or even immutable, which would suppress the per-statement snapshot acquisitions. On my machine (admittedly only 8 cores), the scalability problems in this example pretty much vanish when I attach "stable" to the function definitions. There is some discussion going on about improving the scalability of snapshot acquisition, but nothing will happen in that line before 9.6 at the earliest. regards, tom lane
On the machine used for this set of benchmarks, THP is set to never. = I've seen this on several machines with slight variations in their = configuration.=20 I spent days poring over best practice guides from google and others = when setting these up, and I try to stay up to date with discussion on = pgsql-performance for kernel and postgresql.conf.=20 - Don't have any spare machines currently, or time, to test on pg9.5. = Going into hospital soon and then to a conference. Wanted to get this = published first. - may not be able to provide perf data just now. Lots of work, little = time, bad health. Will try if i can though! Graeme Bell > On 8 Jul 2015, at 02:16, Andres Freund <andres@anarazel.de> wrote: >=20 > On July 8, 2015 12:27:39 AM GMT+02:00, grb@skogoglandskap.no wrote: >> The following bug has been logged on the website: >>=20 >> Bug reference: 13493 >> Logged by: Graeme Bell >> Email address: grb@skogoglandskap.no >> PostgreSQL version: 9.3.9 >> Operating system: Linux >> Description: =20 >>=20 >> Hi, >>=20 >> pl/pgsql doesn't scale properly on postgres 9.3/9.4 with multiple >> processors >> beyond 2-3 processors for me, regardless of the machine I use or the >> benchmark/project.=20 >>=20 >> I discovered this painfully during a project here where we were = running >> a >> small 'forest simulator' in pl/pgsql on different datasets >> simultaneously.=20 >>=20 >> To highlight the problem, I've provided a benchmark that demonstrates >> two >> problems (on my machines). >>=20 >> 1. The first problem is scaling when you have lots of pl/pgsql code >> running >> and doing real work.=20 >>=20 >> 2. The second problem is scaling when you have a table column as an >> input >> parameter.=20 >>=20 >> The benchmark results & source code are here: >>=20 >> https://github.com/gbb/ppppt >>=20 >> Another set of benchmarks showing the same phenomena on PG9.3 and = PG9.4 >> can >> be found here, under 'BENCHMARKS.md' >>=20 >> https://github.com/gbb/par_psql >>=20 >> I would be grateful if others could run the benchmark and >> confirm/disconfirm >> the result.=20 >>=20 >> If confirmed, the result may be of special interest to e.g. the = postgis >> & >> pgrouting communities. >>=20 >> This result is from a 16-core machine with 128GB of memory and a lot = of >> random I/O capacity (there's no writing involved though other than a >> bit of >> WAL, and not much data, so this shouldn't matter). >=20 > That sounds odd.. Could you provide perf profiles with this running? = Also have you reproduced on 9.5? >=20 > Do you have transparent huge pages disabled? >=20 > ---=20 > Please excuse brevity and formatting - I am writing this on my mobile = phone. >=20 >=20 > --=20 > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs
> On 8 Jul 2015, at 03:13, Tom Lane <tgl@sss.pgh.pa.us> wrote: >=20 > grb@skogoglandskap.no writes: >> pl/pgsql doesn't scale properly on postgres 9.3/9.4 with multiple = processors >> beyond 2-3 processors for me, regardless of the machine I use or the >> benchmark/project.=20 >=20 >> The benchmark results & source code are here: >> https://github.com/gbb/ppppt >=20 > First off, thanks for providing a concrete test case! It's always a = lot > easier to investigate when a problem can be reproduced locally. >=20 > Having said that ... >=20 > plpgsql is really designed as a glue language for SQL queries, not for > heavy-duty computation, so these examples aren't exactly showing it at > its best. It would be worth your while to consider using some other > convenient programming language, perhaps plperl or plpython or plv8, > if you want to do self-contained calculations on the server side. I'm very interested in trying out plv8, but it is a problem for projects = when you have a team of people who know plpgsql and a lot of plpgsql = legacy code. It would be interesting to know what happens when these functions are = ported into those other languages.=20 > But I think that the main problem you are seeing here is from snapshot > acquisition contention. By default, plpgsql acquires a new snapshot > for each statement inside a function, and that results in a lot of > contention for the ProcArray if you're maxing out a multicore machine. > Depending on what you're actually doing inside the function, you might > well be able to mark it stable or even immutable, which would suppress > the per-statement snapshot acquisitions. On my machine (admittedly = only > 8 cores), the scalability problems in this example pretty much vanish > when I attach "stable" to the function definitions. Thanks for testing it on your own machine and looking into a possible = cause. Do you think that explains both problems (one occurs with table data; = one occurs with work)? Several of the functions that were killing me last year were dynamic SQL = with some internal state, so declaring them harmless and predictable = probably isn't possible. I will need to go back and check.=20 It sounds like marking up functions may help (at least as far as 8 = cores), but there must be a lot of people out there running plpgsql they = have ported over from oracle, (postgis? pgrouting? not sure about their = internals) etc, and various libraries where the functions aren't marked = up and can't easily be marked up. A broader solution in the far off = future would be awesome, if it is ever possible. I am not in a position = currently to provide it, I regret.=20 > There is some discussion going on about improving the scalability of > snapshot acquisition, but nothing will happen in that line before 9.6 > at the earliest. Thanks, this is interesting to know. Thanks again for your time looking into this and for the = stable/immutable tip. Graeme. >=20 > regards, tom lane >=20 >=20 > --=20 > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs
On 2015-07-07 21:13:04 -0400, Tom Lane wrote: > There is some discussion going on about improving the scalability of > snapshot acquisition, but nothing will happen in that line before 9.6 > at the earliest. 9.5 should be less bad at it than 9.5, at least if it's mostly read-only ProcArrayLock acquisitions which sounds like it should be the case here.
On 2015-07-08 11:12:38 +0200, Andres Freund wrote: > On 2015-07-07 21:13:04 -0400, Tom Lane wrote: > > There is some discussion going on about improving the scalability of > > snapshot acquisition, but nothing will happen in that line before 9.6 > > at the earliest. > > 9.5 should be less bad at it than 9.5, at least if it's mostly read-only > ProcArrayLock acquisitions which sounds like it should be the case here. test 3: master: 1 clients: 3112.7 2 clients: 6806.7 4 clients: 13441.2 8 clients: 15765.4 16 clients: 21102.2 9.4: 1 clients: 2524.2 2 clients: 5903.2 4 clients: 11756.8 8 clients: 14583.3 16 clients: 19309.2 So there's an interesting "dip" between 4 and 8 clients. A perf profile doesn't show any actual lock contention on master. Not that surprising, there shouldn't be any exclusive locks here. One interesting thing in exactly such cases is to consider intel's turboboost. Disabling it (echo 0 > /sys/devices/system/cpu/cpufreq/boost) gives us these results: test 3: master: 1 clients: 2926.6 2 clients: 6634.3 4 clients: 13905.2 8 clients: 15718.9 so that's not it in this case. comparing stats between the 4 and 8 client runs shows (removing boring data): 4 clients: 90859.517328 task-clock (msec) # 3.428 CPUs utilized 109,655,985,749 stalled-cycles-frontend # 54.27% frontend cycles idle (27.79%) 62,906,918,008 stalled-cycles-backend # 31.14% backend cycles idle (27.78%) 219,063,494,214 instructions # 1.08 insns per cycle # 0.50 stalled cycles per insn (33.32%) 41,664,400,828 branches # 458.558 M/sec (33.32%) 374,426,805 branch-misses # 0.90% of all branches (33.32%) 62,504,845,665 L1-dcache-loads # 687.928 M/sec (27.78%) 1,224,842,848 L1-dcache-load-misses # 1.96% of all L1-dcache hits (27.81%) 321,981,924 LLC-loads # 3.544 M/sec (22.33%) 23,219,438 LLC-load-misses # 7.21% of all LL-cache hits (5.52%) 26.507528305 seconds time elapsed 8 clients: 165168.247631 task-clock (msec) # 6.824 CPUs utilized 247,231,674,170 stalled-cycles-frontend # 67.04% frontend cycles idle (27.84%) 101,354,900,788 stalled-cycles-backend # 27.48% backend cycles idle (27.83%) 285,829,642,649 instructions # 0.78 insns per cycle # 0.86 stalled cycles per insn (33.39%) 54,503,992,461 branches # 329.991 M/sec (33.39%) 761,911,056 branch-misses # 1.40% of all branches (33.38%) 81,373,091,784 L1-dcache-loads # 492.668 M/sec (27.74%) 4,419,307,036 L1-dcache-load-misses # 5.43% of all L1-dcache hits (27.72%) 510,940,577 LLC-loads # 3.093 M/sec (21.86%) 26,963,120 LLC-load-misses # 5.28% of all LL-cache hits (5.37%) 24.205675255 seconds time elapsed It's quite visible that all caches have considerably worse characteristics on the 8 clients case, and that "instructions per cycle" has gone down considerably. Presumably because more frontend cycles were idle, which in turn is probably caused by the higher cache miss ratios. L1 going from 1.96% misses to 5.43% misses is quite a drastic difference. Now, looking at where cache misses happen: 4 clients: + 7.64% postgres postgres [.] AllocSetAlloc + 3.90% postgres postgres [.] LWLockAcquire + 3.40% postgres plpgsql.so [.] plpgsql_exec_function + 2.64% postgres postgres [.] GetCachedPlan + 2.20% postgres postgres [.] slot_deform_tuple + 2.16% postgres libc-2.19.so [.] _int_free + 2.08% postgres libc-2.19.so [.] __memcpy_sse2_unaligned 8 clients: + 6.34% postgres postgres [.] AllocSetAlloc + 4.89% postgres plpgsql.so [.] plpgsql_exec_function + 2.63% postgres libc-2.19.so [.] _int_free + 2.60% postgres libc-2.19.so [.] __memcpy_sse2_unaligned + 2.50% postgres postgres [.] ExecLimit + 2.47% postgres postgres [.] LWLockAcquire + 2.18% postgres postgres [.] ExecProject So the characteristics interestingly change quite a bit between 4/8. I reproduced this a number of times to make sure it's not just a temporary issue. The memcpy rising is mainly: + 80.27% SearchCatCache + 10.56% appendBinaryStringInfo + 6.51% socket_putmessage + 0.78% pgstat_report_activity So at least on the hardware available to me right now this isn't caused by actual lock contention. Hm. I've a patch addressing the SearchCatCache memcpy() cost somewhere... Andres
On 2015-07-08 14:55:12 +0200, Andres Freund wrote: > comparing stats between the 4 and 8 client runs shows (removing boring data): > 4 clients: > 109,655,985,749 stalled-cycles-frontend # 54.27% frontend cycles idle (27.79%) > 41,664,400,828 branches # 458.558 M/sec (33.32%) > 374,426,805 branch-misses # 0.90% of all branches (33.32%) > 8 clients: > 247,231,674,170 stalled-cycles-frontend # 67.04% frontend cycles idle (27.84%) > 54,503,992,461 branches # 329.991 M/sec (33.39%) > 761,911,056 branch-misses # 1.40% of all branches (33.38%) looking at stalled-cycles-frontend, and branch-misses shows: -e branch-misses 4: + 7.34% postgres plpgsql.so [.] plpgsql_exec_function + 6.33% postgres postgres [.] standard_ExecutorRun + 5.49% postgres postgres [.] SPI_push + 4.24% postgres libc-2.19.so [.] __memcpy_sse2_unaligned + 3.43% postgres postgres [.] ReleaseCachedPlan + 2.72% postgres plpgsql.so [.] exec_stmt + 2.07% postgres postgres [.] SPI_pop + 1.94% postgres postgres [.] ExecLimit + 1.94% postgres libc-2.19.so [.] __strcpy_sse2_unaligned + 1.59% postgres libc-2.19.so [.] _int_malloc + 1.46% postgres postgres [.] LWLockRelease + 1.33% postgres postgres [.] AllocSetAlloc + 1.17% postgres libc-2.19.so [.] _int_free + 1.08% postgres libc-2.19.so [.] memset + 1.00% postgres postgres [.] hash_search_with_hash_value + 0.99% postgres postgres [.] GetSnapshotData 8: + 10.66% pgbench libpq.so.5.9 [.] PQsocket + 8.40% pgbench libpthread-2.19.so [.] __libc_recv + 5.03% postgres plpgsql.so [.] plpgsql_exec_function + 3.00% postgres postgres [.] AllocSetAlloc + 2.86% postgres postgres [.] standard_ExecutorRun + 2.54% postgres plpgsql.so [.] exec_stmt + 2.45% postgres libc-2.19.so [.] __memcpy_sse2_unaligned + 2.27% postgres postgres [.] GetSnapshotData + 2.22% postgres postgres [.] ReleaseCachedPlan + 2.14% postgres postgres [.] OverrideSearchPathMatchesCurrent + 2.06% postgres postgres [.] SPI_push + 2.02% postgres postgres [.] SPI_pop + 1.92% postgres libc-2.19.so [.] _int_malloc + 1.86% postgres postgres [.] ExecLimit + 1.77% postgres postgres [.] SPI_connect + 1.74% postgres libc-2.19.so [.] __strcpy_sse2_unaligned It's interesting to see that the limited size of the trace buffer leads to previously perfectly predicted functions like PQsocket regularly causing cache misses now... Interesting to see how GetSnapshotData() rises in comparison. OverrideSearchPathMatchesCurrent() is also curious, but perhaps not so surprising considering it's chasing down a linked list - almost impossible to predict. -e stalled-cycles-frontend: 4: + 21.50% swapper [kernel.vmlinux] [k] intel_idle + 4.21% pgbench libpthread-2.19.so [.] __libc_recv + 2.68% postgres postgres [.] LWLockAcquire + 2.35% pgbench [kernel.vmlinux] [k] fput + 2.34% pgbench [kernel.vmlinux] [k] unix_stream_recvmsg + 2.03% pgbench [kernel.vmlinux] [k] system_call + 2.02% pgbench pgbench [.] threadRun + 1.82% pgbench [kernel.vmlinux] [k] system_call_after_swapgs + 1.73% postgres plpgsql.so [.] plpgsql_exec_function + 1.56% postgres postgres [.] LWLockRelease + 1.33% pgbench [kernel.vmlinux] [k] sys_recvfrom + 1.24% postgres libc-2.19.so [.] _int_malloc + 1.16% pgbench [vdso] [.] 0x00000000000008c9 + 1.06% pgbench [kernel.vmlinux] [k] __fget + 1.04% postgres libc-2.19.so [.] _int_free + 1.02% pgbench libpthread-2.19.so [.] __pthread_enable_asynccancel 8: + 8.41% swapper [kernel.vmlinux] [k] intel_idle + 4.35% pgbench pgbench [.] threadRun + 3.56% pgbench libpthread-2.19.so [.] __libc_recv + 2.58% pgbench [kernel.vmlinux] [k] unix_stream_recvmsg + 1.99% postgres plpgsql.so [.] plpgsql_exec_function + 1.98% postgres postgres [.] AllocSetAlloc + 1.94% pgbench [kernel.vmlinux] [k] sys_recvfrom + 1.72% postgres postgres [.] LWLockAcquire + 1.66% pgbench pgbench [.] doCustom + 1.59% pgbench [kernel.vmlinux] [k] system_call_after_swapgs + 1.58% pgbench [kernel.vmlinux] [k] system_call + 1.51% pgbench [vdso] [.] __vdso_gettimeofday + 1.50% pgbench [kernel.vmlinux] [k] __fget + 1.21% postgres libc-2.19.so [.] memset + 1.11% postgres libc-2.19.so [.] _int_malloc + 1.08% postgres postgres [.] GetSnapshotData (intel_idle is executed on a cpu when it's idle. Not surprising that it shows up prominently, especially when not all cores are busy). It's interesting to see how the locking functions are less prominent in the -c8 case, and how overhead of allocation and plpgsql_exec_function rises.
Andres Freund <andres@anarazel.de> writes: > So there's an interesting "dip" between 4 and 8 clients. A perf profile > doesn't show any actual lock contention on master. Not that surprising, > there shouldn't be any exclusive locks here. What size of machine are you testing on? I ran Graeme's tests on a 2-socket, 4-core-per-socket, no-hyperthreading machine, which has separate NUMA zones for the 2 sockets. What I saw (after fixing the "stable" issue) was that all the 8-client and 16-client cases were about 8x faster than 1-client, and 2-client was generally within hailing distance of 2x faster, but 4-client was often noticeably worse than the expected 4x faster. I figured this was likely some weird NUMA effect, possibly compounded by brutally stupid scheduling on the part of my kernel. But I didn't have time to look closer. You might be seeing the same kind of effect, or something different. It's hard to tell without knowing more about your machine. regards, tom lane
On 2015-07-08 09:56:51 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > So there's an interesting "dip" between 4 and 8 clients. A perf profile > > doesn't show any actual lock contention on master. Not that surprising, > > there shouldn't be any exclusive locks here. > > What size of machine are you testing on? 2xE5520 (=> 2 x 4 sockets, 8 threads); numa. (note that I intentionally did not fix the volatility of the function) > I ran Graeme's tests on a 2-socket, 4-core-per-socket, no-hyperthreading > machine, which has separate NUMA zones for the 2 sockets. What I saw > (after fixing the "stable" issue) was that all the 8-client and 16-client > cases were about 8x faster than 1-client, and 2-client was generally > within hailing distance of 2x faster, but 4-client was often noticeably > worse than the expected 4x faster. > I figured this was likely some weird NUMA effect, possibly compounded > by brutally stupid scheduling on the part of my kernel. But I didn't > have time to look closer. > > You might be seeing the same kind of effect, or something different. > It's hard to tell without knowing more about your machine. I think it's likely to be some scheduler effect. The number of cpu migrations between 4 and 8 is very different: 4: 64,599 context-switches # 0.003 M/sec (100.00%) 172 cpu-migrations # 0.007 K/sec (100.00%) 537 page-faults # 0.023 K/sec 8: 381,383 context-switches # 0.002 M/sec (100.00%) 1,279 cpu-migrations # 0.008 K/sec (100.00%) 3,869 page-faults # 0.024 K/sec 16: 514,426 context-switches # 0.003 M/sec (100.00%) 1,166 cpu-migrations # 0.007 K/sec (100.00%) 6,308 page-faults # 0.039 K/sec There's a pretty large increase in the number of migrations between 4 and 8, but none between 8 and 16. My guess is that the kernel tries to move around processes to idle nodes too aggressively. second-by-second pgbench is quite interesting: progress: 1.0 s, 22915.3 tps, lat 0.346 ms stddev 0.078 progress: 2.0 s, 15596.8 tps, lat 0.512 ms stddev 0.185 progress: 3.0 s, 15519.2 tps, lat 0.514 ms stddev 0.499 progress: 4.0 s, 15535.7 tps, lat 0.512 ms stddev 0.306 progress: 5.0 s, 15494.3 tps, lat 0.515 ms stddev 0.162 so at -j8 we're routinely much faster than later. Comparing perf stat pgbench -j8 -T 1 and -T 8: -T 1 46 cpu-migrations -T 8 534 cpu-migrations so indeed the number of migration rises noticeably after the first second...
By the way, It may be worth increasing the length of the for loop in the pl/pgsql by = a factor of 100-1000x and seeing if you get the same kind of behaviour. = I intentionally left it very low (10000 iterations of nothing is not = that much work) to ensure people would see a non-zero number there for = the TPS the first time they ran the bench. However in practice the = functions that have the worst behaviour tend to be longer running ones.=20= Graeme. > On 8 Jul 2015, at 16:22, Andres Freund <andres@anarazel.de> wrote: >=20 > On 2015-07-08 09:56:51 -0400, Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >>> So there's an interesting "dip" between 4 and 8 clients. A perf = profile >>> doesn't show any actual lock contention on master. Not that = surprising, >>> there shouldn't be any exclusive locks here. >>=20 >> What size of machine are you testing on? >=20 > 2xE5520 (=3D> 2 x 4 sockets, 8 threads); numa. >=20 > (note that I intentionally did not fix the volatility of the function) >=20 >> I ran Graeme's tests on a 2-socket, 4-core-per-socket, = no-hyperthreading >> machine, which has separate NUMA zones for the 2 sockets. What I saw >> (after fixing the "stable" issue) was that all the 8-client and = 16-client >> cases were about 8x faster than 1-client, and 2-client was generally >> within hailing distance of 2x faster, but 4-client was often = noticeably >> worse than the expected 4x faster. >=20 >> I figured this was likely some weird NUMA effect, possibly compounded >> by brutally stupid scheduling on the part of my kernel. But I didn't >> have time to look closer. >>=20 >> You might be seeing the same kind of effect, or something different. >> It's hard to tell without knowing more about your machine. >=20 > I think it's likely to be some scheduler effect. The number of cpu > migrations between 4 and 8 is very different: >=20 > 4: >=20 > 64,599 context-switches # 0.003 M/sec = (100.00%) > 172 cpu-migrations # 0.007 K/sec = (100.00%) > 537 page-faults # 0.023 K/sec > 8: > 381,383 context-switches # 0.002 M/sec = (100.00%) > 1,279 cpu-migrations # 0.008 K/sec = (100.00%) > 3,869 page-faults # 0.024 K/sec > 16: >=20 > 514,426 context-switches # 0.003 M/sec = (100.00%) > 1,166 cpu-migrations # 0.007 K/sec = (100.00%) > 6,308 page-faults # 0.039 K/sec >=20 > There's a pretty large increase in the number of migrations between 4 > and 8, but none between 8 and 16. >=20 > My guess is that the kernel tries to move around processes to idle = nodes > too aggressively. >=20 > second-by-second pgbench is quite interesting: > progress: 1.0 s, 22915.3 tps, lat 0.346 ms stddev 0.078 > progress: 2.0 s, 15596.8 tps, lat 0.512 ms stddev 0.185 > progress: 3.0 s, 15519.2 tps, lat 0.514 ms stddev 0.499 > progress: 4.0 s, 15535.7 tps, lat 0.512 ms stddev 0.306 > progress: 5.0 s, 15494.3 tps, lat 0.515 ms stddev 0.162 >=20 > so at -j8 we're routinely much faster than later. >=20 > Comparing perf stat pgbench -j8 -T 1 and -T 8: > -T 1 > 46 cpu-migrations > -T 8 > 534 cpu-migrations > so indeed the number of migration rises noticeably after the first > second... >=20 >=20 > --=20 > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs
apologies, I'm posting in a hurry here and forgot to trim the fat from = the quoted text of the last message.=