Обсуждение: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

Поиск
Список
Период
Сортировка

BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
grb@skogoglandskap.no
Дата:
The following bug has been logged on the website:

Bug reference:      13493
Logged by:          Graeme Bell
Email address:      grb@skogoglandskap.no
PostgreSQL version: 9.3.9
Operating system:   Linux
Description:

Hi,

pl/pgsql doesn't scale properly on postgres 9.3/9.4 with multiple processors
beyond 2-3 processors for me, regardless of the machine I use or the
benchmark/project.

I discovered this painfully during a project here where we were running a
small 'forest simulator' in pl/pgsql on different datasets simultaneously.

To highlight the problem, I've provided a benchmark that demonstrates two
problems (on my machines).

1. The first problem is scaling when you have lots of pl/pgsql code running
and doing real work.

2. The second problem is scaling when you have a table column as an input
parameter.

The benchmark results & source code are here:

https://github.com/gbb/ppppt

Another set of benchmarks showing the same phenomena on PG9.3 and PG9.4 can
be found here, under 'BENCHMARKS.md'

https://github.com/gbb/par_psql

I would be grateful if others could run the benchmark and confirm/disconfirm
the result.

If confirmed, the result may be of special interest to e.g. the postgis &
pgrouting communities.

This result is from a 16-core machine with 128GB of memory and a lot of
random I/O capacity (there's no writing involved though other than a bit of
WAL, and not much data, so this shouldn't matter).

Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
Andres Freund
Дата:
On July 8, 2015 12:27:39 AM GMT+02:00, grb@skogoglandskap.no wrote:
>The following bug has been logged on the website:
>
>Bug reference:      13493
>Logged by:          Graeme Bell
>Email address:      grb@skogoglandskap.no
>PostgreSQL version: 9.3.9
>Operating system:   Linux
>Description:
>
>Hi,
>
>pl/pgsql doesn't scale properly on postgres 9.3/9.4 with multiple
>processors
>beyond 2-3 processors for me, regardless of the machine I use or the
>benchmark/project.
>
>I discovered this painfully during a project here where we were running
>a
>small 'forest simulator' in pl/pgsql on different datasets
>simultaneously.
>
>To highlight the problem, I've provided a benchmark that demonstrates
>two
>problems (on my machines).
>
>1. The first problem is scaling when you have lots of pl/pgsql code
>running
>and doing real work.
>
>2. The second problem is scaling when you have a table column as an
>input
>parameter.
>
>The benchmark results & source code are here:
>
>https://github.com/gbb/ppppt
>
>Another set of benchmarks showing the same phenomena on PG9.3 and PG9.4
>can
>be found here, under 'BENCHMARKS.md'
>
>https://github.com/gbb/par_psql
>
>I would be grateful if others could run the benchmark and
>confirm/disconfirm
>the result.
>
>If confirmed, the result may be of special interest to e.g. the postgis
>&
>pgrouting communities.
>
>This result is from a 16-core machine with 128GB of memory and a lot of
>random I/O capacity (there's no writing involved though other than a
>bit of
>WAL, and not much data, so this shouldn't matter).

That sounds odd.. Could you provide perf profiles with this running? Also have you reproduced on 9.5?

Do you have transparent huge pages disabled?

---
Please excuse brevity and formatting - I am writing this on my mobile phone.

Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
Tom Lane
Дата:
grb@skogoglandskap.no writes:
> pl/pgsql doesn't scale properly on postgres 9.3/9.4 with multiple processors
> beyond 2-3 processors for me, regardless of the machine I use or the
> benchmark/project.

> The benchmark results & source code are here:
> https://github.com/gbb/ppppt

First off, thanks for providing a concrete test case!  It's always a lot
easier to investigate when a problem can be reproduced locally.

Having said that ...

plpgsql is really designed as a glue language for SQL queries, not for
heavy-duty computation, so these examples aren't exactly showing it at
its best.  It would be worth your while to consider using some other
convenient programming language, perhaps plperl or plpython or plv8,
if you want to do self-contained calculations on the server side.

But I think that the main problem you are seeing here is from snapshot
acquisition contention.  By default, plpgsql acquires a new snapshot
for each statement inside a function, and that results in a lot of
contention for the ProcArray if you're maxing out a multicore machine.
Depending on what you're actually doing inside the function, you might
well be able to mark it stable or even immutable, which would suppress
the per-statement snapshot acquisitions.  On my machine (admittedly only
8 cores), the scalability problems in this example pretty much vanish
when I attach "stable" to the function definitions.

There is some discussion going on about improving the scalability of
snapshot acquisition, but nothing will happen in that line before 9.6
at the earliest.

            regards, tom lane

Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
Graeme
Дата:
On the machine used for this set of benchmarks, THP is set to never. =
I've seen this on several machines with slight variations in their =
configuration.=20

I spent days poring over best practice guides from google and others =
when setting these up, and I try to stay up to date with discussion on =
pgsql-performance for kernel and postgresql.conf.=20

- Don't have any spare machines currently, or time, to test on pg9.5. =
Going into hospital soon and then to a conference. Wanted to get this =
published first.
- may not be able to provide perf data just now. Lots of work, little =
time, bad health. Will try if i can though!

Graeme Bell


> On 8 Jul 2015, at 02:16, Andres Freund <andres@anarazel.de> wrote:
>=20
> On July 8, 2015 12:27:39 AM GMT+02:00, grb@skogoglandskap.no wrote:
>> The following bug has been logged on the website:
>>=20
>> Bug reference:      13493
>> Logged by:          Graeme Bell
>> Email address:      grb@skogoglandskap.no
>> PostgreSQL version: 9.3.9
>> Operating system:   Linux
>> Description:       =20
>>=20
>> Hi,
>>=20
>> pl/pgsql doesn't scale properly on postgres 9.3/9.4 with multiple
>> processors
>> beyond 2-3 processors for me, regardless of the machine I use or the
>> benchmark/project.=20
>>=20
>> I discovered this painfully during a project here where we were =
running
>> a
>> small 'forest simulator' in pl/pgsql on different datasets
>> simultaneously.=20
>>=20
>> To highlight the problem, I've provided a benchmark that demonstrates
>> two
>> problems (on my machines).
>>=20
>> 1. The first problem is scaling when you have lots of pl/pgsql code
>> running
>> and doing real work.=20
>>=20
>> 2. The second problem is scaling when you have a table column as an
>> input
>> parameter.=20
>>=20
>> The benchmark results & source code are here:
>>=20
>> https://github.com/gbb/ppppt
>>=20
>> Another set of benchmarks showing the same phenomena on PG9.3 and =
PG9.4
>> can
>> be found here, under 'BENCHMARKS.md'
>>=20
>> https://github.com/gbb/par_psql
>>=20
>> I would be grateful if others could run the benchmark and
>> confirm/disconfirm
>> the result.=20
>>=20
>> If confirmed, the result may be of special interest to e.g. the =
postgis
>> &
>> pgrouting communities.
>>=20
>> This result is from a 16-core machine with 128GB of memory and a lot =
of
>> random I/O capacity (there's no writing involved though other than a
>> bit of
>> WAL, and not much data, so this shouldn't matter).
>=20
> That sounds odd.. Could you provide perf profiles with this running? =
Also have you reproduced on 9.5?
>=20
> Do you have transparent huge pages disabled?
>=20
> ---=20
> Please excuse brevity and formatting - I am writing this on my mobile =
phone.
>=20
>=20
> --=20
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs

Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
Graeme
Дата:
> On 8 Jul 2015, at 03:13, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>=20
> grb@skogoglandskap.no writes:
>> pl/pgsql doesn't scale properly on postgres 9.3/9.4 with multiple =
processors
>> beyond 2-3 processors for me, regardless of the machine I use or the
>> benchmark/project.=20
>=20
>> The benchmark results & source code are here:
>> https://github.com/gbb/ppppt
>=20
> First off, thanks for providing a concrete test case!  It's always a =
lot
> easier to investigate when a problem can be reproduced locally.
>=20
> Having said that ...
>=20
> plpgsql is really designed as a glue language for SQL queries, not for
> heavy-duty computation, so these examples aren't exactly showing it at
> its best.  It would be worth your while to consider using some other
> convenient programming language, perhaps plperl or plpython or plv8,
> if you want to do self-contained calculations on the server side.

I'm very interested in trying out plv8, but it is a problem for projects =
when you have a team of people who know plpgsql and a lot of plpgsql =
legacy code.

It would be interesting to know what happens when these functions are =
ported into those other languages.=20

> But I think that the main problem you are seeing here is from snapshot
> acquisition contention.  By default, plpgsql acquires a new snapshot
> for each statement inside a function, and that results in a lot of
> contention for the ProcArray if you're maxing out a multicore machine.
> Depending on what you're actually doing inside the function, you might
> well be able to mark it stable or even immutable, which would suppress
> the per-statement snapshot acquisitions.  On my machine (admittedly =
only
> 8 cores), the scalability problems in this example pretty much vanish
> when I attach "stable" to the function definitions.

Thanks for testing it on your own machine and looking into a possible =
cause.
Do you think that explains both problems (one occurs with table data; =
one occurs with work)?

Several of the functions that were killing me last year were dynamic SQL =
with some internal state, so declaring them harmless and predictable =
probably isn't possible. I will need to go back and check.=20

It sounds like marking up functions may help (at least as far as 8 =
cores), but there must be a lot of people out there running plpgsql they =
have ported over from oracle, (postgis? pgrouting? not sure about their =
internals) etc, and various libraries where the functions aren't marked =
up and can't easily be marked up. A broader solution in the far off =
future would be awesome, if it is ever possible. I am not in a position =
currently to provide it, I regret.=20

> There is some discussion going on about improving the scalability of
> snapshot acquisition, but nothing will happen in that line before 9.6
> at the earliest.

Thanks, this is interesting to know.

Thanks again for your time looking into this and for the =
stable/immutable tip.

Graeme.

>=20
>             regards, tom lane
>=20
>=20
> --=20
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs

Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
Andres Freund
Дата:
On 2015-07-07 21:13:04 -0400, Tom Lane wrote:
> There is some discussion going on about improving the scalability of
> snapshot acquisition, but nothing will happen in that line before 9.6
> at the earliest.

9.5 should be less bad at it than 9.5, at least if it's mostly read-only
ProcArrayLock acquisitions which sounds like it should be the case here.

Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
Andres Freund
Дата:
On 2015-07-08 11:12:38 +0200, Andres Freund wrote:
> On 2015-07-07 21:13:04 -0400, Tom Lane wrote:
> > There is some discussion going on about improving the scalability of
> > snapshot acquisition, but nothing will happen in that line before 9.6
> > at the earliest.
>
> 9.5 should be less bad at it than 9.5, at least if it's mostly read-only
> ProcArrayLock acquisitions which sounds like it should be the case here.

test 3:
master:
1 clients: 3112.7
2 clients: 6806.7
4 clients: 13441.2
8 clients: 15765.4
16 clients: 21102.2

9.4:
1 clients: 2524.2
2 clients: 5903.2
4 clients: 11756.8
8 clients: 14583.3
16 clients: 19309.2

So there's an interesting "dip" between 4 and 8 clients. A perf profile
doesn't show any actual lock contention on master. Not that surprising,
there shouldn't be any exclusive locks here.

One interesting thing in exactly such cases is to consider intel's
turboboost. Disabling it (echo 0 >
/sys/devices/system/cpu/cpufreq/boost) gives us these results:
test 3:
master:
1 clients: 2926.6
2 clients: 6634.3
4 clients: 13905.2
8 clients: 15718.9

so that's not it in this case.

comparing stats between the 4 and 8 client runs shows (removing boring data):

4 clients:
      90859.517328      task-clock (msec)         #    3.428 CPUs utilized
   109,655,985,749      stalled-cycles-frontend   #   54.27% frontend cycles idle     (27.79%)
    62,906,918,008      stalled-cycles-backend    #   31.14% backend  cycles idle     (27.78%)
   219,063,494,214      instructions              #    1.08  insns per cycle
                                                  #    0.50  stalled cycles per insn  (33.32%)
    41,664,400,828      branches                  #  458.558 M/sec                    (33.32%)
       374,426,805      branch-misses             #    0.90% of all branches          (33.32%)
    62,504,845,665      L1-dcache-loads           #  687.928 M/sec                    (27.78%)
     1,224,842,848      L1-dcache-load-misses     #    1.96% of all L1-dcache hits    (27.81%)
       321,981,924      LLC-loads                 #    3.544 M/sec                    (22.33%)
        23,219,438      LLC-load-misses           #    7.21% of all LL-cache hits     (5.52%)

      26.507528305 seconds time elapsed

8 clients:
     165168.247631      task-clock (msec)         #    6.824 CPUs utilized
   247,231,674,170      stalled-cycles-frontend   #   67.04% frontend cycles idle     (27.84%)
   101,354,900,788      stalled-cycles-backend    #   27.48% backend  cycles idle     (27.83%)
   285,829,642,649      instructions              #    0.78  insns per cycle
                                                  #    0.86  stalled cycles per insn  (33.39%)
    54,503,992,461      branches                  #  329.991 M/sec                    (33.39%)
       761,911,056      branch-misses             #    1.40% of all branches          (33.38%)
    81,373,091,784      L1-dcache-loads           #  492.668 M/sec                    (27.74%)
     4,419,307,036      L1-dcache-load-misses     #    5.43% of all L1-dcache hits    (27.72%)
       510,940,577      LLC-loads                 #    3.093 M/sec                    (21.86%)
        26,963,120      LLC-load-misses           #    5.28% of all LL-cache hits     (5.37%)

      24.205675255 seconds time elapsed


It's quite visible that all caches have considerably worse
characteristics on the 8 clients case, and that "instructions per cycle"
has gone down considerably. Presumably because more frontend cycles were
idle, which in turn is probably caused by the higher cache miss
ratios. L1 going from 1.96% misses to 5.43% misses is quite a drastic
difference.

Now, looking at where cache misses happen:
4 clients:
+    7.64%  postgres         postgres                       [.] AllocSetAlloc
+    3.90%  postgres         postgres                       [.] LWLockAcquire
+    3.40%  postgres         plpgsql.so                     [.] plpgsql_exec_function
+    2.64%  postgres         postgres                       [.] GetCachedPlan
+    2.20%  postgres         postgres                       [.] slot_deform_tuple
+    2.16%  postgres         libc-2.19.so                   [.] _int_free
+    2.08%  postgres         libc-2.19.so                   [.] __memcpy_sse2_unaligned

8 clients:
+    6.34%  postgres       postgres                      [.] AllocSetAlloc
+    4.89%  postgres       plpgsql.so                    [.] plpgsql_exec_function
+    2.63%  postgres       libc-2.19.so                  [.] _int_free
+    2.60%  postgres       libc-2.19.so                  [.] __memcpy_sse2_unaligned
+    2.50%  postgres       postgres                      [.] ExecLimit
+    2.47%  postgres       postgres                      [.] LWLockAcquire
+    2.18%  postgres       postgres                      [.] ExecProject

So the characteristics interestingly change quite a bit between 4/8. I
reproduced this a number of times to make sure it's not just a temporary
issue.

The memcpy rising is mainly:
      + 80.27% SearchCatCache
      + 10.56% appendBinaryStringInfo
      + 6.51% socket_putmessage
      + 0.78% pgstat_report_activity

So at least on the hardware available to me right now this isn't caused
by actual lock contention.


Hm. I've a patch addressing the SearchCatCache memcpy() cost
somewhere...

Andres

Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
Andres Freund
Дата:
On 2015-07-08 14:55:12 +0200, Andres Freund wrote:
> comparing stats between the 4 and 8 client runs shows (removing boring data):

> 4 clients:
>    109,655,985,749      stalled-cycles-frontend   #   54.27% frontend cycles idle     (27.79%)
>     41,664,400,828      branches                  #  458.558 M/sec                    (33.32%)
>        374,426,805      branch-misses             #    0.90% of all branches          (33.32%)

> 8 clients:
>    247,231,674,170      stalled-cycles-frontend   #   67.04% frontend cycles idle     (27.84%)
>     54,503,992,461      branches                  #  329.991 M/sec                    (33.39%)
>        761,911,056      branch-misses             #    1.40% of all branches          (33.38%)

looking at stalled-cycles-frontend, and branch-misses shows:

-e branch-misses
4:
+    7.34%  postgres         plpgsql.so                     [.] plpgsql_exec_function
+    6.33%  postgres         postgres                       [.] standard_ExecutorRun
+    5.49%  postgres         postgres                       [.] SPI_push
+    4.24%  postgres         libc-2.19.so                   [.] __memcpy_sse2_unaligned
+    3.43%  postgres         postgres                       [.] ReleaseCachedPlan
+    2.72%  postgres         plpgsql.so                     [.] exec_stmt
+    2.07%  postgres         postgres                       [.] SPI_pop
+    1.94%  postgres         postgres                       [.] ExecLimit
+    1.94%  postgres         libc-2.19.so                   [.] __strcpy_sse2_unaligned
+    1.59%  postgres         libc-2.19.so                   [.] _int_malloc
+    1.46%  postgres         postgres                       [.] LWLockRelease
+    1.33%  postgres         postgres                       [.] AllocSetAlloc
+    1.17%  postgres         libc-2.19.so                   [.] _int_free
+    1.08%  postgres         libc-2.19.so                   [.] memset
+    1.00%  postgres         postgres                       [.] hash_search_with_hash_value
+    0.99%  postgres         postgres                       [.] GetSnapshotData

8:

+   10.66%  pgbench          libpq.so.5.9                   [.] PQsocket
+    8.40%  pgbench          libpthread-2.19.so             [.] __libc_recv
+    5.03%  postgres         plpgsql.so                     [.] plpgsql_exec_function
+    3.00%  postgres         postgres                       [.] AllocSetAlloc
+    2.86%  postgres         postgres                       [.] standard_ExecutorRun
+    2.54%  postgres         plpgsql.so                     [.] exec_stmt
+    2.45%  postgres         libc-2.19.so                   [.] __memcpy_sse2_unaligned
+    2.27%  postgres         postgres                       [.] GetSnapshotData
+    2.22%  postgres         postgres                       [.] ReleaseCachedPlan
+    2.14%  postgres         postgres                       [.] OverrideSearchPathMatchesCurrent
+    2.06%  postgres         postgres                       [.] SPI_push
+    2.02%  postgres         postgres                       [.] SPI_pop
+    1.92%  postgres         libc-2.19.so                   [.] _int_malloc
+    1.86%  postgres         postgres                       [.] ExecLimit
+    1.77%  postgres         postgres                       [.] SPI_connect
+    1.74%  postgres         libc-2.19.so                   [.] __strcpy_sse2_unaligned

It's interesting to see that the limited size of the trace buffer leads
to previously perfectly predicted functions like PQsocket regularly
causing cache misses now... Interesting to see how GetSnapshotData()
rises in comparison.

OverrideSearchPathMatchesCurrent() is also curious, but perhaps not so
surprising considering it's chasing down a linked list - almost
impossible to predict.


-e stalled-cycles-frontend:
4:
+   21.50%  swapper          [kernel.vmlinux]               [k] intel_idle
+    4.21%  pgbench          libpthread-2.19.so             [.] __libc_recv
+    2.68%  postgres         postgres                       [.] LWLockAcquire
+    2.35%  pgbench          [kernel.vmlinux]               [k] fput
+    2.34%  pgbench          [kernel.vmlinux]               [k] unix_stream_recvmsg
+    2.03%  pgbench          [kernel.vmlinux]               [k] system_call
+    2.02%  pgbench          pgbench                        [.] threadRun
+    1.82%  pgbench          [kernel.vmlinux]               [k] system_call_after_swapgs
+    1.73%  postgres         plpgsql.so                     [.] plpgsql_exec_function
+    1.56%  postgres         postgres                       [.] LWLockRelease
+    1.33%  pgbench          [kernel.vmlinux]               [k] sys_recvfrom
+    1.24%  postgres         libc-2.19.so                   [.] _int_malloc
+    1.16%  pgbench          [vdso]                         [.] 0x00000000000008c9
+    1.06%  pgbench          [kernel.vmlinux]               [k] __fget
+    1.04%  postgres         libc-2.19.so                   [.] _int_free
+    1.02%  pgbench          libpthread-2.19.so             [.] __pthread_enable_asynccancel

8:

+    8.41%  swapper          [kernel.vmlinux]               [k] intel_idle
+    4.35%  pgbench          pgbench                        [.] threadRun
+    3.56%  pgbench          libpthread-2.19.so             [.] __libc_recv
+    2.58%  pgbench          [kernel.vmlinux]               [k] unix_stream_recvmsg
+    1.99%  postgres         plpgsql.so                     [.] plpgsql_exec_function
+    1.98%  postgres         postgres                       [.] AllocSetAlloc
+    1.94%  pgbench          [kernel.vmlinux]               [k] sys_recvfrom
+    1.72%  postgres         postgres                       [.] LWLockAcquire
+    1.66%  pgbench          pgbench                        [.] doCustom
+    1.59%  pgbench          [kernel.vmlinux]               [k] system_call_after_swapgs
+    1.58%  pgbench          [kernel.vmlinux]               [k] system_call
+    1.51%  pgbench          [vdso]                         [.] __vdso_gettimeofday
+    1.50%  pgbench          [kernel.vmlinux]               [k] __fget
+    1.21%  postgres         libc-2.19.so                   [.] memset
+    1.11%  postgres         libc-2.19.so                   [.] _int_malloc
+    1.08%  postgres         postgres                       [.] GetSnapshotData

(intel_idle is executed on a cpu when it's idle. Not surprising that it
shows up prominently, especially when not all cores are busy). It's
interesting to see how the locking functions are less prominent in the
-c8 case, and how overhead of allocation and plpgsql_exec_function
rises.

Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> So there's an interesting "dip" between 4 and 8 clients. A perf profile
> doesn't show any actual lock contention on master. Not that surprising,
> there shouldn't be any exclusive locks here.

What size of machine are you testing on?

I ran Graeme's tests on a 2-socket, 4-core-per-socket, no-hyperthreading
machine, which has separate NUMA zones for the 2 sockets.  What I saw
(after fixing the "stable" issue) was that all the 8-client and 16-client
cases were about 8x faster than 1-client, and 2-client was generally
within hailing distance of 2x faster, but 4-client was often noticeably
worse than the expected 4x faster.

I figured this was likely some weird NUMA effect, possibly compounded
by brutally stupid scheduling on the part of my kernel.  But I didn't
have time to look closer.

You might be seeing the same kind of effect, or something different.
It's hard to tell without knowing more about your machine.

            regards, tom lane

Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
Andres Freund
Дата:
On 2015-07-08 09:56:51 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > So there's an interesting "dip" between 4 and 8 clients. A perf profile
> > doesn't show any actual lock contention on master. Not that surprising,
> > there shouldn't be any exclusive locks here.
>
> What size of machine are you testing on?

2xE5520 (=> 2 x 4 sockets, 8 threads); numa.

(note that I intentionally did not fix the volatility of the function)

> I ran Graeme's tests on a 2-socket, 4-core-per-socket, no-hyperthreading
> machine, which has separate NUMA zones for the 2 sockets.  What I saw
> (after fixing the "stable" issue) was that all the 8-client and 16-client
> cases were about 8x faster than 1-client, and 2-client was generally
> within hailing distance of 2x faster, but 4-client was often noticeably
> worse than the expected 4x faster.

> I figured this was likely some weird NUMA effect, possibly compounded
> by brutally stupid scheduling on the part of my kernel.  But I didn't
> have time to look closer.
>
> You might be seeing the same kind of effect, or something different.
> It's hard to tell without knowing more about your machine.

I think it's likely to be some scheduler effect. The number of cpu
migrations between 4 and 8 is very different:

4:

            64,599      context-switches          #    0.003 M/sec                    (100.00%)
               172      cpu-migrations            #    0.007 K/sec                    (100.00%)
               537      page-faults               #    0.023 K/sec
8:
           381,383      context-switches          #    0.002 M/sec                    (100.00%)
             1,279      cpu-migrations            #    0.008 K/sec                    (100.00%)
             3,869      page-faults               #    0.024 K/sec
16:

           514,426      context-switches          #    0.003 M/sec                    (100.00%)
             1,166      cpu-migrations            #    0.007 K/sec                    (100.00%)
             6,308      page-faults               #    0.039 K/sec

There's a pretty large increase in the number of migrations between 4
and 8, but none between 8 and 16.

My guess is that the kernel tries to move around processes to idle nodes
too aggressively.

second-by-second pgbench is quite interesting:
progress: 1.0 s, 22915.3 tps, lat 0.346 ms stddev 0.078
progress: 2.0 s, 15596.8 tps, lat 0.512 ms stddev 0.185
progress: 3.0 s, 15519.2 tps, lat 0.514 ms stddev 0.499
progress: 4.0 s, 15535.7 tps, lat 0.512 ms stddev 0.306
progress: 5.0 s, 15494.3 tps, lat 0.515 ms stddev 0.162

so at -j8 we're routinely much faster than later.

Comparing perf stat pgbench -j8 -T 1 and -T 8:
-T 1
                46      cpu-migrations
-T 8
               534      cpu-migrations
so indeed the number of migration rises noticeably after the first
second...

Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
Graeme
Дата:
By the way,

It may be worth increasing the length of the for loop in the pl/pgsql by =
a factor of 100-1000x and seeing if you get the same kind of behaviour. =
I intentionally left it very low (10000 iterations of nothing is not =
that much work) to ensure people would see a non-zero number there for =
the TPS the first time they ran the bench.  However in practice the =
functions that have the worst behaviour tend to be longer running ones.=20=


Graeme.

> On 8 Jul 2015, at 16:22, Andres Freund <andres@anarazel.de> wrote:
>=20
> On 2015-07-08 09:56:51 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> So there's an interesting "dip" between 4 and 8 clients. A perf =
profile
>>> doesn't show any actual lock contention on master. Not that =
surprising,
>>> there shouldn't be any exclusive locks here.
>>=20
>> What size of machine are you testing on?
>=20
> 2xE5520 (=3D> 2 x 4 sockets, 8 threads); numa.
>=20
> (note that I intentionally did not fix the volatility of the function)
>=20
>> I ran Graeme's tests on a 2-socket, 4-core-per-socket, =
no-hyperthreading
>> machine, which has separate NUMA zones for the 2 sockets.  What I saw
>> (after fixing the "stable" issue) was that all the 8-client and =
16-client
>> cases were about 8x faster than 1-client, and 2-client was generally
>> within hailing distance of 2x faster, but 4-client was often =
noticeably
>> worse than the expected 4x faster.
>=20
>> I figured this was likely some weird NUMA effect, possibly compounded
>> by brutally stupid scheduling on the part of my kernel.  But I didn't
>> have time to look closer.
>>=20
>> You might be seeing the same kind of effect, or something different.
>> It's hard to tell without knowing more about your machine.
>=20
> I think it's likely to be some scheduler effect. The number of cpu
> migrations between 4 and 8 is very different:
>=20
> 4:
>=20
>            64,599      context-switches          #    0.003 M/sec      =
              (100.00%)
>               172      cpu-migrations            #    0.007 K/sec      =
              (100.00%)
>               537      page-faults               #    0.023 K/sec
> 8:
>           381,383      context-switches          #    0.002 M/sec      =
              (100.00%)
>             1,279      cpu-migrations            #    0.008 K/sec      =
              (100.00%)
>             3,869      page-faults               #    0.024 K/sec
> 16:
>=20
>           514,426      context-switches          #    0.003 M/sec      =
              (100.00%)
>             1,166      cpu-migrations            #    0.007 K/sec      =
              (100.00%)
>             6,308      page-faults               #    0.039 K/sec
>=20
> There's a pretty large increase in the number of migrations between 4
> and 8, but none between 8 and 16.
>=20
> My guess is that the kernel tries to move around processes to idle =
nodes
> too aggressively.
>=20
> second-by-second pgbench is quite interesting:
> progress: 1.0 s, 22915.3 tps, lat 0.346 ms stddev 0.078
> progress: 2.0 s, 15596.8 tps, lat 0.512 ms stddev 0.185
> progress: 3.0 s, 15519.2 tps, lat 0.514 ms stddev 0.499
> progress: 4.0 s, 15535.7 tps, lat 0.512 ms stddev 0.306
> progress: 5.0 s, 15494.3 tps, lat 0.515 ms stddev 0.162
>=20
> so at -j8 we're routinely much faster than later.
>=20
> Comparing perf stat pgbench -j8 -T 1 and -T 8:
> -T 1
>                46      cpu-migrations
> -T 8
>               534      cpu-migrations
> so indeed the number of migration rises noticeably after the first
> second...
>=20
>=20
> --=20
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs

Re: BUG #13493: pl/pgsql doesn't scale with cpus (PG9.3, 9.4)

От
Graeme
Дата:
apologies, I'm posting in a hurry here and forgot to trim the fat from =
the quoted text of the last message.=