Обсуждение: Performance implications of 8K pread()s

Поиск
Список
Период
Сортировка

Performance implications of 8K pread()s

От
Dimitrios Apostolou
Дата:
Hello list,

I have noticed that the performance during a SELECT COUNT(*) command is
much slower than what the device can provide. Parallel workers improve the
situation but for simplicity's sake, I disable parallelism for my
measurements here by setting max_parallel_workers_per_gather to 0.

Strace'ing the postgresql process shows that all reads happen in offset'ed 8KB
blocks using pread():

   pread64(172, ..., 8192, 437370880) = 8192

The read rate I see on the device is only 10-20 MB/s. My case is special
though, as this is on a zstd-compressed btrfs filesystem, on a very fast
(1GB/s) direct attached storage system. Given the decompression ratio is around
10x, the above rate corresponds to about 100 to 200 MB/s of data going into the
postgres process.

Can the 8K block size cause slowdown? Here are my observations:

+ Reading a 1GB postgres file using dd (which uses read() internally) in
    8K and 32K chunks:

      # dd if=4156889.4 of=/dev/null bs=8k
      1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.18829 s, 174 MB/s

      # dd if=4156889.4 of=/dev/null bs=8k    # 2nd run, data is cached
      1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.287623 s, 3.7 GB/s

      # dd if=4156889.8 of=/dev/null bs=32k
      1073741824 bytes (1.1 GB, 1.0 GiB) copied, 1.02688 s, 1.0 GB/s

      # dd if=4156889.8 of=/dev/null bs=32k    # 2nd run, data is cached
      1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.264049 s, 4.1 GB/s

    The rates displayed are after decompression (the fs does it
    transparently) and the results have been verified with multiple runs.

    Notice that the read rate with bs=8k is 174MB/s (I see ~20MB/s on the
    device), slow and similar to what Postgresql gave us above. With bs=32k
    the rate increases to 1GB/s (I see ~80MB/s on the device, but the time
    is very short to register properly).

   The cached reads are fast in both cases.

Note that I suspect my setup being related, (btrfs compression behaving
suboptimally) since the raw device can give me up to 1GB/s rate. It is however
evident that reading in bigger chunks would mitigate such setup inefficiencies.
On a system that reads are already optimal and the read rate remains the same,
then bigger block size would probably reduce the sys time postgresql consumes
because of the fewer system calls.

So would it make sense for postgres to perform reads in bigger blocks? Is it
easy-ish to implement (where would one look for that)? Or must the I/O unit be
tied to postgres' page size?

Regards,
Dimitris




Re: Performance implications of 8K pread()s

От
Thomas Munro
Дата:
On Wed, Jul 12, 2023 at 1:11 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
> Note that I suspect my setup being related, (btrfs compression behaving
> suboptimally) since the raw device can give me up to 1GB/s rate. It is however
> evident that reading in bigger chunks would mitigate such setup inefficiencies.
> On a system that reads are already optimal and the read rate remains the same,
> then bigger block size would probably reduce the sys time postgresql consumes
> because of the fewer system calls.

I don't know about btrfs but maybe it can be tuned to prefetch
sequential reads better...

> So would it make sense for postgres to perform reads in bigger blocks? Is it
> easy-ish to implement (where would one look for that)? Or must the I/O unit be
> tied to postgres' page size?

It is hard to implement.  But people are working on it.  One of the
problems is that the 8KB blocks that we want to read data into aren't
necessarily contiguous so you can't just do bigger pread() calls
without solving a lot more problems first.  The project at
https://wiki.postgresql.org/wiki/AIO aims to deal with the
"clustering" you seek plus the "gathering" required for non-contiguous
buffers by allowing multiple block-sized reads to be prepared and
collected on a pending list up to some size that triggers merging and
submission to the operating system at a sensible rate, so we can build
something like a single large preadv() call.  In the current
prototype, if io_method=worker then that becomes a literal preadv()
call running in a background "io worker" process, but it could also be
OS-specific stuff (io_uring, ...) that starts an asynchronous IO
depending on settings.  If you take that branch and run your test you
should see 128KB-sized preadv() calls.



Re: Performance implications of 8K pread()s

От
Thomas Munro
Дата:
On Wed, Jul 12, 2023 at 5:12 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> "gathering"

(Oops, for reads, that's "scattering".  As in scatter/gather I/O but I
picked the wrong one...).



Re: Performance implications of 8K pread()s

От
Dimitrios Apostolou
Дата:
Hello and thanks for the feedback!

On Wed, 12 Jul 2023, Thomas Munro wrote:

> On Wed, Jul 12, 2023 at 1:11 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
>> Note that I suspect my setup being related, (btrfs compression behaving
>> suboptimally) since the raw device can give me up to 1GB/s rate. It is however
>> evident that reading in bigger chunks would mitigate such setup inefficiencies.
>> On a system that reads are already optimal and the read rate remains the same,
>> then bigger block size would probably reduce the sys time postgresql consumes
>> because of the fewer system calls.
>
> I don't know about btrfs but maybe it can be tuned to prefetch
> sequential reads better...

I tried a lot to tweak the kernel's block layer read-ahead and to change
different I/O schedulers, but it made no difference. I'm now convinced
that the problem manifests specially on compressed btrfs: the filesystem
doesn't do any read-ahed (pre-fetch) so no I/O requests "merge" on the
block layer.

Iostat gives an interesting insight in the above measurements.  For both
postgres doing sequential scan and for dd with bs=8k, the kernel block
layer does not appear to merge the I/O requests. `iostat -x` shows 16
sectors average read request size, 0 merged requests, and very high
reads/s IOPS number.

The dd commands with bs=32k block size show fewer IOPS on `iostat -x` but
higher speed(!), larger average block size and high number of merged
requests.

Example output for some random second out of dd bs=8k:

     Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
     sdc           1313.00     20.93     2.00   0.15    0.53    16.32

with dd bs=32k:

     Device            r/s     rMB/s   rrqm/s  %rrqm r_await rareq-sz
     sdc            290.00     76.44  4528.00  93.98    1.71   269.92

On the same filesystem, doing dd bs=8k reads from a file that has not been
compressed by the filesystem I get 1GB/s device read throughput!

I sent this feedback to the btrfs list, but got no feedback yet:

https://www.spinics.net/lists/linux-btrfs/msg137200.html

>
>> So would it make sense for postgres to perform reads in bigger blocks? Is it
>> easy-ish to implement (where would one look for that)? Or must the I/O unit be
>> tied to postgres' page size?
>
> It is hard to implement.  But people are working on it.  One of the
> problems is that the 8KB blocks that we want to read data into aren't
> necessarily contiguous so you can't just do bigger pread() calls
> without solving a lot more problems first.

This kind of overhaul is good, but goes much deeper. Same with async I/O
of course. But what I have in mind should be much simpler (add grains
of salt since I don't know postgres internals :-)

+ A process wants to read a block from a file
+ Postgres' buffer cache layer (shared_buffers?) looks it up in the cache,
   if not found it passes the request down to
+ postgres' block layer; it submits an I/O request for 32KB that include
   the 8K block requested; it returns the 32K block to
+ postgres' buffer cache layer; it stores all 4 blocks read from the disk
   into the buffer cache, and returns only the 1 block requested.

The danger here is that in random non-contiguous 8K reads, the buffer
cache gets satsurated by 4x the amount of data because of 32K reads, and
75% of that data is useless, but may still evict useful data. The answer
is that is should be marked as unused then (by putting it in front of the
cache's LRU for example) so that those unused read-ahead pages are re-used
for upcoming read-ahead, without evicting too much useful pages.

> The project at
> https://wiki.postgresql.org/wiki/AIO aims to deal with the
> "clustering" you seek plus the "gathering" required for non-contiguous
> buffers by allowing multiple block-sized reads to be prepared and
> collected on a pending list up to some size that triggers merging and
> submission to the operating system at a sensible rate, so we can build
> something like a single large preadv() call.  In the current
> prototype, if io_method=worker then that becomes a literal preadv()
> call running in a background "io worker" process, but it could also be
> OS-specific stuff (io_uring, ...) that starts an asynchronous IO
> depending on settings.  If you take that branch and run your test you
> should see 128KB-sized preadv() calls.
>

Interesting and kind of sad that the last update on the wiki page is from
2021. What is the latest prototype? I'm not sure I'm up to the task of
putting my database to the test. ;-)


Thanks and regards,
Dimitris

Re: Performance implications of 8K pread()s

От
Thomas Munro
Дата:
On Thu, Jul 13, 2023 at 6:50 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
> Interesting and kind of sad that the last update on the wiki page is from
> 2021. What is the latest prototype? I'm not sure I'm up to the task of
> putting my database to the test. ;-)

It works pretty well,  certainly well enough to try out, and work is
happening.  I'll try to update the wiki with some more up-to-date
information soon.  Basically, compare these two slides (you could also
look at slide 11, which is the most most people are probably
interested in, but then you can't really see what's going on with
system call-level tools):

https://speakerdeck.com/macdice/aio-and-dio-for-postgresql-on-freebsd?slide=7
https://speakerdeck.com/macdice/aio-and-dio-for-postgresql-on-freebsd?slide=9

Not only are the IOs converted into 128KB preadv() calls, they are
issued concurrently and ahead of time while your backend is chewing on
the last lot of pages.  So even if your file system completely fails
at prefetching, we'd have a fighting chance at getting closer to
device/line speed.  That's basically what you have to do to support
direct I/O, where there is no system-provided prefetching.



Re: Performance implications of 8K pread()s

От
Dimitrios Apostolou
Дата:
Thanks, it sounds promising! Are the changes in the 16 branch already,
i.e. is it enough to fetch sources for 16-beta2? If
so do I just configure --with-liburing (I'm on linux) and run with
io_method=io_uring? Else, if I use the io_method=worker what is a sensible
amount of worker threads? Should I also set all the flags for direct I/O?
(io_data_direct=on io_wal_direct=on).




Re: Performance implications of 8K pread()s

От
Andres Freund
Дата:
Hi,

On 2023-07-17 16:42:31 +0200, Dimitrios Apostolou wrote:
> Thanks, it sounds promising! Are the changes in the 16 branch already,
> i.e. is it enough to fetch sources for 16-beta2?

No, this is in a separate branch.

https://github.com/anarazel/postgres/tree/aio


> If so do I just configure --with-liburing (I'm on linux) and run with
> io_method=io_uring?

It's probably worth trying out both io_uring and worker. I've not looked at
performance on btrfs. I know that some of the optimized paths for io_uring
(being able to perform filesystem IO without doing so synchronously in an
in-kernel thread) require filesystem cooperation, and I do not know how much
attention btrfs has received for that.


> Else, if I use the io_method=worker what is a sensible amount of worker
> threads?

Depends on your workload :/. If you just want to measure whether it fixes your
single-threaded query execution issue, the default should be just fine.


> Should I also set all the flags for direct I/O?  (io_data_direct=on
> io_wal_direct=on).

FWIW, I just pushed a rebased version to the aio branch, and there the config
for direct io is
io_direct = 'data, wal, wal_init'
(or a subset thereof).

From what I know of btrfs, I don't think you want direct IO though. Possibly
for WAL, but definitely not for data. IIRC it currently can cause corruption.

Greetings,

Andres Freund



Re: Performance implications of 8K pread()s

От
Thomas Munro
Дата:
On Wed, Jul 12, 2023 at 1:11 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
> So would it make sense for postgres to perform reads in bigger blocks? Is it
> easy-ish to implement (where would one look for that)? Or must the I/O unit be
> tied to postgres' page size?

FYI as of last week we can do a little bit of that on the master branch:

postgres=# select count(*) from t;

preadv(46, ..., 8, 256237568) = 131072
preadv(46, ..., 5, 256368640) = 131072
preadv(46, ..., 8, 256499712) = 131072
preadv(46, ..., 5, 256630784) = 131072

postgres=# set io_combine_limit = '256k';
postgres=# select count(*) from t;

preadv(47, ..., 5, 613728256) = 262144
preadv(47, ..., 5, 613990400) = 262144
preadv(47, ..., 5, 614252544) = 262144
preadv(47, ..., 5, 614514688) = 262144

Here's hoping the commits implementing this stick, for the PostgreSQL
17 release.  It's just the beginning though, we can only do this for
full table scans so far (plus a couple of other obscure places).
Hopefully in the coming year we'll get the "streaming I/O" mechanism
that powers this hooked up to lots more places... index scans and
other stuff.  And writing.  Then eventually pushing the I/O into the
background.  Your questions actually triggered us to talk about why we
couldn't switch a few things around in our project and get the I/O
combining piece done sooner.  Thanks!



Re: Performance implications of 8K pread()s

От
Dimitrios Apostolou
Дата:
Exciting! Since I still have the same performance issues on compressed btrfs, I'm looking forward to testing the
patches,probably when a 17 Beta is out and I can find binaries on my platform (OpenSUSE). It looks like it will make a
hugedifference. 

Thank you for persisting and getting this through.

Dimitris


On 12 April 2024 07:45:52 CEST, Thomas Munro <thomas.munro@gmail.com> wrote:
>On Wed, Jul 12, 2023 at 1:11 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
>> So would it make sense for postgres to perform reads in bigger blocks? Is it
>> easy-ish to implement (where would one look for that)? Or must the I/O unit be
>> tied to postgres' page size?
>
>FYI as of last week we can do a little bit of that on the master branch:
>
>postgres=# select count(*) from t;
>
>preadv(46, ..., 8, 256237568) = 131072
>preadv(46, ..., 5, 256368640) = 131072
>preadv(46, ..., 8, 256499712) = 131072
>preadv(46, ..., 5, 256630784) = 131072
>
>postgres=# set io_combine_limit = '256k';
>postgres=# select count(*) from t;
>
>preadv(47, ..., 5, 613728256) = 262144
>preadv(47, ..., 5, 613990400) = 262144
>preadv(47, ..., 5, 614252544) = 262144
>preadv(47, ..., 5, 614514688) = 262144
>
>Here's hoping the commits implementing this stick, for the PostgreSQL
>17 release.  It's just the beginning though, we can only do this for
>full table scans so far (plus a couple of other obscure places).
>Hopefully in the coming year we'll get the "streaming I/O" mechanism
>that powers this hooked up to lots more places... index scans and
>other stuff.  And writing.  Then eventually pushing the I/O into the
>background.  Your questions actually triggered us to talk about why we
>couldn't switch a few things around in our project and get the I/O
>combining piece done sooner.  Thanks!