Обсуждение: Flushing large data immediately in pqcomm

Поиск
Список
Период
Сортировка

Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:
Hi hackers

I've been looking into ways to reduce the overhead we're having in pqcomm and I'd like to propose a small patch to modify how socket_putmessage works.

Currently socket_putmessage copies any input data into the pqcomm send buffer (PqSendBuffer) and the size of this buffer is 8K. When the send buffer gets full, it's flushed and we continue to copy more data into the send buffer until we have no data left to be sent.
Since the send buffer is flushed whenever it's full, I think we are safe to say that if the size of input data is larger than the buffer size, which is 8K, then the buffer will be flushed at least once (or more, depends on the input size) to store and all the input data.

Proposed change modifies socket_putmessage to send any data larger than 8K immediately without copying it into the send buffer. Assuming that the send buffer would be flushed anyway due to reaching its limit, the patch just gets rid of the copy part which seems unnecessary and sends data without waiting.

This change affects places where pq_putmessage is used such as pg_basebackup, COPY TO, walsender etc.

I did some experiments to see how the patch performs. 
Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO STDOUT". Here are perf results of both the patch and HEAD

HEAD:
-   94,13%     0,22%  postgres  postgres           [.] DoCopyTo 
  - 93,90% DoCopyTo                                           
      - 91,80% CopyOneRowTo                                    
         + 47,35% CopyAttributeOutText                         
         - 26,49% CopySendEndOfRow                             
            - 25,97% socket_putmessage                         
               - internal_putbytes                             
                  - 24,38% internal_flush                      
                     + secure_write                            
                  + 1,47% memcpy (inlined)                     
         + 14,69% FunctionCall1Coll                            
         + 1,94% appendBinaryStringInfo                        
         + 0,75% MemoryContextResetOnly                        
      + 1,54% table_scan_getnextslot (inlined)

Patch:
-   94,40%     0,30%  postgres  postgres           [.] DoCopyTo
   - 94,11% DoCopyTo                                           
      - 92,41% CopyOneRowTo                                    
         + 51,20% CopyAttributeOutText                         
         - 20,87% CopySendEndOfRow                             
            - 20,45% socket_putmessage                         
               - internal_putbytes                             
                  - 18,50% internal_flush (inlined)            
                       internal_flush_buffer                   
                     + secure_write                            
                  + 1,61% memcpy (inlined)                     
         + 17,36% FunctionCall1Coll                            
         + 1,33% appendBinaryStringInfo                        
         + 0,93% MemoryContextResetOnly                        
      + 1,36% table_scan_getnextslot (inlined)

The patch brings a ~5% gain in socket_putmessage.

Also timed the pg_basebackup like: 
time pg_basebackup -p 5432 -U replica_user -X none -c fast --no_maanifest -D test

HEAD:
real    0m10,040s
user    0m0,768s
sys     0m7,758s

Patch:
real    0m8,882s
user    0m0,699s
sys     0m6,980s

It seems ~11% faster in this specific case.

I'd appreciate any feedback/thoughts.


[1]
CREATE TABLE test(id int, name text, time TIMESTAMP);
INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100) AS name, NOW() AS time FROM generate_series(1, 100000000) AS i;


Thanks,
--
Melih Mutlu
Microsoft
Вложения

Re: Flushing large data immediately in pqcomm

От
Heikki Linnakangas
Дата:
On 20/11/2023 14:21, Melih Mutlu wrote:
> Hi hackers
> 
> I've been looking into ways to reduce the overhead we're having in 
> pqcomm and I'd like to propose a small patch to modify how 
> socket_putmessage works.
> 
> Currently socket_putmessage copies any input data into the pqcomm send 
> buffer (PqSendBuffer) and the size of this buffer is 8K. When the send 
> buffer gets full, it's flushed and we continue to copy more data into 
> the send buffer until we have no data left to be sent.
> Since the send buffer is flushed whenever it's full, I think we are safe 
> to say that if the size of input data is larger than the buffer size, 
> which is 8K, then the buffer will be flushed at least once (or more, 
> depends on the input size) to store and all the input data.

Agreed, that's silly.

> Proposed change modifies socket_putmessage to send any data larger than 
> 8K immediately without copying it into the send buffer. Assuming that 
> the send buffer would be flushed anyway due to reaching its limit, the 
> patch just gets rid of the copy part which seems unnecessary and sends 
> data without waiting.

If there's already some data in PqSendBuffer, I wonder if it would be 
better to fill it up with data, flush it, and then send the rest of the 
data directly. Instead of flushing the partial data first. I'm afraid 
that you'll make a tiny call to secure_write(), followed by a large one, 
then a tine one again, and so forth. Especially when socket_putmessage 
itself writes the msgtype and len, which are tiny, before the payload.

Perhaps we should invent a new pq_putmessage() function that would take 
an input buffer with 5 bytes of space reserved before the payload. 
pq_putmessage() could then fill in the msgtype and len bytes in the 
input buffer and send that directly. (Not wedded to that particular API, 
but something that would have the same effect)

> This change affects places where pq_putmessage is used such as 
> pg_basebackup, COPY TO, walsender etc.
> 
> I did some experiments to see how the patch performs.
> Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO 
> STDOUT". Here are perf results of both the patch and HEAD > ...
> The patch brings a ~5% gain in socket_putmessage.
> 
> [1]
> CREATE TABLE test(id int, name text, time TIMESTAMP);
> INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100) 
> AS name, NOW() AS time FROM generate_series(1, 100000000) AS i;

I'm surprised by these results, because each row in that table is < 600 
bytes. PqSendBufferSize is 8kB, so the optimization shouldn't kick in in 
that test. Am I missing something?

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Flushing large data immediately in pqcomm

От
Robert Haas
Дата:
On Mon, Jan 29, 2024 at 11:12 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Agreed, that's silly.

+1.

> If there's already some data in PqSendBuffer, I wonder if it would be
> better to fill it up with data, flush it, and then send the rest of the
> data directly. Instead of flushing the partial data first. I'm afraid
> that you'll make a tiny call to secure_write(), followed by a large one,
> then a tine one again, and so forth. Especially when socket_putmessage
> itself writes the msgtype and len, which are tiny, before the payload.
>
> Perhaps we should invent a new pq_putmessage() function that would take
> an input buffer with 5 bytes of space reserved before the payload.
> pq_putmessage() could then fill in the msgtype and len bytes in the
> input buffer and send that directly. (Not wedded to that particular API,
> but something that would have the same effect)

I share the concern; I'm not sure about the best solution. I wonder if
it would be useful to have pq_putmessagev() in the style of writev()
et al. Or maybe what we need is secure_writev().

I also wonder if the threshold for sending data directly should be
smaller than the buffer size, and/or whether it should depend on the
buffer being empty. If we have an 8kB buffer that currently has
nothing in it, and somebody writes 2kB, I suspect it might be wrong to
copy that into the buffer. If the same buffer had 5kB used and 3kB
free, copying sounds a lot more likely to work out. The goal here is
probably to condense sequences of short messages into a single
transmission while sending long messages individually. I'm just not
quite sure what heuristic would do that most effectively.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:
Hi Heikki,

Heikki Linnakangas <hlinnaka@iki.fi>, 29 Oca 2024 Pzt, 19:12 tarihinde şunu yazdı:
> Proposed change modifies socket_putmessage to send any data larger than
> 8K immediately without copying it into the send buffer. Assuming that
> the send buffer would be flushed anyway due to reaching its limit, the
> patch just gets rid of the copy part which seems unnecessary and sends
> data without waiting.

If there's already some data in PqSendBuffer, I wonder if it would be
better to fill it up with data, flush it, and then send the rest of the
data directly. Instead of flushing the partial data first. I'm afraid
that you'll make a tiny call to secure_write(), followed by a large one,
then a tine one again, and so forth. Especially when socket_putmessage
itself writes the msgtype and len, which are tiny, before the payload.

I agree that I could do better there without flushing twice for both PqSendBuffer and input data. PqSendBuffer always has some data, even if it's tiny, since msgtype and len are added.
 
Perhaps we should invent a new pq_putmessage() function that would take
an input buffer with 5 bytes of space reserved before the payload.
pq_putmessage() could then fill in the msgtype and len bytes in the
input buffer and send that directly. (Not wedded to that particular API,
but something that would have the same effect)

I thought about doing this. The reason why I didn't was because I think that such a change would require adjusting all input buffers wherever pq_putmessage is called, and I did not want to touch that many different places. These places where we need pq_putmessage might not be that many though, I'm not sure. 
 

> This change affects places where pq_putmessage is used such as
> pg_basebackup, COPY TO, walsender etc.
>
> I did some experiments to see how the patch performs.
> Firstly, I loaded ~5GB data into a table [1], then ran "COPY test TO
> STDOUT". Here are perf results of both the patch and HEAD > ...
> The patch brings a ~5% gain in socket_putmessage.
>
> [1]
> CREATE TABLE test(id int, name text, time TIMESTAMP);
> INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 100)
> AS name, NOW() AS time FROM generate_series(1, 100000000) AS i;

I'm surprised by these results, because each row in that table is < 600
bytes. PqSendBufferSize is 8kB, so the optimization shouldn't kick in in
that test. Am I missing something?

You're absolutely right. I made a silly mistake there. I also think that the way I did perf analysis does not make much sense, even if one row of the table is greater than 8kB.
Here are some quick timing results after being sure that it triggers this patch's optimization. I need to think more on how to profile this with perf. I hope to share proper results soon.

I just added a bit more zeros [1] and ran [2] (hopefully measured the correct thing)

HEAD:
real    2m48,938s
user    0m9,226s
sys     1m35,342s

Patch:
real    2m40,690s
user    0m8,492s
sys     1m31,001s

[1]
 INSERT INTO test (id, name, time) SELECT i AS id, repeat('dummy', 10000) AS name, NOW() AS time FROM generate_series(1, 1000000) AS i;

[2]
 rm /tmp/dummy && echo 3 | sudo tee /proc/sys/vm/drop_caches && time psql -d postgres -c "COPY test TO STDOUT;" > /tmp/dummy

Thanks,
--
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:
Hi Robert,

Robert Haas <robertmhaas@gmail.com>, 29 Oca 2024 Pzt, 20:48 tarihinde şunu yazdı:
> If there's already some data in PqSendBuffer, I wonder if it would be
> better to fill it up with data, flush it, and then send the rest of the
> data directly. Instead of flushing the partial data first. I'm afraid
> that you'll make a tiny call to secure_write(), followed by a large one,
> then a tine one again, and so forth. Especially when socket_putmessage
> itself writes the msgtype and len, which are tiny, before the payload.
>
> Perhaps we should invent a new pq_putmessage() function that would take
> an input buffer with 5 bytes of space reserved before the payload.
> pq_putmessage() could then fill in the msgtype and len bytes in the
> input buffer and send that directly. (Not wedded to that particular API,
> but something that would have the same effect)

I share the concern; I'm not sure about the best solution. I wonder if
it would be useful to have pq_putmessagev() in the style of writev()
et al. Or maybe what we need is secure_writev().

I thought about using writev() for not only pq_putmessage() but pq_putmessage_noblock() too. Currently, pq_putmessage_noblock() repallocs PqSendBuffer and copies input buffer, which can easily be larger than 8kB, into PqSendBuffer.I also discussed it with Thomas off-list. The thing is that I believe we would need secure_writev() with SSL/GSS cases handled properly. I'm just not sure if the effort would be worthwhile considering what we gain from it.
 
I also wonder if the threshold for sending data directly should be
smaller than the buffer size, and/or whether it should depend on the
buffer being empty.

You might be right. I'm not sure what the ideal threshold would be.
 
If we have an 8kB buffer that currently has
nothing in it, and somebody writes 2kB, I suspect it might be wrong to
copy that into the buffer. If the same buffer had 5kB used and 3kB
free, copying sounds a lot more likely to work out. The goal here is
probably to condense sequences of short messages into a single
transmission while sending long messages individually. I'm just not
quite sure what heuristic would do that most effectively.

Sounds like it's difficult to come up with a heuristic that would work well enough for most cases.
One thing with sending data instead of copying it if the buffer is empty is that initially the buffer is empty. I believe it will stay empty forever if we do not copy anything when the buffer is empty. We can maybe simply set the threshold to the buffer size/2 (4kB) and hope that will work better. Or copy the data only if it fits into the remaining space in the buffer. What do you think?


An additional note while I mentioned pq_putmessage_noblock(), I've been testing sending input data immediately in pq_putmessage_noblock() without blocking and copy the data into PqSendBuffer only if the socket would block and cannot send it. Unfortunately, I don't have strong numbers to demonstrate any improvement in perf or timing yet. But I still like to know what would you think about it?

Thanks,
--
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

От
Robert Haas
Дата:
On Tue, Jan 30, 2024 at 12:58 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> Sounds like it's difficult to come up with a heuristic that would work well enough for most cases.
> One thing with sending data instead of copying it if the buffer is empty is that initially the buffer is empty. I
believeit will stay empty forever if we do not copy anything when the buffer is empty. We can maybe simply set the
thresholdto the buffer size/2 (4kB) and hope that will work better. Or copy the data only if it fits into the remaining
spacein the buffer. What do you think? 
>
> An additional note while I mentioned pq_putmessage_noblock(), I've been testing sending input data immediately in
pq_putmessage_noblock()without blocking and copy the data into PqSendBuffer only if the socket would block and cannot
sendit. Unfortunately, I don't have strong numbers to demonstrate any improvement in perf or timing yet. But I still
liketo know what would you think about it? 

I think this is an area where it's very difficult to foresee on
theoretical grounds what will be right in practice. The problem is
that the best algorithm probably depends on what usage patterns are
common in practice. I think one common usage pattern will be a bunch
of roughly equal-sized messages in a row, like CopyData or DataRow
messages -- but those messages won't have a consistent width. It would
probably be worth testing what behavior you see in such cases -- start
with say a stream of 100 byte messages and then gradually increase and
see how the behavior evolves.

But you can also have other patterns, with messages of different sizes
interleaved. In the case of FE-to-BE traffic, the extended query
protocol might be a good example of that: the Parse message could be
quite long, or not, but the Bind Describe Execute Sync messages that
follow are probably all short. That case doesn't arise in this
direction, but I can't think exactly of what cases that do. It seems
like someone would need to play around and try some different cases
and maybe log the sizes of the secure_write() calls with various
algorithms, and then try to figure out what's best. For example, if
the alternating short-write, long-write behavior that Heikki mentioned
is happening, and I do think that particular thing is a very real
risk, then you haven't got it figured out yet...

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Tue, 30 Jan 2024 at 19:48, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Jan 30, 2024 at 12:58 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> > Sounds like it's difficult to come up with a heuristic that would work well enough for most cases.
> > One thing with sending data instead of copying it if the buffer is empty is that initially the buffer is empty. I
believeit will stay empty forever if we do not copy anything when the buffer is empty. We can maybe simply set the
thresholdto the buffer size/2 (4kB) and hope that will work better. Or copy the data only if it fits into the remaining
spacein the buffer. What do you think? 
> >
> > An additional note while I mentioned pq_putmessage_noblock(), I've been testing sending input data immediately in
pq_putmessage_noblock()without blocking and copy the data into PqSendBuffer only if the socket would block and cannot
sendit. Unfortunately, I don't have strong numbers to demonstrate any improvement in perf or timing yet. But I still
liketo know what would you think about it? 
>
> I think this is an area where it's very difficult to foresee on
> theoretical grounds what will be right in practice

I agree that it's hard to prove that such heuristics will always be
better in practice than the status quo. But I feel like we shouldn't
let perfect be the enemy of good here. I one approach that is a clear
improvement over the status quo is:
1. If the buffer is empty AND the data we are trying to send is larger
than the buffer size, then don't use the buffer.
2. If not, fill up the buffer first (just like we do now) then send
that. And if the left over data is then still larger than the buffer,
then now the buffer is empty so 1. applies.



Re: Flushing large data immediately in pqcomm

От
Robert Haas
Дата:
On Tue, Jan 30, 2024 at 6:39 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> I agree that it's hard to prove that such heuristics will always be
> better in practice than the status quo. But I feel like we shouldn't
> let perfect be the enemy of good here.

Sure, I agree.

> I one approach that is a clear
> improvement over the status quo is:
> 1. If the buffer is empty AND the data we are trying to send is larger
> than the buffer size, then don't use the buffer.
> 2. If not, fill up the buffer first (just like we do now) then send
> that. And if the left over data is then still larger than the buffer,
> then now the buffer is empty so 1. applies.

That seems like it might be a useful refinement of Melih Mutlu's
original proposal, but consider a message stream that consists of
messages exactly 8kB in size. If that message stream begins when the
buffer is empty, all messages are sent directly. If it begins when
there are any number of bytes in the buffer, we buffer every message
forever. That's kind of an odd artifact, but maybe it's fine in
practice. I say again that it's good to test out a bunch of scenarios
and see what shakes out.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Wed, 31 Jan 2024 at 18:23, Robert Haas <robertmhaas@gmail.com> wrote:
> That's kind of an odd artifact, but maybe it's fine in
> practice.

I agree it's an odd artifact, but it's not a regression over the
status quo. Achieving that was the intent of my suggestion: A change
that improves some cases, but regresses nowhere.

> I say again that it's good to test out a bunch of scenarios
> and see what shakes out.

Testing a bunch of scenarios to find a good one sounds like a good
idea, which can probably give us a more optimal heuristic. But it also
sounds like a lot of work, and probably results in a lot of
discussion. That extra effort might mean that we're not going to
commit any change for PG17 (or even at all). If so, then I'd rather
have a modest improvement from my refinement of Melih's proposal, than
none at all.



Re: Flushing large data immediately in pqcomm

От
Robert Haas
Дата:
On Wed, Jan 31, 2024 at 12:49 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> Testing a bunch of scenarios to find a good one sounds like a good
> idea, which can probably give us a more optimal heuristic. But it also
> sounds like a lot of work, and probably results in a lot of
> discussion. That extra effort might mean that we're not going to
> commit any change for PG17 (or even at all). If so, then I'd rather
> have a modest improvement from my refinement of Melih's proposal, than
> none at all.

Personally, I don't think it's likely that anything will get committed
here without someone doing more legwork than I've seen on the thread
so far. I don't have any plan to pick up this patch anyway, but if I
were thinking about it, I would abandon the idea unless I were
prepared to go test a bunch of stuff myself. I agree with the core
idea of this work, but not with the idea that the bar is as low as "if
it can't lose relative to today, it's good enough."

Of course, another committer may see it differently.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:

Robert Haas <robertmhaas@gmail.com>, 31 Oca 2024 Çar, 20:23 tarihinde şunu yazdı:
On Tue, Jan 30, 2024 at 6:39 PM Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> I agree that it's hard to prove that such heuristics will always be
> better in practice than the status quo. But I feel like we shouldn't
> let perfect be the enemy of good here.

Sure, I agree.

> I one approach that is a clear
> improvement over the status quo is:
> 1. If the buffer is empty AND the data we are trying to send is larger
> than the buffer size, then don't use the buffer.
> 2. If not, fill up the buffer first (just like we do now) then send
> that. And if the left over data is then still larger than the buffer,
> then now the buffer is empty so 1. applies.

That seems like it might be a useful refinement of Melih Mutlu's
original proposal, but consider a message stream that consists of
messages exactly 8kB in size. If that message stream begins when the
buffer is empty, all messages are sent directly. If it begins when
there are any number of bytes in the buffer, we buffer every message
forever. That's kind of an odd artifact, but maybe it's fine in
practice. I say again that it's good to test out a bunch of scenarios
and see what shakes out.

Isn't this already the case? Imagine sending exactly 8kB messages, the first pq_putmessage() call will buffer 8kB. Any call after this point simply sends a 8kB message already buffered from the previous call and buffers a new 8kB message. Only difference here is we keep the message in the buffer for a while instead of sending it directly. In theory, the proposed idea should not bring any difference in the number of flushes and the size of data we send in each time, but can remove unnecessary copies to the buffer in this case. I guess the behaviour is also the same with or without the patch in case the buffer has already some bytes.

Robert Haas <robertmhaas@gmail.com>, 31 Oca 2024 Çar, 21:28 tarihinde şunu yazdı:
Personally, I don't think it's likely that anything will get committed
here without someone doing more legwork than I've seen on the thread
so far. I don't have any plan to pick up this patch anyway, but if I
were thinking about it, I would abandon the idea unless I were
prepared to go test a bunch of stuff myself. I agree with the core
idea of this work, but not with the idea that the bar is as low as "if
it can't lose relative to today, it's good enough."

You're right and I'm open to doing more legwork. I'd also appreciate any suggestion about how to test this properly and/or useful scenarios to test. That would be really helpful.

I understand that I should provide more/better analysis around this change to prove that it doesn't hurt (hopefully) but improves some cases even though not all the cases. That may even help us to find a better approach than what's already proposed. Just to clarify, I don't think anyone here suggests that the bar should be at "if it can't lose relative to today, it's good enough". IMHO "a change that improves some cases, but regresses nowhere" does not translate to that.

Thanks,
--
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

От
Robert Haas
Дата:
On Wed, Jan 31, 2024 at 2:23 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
>> That seems like it might be a useful refinement of Melih Mutlu's
>> original proposal, but consider a message stream that consists of
>> messages exactly 8kB in size. If that message stream begins when the
>> buffer is empty, all messages are sent directly. If it begins when
>> there are any number of bytes in the buffer, we buffer every message
>> forever. That's kind of an odd artifact, but maybe it's fine in
>> practice. I say again that it's good to test out a bunch of scenarios
>> and see what shakes out.
>
> Isn't this already the case? Imagine sending exactly 8kB messages, the first pq_putmessage() call will buffer 8kB.
Anycall after this point simply sends a 8kB message already buffered from the previous call and buffers a new 8kB
message.Only difference here is we keep the message in the buffer for a while instead of sending it directly. In
theory,the proposed idea should not bring any difference in the number of flushes and the size of data we send in each
time,but can remove unnecessary copies to the buffer in this case. I guess the behaviour is also the same with or
withoutthe patch in case the buffer has already some bytes. 

Yes, it's never worse than today in terms of number of buffer flushes,
but it doesn't feel like great behavior, either. Users tend not to
like it when the behavior of an algorithm depends heavily on
incidental factors that shouldn't really be relevant, like whether the
buffer starts with 1 byte in it or 0 at the beginning of a long
sequence of messages. They see the performance varying "for no reason"
and they dislike it. They don't say "even the bad performance is no
worse than earlier versions so it's fine."

> You're right and I'm open to doing more legwork. I'd also appreciate any suggestion about how to test this properly
and/oruseful scenarios to test. That would be really helpful. 

I think experimenting to see whether the long-short-long-short
behavior that Heikki postulated emerges in practice would be a really
good start.

Another experiment that I think would be interesting is: suppose you
create a patch that sends EVERY message without buffering and compare
that to master. My naive expectation would be that this will lose if
you pump short messages through that connection and win if you pump
long messages through that connection. Is that true? If yes, at what
point do we break even on performance? Does it depend on whether the
connection is local or over a network? Does it depend on whether it's
with or without SSL? Does it depend on Linux vs. Windows vs.
whateverBSD? What happens if you twiddle the 8kB buffer size up or,
say, down to just below the Ethernet frame size?

I think that what we really want to understand here is under what
circumstances the extra layer of buffering is a win vs. being a loss.
If all the stuff I just mentioned doesn't really matter and the answer
is, say, that an 8kB buffer is great and the breakpoint where extra
buffering makes sense is also 8kB, and that's consistent regardless of
other variables, then your algorithm or Jelte's variant or something
of that nature is probably just right. But if it turns out, say, that
the extra buffering is only a win for sub-1kB messages, that would be
rather nice to know before we finalize the approach. Also, if it turns
out that the answer differs dramatically based on whether you're using
a UNIX socket or TCP, that would also be nice to know before
finalizing an algorithm.

> I understand that I should provide more/better analysis around this change to prove that it doesn't hurt (hopefully)
butimproves some cases even though not all the cases. That may even help us to find a better approach than what's
alreadyproposed. Just to clarify, I don't think anyone here suggests that the bar should be at "if it can't lose
relativeto today, it's good enough". IMHO "a change that improves some cases, but regresses nowhere" does not translate
tothat. 

Well, I thought those were fairly similar sentiments, so maybe I'm not
quite understanding the statement in the way it was meant.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Flushing large data immediately in pqcomm

От
Andres Freund
Дата:
Hi,

On 2024-01-31 14:57:35 -0500, Robert Haas wrote:
> > You're right and I'm open to doing more legwork. I'd also appreciate any
> > suggestion about how to test this properly and/or useful scenarios to
> > test. That would be really helpful.
>
> I think experimenting to see whether the long-short-long-short
> behavior that Heikki postulated emerges in practice would be a really
> good start.
>
> Another experiment that I think would be interesting is: suppose you
> create a patch that sends EVERY message without buffering and compare
> that to master. My naive expectation would be that this will lose if
> you pump short messages through that connection and win if you pump
> long messages through that connection. Is that true? If yes, at what
> point do we break even on performance? Does it depend on whether the
> connection is local or over a network? Does it depend on whether it's
> with or without SSL? Does it depend on Linux vs. Windows vs.
> whateverBSD? What happens if you twiddle the 8kB buffer size up or,
> say, down to just below the Ethernet frame size?

I feel like you're putting up a too high bar for something that can be a
pretty clear improvement on its own, without a downside. The current behaviour
is pretty absurd, doing all this research across all platforms isn't going to
disprove that - and it's a lot of work.  ISTM we can analyze this without
taking concrete hardware into account easily enough.


One thing that I haven't seen mentioned here that's relevant around using
small buffers: Postgres uses TCP_NODELAY and has to do so. That means doing
tiny sends can hurt substantially


> I think that what we really want to understand here is under what
> circumstances the extra layer of buffering is a win vs. being a loss.

It's quite easy to see that doing no buffering isn't viable - we end up with
tiny tiny TCP packets, one for each send(). And then there's the syscall
overhead.


Here's a quickly thrown together benchmark using netperf. First with -D, which
instructs it to use TCP_NODELAY, as we do.

10gbit network, remote host:

$ (fields="request_size,throughput"; echo "$fields";for i in $(seq 0 16); do s=$((2**$i));netperf -P0 -t TCP_STREAM -l1
-Halap5-10gbe  -- -r $s,$s -D 1 -o "$fields";done)|column -t -s,
 

request_size  throughput
1             22.73
2             45.77
4             108.64
8             225.78
16            560.32
32            1035.61
64            2177.91
128           3604.71
256           5878.93
512           9334.70
1024          9031.13
2048          9405.35
4096          9334.60
8192          9275.33
16384         9406.29
32768         9385.52
65536         9399.40


localhost:
request_size  throughput
1             2.76
2             5.10
4             9.89
8             20.51
16            43.42
32            87.13
64            173.72
128           343.70
256           647.89
512           1328.79
1024          2550.14
2048          4998.06
4096          9482.06
8192          17130.76
16384         29048.02
32768         42106.33
65536         48579.95

I'm slightly baffled by the poor performance of localhost with tiny packet
sizes. Ah, I see - it's the NODELA, without that:

localhost:
1             32.02
2             60.58
4             114.32
8             262.71
16            558.42
32            1053.66
64            2099.39
128           3815.60
256           6566.19
512           11751.79
1024          18976.11
2048          27222.99
4096          33838.07
8192          38219.60
16384         39146.37
32768         44784.98
65536         44214.70


NODELAY triggers many more context switches, because there's immediately data
available for the receiving side. Whereas with real network the interrupts get
coalesced.


I think that's pretty clear evidence that we need buffering.  But I think we
can probably be smarter than we are right now, and then what's been proposed
in the patch. Because of TCP_NODELAY we shouldn't send a tiny buffer on its
own, it may trigger sending a small TCP packet, which is quite inefficient.


While not perfect - e.g. because networks might use jumbo packets / large MTUs
and we don't know how many outstanding bytes there are locally, I think a
decent heuristic could be to always try to send at least one packet worth of
data at once (something like ~1400 bytes), even if that requires copying some
of the input data. It might not be sent on its own, but it should make it
reasonably unlikely to end up with tiny tiny packets.


Greetings,

Andres Freund



Re: Flushing large data immediately in pqcomm

От
Robert Haas
Дата:
On Wed, Jan 31, 2024 at 10:24 PM Andres Freund <andres@anarazel.de> wrote:
> While not perfect - e.g. because networks might use jumbo packets / large MTUs
> and we don't know how many outstanding bytes there are locally, I think a
> decent heuristic could be to always try to send at least one packet worth of
> data at once (something like ~1400 bytes), even if that requires copying some
> of the input data. It might not be sent on its own, but it should make it
> reasonably unlikely to end up with tiny tiny packets.

I think that COULD be a decent heuristic but I think it should be
TESTED, including against the ~3 or so other heuristics proposed on
this thread, before we make a decision.

I literally mentioned the Ethernet frame size as one of the things
that we should test whether it's relevant in the exact email to which
you're replying, and you replied by proposing that as a heuristic, but
also criticizing me for wanting more research before we settle on
something. Are we just supposed to assume that your heuristic is
better than the others proposed here without testing anything, or,
like, what? I don't think this needs to be a completely exhaustive or
exhausting process, but I think trying a few different things out and
seeing what happens is smart.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Flushing large data immediately in pqcomm

От
Robert Haas
Дата:
On Thu, Feb 1, 2024 at 10:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jan 31, 2024 at 10:24 PM Andres Freund <andres@anarazel.de> wrote:
> > While not perfect - e.g. because networks might use jumbo packets / large MTUs
> > and we don't know how many outstanding bytes there are locally, I think a
> > decent heuristic could be to always try to send at least one packet worth of
> > data at once (something like ~1400 bytes), even if that requires copying some
> > of the input data. It might not be sent on its own, but it should make it
> > reasonably unlikely to end up with tiny tiny packets.
>
> I think that COULD be a decent heuristic but I think it should be
> TESTED, including against the ~3 or so other heuristics proposed on
> this thread, before we make a decision.
>
> I literally mentioned the Ethernet frame size as one of the things
> that we should test whether it's relevant in the exact email to which
> you're replying, and you replied by proposing that as a heuristic, but
> also criticizing me for wanting more research before we settle on
> something. Are we just supposed to assume that your heuristic is
> better than the others proposed here without testing anything, or,
> like, what? I don't think this needs to be a completely exhaustive or
> exhausting process, but I think trying a few different things out and
> seeing what happens is smart.

There was probably a better way to phrase this email ... the sentiment
is sincere, but there was almost certainly a way of writing it that
didn't sound like I'm super-annoyed.

Apologies for that.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Flushing large data immediately in pqcomm

От
Andres Freund
Дата:
Hi,

On 2024-02-01 15:02:57 -0500, Robert Haas wrote:
> On Thu, Feb 1, 2024 at 10:52 AM Robert Haas <robertmhaas@gmail.com> wrote:
> There was probably a better way to phrase this email ... the sentiment
> is sincere, but there was almost certainly a way of writing it that
> didn't sound like I'm super-annoyed.

NP - I could have phrased mine better as well...


> > On Wed, Jan 31, 2024 at 10:24 PM Andres Freund <andres@anarazel.de> wrote:
> > > While not perfect - e.g. because networks might use jumbo packets / large MTUs
> > > and we don't know how many outstanding bytes there are locally, I think a
> > > decent heuristic could be to always try to send at least one packet worth of
> > > data at once (something like ~1400 bytes), even if that requires copying some
> > > of the input data. It might not be sent on its own, but it should make it
> > > reasonably unlikely to end up with tiny tiny packets.
> >
> > I think that COULD be a decent heuristic but I think it should be
> > TESTED, including against the ~3 or so other heuristics proposed on
> > this thread, before we make a decision.
> >
> > I literally mentioned the Ethernet frame size as one of the things
> > that we should test whether it's relevant in the exact email to which
> > you're replying, and you replied by proposing that as a heuristic, but
> > also criticizing me for wanting more research before we settle on
> > something.

I mentioned the frame size thing because afaict nobody in the thread had
mentioned our use of TCP_NODELAY (which basically forces the kernel to send
out data immediately instead of waiting for further data to be sent). Without
that it'd be a lot less problematic to occasionally send data in small
increments inbetween larger sends. Nor would packet sizes be as relevant.


> > Are we just supposed to assume that your heuristic is better than the
> > others proposed here without testing anything, or, like, what? I don't
> > think this needs to be a completely exhaustive or exhausting process, but
> > I think trying a few different things out and seeing what happens is
> > smart.

I wasn't trying to say that my heuristic necessarily is better. What I was
trying to get at - and expressed badly - was that I doubt that testing can get
us all that far here. It's not too hard to test the effects of our buffering
with regards to syscall overhead, but once you actually take network effects
into account it gets quite hard. Bandwidth, latency, the specific network
hardware and operating systems involved all play a significant role. Given
how, uh, naive our current approach is, I think analyzing the situation from
first principles and then doing some basic validation of the results makes
more sense.

Separately, I think we shouldn't aim for perfect here. It's obviously
extremely inefficient to send a larger amount of data by memcpy()ing and
send()ing it in 8kB chunks. As mentioned by several folks upthread, we can
improve upon that without having worse behaviour than today.  Medium-long term
I suspect we're going to want to use asynchronous network interfaces, in
combination with zero-copy sending, which requires larger changes. Not that
relevant for things like query results, quite relevant for base backups etc.


It's perhaps also worth mentioning that the small send buffer isn't great for
SSL performance, the encryption overhead increases when sending in small
chunks.


I hacked up Melih's patch to send the pending data together with the first bit
of the large "to be sent" data and also added a patch to increased
SINK_BUFFER_LENGTH by 16x. With a 12GB database I tested the time for
  pg_basebackup -c fast -Ft --compress=none -Xnone -D - -d "$conn" > /dev/null

                       time via
test                   unix     tcp     tcp+ssl
master                 6.305s   9.436s  15.596s
master-larger-buffer   6.535s   9.453s  15.208s
patch                  5.900s   7.465s  13.634s
patch-larger-buffer    5.233s   5.439s  11.730s


The increase when using tcp is pretty darn impressive.  If I had remembered in
time to disable manifests checksums, the win would have been even bigger.


The bottleneck for SSL is that it still ends up with ~16kB sends, not sure
why.

Greetings,

Andres Freund



Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:
Hi hackers,

I did some experiments with this patch, after previous discussions. This probably does not answer all questions, but would be happy to do more if needed.

First, I updated the patch according to what suggested here [1]. PSA  v2.
I tweaked the master branch a bit to not allow any buffering. I compared HEAD, this patch and no buffering at all.
I also added a simple GUC to control PqSendBufferSize, this change only allows to modify the buffer size and should not have any impact on performance.

I again ran the COPY TO STDOUT command and timed it. AFAIU COPY sends data row by row, and I tried running the command under different scenarios with different # of rows and row sizes. You can find the test script attached (see test.sh).
All timings are in ms.

1- row size = 100 bytes, # of rows = 1000000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│           │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD      │ 1036       │ 998  │ 940  │ 910  │ 894  │ 874  │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch     │ 1107       │ 1032 │ 980  │ 957  │ 917  │ 909  │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 6230       │ 6125 │ 6282 │ 6279 │ 6255 │ 6221 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘


2-  row size = half of the rows are 1KB and rest is 10KB , # of rows = 1000000
┌───────────┬────────────┬───────┬───────┬───────┬───────┬───────┐
│           │ 1400 bytes │ 2KB   │ 4KB   │ 8KB   │ 16KB  │ 32KB  │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ HEAD      │ 25197      │ 23414 │ 20612 │ 19206 │ 18334 │ 18033 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ patch     │ 19843      │ 19889 │ 19798 │ 19129 │ 18578 │ 18260 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ no buffer │ 23752      │ 23565 │ 23602 │ 23622 │ 23541 │ 23599 │
└───────────┴────────────┴───────┴───────┴───────┴───────┴───────┘

3-  row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│           │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD      │ 3137       │ 2937 │ 2687 │ 2551 │ 2456 │ 2465 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch     │ 2399       │ 2390 │ 2402 │ 2415 │ 2417 │ 2422 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 2417       │ 2414 │ 2429 │ 2418 │ 2435 │ 2404 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

3-  row size = all rows are 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│           │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD      │ 6113       │ 5764 │ 5281 │ 5009 │ 4885 │ 4872 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch     │ 4759       │ 4754 │ 4754 │ 4758 │ 4782 │ 4805 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 4756       │ 4774 │ 4793 │ 4766 │ 4770 │ 4774 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘


Some quick observations:
1- Even though I expect both the patch and HEAD behave similarly in case of small data (case 1: 100 bytes), the patch runs slightly slower than HEAD.
2- In cases where the data does not fit into the buffer, the patch starts performing better than HEAD. For example, in case 2, patch seems faster until the buffer size exceeds the data length. When the buffer size is set to something larger than 10KB (16KB/32KB in this case), there is again a slight performance loss with the patch as in case 1.
3- With large row sizes (i.e. sizes that do not fit into the buffer) not buffering at all starts performing better than HEAD. Similarly the patch performs better too as it stops buffering if data does not fit into the buffer.



[1]


Thanks,
--
Melih Mutlu
Microsoft
Вложения

Re: Flushing large data immediately in pqcomm

От
Robert Haas
Дата:
On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> 1- Even though I expect both the patch and HEAD behave similarly in case of small data (case 1: 100 bytes), the patch
runsslightly slower than HEAD. 

I wonder why this happens. It seems like maybe something that could be fixed.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Thu, 14 Mar 2024 at 12:22, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> I did some experiments with this patch, after previous discussions

One thing I noticed is that the buffer sizes don't seem to matter much
in your experiments, even though Andres his expectation was that 1400
would be better. I think I know the reason for that:

afaict from your test.sh script you connect psql over localhost or
maybe even unix socket to postgres. Neither of those would not have an
MTU of 1500. You'd probably want to do those tests over an actual
network or at least change the MTU of the loopback interface. e.g. my
"lo" interface mtu is 65536 by default:

❯ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever



Re: Flushing large data immediately in pqcomm

От
Heikki Linnakangas
Дата:
On 14/03/2024 13:22, Melih Mutlu wrote:
> @@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len)
>              if (internal_flush())
>                  return EOF;
>          }
> -        amount = PqSendBufferSize - PqSendPointer;
> -        if (amount > len)
> -            amount = len;
> -        memcpy(PqSendBuffer + PqSendPointer, s, amount);
> -        PqSendPointer += amount;
> -        s += amount;
> -        len -= amount;
> +
> +        /*
> +         * If the buffer is empty and data length is larger than the buffer
> +         * size, send it without buffering. Otherwise, put as much data as
> +         * possible into the buffer.
> +         */
> +        if (!pq_is_send_pending() && len >= PqSendBufferSize)
> +        {
> +            int start = 0;
> +
> +            socket_set_nonblocking(false);
> +            if (internal_flush_buffer(s, &start, (int *)&len))
> +                return EOF;
> +        }
> +        else
> +        {
> +            amount = PqSendBufferSize - PqSendPointer;
> +            if (amount > len)
> +                amount = len;
> +            memcpy(PqSendBuffer + PqSendPointer, s, amount);
> +            PqSendPointer += amount;
> +            s += amount;
> +            len -= amount;
> +        }
>      }
> +
>      return 0;
>  }

Two small bugs:

- the "(int *) &len)" cast is not ok, and will break visibly on 
big-endian systems where sizeof(int) != sizeof(size_t).

- If internal_flush_buffer() cannot write all the data in one call, it 
updates 'start' for how much it wrote, and leaves 'end' unchanged. You 
throw the updated 'start' value away, and will send the same data again 
on next iteration.

Not a correctness issue, but instead of pq_is_send_pending(), I think it 
would be better to check "PqSendStart == PqSendPointer" directly, or 
call socket_is_send_pending() directly here. pq_is_send_pending() does 
the same, but it's at a higher level of abstraction.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Thu, 14 Mar 2024 at 13:12, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> > 1- Even though I expect both the patch and HEAD behave similarly in case of small data (case 1: 100 bytes), the
patchruns slightly slower than HEAD. 
>
> I wonder why this happens. It seems like maybe something that could be fixed.

some wild guesses:
1. maybe it's the extra call overhead of the new internal_flush
implementation. What happens if you make that an inline function?
2. maybe swap these conditions around (the call seems heavier than a
simple comparison): !pq_is_send_pending() && len >= PqSendBufferSize

BTW, the improvements for the larger rows are awesome!



Re: Flushing large data immediately in pqcomm

От
David Rowley
Дата:
On Fri, 15 Mar 2024 at 02:03, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
>
> On Thu, 14 Mar 2024 at 13:12, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> > > 1- Even though I expect both the patch and HEAD behave similarly in case of small data (case 1: 100 bytes), the
patchruns slightly slower than HEAD. 
> >
> > I wonder why this happens. It seems like maybe something that could be fixed.
>
> some wild guesses:
> 1. maybe it's the extra call overhead of the new internal_flush
> implementation. What happens if you make that an inline function?
> 2. maybe swap these conditions around (the call seems heavier than a
> simple comparison): !pq_is_send_pending() && len >= PqSendBufferSize

I agree these are both worth trying.  For #2, I wonder if the
pq_is_send_pending() call is even worth checking at all. It seems to
me that the internal_flush_buffer() code will just do nothing if
nothing is pending.  Also, isn't there almost always going to be
something pending when the "len >= PqSendBufferSize" condition is met?
 We've just added the msgtype and number of bytes to the buffer which
is 5 bytes. If the previous message was also more than
PqSendBufferSize, then the buffer is likely to have 5 bytes due to the
previous flush, otherwise isn't it a 1 in 8192 chance that the buffer
is empty?

If that fails to resolve the regression, maybe it's worth memcpy()ing
enough bytes out of the message to fill the buffer then flush it and
check if we still have > PqSendBufferSize remaining and skip the
memcpy() for the rest.  That way there are no small flushes of just 5
bytes and only ever the possibility of reducing the flushes as no
pattern should cause the number of flushes to increase.

David



Re: Flushing large data immediately in pqcomm

От
David Rowley
Дата:
On Fri, 15 Mar 2024 at 01:46, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> - the "(int *) &len)" cast is not ok, and will break visibly on
> big-endian systems where sizeof(int) != sizeof(size_t).

I think fixing this requires adjusting the signature of
internal_flush_buffer() to use size_t instead of int.   That also
means that PqSendStart and PqSendPointer must also become size_t, or
internal_flush() must add local size_t variables to pass to
internal_flush_buffer and assign these back again to the global after
the call.  Upgrading the globals might be the cleaner option.

David



Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:
David Rowley <dgrowleyml@gmail.com>, 21 Mar 2024 Per, 00:54 tarihinde şunu yazdı:
On Fri, 15 Mar 2024 at 02:03, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
>
> On Thu, 14 Mar 2024 at 13:12, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 7:22 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> > > 1- Even though I expect both the patch and HEAD behave similarly in case of small data (case 1: 100 bytes), the patch runs slightly slower than HEAD.
> >
> > I wonder why this happens. It seems like maybe something that could be fixed.
>
> some wild guesses:
> 1. maybe it's the extra call overhead of the new internal_flush
> implementation. What happens if you make that an inline function?
> 2. maybe swap these conditions around (the call seems heavier than a
> simple comparison): !pq_is_send_pending() && len >= PqSendBufferSize

I agree these are both worth trying.  For #2, I wonder if the
pq_is_send_pending() call is even worth checking at all. It seems to
me that the internal_flush_buffer() code will just do nothing if
nothing is pending.  Also, isn't there almost always going to be
something pending when the "len >= PqSendBufferSize" condition is met?
 We've just added the msgtype and number of bytes to the buffer which
is 5 bytes. If the previous message was also more than
PqSendBufferSize, then the buffer is likely to have 5 bytes due to the
previous flush, otherwise isn't it a 1 in 8192 chance that the buffer
is empty?

If that fails to resolve the regression, maybe it's worth memcpy()ing
enough bytes out of the message to fill the buffer then flush it and
check if we still have > PqSendBufferSize remaining and skip the
memcpy() for the rest.  That way there are no small flushes of just 5
bytes and only ever the possibility of reducing the flushes as no
pattern should cause the number of flushes to increase.

In len > PqSendBufferSize cases, the buffer should be filled as much as possible if we're sure that it will be flushed at some point. Otherwise we might end up with small flushes. The cases where we're sure that the buffer will be flushed is when the buffer is not empty. If it's empty, there is no need to fill it unnecessarily as it might cause an additional flush. AFAIU from what you said, we shouldn't be worried about such a case since it's unlikely to have the buffer empty due to the first 5 bytes. I guess the only case where the buffer can be empty is when the buffer has PqSendBufferSize-5 bytes from previous messages and adding 5 bytes of the current message will flush the buffer. I'm not sure if removing the check may cause any regression in any case, but it's just there to be safe.

What if I do a simple comparison like PqSendStart == PqSendPointer instead of calling pq_is_send_pending() as Heikki suggested, then this check should not hurt that much. Right? Does that make sense?

--
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

От
David Rowley
Дата:
On Thu, 21 Mar 2024 at 13:24, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> What if I do a simple comparison like PqSendStart == PqSendPointer instead of calling pq_is_send_pending() as Heikki
suggested,then this check should not hurt that much. Right? Does that make sense?
 

As I understand the code, there's no problem calling
internal_flush_buffer() when the buffer is empty and I suspect that if
we're sending a few buffers with "len > PqSendBufferSize" that it's
just so unlikely that the buffer is empty that we should just do the
function call and let internal_flush_buffer() handle doing nothing if
the buffer really is empty.  I think the chances of
internal_flush_buffer() having to do exactly nothing here is less than
1 in 8192, so I just don't think the check is worthwhile.  The reason
I don't think the odds are exactly 1 in 8192 is because if we're
sending a large number of bytes then it will be common that the buffer
will contain exactly 5 bytes due to the previous flush and command
prefix just having been added.

It's worth testing both, however. I might be wrong. Performance is
hard to predict. It would be good to see your test.sh script run with
and without the PqSendStart == PqSendPointer condition.

David



Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Thu, 21 Mar 2024 at 01:45, David Rowley <dgrowleyml@gmail.com> wrote:
> As I understand the code, there's no problem calling
> internal_flush_buffer() when the buffer is empty and I suspect that if
> we're sending a few buffers with "len > PqSendBufferSize" that it's
> just so unlikely that the buffer is empty that we should just do the
> function call and let internal_flush_buffer() handle doing nothing if
> the buffer really is empty.  I think the chances of
> internal_flush_buffer() having to do exactly nothing here is less than
> 1 in 8192, so I just don't think the check is worthwhile.

I think you're missing the exact case that we're trying to improve
here: Calls to internal_putbytes with a very large len, e.g. 1MB.
With the new code the buffer will be empty ~50% of the time (not less
than 1 in 8192) with such large buffers, because the flow that will
happen:

1. We check len > PqSendBufferSize. There are some bytes in the buffer
e.g. the 5 bytes of the msgtype. So we fill up the buffer, but have
many bytes left in len.
2. We loop again, because len is not 0.
3. We flush the buffer (at the top of the loop) because the buffer is full.
4. We check len > PqSendBufferSize. Now the buffer is empty, so we
call internal_flush_buffer directly

As you can see we check len > PqSendBufferSize twice (in step 1. and
step 4.), and 1 out of 2 times it returns 0

To be clear, the code is done this way so our behaviour would only
ever be better than the status-quo, and cause no regressions. For
instance, flushing the 5 byte header separately and then flushing the
full input buffer might result in more IP packets being sent in total
in some cases due to our TCP_NODELAY.



Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Thu, 21 Mar 2024 at 01:24, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> What if I do a simple comparison like PqSendStart == PqSendPointer instead of calling pq_is_send_pending()

Yeah, that sounds worth trying out. So the new suggestions to fix the
perf issues on small message sizes would be:

1. add "inline" to internal_flush function
2. replace pq_is_send_pending() with PqSendStart == PqSendPointer
3. (optional) swap the order of PqSendStart == PqSendPointer and len
>= PqSendBufferSize



Re: Flushing large data immediately in pqcomm

От
David Rowley
Дата:
On Thu, 21 Mar 2024 at 22:44, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
>
> On Thu, 21 Mar 2024 at 01:45, David Rowley <dgrowleyml@gmail.com> wrote:
> > As I understand the code, there's no problem calling
> > internal_flush_buffer() when the buffer is empty and I suspect that if
> > we're sending a few buffers with "len > PqSendBufferSize" that it's
> > just so unlikely that the buffer is empty that we should just do the
> > function call and let internal_flush_buffer() handle doing nothing if
> > the buffer really is empty.  I think the chances of
> > internal_flush_buffer() having to do exactly nothing here is less than
> > 1 in 8192, so I just don't think the check is worthwhile.
>
> I think you're missing the exact case that we're trying to improve
> here: Calls to internal_putbytes with a very large len, e.g. 1MB.
> With the new code the buffer will be empty ~50% of the time (not less
> than 1 in 8192) with such large buffers, because the flow that will
> happen:

It was the code I misread. I understand what the aim is. I failed to
notice the while loop in internal_putbytes().  So what I mentioned
about trying to fill the buffer before flushing already happens.  I
now agree that the PqSendStart == PqSendPointer test.  I'd say since
the reported regression was with 100 byte rows that testing "len >=
PqSendBufferSize" before PqSendStart == PqSendPointer makes sense.

David



Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:


Heikki Linnakangas <hlinnaka@iki.fi>, 14 Mar 2024 Per, 15:46 tarihinde şunu yazdı:
On 14/03/2024 13:22, Melih Mutlu wrote:
> @@ -1282,14 +1283,32 @@ internal_putbytes(const char *s, size_t len)
>                       if (internal_flush())
>                               return EOF;
>               }
> -             amount = PqSendBufferSize - PqSendPointer;
> -             if (amount > len)
> -                     amount = len;
> -             memcpy(PqSendBuffer + PqSendPointer, s, amount);
> -             PqSendPointer += amount;
> -             s += amount;
> -             len -= amount;
> +
> +             /*
> +              * If the buffer is empty and data length is larger than the buffer
> +              * size, send it without buffering. Otherwise, put as much data as
> +              * possible into the buffer.
> +              */
> +             if (!pq_is_send_pending() && len >= PqSendBufferSize)
> +             {
> +                     int start = 0;
> +
> +                     socket_set_nonblocking(false);
> +                     if (internal_flush_buffer(s, &start, (int *)&len))
> +                             return EOF;
> +             }
> +             else
> +             {
> +                     amount = PqSendBufferSize - PqSendPointer;
> +                     if (amount > len)
> +                             amount = len;
> +                     memcpy(PqSendBuffer + PqSendPointer, s, amount);
> +                     PqSendPointer += amount;
> +                     s += amount;
> +                     len -= amount;
> +             }
>       }
> +
>       return 0;
>  }

Two small bugs:

- the "(int *) &len)" cast is not ok, and will break visibly on
big-endian systems where sizeof(int) != sizeof(size_t).

- If internal_flush_buffer() cannot write all the data in one call, it
updates 'start' for how much it wrote, and leaves 'end' unchanged. You
throw the updated 'start' value away, and will send the same data again
on next iteration.

There are two possible options for internal_flush_buffer() in internal_putbytes() case:
1- Write all the data and return 0. We don't need start or end of the data in this case.
2- Cannot write all and return EOF. In this case internal_putbytes() also returns EOF immediately and does not really retry. There will be no next iteration.

If it was non-blocking, then we may need to keep the new value. But I think we do not need the updated start value in both cases here. What do you think?

Thanks,
--
Melih Mutlu
Microsoft

Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:
Hi,

PSA v3.

Jelte Fennema-Nio <postgres@jeltef.nl>, 21 Mar 2024 Per, 12:58 tarihinde şunu yazdı:
On Thu, 21 Mar 2024 at 01:24, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> What if I do a simple comparison like PqSendStart == PqSendPointer instead of calling pq_is_send_pending()

Yeah, that sounds worth trying out. So the new suggestions to fix the
perf issues on small message sizes would be:

1. add "inline" to internal_flush function
2. replace pq_is_send_pending() with PqSendStart == PqSendPointer
3. (optional) swap the order of PqSendStart == PqSendPointer and len
>= PqSendBufferSize

I did all of the above changes and it seems like those resolved the regression issue. 
Since the previous results were with unix sockets, I share here the results of v3 when using unix sockets for comparison. 
Sharing only the case where all messages are 100 bytes, since this was when the regression was most visible.  

row size = 100 bytes, # of rows = 1000000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│           │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD      │ 1106       │ 1006 │ 947  │ 920  │ 899  │ 888  │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch     │ 1094       │ 997  │ 943  │ 913  │ 894  │ 881  │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 6389       │ 6195 │ 6214 │ 6271 │ 6325 │ 6211 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

David Rowley <dgrowleyml@gmail.com>, 21 Mar 2024 Per, 00:57 tarihinde şunu yazdı:
On Fri, 15 Mar 2024 at 01:46, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> - the "(int *) &len)" cast is not ok, and will break visibly on
> big-endian systems where sizeof(int) != sizeof(size_t).

I think fixing this requires adjusting the signature of
internal_flush_buffer() to use size_t instead of int.   That also
means that PqSendStart and PqSendPointer must also become size_t, or
internal_flush() must add local size_t variables to pass to
internal_flush_buffer and assign these back again to the global after
the call.  Upgrading the globals might be the cleaner option.

David

This is done too.

I actually tried to test it over a real network for a while. However, I couldn't get reliable-enough numbers with both HEAD and the patch due to network related issues. 
I've decided to go with Jelte's suggestion [1]  which is decreasing MTU of the loopback interface to 1500 and using localhost. 

Here are the results:

1- row size = 100 bytes, # of rows = 1000000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────
│           │ 1400 bytes │ 2KB   │  4KB  │  8KB  │  16KB │  32KB │
├───────────┼────────────┼─────
─┼──────┼──────┼──────┼─────
│ HEAD      │ 1351       │ 1233  │ 1074  │  988  │  944  │  916  │
├───────────┼────────────┼──────
┼──────┼──────┼──────┼──────
│ patch     │ 1369       │ 1232  │ 1073  │  981  │  928  │  907  │
├───────────┼────────────┼─────
─┼──────┼──────┼──────┼──────
│ no buffer │ 14949      │ 14533 │ 14791 │ 14864 │ 14612 │ 14751 │
└───────────┴────────────┴─────
─┴──────┴──────┴──────┴──────

2-  row size = half of the rows are 1KB and rest is 10KB , # of rows = 1000000
┌───────────┬────────────┬───────┬───────┬───────┬───────┬───────┐
│           │ 1400 bytes │ 2KB   │ 4KB   │ 8KB   │ 16KB  │ 32KB  │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ HEAD      │ 37212      │ 31372 │ 25520 │ 21980 │ 20311 │ 18864 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ patch     │ 23006      │ 23127 │ 23147 │ 22229 │ 20367 │ 19155 │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ no buffer │ 30725      │ 31090 │ 30917 │ 30796 │ 30984 │ 30813 │
└───────────┴────────────┴───────┴───────┴───────┴───────┴───────┘

3-  row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│           │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD      │ 4296       │ 3713 │ 3040 │ 2711 │ 2528 │ 2449 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch     │ 2401       │ 2411 │ 2404 │ 2374 │ 2395 │ 2408 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 2399       │ 2403 │ 2408 │ 2389 │ 2402 │ 2403 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

4-  row size = all rows are 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│           │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD      │ 8335       │ 7370 │ 6017 │ 5368 │ 5009 │ 4843 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch     │ 4711       │ 4722 │ 4708 │ 4693 │ 4724 │ 4717 │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 4704       │ 4712 │ 4746 │ 4728 │ 4709 │ 4730 │
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘
 

[1]

Thanks,
--
Melih Mutlu
Microsoft
Вложения

Re: Flushing large data immediately in pqcomm

От
David Rowley
Дата:
On Fri, 22 Mar 2024 at 12:46, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> I did all of the above changes and it seems like those resolved the regression issue.

Thanks for adjusting the patch.   The numbers do look better, but on
looking at your test.sh script from [1], I see:

meson setup --buildtype debug -Dcassert=true
--prefix="$DESTDIR/usr/local/pgsql" $DESTDIR && \

can you confirm if the test was done in debug with casserts on?   If
so, it would be much better to have asserts off and have
-Dbuildtype=release.

I'm planning to run some benchmarks tomorrow.   My thoughts are that
the patch allows the memcpy() to be skipped without adding any
additional buffer flushes and demonstrates a good performance increase
in various scenarios from doing so.  I think that is a satisfactory
goal. If I don't see any issues from reviewing and benchmarking
tomorrow, I'd like to commit this.

Robert, I understand you'd like a bit more from this patch. I'm
wondering if you planning on blocking another committer from going
ahead with this? Or if you have a reason why the current state of the
patch is not a meaningful enough improvement that would justify
possibly not getting any improvements in this area for PG17?

David

[1] https://www.postgresql.org/message-id/CAGPVpCSX8bTF61ZL9jOgh1AaY3bgsWnQ6J7WmJK4TV0f2LPnJQ%40mail.gmail.com



Re: Flushing large data immediately in pqcomm

От
Robert Haas
Дата:
On Wed, Mar 27, 2024 at 7:39 AM David Rowley <dgrowleyml@gmail.com> wrote:
> Robert, I understand you'd like a bit more from this patch. I'm
> wondering if you planning on blocking another committer from going
> ahead with this? Or if you have a reason why the current state of the
> patch is not a meaningful enough improvement that would justify
> possibly not getting any improvements in this area for PG17?

So, I think that the first version of the patch, when it got a big
chunk of data, would just flush whatever was already in the buffer and
then send the rest without copying. The current version, as I
understand it, only does that if the buffer is empty; otherwise, it
copies data as much data as it can into the partially-filled buffer. I
think that change addresses most of my concern about the approach; the
old way could, I believe, lead to an increased total number of flushes
with the right usage pattern, but I don't believe that's possible with
the revised approach. I do kind of wonder whether there is some more
fine-tuning of the approach that would improve things further, but I
realize that we have very limited time to figure this out, and there's
no sense letting the perfect be the enemy of the good.

So in short... no, I don't have big concerns at this point. Melih's
latest benchmarks look fairly promising to me, too.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:


On Wed, Mar 27, 2024 at 14:39 David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 22 Mar 2024 at 12:46, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> I did all of the above changes and it seems like those resolved the regression issue.

Thanks for adjusting the patch.   The numbers do look better, but on
looking at your test.sh script from [1], I see:

meson setup --buildtype debug -Dcassert=true
--prefix="$DESTDIR/usr/local/pgsql" $DESTDIR && \

can you confirm if the test was done in debug with casserts on?   If
so, it would be much better to have asserts off and have
-Dbuildtype=release.

Yes, previous numbers were with --buildtype debug -Dcassert=true. I can share new numbers with release build and asserts off soon.

Thanks, 
Melih

Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:


On Wed, Mar 27, 2024 at 18:54 Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Mar 27, 2024 at 7:39 AM David Rowley <dgrowleyml@gmail.com> wrote:
> Robert, I understand you'd like a bit more from this patch. I'm
> wondering if you planning on blocking another committer from going
> ahead with this? Or if you have a reason why the current state of the
> patch is not a meaningful enough improvement that would justify
> possibly not getting any improvements in this area for PG17?

So, I think that the first version of the patch, when it got a big
chunk of data, would just flush whatever was already in the buffer and
then send the rest without copying.

Correct.

The current version, as I
understand it, only does that if the buffer is empty; otherwise, it
copies data as much data as it can into the partially-filled buffer.

Yes, currently it should fill and flush the buffer first, if it’s not already empty. Only then it sends the rest without copying.

Thanks,
Melih

Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:
Hi,

Melih Mutlu <m.melihmutlu@gmail.com>, 28 Mar 2024 Per, 22:44 tarihinde şunu yazdı:
>
> On Wed, Mar 27, 2024 at 14:39 David Rowley <dgrowleyml@gmail.com> wrote:
>>
>> On Fri, 22 Mar 2024 at 12:46, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
>> can you confirm if the test was done in debug with casserts on?   If
>> so, it would be much better to have asserts off and have
>> -Dbuildtype=release.
>
>
> Yes, previous numbers were with --buildtype debug -Dcassert=true. I can share new numbers with release build and asserts off soon.

While testing the patch without --buildtype debug -Dcassert=true, I felt like there was still a slight regression. I changed internal_flush() to an inline function, results look better this way.


1- row size = 100 bytes, # of rows = 1000000
┌───────────┬────────────┬───────┬───────┬───────┬───────┬───────┐
│           │ 1400 bytes │ 2KB   │  4KB  │  8KB  │  16KB │  32KB │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ HEAD      │  861       │ 765   │ 612   │  521  │  477  │  480  │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ patch     │  869       │ 766   │ 612   │  519  │  482  │  467  │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ no buffer │ 13978      13746 13909 13956 13920 13895 
└───────────┴────────────┴───────┴───────┴───────┴───────┴───────┘

2-  row size = half of the rows are 1KB and rest is 10KB , # of rows = 1000000
┌───────────┬────────────┬───────┬───────┬───────┬───────┬───────┐
│           │ 1400 bytes │ 2KB   │ 4KB   │ 8KB   │ 16KB  │ 32KB  │
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ HEAD      │ 30195      │ 26455 │ 17338 │ 14562 │ 12844 │ 11652
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ patch     │ 14744      │ 15830 │ 15697 │ 14273 │ 12794 │ 11652
├───────────┼────────────┼───────┼───────┼───────┼───────┼───────┤
│ no buffer │ 24054      │ 23992 │ 24162 │ 23951 │ 23901 │ 23925
└───────────┴────────────┴───────┴───────┴───────┴───────┴───────┘

3-  row size = half of the rows are 1KB and rest is 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│           │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD      │ 3546       │ 3029 │ 2373 │ 2032 │ 1873 │ 1806
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch     │ 1715       │ 17231724 │ 1731 │ 1729 │ 1709
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 1749       │ 1748 │ 1742 │ 1744 │ 1757 │ 1744
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

4-  row size = all rows are 1MB , # of rows = 1000
┌───────────┬────────────┬──────┬──────┬──────┬──────┬──────┐
│           │ 1400 bytes │ 2KB  │ 4KB  │ 8KB  │ 16KB │ 32KB │
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ HEAD      │ 7089       │ 5987 │ 4697 │ 4048 │ 3737 │ 3523
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ patch     │ 3438       │ 3411 │ 3400 │ 3416 │ 3399 │ 3429
├───────────┼────────────┼──────┼──────┼──────┼──────┼──────┤
│ no buffer │ 3432       │ 3432 │ 3416 │ 3424 │ 3378 │ 3429
└───────────┴────────────┴──────┴──────┴──────┴──────┴──────┘

Thanks,
--
Melih Mutlu
Microsoft
Вложения

Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Thu, 4 Apr 2024 at 13:08, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> I changed internal_flush() to an inline function, results look better this way.

It seems you also change internal_flush_buffer to be inline (but only
in the actual function definition, not declaration at the top). I
don't think inlining internal_flush_buffer should be necessary to
avoid the perf regressions, i.e. internal_flush is adding extra
indirection compared to master and is only a single line, so that one
makes sense to inline.

Other than that the code looks good to me.

The new results look great.

One thing that is quite interesting about these results is that
increasing the buffer size results in even better performance (by
quite a bit). I don't think we can easily choose a perfect number, as
it seems to be a trade-off between memory usage and perf. But allowing
people to configure it through a GUC like in your second patchset
would be quite useful I think, especially because larger buffers could
be configured for connections that would benefit most for it (e.g.
replication connections or big COPYs).

I think your add-pq_send_buffer_size-GUC.patch is essentially what we
would need there but it would need some extra changes to actually be
merge-able:
1. needs docs
2. rename PQ_SEND_BUFFER_SIZE (at least make it not UPPER_CASE, but
maybe also remove the PQ_ prefix)
3. It's marked as PGC_USERSET, but there's no logic to grow/shrink it
after initial allocation



Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:


Jelte Fennema-Nio <postgres@jeltef.nl>, 4 Nis 2024 Per, 16:34 tarihinde şunu yazdı:
On Thu, 4 Apr 2024 at 13:08, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> I changed internal_flush() to an inline function, results look better this way.

It seems you also change internal_flush_buffer to be inline (but only
in the actual function definition, not declaration at the top). I
don't think inlining internal_flush_buffer should be necessary to
avoid the perf regressions, i.e. internal_flush is adding extra
indirection compared to master and is only a single line, so that one
makes sense to inline.

Right. It was a mistake, forgot to remove that. Fixed it in v5.

 
Other than that the code looks good to me.

The new results look great.

One thing that is quite interesting about these results is that
increasing the buffer size results in even better performance (by
quite a bit). I don't think we can easily choose a perfect number, as
it seems to be a trade-off between memory usage and perf. But allowing
people to configure it through a GUC like in your second patchset
would be quite useful I think, especially because larger buffers could
be configured for connections that would benefit most for it (e.g.
replication connections or big COPYs).

I think your add-pq_send_buffer_size-GUC.patch is essentially what we
would need there but it would need some extra changes to actually be
merge-able:
1. needs docs
2. rename PQ_SEND_BUFFER_SIZE (at least make it not UPPER_CASE, but
maybe also remove the PQ_ prefix)
3. It's marked as PGC_USERSET, but there's no logic to grow/shrink it
after initial allocation

I agree that the GUC patch requires more work to be in good shape. I created that for testing purposes. But if we decide to make the buffer size customizable, then I'll start polishing up that patch and address your suggestions.

One case that could benefit from increased COPY performance is table sync of logical replication. It might make sense letting users to configure buffer size to speed up table sync. I'm not sure what kind of problems this GUC would bring though.

Thanks,
--
Melih Mutlu
Microsoft
Вложения

Re: Flushing large data immediately in pqcomm

От
David Rowley
Дата:
On Fri, 5 Apr 2024 at 03:28, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
>
> Jelte Fennema-Nio <postgres@jeltef.nl>, 4 Nis 2024 Per, 16:34 tarihinde şunu yazdı:
>>
>> On Thu, 4 Apr 2024 at 13:08, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
>> > I changed internal_flush() to an inline function, results look better this way.
>>
>> It seems you also change internal_flush_buffer to be inline (but only
>> in the actual function definition, not declaration at the top). I
>> don't think inlining internal_flush_buffer should be necessary to
>> avoid the perf regressions, i.e. internal_flush is adding extra
>> indirection compared to master and is only a single line, so that one
>> makes sense to inline.
>
> Right. It was a mistake, forgot to remove that. Fixed it in v5.

I don't see any issues with v5, so based on the performance numbers
shown on this thread for the latest patch, it would make sense to push
it.  The problem is, I just can't recreate the performance numbers.

I've tried both on my AMD 3990x machine and an Apple M2 with a script
similar to the test.sh from above.  I mostly just stripped out the
buffer size stuff and adjusted the timing code to something that would
work with mac.

The script runs each copy 30 times and takes the average time,
reported here in seconds.

With AMD 3990x:

master
Run 100 100 5000000: 1.032264113 sec
Run 1024 10240 200000: 1.016229105 sec
Run 1024 1048576 2000: 1.242267116 sec
Run 1048576 1048576 1000: 1.245425089 sec

v5
Run 100 100 5000000: 1.068543053 sec
Run 1024 10240 200000: 1.026298571 sec
Run 1024 1048576 2000: 1.231169669 sec
Run 1048576 1048576 1000: 1.236355567 sec

With the M2 mini:

master
Run 100 100 5000000: 1.167851249 sec
Run 1024 10240 200000: 1.962466987 sec
Run 1024 1048576 2000: 2.052836275 sec
Run 1048576 1048576 1000: 2.057908066 sec

v5
Run 100 100 5000000: 1.149636571 sec
Run 1024 10240 200000: 2.158487741 sec
Run 1024 1048576 2000: 2.046627068 sec
Run 1048576 1048576 1000: 2.039329068 sec

From looking at the perf reports, the top function is:

  57.62%  postgres   [.] CopyAttributeOutText

I messed around with trying to speed up the string escaping in that
function with the attached hacky patch and got the following on the
AMD 3990x machine:

CopyAttributeOutText_speedup.patch.txt
Run 100 100 5000000: 0.821673910
Run 1024 10240 200000: 0.546632147
Run 1024 1048576 2000: 0.848492694
Run 1048576 1048576 1000: 0.840870293

I don't think we could actually do this unless we modified the output
function API to have it somehow output the number of bytes. The patch
may look beyond the NUL byte with pg_lfind8, which I don't think is
safe.

Does anyone else want to try the attached script on the v5 patch to
see if their numbers are better?

David

Вложения

Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Sat, 6 Apr 2024 at 03:34, David Rowley <dgrowleyml@gmail.com> wrote:
> Does anyone else want to try the attached script on the v5 patch to
> see if their numbers are better?

On my machine (i9-10900X, in Ubuntu 22.04 on WSL on Windows) v5
consistently beats master by ~0.25 seconds:

master:
Run 100 100 5000000: 1.948975205
Run 1024 10240 200000: 3.039986587
Run 1024 1048576 2000: 2.444176276
Run 1048576 1048576 1000: 2.475328596

v5:
Run 100 100 5000000: 1.997170909
Run 1024 10240 200000: 3.057802598
Run 1024 1048576 2000: 2.199449857
Run 1048576 1048576 1000: 2.210328762

The first two runs are pretty much equal, and I ran your script a few
more times and this seems like just random variance (sometimes v5 wins
those, sometimes master does always quite close to each other). But
the last two runs v5 consistently wins.

Weird that on your machines you don't see a difference. Are you sure
you didn't make a silly mistake, like not restarting postgres or
something?



Re: Flushing large data immediately in pqcomm

От
David Rowley
Дата:
On Sat, 6 Apr 2024 at 23:17, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
> Weird that on your machines you don't see a difference. Are you sure
> you didn't make a silly mistake, like not restarting postgres or
> something?

I'm sure. I spent quite a long time between the AMD and an Apple m2 trying.

I did see the same regression as you on the smaller numbers.  I
experimented with the attached which macro'ifies internal_flush() and
pg_noinlines internal_flush_buffer.

Can you try that to see if it gets rid of the regression on the first two tests?

David

Вложения

Re: Flushing large data immediately in pqcomm

От
Andres Freund
Дата:
Hi,

On 2024-04-06 14:34:17 +1300, David Rowley wrote:
> I don't see any issues with v5, so based on the performance numbers
> shown on this thread for the latest patch, it would make sense to push
> it.  The problem is, I just can't recreate the performance numbers.
>
> I've tried both on my AMD 3990x machine and an Apple M2 with a script
> similar to the test.sh from above.  I mostly just stripped out the
> buffer size stuff and adjusted the timing code to something that would
> work with mac.

I think there are a few issues with the test script leading to not seeing a
gain:

1) I think using the textual protocol, with the text datatype, will make it
   harder to spot differences. That's a lot of overhead.

2) Afaict the test is connecting over the unix socket, I think we expect
   bigger wins for tcp

3) Particularly the larger string is bottlenecked due to pglz compression in
   toast.


Where I had noticed the overhead of the current approach badly, was streaming
out basebackups. Which is all binary, of course.


I added WITH BINARY, SET STORAGE EXTERNAL and tested both unix socket and
localhost. I also reduced row counts and iteration counts, because I am
impatient, and I don't think it matters much here. Attached the modified
version.


On a dual xeon Gold 5215, turbo boost disabled, server pinned to one core,
script pinned to another:


unix:

master:
Run 100 100 1000000: 0.058482377
Run 1024 10240 100000: 0.120909810
Run 1024 1048576 2000: 0.153027916
Run 1048576 1048576 1000: 0.154953512

v5:
Run 100 100 1000000: 0.058760126
Run 1024 10240 100000: 0.118831396
Run 1024 1048576 2000: 0.124282503
Run 1048576 1048576 1000: 0.123894962


localhost:

master:
Run 100 100 1000000: 0.067088000
Run 1024 10240 100000: 0.170894273
Run 1024 1048576 2000: 0.230346632
Run 1048576 1048576 1000: 0.230336078

v5:
Run 100 100 1000000: 0.067144036
Run 1024 10240 100000: 0.167950948
Run 1024 1048576 2000: 0.135167027
Run 1048576 1048576 1000: 0.135347867


The perf difference for 1MB via TCP is really impressive.

The small regression for small results is still kinda visible, I haven't yet
tested the patch downthread.

Greetings,

Andres Freund

Вложения

Re: Flushing large data immediately in pqcomm

От
Ranier Vilela
Дата:
Hi,

On Fri, 5 Apr 2024 at 03:28, Melih Mutlu <m(dot)melihmutlu(at)gmail(dot)com> wrote:
>Right. It was a mistake, forgot to remove that. Fixed it in v5.

If you don't mind, I have some suggestions for patch v5.

1.  Shouldn't PqSendBufferSize be of type size_t?
There are several comparisons with other size_t variables.
static size_t PqSendBufferSize; /* Size send buffer */

I think this would prevent possible overflows.

2. If PqSendBufferSize is changed to size_t, in the function
socket_putmessage_noblock, the variable which name is *required*, should
be changed to size_t as well.

static void
socket_putmessage_noblock(char msgtype, const char *s, size_t len)
{
int res PG_USED_FOR_ASSERTS_ONLY;
size_t required;

3. In the internal_putbytes function, the *amout* variable could
have the scope safely reduced.

else
{
size_t amount;

amount = PqSendBufferSize - PqSendPointer;

4. In the function internal_flush_buffer, the variables named
*bufptr* and *bufend* could be const char * type, like:

static int
internal_flush_buffer(const char *s, size_t *start, size_t *end)
{
static int last_reported_send_errno = 0;

const char *bufptr = s + *start;
const char *bufend = s + *end;

best regards,
Ranier Vilela

Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Sat, 6 Apr 2024 at 22:21, Andres Freund <andres@anarazel.de> wrote:
> The small regression for small results is still kinda visible, I haven't yet
> tested the patch downthread.

Thanks a lot for the faster test script, I'm also impatient. I still
saw the small regression with David his patch. Here's a v6 where I
think it is now gone. I added inline to internal_put_bytes too. I
think that helped especially because for two calls to
internal_put_bytes len is a constant (1 and 4) that is smaller than
PqSendBufferSize. So for those calls the compiler can now statically
eliminate the new codepath because "len >= PqSendBufferSize" is known
to be false at compile time.

Also I incorporated all of Ranier his comments.

Вложения

Re: Flushing large data immediately in pqcomm

От
Andres Freund
Дата:
Hi,

On 2024-04-07 00:45:31 +0200, Jelte Fennema-Nio wrote:
> On Sat, 6 Apr 2024 at 22:21, Andres Freund <andres@anarazel.de> wrote:
> > The small regression for small results is still kinda visible, I haven't yet
> > tested the patch downthread.
> 
> Thanks a lot for the faster test script, I'm also impatient. I still
> saw the small regression with David his patch. Here's a v6 where I
> think it is now gone. I added inline to internal_put_bytes too. I
> think that helped especially because for two calls to
> internal_put_bytes len is a constant (1 and 4) that is smaller than
> PqSendBufferSize. So for those calls the compiler can now statically
> eliminate the new codepath because "len >= PqSendBufferSize" is known
> to be false at compile time.

Nice.


> Also I incorporated all of Ranier his comments.

Changing the global vars to size_t seems mildly bogus to me. All it's
achieving is to use slightly more memory. It also just seems unrelated to the
change.

Greetings,

Andres Freund



Re: Flushing large data immediately in pqcomm

От
David Rowley
Дата:
On Sun, 7 Apr 2024 at 08:21, Andres Freund <andres@anarazel.de> wrote:
> I added WITH BINARY, SET STORAGE EXTERNAL and tested both unix socket and
> localhost. I also reduced row counts and iteration counts, because I am
> impatient, and I don't think it matters much here. Attached the modified
> version.

Thanks for the script.  I'm able to reproduce the speedup with your script.

I looked over the patch again and ended up making internal_flush an
inline function rather than a macro.  I compared the assembly produced
from each and it's the same with the exception of the label names
being different.

I've now pushed the patch.

One thing that does not seem ideal is having to cast away the
const-ness of the buffer in internal_flush_buffer().  Previously this
wasn't an issue as we always copied the buffer and passed that to
secure_write().  I wonder if it's worth seeing if we can keep this
buffer constant all the way to the socket write.

That seems to require modifying the following function signatures:
secure_write(), be_tls_write(), be_gssapi_write().  That's not an area
I'm familiar with, however.

David



Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Sun, 7 Apr 2024 at 03:39, Andres Freund <andres@anarazel.de> wrote:
> Changing the global vars to size_t seems mildly bogus to me. All it's
> achieving is to use slightly more memory. It also just seems unrelated to the
> change.

I took a closer look at this. I agree that changing PqSendBufferSize
to size_t is unnecessary: given the locations that it is used I see no
risk of overflow anywhere. Changing the type of PqSendPointer and
PqSendStart is needed though, because (as described by Heiki and David
upthread) the argument type of internal_flush_buffer is size_t*. So if
you actually pass int* there, and the sizes are not the same then you
will start writing out of bounds. And because internal_flush_buffer is
introduced in this patch, it is related to this change.

This is what David just committed too.

However, the "required" var actually should be of size_t to avoid
overflow if len is larger than int even without this change. So
attached is a tiny patch that does that.

Вложения

Re: Flushing large data immediately in pqcomm

От
David Rowley
Дата:
On Sun, 7 Apr 2024 at 22:05, Jelte Fennema-Nio <postgres@jeltef.nl> wrote:
>
> On Sun, 7 Apr 2024 at 03:39, Andres Freund <andres@anarazel.de> wrote:
> > Changing the global vars to size_t seems mildly bogus to me. All it's
> > achieving is to use slightly more memory. It also just seems unrelated to the
> > change.
>
> I took a closer look at this. I agree that changing PqSendBufferSize
> to size_t is unnecessary: given the locations that it is used I see no
> risk of overflow anywhere. Changing the type of PqSendPointer and
> PqSendStart is needed though, because (as described by Heiki and David
> upthread) the argument type of internal_flush_buffer is size_t*. So if
> you actually pass int* there, and the sizes are not the same then you
> will start writing out of bounds. And because internal_flush_buffer is
> introduced in this patch, it is related to this change.
>
> This is what David just committed too.
>
> However, the "required" var actually should be of size_t to avoid
> overflow if len is larger than int even without this change. So
> attached is a tiny patch that does that.

Looking at the code in socket_putmessage_noblock(), I don't understand
why it's ok for PqSendBufferSize to be int but "required" must be
size_t.  There's a line that does "PqSendBufferSize = required;". It
kinda looks like they both should be size_t.  Am I missing something
that you've thought about?

David



Re: Flushing large data immediately in pqcomm

От
Ranier Vilela
Дата:
Em sáb., 6 de abr. de 2024 às 22:39, Andres Freund <andres@anarazel.de> escreveu:
Hi,

On 2024-04-07 00:45:31 +0200, Jelte Fennema-Nio wrote:
> On Sat, 6 Apr 2024 at 22:21, Andres Freund <andres@anarazel.de> wrote:
> > The small regression for small results is still kinda visible, I haven't yet
> > tested the patch downthread.
>
> Thanks a lot for the faster test script, I'm also impatient. I still
> saw the small regression with David his patch. Here's a v6 where I
> think it is now gone. I added inline to internal_put_bytes too. I
> think that helped especially because for two calls to
> internal_put_bytes len is a constant (1 and 4) that is smaller than
> PqSendBufferSize. So for those calls the compiler can now statically
> eliminate the new codepath because "len >= PqSendBufferSize" is known
> to be false at compile time.

Nice.


> Also I incorporated all of Ranier his comments.

Changing the global vars to size_t seems mildly bogus to me. All it's
achieving is to use slightly more memory. It also just seems unrelated to the
change.
I don't agree with this thought.
Actually size_t uses 4 bytes of memory than int, right.
But mixing up int and size_t is a sure way to write non-portable code.
And the compilers will start showing messages such as " signed/unsigned mismatch".

The global vars PqSendPointer and PqSendStart were changed in the v5 patch, 
so for the sake of style and consistency, I understand that it is better not to mix the types.

The compiler will promote PqSendBufferSize to size_t in all comparisons.

And finally the correct type to deal with char * variables is size_t.

Best regards,
Ranier Vilela


Greetings,

Andres Freund

Re: Flushing large data immediately in pqcomm

От
Melih Mutlu
Дата:
David Rowley <dgrowleyml@gmail.com>, 6 Nis 2024 Cmt, 04:34 tarihinde şunu yazdı:
Does anyone else want to try the attached script on the v5 patch to
see if their numbers are better?

I'm seeing the below results with your script on my machine (). I ran it several times, results were almost similar each time.

master:
Run 100 100 5000000: 1.627905512
Run 1024 10240 200000: 1.603231684
Run 1024 1048576 2000: 2.962812352
Run 1048576 1048576 1000: 2.940766748

v5:
Run 100 100 5000000: 1.611508155
Run 1024 10240 200000: 1.603505596
Run 1024 1048576 2000: 2.727241937
Run 1048576 1048576 1000: 2.721268988

 

Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Sun, 7 Apr 2024 at 14:41, David Rowley <dgrowleyml@gmail.com> wrote:
> Looking at the code in socket_putmessage_noblock(), I don't understand
> why it's ok for PqSendBufferSize to be int but "required" must be
> size_t.  There's a line that does "PqSendBufferSize = required;". It
> kinda looks like they both should be size_t.  Am I missing something
> that you've thought about?


You and Ranier are totally right (I missed this assignment). Attached is v8.

Вложения

Re: Flushing large data immediately in pqcomm

От
Ranier Vilela
Дата:

Em seg., 8 de abr. de 2024 às 07:42, Jelte Fennema-Nio <postgres@jeltef.nl> escreveu:
On Sun, 7 Apr 2024 at 14:41, David Rowley <dgrowleyml@gmail.com> wrote:
> Looking at the code in socket_putmessage_noblock(), I don't understand
> why it's ok for PqSendBufferSize to be int but "required" must be
> size_t.  There's a line that does "PqSendBufferSize = required;". It
> kinda looks like they both should be size_t.  Am I missing something
> that you've thought about?


You and Ranier are totally right (I missed this assignment). Attached is v8.
+1
LGTM.

best regards,
Ranier Vilela

Re: Flushing large data immediately in pqcomm

От
Jelte Fennema-Nio
Дата:
On Sun, 7 Apr 2024 at 11:34, David Rowley <dgrowleyml@gmail.com> wrote:
> That seems to require modifying the following function signatures:
> secure_write(), be_tls_write(), be_gssapi_write().  That's not an area
> I'm familiar with, however.

Attached is a new patchset where 0003 does exactly that. The only
place where we need to cast to non-const is for GSS, but that seems
fine (commit message has more details).

I also added patch 0002, which is a small addition to the function
comment of internal_flush_buffer that seemed useful to me to
differentiate it from internal_flush (feel free to ignore/rewrite).

Вложения

Re: Flushing large data immediately in pqcomm

От
Ranier Vilela
Дата:
Em seg., 8 de abr. de 2024 às 09:27, Jelte Fennema-Nio <postgres@jeltef.nl> escreveu:
On Sun, 7 Apr 2024 at 11:34, David Rowley <dgrowleyml@gmail.com> wrote:
> That seems to require modifying the following function signatures:
> secure_write(), be_tls_write(), be_gssapi_write().  That's not an area
> I'm familiar with, however.

Attached is a new patchset where 0003 does exactly that. The only
place where we need to cast to non-const is for GSS, but that seems
fine (commit message has more details).
+1.
Looks ok to me.
The GSS pointer *ptr, is already cast to char * where it is needed,
so the code is already correct.

best regards,
Ranier Vilela