Обсуждение: BUG #13484: Performance problem with logical decoding
The following bug has been logged on the website: Bug reference: 13484 Logged by: Olivier Gosseaume Email address: olivier.gosseaume@free.fr PostgreSQL version: 9.4.4 Operating system: Windows 7 64 bits (dev system) Description: Problem: when a transaction involve more than 4095 operations, logical decoding on the receiver end become very very slow. Repro : Open two psql sessions 1 and 2 On session 1 : Prepare : CREATE TABLE data(id serial primary key, data text); On session 2 (receiver) : Prepare : \timing on SELECT pg_create_logical_replication_slot('my_slot','test_decoding'); Run : On session 1 : insert into data (data) values (generate_series(1,4095)); On session 2 : select pg_logical_slot_get_changes('my_slot', NULL, NULL); --> returns in 80mS (plenty fast) On session 1 : insert into data (data) values (generate_series(1,4095)); --> repeat this 10 times to insert 40950 rows On session 2 : select pg_logical_slot_get_changes('my_slot', NULL, NULL); --> returns in 380mS (plenty fast) Now the problem : On session 1 : insert into data (data) values (generate_series(1,4096)); On session 2 : select pg_logical_slot_get_changes('my_slot', NULL, NULL); --> returns in 4204mS (ie 52x times slower than "normal") On session 1 : insert into data (data) values (generate_series(1,40950)); On session 2 : select pg_logical_slot_get_changes('my_slot', NULL, NULL); --> returns in 34998mS (ie 92x times slower than "normal")
On July 3, 2015 11:02:17 AM GMT+02:00, olivier.gosseaume@free.fr wrote: >The following bug has been logged on the website: > >Bug reference: 13484 >Logged by: Olivier Gosseaume >Email address: olivier.gosseaume@free.fr >PostgreSQL version: 9.4.4 >Operating system: Windows 7 64 bits (dev system) >Description: > >Problem: when a transaction involve more than 4095 operations, logical >decoding on the receiver end become very very slow. Around 4096 transactions are getting spilled to disk. >Repro : >Open two psql sessions 1 and 2 > >On session 1 : >Prepare : >CREATE TABLE data(id serial primary key, data text); > >On session 2 (receiver) : >Prepare : >\timing on >SELECT pg_create_logical_replication_slot('my_slot','test_decoding'); > >Run : >On session 1 : insert into data (data) values >(generate_series(1,4095)); >On session 2 : select pg_logical_slot_get_changes('my_slot', NULL, >NULL); >--> returns in 80mS (plenty fast) > >On session 1 : insert into data (data) values >(generate_series(1,4095)); --> >repeat this 10 times to insert 40950 rows >On session 2 : select pg_logical_slot_get_changes('my_slot', NULL, >NULL); >--> returns in 380mS (plenty fast) That's less than linear growth... >Now the problem : >On session 1 : insert into data (data) values >(generate_series(1,4096)); >On session 2 : select pg_logical_slot_get_changes('my_slot', NULL, >NULL); >--> returns in 4204mS (ie 52x times slower than "normal") So you just had 4096 changes here? >On session 1 : insert into data (data) values >(generate_series(1,40950)); >On session 2 : select pg_logical_slot_get_changes('my_slot', NULL, >NULL); >--> returns in 34998mS (ie 92x times slower than "normal") The SQL interface isn't really the best thing to test this - the output as a whole is stored first in memory, and then whengetting to large, spilled to disk. Additionally the starting/stopping of the slot can take a long while because EAL mayneed to be reread. Please test the same using the streaming interface. You can use pg-recvlogical. Regards, Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
Just after having submitted the bug report, i saw that when a transaction have 4096 or more operations (inserts for example),I see postgres generating files in pg_repslot/my_slot directory. Is that what you call "spilling to disk" ? BTWmy hard disk is a SSD To be more concise : - ONE transaction with 4095 operations -> consumed in 80mS - TEN transactions with 4095 operations each (so 40950 operations) -> 380mS (which as you said is very good -less than lineargrowth-) - ONE transaction with 4096 operations -> consumed in 4204mS (ouch ...). I confirm there are only 4096 changes in one transaction - TEN transactions with 4095 operations each (so 40950 operations) -> 34998mS, ouch again I know the SQL interface is not the best thing but it matches my use case. Let me explain. I'm currently evaluating postgres for my company (a quite big one) as a mixed JSON/SQL storage + notification system basedon data change feeding a HTTP infrastructure. We use .NET client to communicate with database. I've evaluated postgres against Oracle 12c (we have been using Oracle for 18 years), SQL Server, MongoDB (not ACID whichis a real pain), OrientDB (promising but unreliable) and RethinkDB. The JSON implementation of postgres is in my opinionvery very good, cleverly made and have good performance (much better than Oracle in many cases). I'm very impressedoverall by everything i have tried in postgres. I need some sort of change data capture mechanism (CDC) to detect changes on data (async is ok for me as long as i get verylow latency). I could use triggers but performance would suffer. Reading the WAL log seems like it could be a good approach. The current "simple" interface to consume a slot is thru this SQL interface, that's why i used it while prototyping things. I also tried pg_recvlogical feeding a file, but got the exact same results so i'm stuck. What is observe is that the spillingoccurs, and when the .snap file is created then pg_recvlogical will consume data but it does take a long time exactlythe same time as pg_logical_slot_get_changes in fact. BTW there is also a bug in pg_recvlogical with option -f - (output to stdout), pg_recvlogical tries to flush with fsync whichdoes not work on windows and display an error message. Best regards, Olivier
Hi, On 2015-07-06 19:29:30 +0200, olivier.gosseaume@free.fr wrote: > Just after having submitted the bug report, i saw that when a > transaction have 4096 or more operations (inserts for example), I see > postgres generating files in pg_repslot/my_slot directory. Is that > what you call "spilling to disk"? Yes. > To be more concise : > - ONE transaction with 4095 operations -> consumed in 80mS > - TEN transactions with 4095 operations each (so 40950 operations) -> 380mS (which as you said is very good -less thanlinear growth-) > - ONE transaction with 4096 operations -> consumed in 4204mS (ouch ...). I confirm there are only 4096 changes in one transaction > - TEN transactions with 4095 operations each (so 40950 operations) -> 34998mS, ouch again FWIW, I can't reproduce those results at all. I just tried, and for me all results are 34ms, 121ms, 149ms, 374ms. When using the streaming interface there's no discernible delays at all. > I need some sort of change data capture mechanism (CDC) to detect > changes on data (async is ok for me as long as i get very low > latency). I could use triggers but performance would suffer. Reading > the WAL log seems like it could be a good approach. Right, that sounds pretty typical. > I also tried pg_recvlogical feeding a file, but got the exact same > results so i'm stuck. Are you stopping pg_recvlogical between the runs, or are you letting it run? The point of using it is that it's a streaming, i.e. that you do not need to pay to "startup" costs of logical decoding, which can be noticeable. > What is observe is that the spilling occurs, and > when the .snap file is created then pg_recvlogical will consume data > but it does take a long time exactly the same time as > pg_logical_slot_get_changes in fact. Another possibility is that there's some windows specific problem here. > BTW there is also a bug in pg_recvlogical with option -f - (output to > stdout), pg_recvlogical tries to flush with fsync which does not work > on windows and display an error message. Uh, that should obviously not be the case. Thanks for the report, will fix. Greetings, Andres Freund
On 2015-07-06 19:29:30 +0200, olivier.gosseaume@free.fr wrote: > To be more concise : > - ONE transaction with 4095 operations -> consumed in 80mS > - TEN transactions with 4095 operations each (so 40950 operations) -> 380mS (which as you said is very good -less thanlinear growth-) > - ONE transaction with 4096 operations -> consumed in 4204mS (ouch ...). I confirm there are only 4096 changes in one transaction > - TEN transactions with 4095 operations each (so 40950 operations) -> 34998mS, ouch again Given the third result it's unlikely, but just to make sure: Are you sure network issues aren't a factor here? 40950 changes are about 3.8MB, over a slow network that can take a second. Additionally psql will display all the changes in a pager, which'll not be fast either (but shouldn't be included in \timing's output). Greetings, Andres Freund
Hi >> To be more concise : >> - ONE transaction with 4095 operations -> consumed in 80mS >> - TEN transactions with 4095 operations each (so 40950 operations) -> 380mS (which as you said is very good -less thanlinear growth-) >> - ONE transaction with 4096 operations -> consumed in 4204mS (ouch ...). I confirm there are only 4096 changes in onetransaction >> - TEN transactions with 4095 operations each (so 40950 operations) -> 34998mS, ouch again> > >FWIW, I can't reproduce those results at all. I just tried, and for me >all results are 34ms, 121ms, 149ms, 374ms. When using the streaming >interface there's no discernible delays at all. Very interesting >> I also tried pg_recvlogical feeding a file, but got the exact same >> results so i'm stuck. > >Are you stopping pg_recvlogical between the runs, or are you letting it >run? The point of using it is that it's a streaming, i.e. that you do >not need to pay to "startup" costs of logical decoding, which can be >noticeable. No, I just let it run. In my case the bottleneck seems to be on server side when spilling to disk. I can see the server buildthe files but very slowly. >> What is observe is that the spilling occurs, and >> when the .snap file is created then pg_recvlogical will consume data >> but it does take a long time exactly the same time as >> pg_logical_slot_get_changes in fact. > >Another possibility is that there's some windows specific problem here. I do agree. It must be a Windows issue then. I reproduced it on two differents dev machines both with SSD and Windows 7.I can make a try on a real Windows 2008 or 2012 server with SAN access just to check. Will do it tomorrow >> BTW there is also a bug in pg_recvlogical with option -f - (output to >> stdout), pg_recvlogical tries to flush with fsync which does not work >> on windows and display an error message. > >Uh, that should obviously not be the case. Thanks for the report, will >fix. Thanks Best regards, Olivier
>> To be more concise : >> - ONE transaction with 4095 operations -> consumed in 80mS >> - TEN transactions with 4095 operations each (so 40950 operations) -> 380mS (which as you said is very good -less thanlinear growth-) >> - ONE transaction with 4096 operations -> consumed in 4204mS (ouch ...). I confirm there are only 4096 changes in onetransaction >> - TEN transactions with 4095 operations each (so 40950 operations) -> 34998mS, ouch again > >Given the third result it's unlikely, but just to make sure: Are you >sure network issues aren't a factor here? 40950 changes are about 3.8MB, >over a slow network that can take a second. Additionally psql will >display all the changes in a pager, which'll not be fast either (but >shouldn't be included in \timing's output). I'm doing everything on the same host, so network is not involved at all (localhost) br Olivier
On 2015-07-06 22:56:22 +0200, olivier.gosseaume@free.fr wrote: > >Are you stopping pg_recvlogical between the runs, or are you letting it > >run? The point of using it is that it's a streaming, i.e. that you do > >not need to pay to "startup" costs of logical decoding, which can be > >noticeable. > > No, I just let it run. In my case the bottleneck seems to be on server > side when spilling to disk. I can see the server build the files but > very slowly. That's odd. It's really just a plain series of writes. I wonder if for some reason windows disables write buffering or such? Ugh, I have a theory. I guess you can't easily recompile postgres with a patch and test again? I don't have access to windows... > >> What is observe is that the spilling occurs, and > >> when the .snap file is created then pg_recvlogical will consume data > >> but it does take a long time exactly the same time as > >> pg_logical_slot_get_changes in fact. > > > >Another possibility is that there's some windows specific problem here. > > I do agree. It must be a Windows issue then. I reproduced it on two > differents dev machines both with SSD and Windows 7. I can make a try > on a real Windows 2008 or 2012 server with SAN access just to > check. Will do it tomorrow Any chance you could cross check on a unixoid OS? Just to rule out it's not some configuration/setup/? issue? Regards, Andres
On 2015-07-06 23:05:27 +0200, Andres Freund wrote: > Ugh, I have a theory. I guess you can't easily recompile postgres with a > patch and test again? I don't have access to windows... So the problem I'm seing is that there's a typo/bug in ReorderBufferSerializeTXN(). It closes the filehandle after each individual spilled file instead of keeping it open for up to 16MB of WAL. On linux that doesn't hurt particularly much, the file isn't flushed to disk. Which presumably is why we haven't noticed. But if windows does that differently... Greetings, Andres Freund
On 2015-07-06 23:21:26 +0200, Andres Freund wrote: > On 2015-07-06 23:05:27 +0200, Andres Freund wrote: > > Ugh, I have a theory. I guess you can't easily recompile postgres with a > > patch and test again? I don't have access to windows... > > So the problem I'm seing is that there's a typo/bug in > ReorderBufferSerializeTXN(). It closes the filehandle after each > individual spilled file instead of keeping it open for up to 16MB of > WAL. On linux that doesn't hurt particularly much, the file isn't > flushed to disk. Which presumably is why we haven't noticed. But if > windows does that differently... Hrmpf, just sent the last part of this message to a different thread. The fix is just to change if (fd == -1 || XLByteInSeg(change->lsn, curOpenSegNo)) into if (fd == -1 || !XLByteInSeg(change->lsn, curOpenSegNo)) the bug doesn't have any correctness implications afaics, just performance. We read all the spilled files till the end, so even change spilled to the wrong segment gets picked up.
>> >Are you stopping pg_recvlogical between the runs, or are you letting it >> >run? The point of using it is that it's a streaming, i.e. that you do >> >not need to pay to "startup" costs of logical decoding, which can be >> >noticeable. >> >> No, I just let it run. In my case the bottleneck seems to be on server >> side when spilling to disk. I can see the server build the files but >> very slowly. > >That's odd. It's really just a plain series of writes. I wonder if for >some reason windows disables write buffering or such? > >Ugh, I have a theory. I guess you can't easily recompile postgres with a >patch and test again? I don't have access to windows... Thanks for your finding. I can try to recompile on Windows (will look how). In fact I did a test with a tool called "APIMonitor" which displays all Windows API calls during postgres execution. What I see is : - approx 8000 writes - for each individual write I see a call to FlushFileBuffers() which is very bad for performance, so it is a Windows onlyissue >Any chance you could cross check on a unixoid OS? Just to rule out it's >not some configuration/setup/? issue? I can try in a virtualbox, but I would prefer to test your patch before if possible br Olivier
>> So the problem I'm seing is that there's a typo/bug in >> ReorderBufferSerializeTXN(). It closes the filehandle after each >> individual spilled file instead of keeping it open for up to 16MB of >> WAL. On linux that doesn't hurt particularly much, the file isn't >> flushed to disk. Which presumably is why we haven't noticed. But if >> windows does that differently... >Hrmpf, just sent the last part of this message to a different thread. >The fix is just to change > if (fd == -1 || XLByteInSeg(change->lsn, curOpenSegNo)) >into > if (fd == -1 || !XLByteInSeg(change->lsn, curOpenSegNo)) > >the bug doesn't have any correctness implications afaics, just >performance. We read all the spilled files till the end, so even change >spilled to the wrong segment gets picked up. Great. I will try to rebuild on Windows with this patch and let you know Olivier
On 2015-07-06 23:41, Andres Freund wrote: > On 2015-07-06 19:29:30 +0200, olivier.gosseaume@free.fr wrote: >> BTW there is also a bug in pg_recvlogical with option -f - (output to >> stdout), pg_recvlogical tries to flush with fsync which does not work >> on windows and display an error message. > > Uh, that should obviously not be the case. Thanks for the report, will > fix. Same problem on at least OS X, if stdout is piped to e.g. hexdump. .m
Hi Andres >>The fix is just to change >> if (fd == -1 || XLByteInSeg(change->lsn, curOpenSegNo)) >>into >> if (fd == -1 || !XLByteInSeg(change->lsn, curOpenSegNo)) >> >>the bug doesn't have any correctness implications afaics, just >>performance. We read all the spilled files till the end, so even change >>spilled to the wrong segment gets picked up. > >Great. I will try to rebuild on Windows with this patch and let you know You got it :) I rebuilt postgres 9.4.4 x64 this morning with your patch in reorderbuffer.c and now it FLIES (140mS for a transaction with4096 inserts, 600mS for 40960 inserts). Spilling to disk is now very fast as expected (not even noticeable). Thanks for your great support. Will this patch be included in 9.4.5 ? PS: the build process on Windows is very simple and straightforward, congrats. Best regards, Olivier
On 2015-07-07 09:55:50 +0200, olivier.gosseaume@free.fr wrote: > >Great. I will try to rebuild on Windows with this patch and let you know > > You got it :) Phew. Good to know, thanks. > I rebuilt postgres 9.4.4 x64 this morning with your patch in > reorderbuffer.c and now it FLIES (140mS for a transaction with 4096 > inserts, 600mS for 40960 inserts). Sorry for being boring and reiterating: This'll still be faster when using the streaming interface ;) > Will this patch be included in 9.4.5 ? Yes. > PS: the build process on Windows is very simple and straightforward, congrats. Good to know. The last time I built PG on windows (2006?) it wasn't ;)
Hi Olivier, Marko, On 2015-07-07 00:56:27 +0300, Marko Tiikkaja wrote: > On 2015-07-06 23:41, Andres Freund wrote: > >On 2015-07-06 19:29:30 +0200, olivier.gosseaume@free.fr wrote: > >>BTW there is also a bug in pg_recvlogical with option -f - (output to > >>stdout), pg_recvlogical tries to flush with fsync which does not work > >>on windows and display an error message. > > > >Uh, that should obviously not be the case. Thanks for the report, will > >fix. > > Same problem on at least OS X, if stdout is piped to e.g. hexdump. Interesting - I apparently forsaw that case and decided to handle it by not erroring out on EINVAL: /* Accept EINVAL, in case output is writing to a pipe or similar. */ if (fsync(outfd) != 0 && errno != EINVAL) { fprintf(stderr, _("%s: could not fsync log file \"%s\": %s\n"), progname, outfile, strerror(errno)); return false; } but that's obviously not good enough. I think testing whether the output fd is a regular file and not a tty when opening it should work across platforms: if (fstat(outfd, &statbuf) != 0) fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"), progname, outfile, strerror(errno)); output_isfile = S_ISREG(statbuf.st_mode) && !isatty(outfd); I pushed the patch and tested it locally (linux), but I'd be thankful if you two could test whether it works now on your respective platforms. Regards, Andres
On 2015-07-07 12:14:53 +0200, Andres Freund wrote: > On 2015-07-07 09:55:50 +0200, olivier.gosseaume@free.fr wrote: > > Will this patch be included in 9.4.5 ? > > Yes. Fix pushed to the relevant branches. Thanks for the report!
On 2015-07-07 14:15, Andres Freund wrote: > I think testing whether the output fd is a regular file and not a tty > when opening it should work across platforms: > > I pushed the patch and tested it locally (linux), but I'd be thankful if > you two could test whether it works now on your respective platforms. Fixes the issue on OS X. Thank you. .m
>I pushed the patch and tested it locally (linux), but I'd be thankful if >you two could test whether it works now on your respective platforms. Tested it, and that also does fix the bug on Windows. Thanks :) Regards, Olivier