Обсуждение: [HACKERS] Race-like failure in recovery/t/009_twophase.pl
Since I put in that patch to reduce pg_ctl's reaction time, we've seen several buildfarm failures with identical symptoms: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-06-30%2012%3A56%3A08 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2017-06-30%2023%3A42%3A11 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2017-07-01%2000%3A22%3A09 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2017-07-01%2003%3A30%3A01 to wit that the recovery/009 test blows up with Bailout called. Further testing stopped: pg_ctl start failed FAILED--Further testing stopped: pg_ctl start failed Fortunately, it reports the relevant log files, and we can see things like this in the "slave" postmaster's log: 2017-07-01 02:31:51.131 UTC [11665568] LOG: entering standby mode 2017-07-01 02:31:51.198 UTC [11665568] LOG: consistent recovery state reached at 0/301CB10 2017-07-01 02:31:51.198 UTC [11665568] LOG: invalid record length at 0/301CB10: wanted 24, got 0 2017-07-01 02:31:51.200 UTC [12779594] LOG: database system is ready to accept read only connections 2017-07-01 02:31:51.214 UTC [5111896] LOG: fetching timeline history file for timeline 3 from primary server 2017-07-01 02:31:51.246 UTC [5111896] LOG: started streaming WAL from primary at 0/3000000 on timeline 2 2017-07-01 02:31:51.248 UTC [5111896] LOG: replication terminated by primary server 2017-07-01 02:31:51.248 UTC [5111896] DETAIL: End of WAL reached on timeline 2 at 0/301C168. 2017-07-01 02:31:51.263 UTC [11665568] LOG: new timeline 3 forked off current database system timeline 2 before currentrecovery point 0/301CB10 2017-07-01 02:31:51.263 UTC [5111896] LOG: restarted WAL streaming at 0/3000000 on timeline 2 2017-07-01 02:31:51.264 UTC [5111896] LOG: replication terminated by primary server 2017-07-01 02:31:51.264 UTC [5111896] DETAIL: End of WAL reached on timeline 2 at 0/301C168. 2017-07-01 02:31:51.264 UTC [11665568] LOG: new timeline 3 forked off current database system timeline 2 before currentrecovery point 0/301CB10 2017-07-01 02:31:51.579 UTC [12779594] LOG: received immediate shutdown request 2017-07-01 02:31:51.629 UTC [12779594] LOG: database system is shut down 2017-07-01 02:31:51.790 UTC [7471446] LOG: listening on Unix socket "/tmp/ELoHNJ0v5u/.s.PGSQL.49810" 2017-07-01 02:31:51.826 UTC [9502748] LOG: database system was interrupted while in recovery at log time 2017-07-01 02:31:49UTC 2017-07-01 02:31:51.826 UTC [9502748] HINT: If this has occurred more than once some data might be corrupted and you mightneed to choose an earlier recovery target. 2017-07-01 02:31:51.826 UTC [9502748] LOG: entering standby mode 2017-07-01 02:31:51.827 UTC [9502748] FATAL: requested timeline 3 is not a child of this server's history 2017-07-01 02:31:51.827 UTC [9502748] DETAIL: Latest checkpoint is at 0/301CAA0 on timeline 2, but in the history of therequested timeline, the server forked off from that timeline at 0/301C168. 2017-07-01 02:31:51.828 UTC [7471446] LOG: startup process (PID 9502748) exited with exit code 1 2017-07-01 02:31:51.829 UTC [7471446] LOG: aborting startup due to startup process failure What seems to have happened here is that the slave has WAL up to 0/301CB10, but the master's WAL ends at 0/301C168, so we've got trouble. How did the slave get ahead of the master? The reporting critters are all on the slow side, so I suspected a timing problem, especially since it only started to show up after changing pg_ctl's timing behavior. I can't reproduce it locally on unmodified sources, but I could after putting my thumb on the scales like this: diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index ea9d21a..7ef22c1 100644 *** a/src/backend/replication/walreceiver.c --- b/src/backend/replication/walreceiver.c *************** WalReceiverMain(void) *** 446,451 **** --- 446,454 ---- * Something was received from master, so reset * timeout */ + + pg_usleep(100000); /* be slow about processing */ + last_recv_timestamp = GetCurrentTimestamp(); ping_sent = false; XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1); (If this doesn't do it for you alone, you might try also increasing WAITS_PER_SEC in pg_ctl.c, so as to make pg_ctl faster yet.) I'm not entirely sure what's happening here, although the gist of it has to be that we're killing off a slave server before it's quite absorbed all the WAL the master has sent it, and then when we exchange the roles of the master and slave, all of a sudden it's the slave that's ahead, not the master. Part of the reason I'm confused is that the programming technique being used in 009_twophase.pl, namely doing ($node_master, $node_slave) = ($node_slave, $node_master); and then working with the reversed variable names, is ENTIRELY TOO CUTE FOR ITS OWN GOOD. The objects continue to have their original "name" fields and continue to refer to the same datadirs and log files internally, which means that when you're trying to trace through the log files to understand what's happening, you have to remember how many of these swaps have been done so far in order to understand whether script operations on "$node_master" are actually affecting the "master" or the "slave" installation, and whether when the log says that it started the "slave" server, that corresponds to a script action on "$node_slave" or "$node_master". This might not be the worst, most confusing code I've ever seen committed into Postgres, but it's right up there, and you can bet that I'm going to be holding somebody's feet to the fire for a rewrite. Anyway, having vented about that ... it's not very clear to me whether the test script is at fault for not being careful to let the slave catch up to the master before we promote it (and then deem the master to be usable as a slave without rebuilding it first), or whether we actually imagine that should work, in which case there's a replication logic bug here someplace. The failure messages are also less clear than they could be, perhaps. regards, tom lane
Tom Lane wrote: > Part of the reason I'm confused is that the programming technique > being used in 009_twophase.pl, namely doing > > ($node_master, $node_slave) = ($node_slave, $node_master); > > and then working with the reversed variable names, is ENTIRELY TOO CUTE > FOR ITS OWN GOOD. This is my fault. I noticed this in the submitted test (and was pretty confused about it too) when I reviewed it for commit, but somehow it didn't reach my threshold to require a rewrite. I'll fix it sometime during the week. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> Part of the reason I'm confused is that the programming technique >> being used in 009_twophase.pl, namely doing >> ($node_master, $node_slave) = ($node_slave, $node_master); >> and then working with the reversed variable names, is ENTIRELY TOO CUTE >> FOR ITS OWN GOOD. > This is my fault. I noticed this in the submitted test (and was pretty > confused about it too) when I reviewed it for commit, but somehow it > didn't reach my threshold to require a rewrite. I'll fix it sometime > during the week. I'd kind of like to fix it now, so I can reason in a less confused way about the actual problem. Last night I didn't have a clear idea of how to make it better, but what I'm thinking this morning is: * Naming the underlying server objects "master" and "slave" is just wrong, because the script makes them swap those roles repeatedly. Better to choose neutral names, like "alice" and "bob" or "london" and "paris". * We could simply make the test script refer directly to the appropriate server at each step, ie s/node_master/node_london/ in relevant parts of the script and s/node_slave/node_london/ elsewhere. Maybe that's the best way, but there is some value in identifying commands as to whether we're issuing them to the current master or the current slave. Plan B might be to do($cur_master, $cur_slave) = ($node_paris, $node_london); at the swap points and use those names where it seems clearer. * Some effort should be put into emitting text to the log showing what's going on, eg print("Now london is master."); as appropriate. * Another reason why I had the feeling of being lost in a maze of twisty little passages all alike was the relentless sameness of the commands being sent to the servers. There is no good reason for the prepared transactions to be all alike; they should be all different so that you can match up postmaster log entries to points in the script. If anyone has other ideas please speak up. Barring objection I'll go do something about this shortly. regards, tom lane
Tom Lane wrote: > I'd kind of like to fix it now, so I can reason in a less confused way > about the actual problem. OK, no objections here. > Last night I didn't have a clear idea of how > to make it better, but what I'm thinking this morning is: > > * Naming the underlying server objects "master" and "slave" is just > wrong, because the script makes them swap those roles repeatedly. > Better to choose neutral names, like "alice" and "bob" or "london" > and "paris". Check. > * We could simply make the test script refer directly to the appropriate > server at each step, ie s/node_master/node_london/ in relevant parts of > the script and s/node_slave/node_london/ elsewhere. Maybe that's the > best way, but there is some value in identifying commands as to whether > we're issuing them to the current master or the current slave. Plan B > might be to do > ($cur_master, $cur_slave) = ($node_paris, $node_london); > at the swap points and use those names where it seems clearer. Hmm, yeah, using role-based aliases seems like a good idea too. > * Some effort should be put into emitting text to the log showing > what's going on, eg print("Now london is master."); as appropriate. Check. Not "print" though; I think using note(" .. ") (from Test::More) is more appropriate. > * Another reason why I had the feeling of being lost in a maze of > twisty little passages all alike was the relentless sameness of the > commands being sent to the servers. There is no good reason for the > prepared transactions to be all alike; they should be all different > so that you can match up postmaster log entries to points in the > script. Check. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> * Some effort should be put into emitting text to the log showing >> what's going on, eg print("Now london is master."); as appropriate. > Check. Not "print" though; I think using note(" .. ") (from Test::More) > is more appropriate. Will do, thanks for the suggestion. regards, tom lane
I wrote: > The reporting critters are all on the slow side, so I suspected > a timing problem, especially since it only started to show up > after changing pg_ctl's timing behavior. I can't reproduce it > locally on unmodified sources, but I could after putting my thumb > on the scales like this: > > diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c > index ea9d21a..7ef22c1 100644 > *** a/src/backend/replication/walreceiver.c > --- b/src/backend/replication/walreceiver.c > *************** WalReceiverMain(void) > *** 446,451 **** > --- 446,454 ---- > * Something was received from master, so reset > * timeout > */ > + > + pg_usleep(100000); /* be slow about processing */ > + > last_recv_timestamp = GetCurrentTimestamp(); > ping_sent = false; > XLogWalRcvProcessMsg(buf[0], &buf[1], len - 1); BTW, I neglected to mention that this hack also produces a second failure in the recovery test suite: [20:28:31] t/001_stream_rep.pl .................. 11/28 # Failed test 'xmin of cascaded slot null with no hs_feedback' # at t/001_stream_rep.pl line 178. # got: undef # expected: '' # Failed test 'catalog xmin of cascaded slot null with no hs_feedback' # at t/001_stream_rep.pl line 179. # got: undef # expected: '' [20:28:31] t/001_stream_rep.pl .................. 28/28 # Looks like you failed 2 tests of 28. [20:28:31] t/001_stream_rep.pl .................. Dubious, test returned 2 (wstat 512, 0x200) Failed 2/28 subtests I had supposed last night that this was merely a matter of needing to insert a wait_slot_xmins() call before those tests, but doing so doesn't help. And, looking closer, the report is complaining that $xmin and $catalog_xmin are coming back as undef, not empty strings. My Perl-fu is too weak to figure out how that could possibly be --- the documentation for PostgresNode::query_hash certainly states flatly that it will never happen. And it even more certainly shouldn't happen if we already successfully probed the pg_replication_slots row in wait_slot_xmins(). The corresponding entries in the postmaster's log don't look abnormal: 2017-07-02 20:07:29.313 UTC [30862] LOG: database system is ready to accept read only connections 2017-07-02 20:07:29.318 UTC [30867] LOG: started streaming WAL from primary at 0/3000000 on timeline 1 2017-07-02 20:07:29.415 UTC [30870] t/001_stream_rep.pl LOG: statement: SELECT pg_create_physical_replication_slot('standby_2'); 2017-07-02 20:07:29.619 UTC [30863] LOG: redo starts at 0/3029458 2017-07-02 20:07:29.665 UTC [30882] t/001_stream_rep.pl LOG: statement: SELECT xmin IS NULL AND catalog_xmin ISNULL FROM pg_catalog.pg_replication_slots WHERE slot_name = 'standby_2'; 2017-07-02 20:07:29.679 UTC [30884] t/001_stream_rep.pl LOG: statement: SELECT plugin, slot_type, datoid, database, active,active_pid, xmin, catalog_xmin, restart_lsn FROM pg_catalog.pg_replication_slots WHERE slot_name = 'standby_2' (This is from a run with a wait_slot_xmins() call added, else the statement logged by PID 30882 would not have been issued.) Any ideas what's wrong there? regards, tom lane
I wrote: > Any ideas what's wrong there? Hah: the answer is that query_hash's split() call is broken. "man perlfunc" quoth split Splits the string EXPR into a list of strings and returns that list. By default, empty leadingfields are preserved, and empty trailing ones are deleted. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In the case at hand, the SQL query returns something like "|physical|||t|10338|||0/302B7E8" with normal timing, but with enough delay in there, you get "|physical|||t|11542|||" which triggers split's default behavior of ignoring the trailing empty fields. It looks like the way to get split to not do that is to pass it a "limit" of -1. regards, tom lane
I wrote: > Anyway, having vented about that ... it's not very clear to me whether the > test script is at fault for not being careful to let the slave catch up to > the master before we promote it (and then deem the master to be usable as > a slave without rebuilding it first), or whether we actually imagine that > should work, in which case there's a replication logic bug here someplace. OK, now that I can make some sense of what's going on in the 009 test script ... it seems like that test script is presuming synchronous replication behavior, but it's only actually set up sync rep in one direction, namely the london->paris direction. The failure occurs when we lose data in the paris->london direction. Specifically, with the delay hack in place, I find this in the log before things go south completely: # Now london is master and paris is slave ok 11 - Restore prepared transactions from files with master down ### Enabling streaming replication for node "paris" ### Starting node "paris" # Running: pg_ctl -D /home/postgres/pgsql/src/test/recovery/tmp_check/data_paris_xSFF/pgdata -l /home/postgres/pgsql/src/test/recovery/tmp_check/log/009_twophase_paris.logstart waiting for server to start.... done server started # Postmaster PID for node "paris" is 30930 psql:<stdin>:1: ERROR: prepared transaction with identifier "xact_009_11" does not exist That ERROR is being reported by the london node, at line 267 of the current script:$cur_master->psql('postgres', "COMMIT PREPARED 'xact_009_11'"); So london is missing a prepared transaction that was created while paris was master, a few lines earlier. (It's not real good that the test script isn't bothering to check the results of any of these queries, although the end-state test I just added should close the loop on that.) london has no idea that it's missing data, but when we restart the paris node a little later, it notices that its WAL is past where london is. I'm now inclined to think that the correct fix is to ensure that we run synchronous rep in both directions, rather than to insert delays to substitute for that. Just setting synchronous_standby_names for node paris at the top of the script doesn't work, because there's at least one place where the script intentionally issues commands to paris while london is stopped. But we could turn off sync rep for that step, perhaps. Anyone have a different view of what to fix here? regards, tom lane
(catching up test threads) On Mon, Jul 3, 2017 at 7:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm now inclined to think that the correct fix is to ensure that we > run synchronous rep in both directions, rather than to insert delays > to substitute for that. Just setting synchronous_standby_names for > node paris at the top of the script doesn't work, because there's > at least one place where the script intentionally issues commands > to paris while london is stopped. I bet that using syncrep in both directions will likely avoid inconsistencies in the future if the test suite is extended on way or another. > But we could turn off sync rep for that step, perhaps. Yeah, by using synchronous_commit = off. > Anyone have a different view of what to fix here? No, this sounds like a good plan. What do you think about the attached? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Вложения
Michael Paquier <michael.paquier@gmail.com> writes: > On Mon, Jul 3, 2017 at 7:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Anyone have a different view of what to fix here? > No, this sounds like a good plan. What do you think about the attached? Oh, that's a good way. I just finished testing a fix that involved not turning on the second server's sync commit until later (it seems that only the first action on "paris" is really at risk currently). But disabling sync commit for individual transactions is clearly cleaner and more extensible to future test script changes. FWIW, I just got done doing a few check-world cycles with the delay in WalReceiverMain plus speeding up pg_ctl.c to WAITS_PER_SEC = 1000. No other problems seem to be revealed this way. regards, tom lane
On 3 July 2017 at 05:10, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> Any ideas what's wrong there? > > Hah: the answer is that query_hash's split() call is broken. > "man perlfunc" quoth > > split Splits the string EXPR into a list of strings and returns that > list. By default, empty leading fields are preserved, and > empty trailing ones are deleted. > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > In the case at hand, the SQL query returns something like > "|physical|||t|10338|||0/302B7E8" with normal timing, but with > enough delay in there, you get "|physical|||t|11542|||" which > triggers split's default behavior of ignoring the trailing empty > fields. It looks like the way to get split to not do that is > to pass it a "limit" of -1. That's my bad. (Insert dark muttering about Perl here). Thanks for spotting it. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Craig Ringer <craig@2ndquadrant.com> writes: > That's my bad. > (Insert dark muttering about Perl here). Yeah, pretty much the only good thing about Perl is it's ubiquitous. But you could say the same of C. Or SQL. For a profession that's under 70 years old, we sure spend a lot of time dealing with legacy stuff. regards, tom lane