Обсуждение: [PATCH] Logical decoding timeline following take II
Hi all Attached is a rebased and updated logical decoding timeline following patch for 10.0. This is a pre-requisite for the pending work on logical decoding on standby servers and simplified failover of logical decoding. Restating the commit message: __________ Follow timeline switches in logical decoding When decoding from a logical slot, it's necessary for xlog reading to be able to read xlog from historical (i.e. not current) timelines. Otherwise decoding fails after failover to a physical replica because the oldest still-needed archives are in the historical timeline. Supporting logical decoding timeline following is a pre-requisite for logical decoding on physical standby servers. It also makes it possible to promote a replica with logical slots to a master and replay from those slots, allowing logical decoding applications to follow physical failover. Logical slots cannot actually be created on a replica without use of the low-level C slot management APIs so this is mostly foundation work for subsequent changes to enable logical decoding on standbys. This commit includes a module in src/test/modules with functions to manipulate the slots (which is not otherwise possible in SQL code) in order to enable testing, and a new test in src/test/recovery to ensure that the behavior is as expected. Note that an earlier version of logical decoding timeline following was committed to 9.5 as 24c5f1a103ce, 3a3b309041b0, 82c83b337202, and f07d18b6e94d. It was then reverted by c1543a81a7a8 just after 9.5 feature freeze when issues were discovered too late to safely fix them in the 9.5 release cycle. The prior approach failed to consider that a record could be split across pages that are on different segments, where the new segment contains the start of a new timeline. In that case the old segment might be missing or renamed with a .partial suffix. This patch reworks the logic to be page-based and in the process simplify how the last timeline for a segment is looked up. Slot timeline following only works in a backend. Frontend support can be aded separately, where it could be useful for pg_xlogdump etc once support for timeline.c, List, etc is added for frontend code. __________ I'm hoping to find time to refactor timeline following so that we avoid passing timeline information around the xlogreader using globals, but that'd be a separate change that can be made after this. I've omitted the --endpos changes for pg_recvlogical, which again can be added separately. The test harness code will become unnecessary when proper support for logical failover or logical decoding on standby is added, so I'm not really sure it should be committed. Prior threads: * https://www.postgresql.org/message-id/CAMsr+YG_1FU_-L8QWSk6oKFT4Jt8dpORy2RHXDyMy0B5ZfkpGA@mail.gmail.com * https://www.postgresql.org/message-id/CAMsr+YH-C1-X_+s=2nzAPnR0wwqJa-rUmVHSYyZaNSn93MUBMQ@mail.gmail.com * http://www.postgresql.org/message-id/20160503165812.GA29604@alvherre.pgsql -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On Thu, Sep 1, 2016 at 1:08 PM, Craig Ringer <craig@2ndquadrant.com> wrote: > Attached is a rebased and updated logical decoding timeline following > patch for 10.0. Moved to next CF, nothing has happened since submission. -- Michael
Hi Craig, On 01/09/16 06:08, Craig Ringer wrote: > Hi all > > Attached is a rebased and updated logical decoding timeline following > patch for 10.0. > > This is a pre-requisite for the pending work on logical decoding on > standby servers and simplified failover of logical decoding. > I went over this and it looks fine to me, I only rebased the patch on top of current version (we renamed pg_xlog which broke the tests) and split the test harness to separate patch. Otherwise I would consider this to be ready for committer. I think this should go in early so that there is enough time in the cycle to uncover potential issues if there are any, even though it looks all correct to me. > > The test harness code will become unnecessary when proper support for > logical failover or logical decoding on standby is added, so I'm not > really sure it should be committed. Yeah as I said above I split out the test harness for this reason. The good thing is that when the followup patches get in the test harness should be easy to removed as the changes are very localized. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On 24 October 2016 at 23:49, Petr Jelinek <petr@2ndquadrant.com> wrote: > Hi Craig, > > On 01/09/16 06:08, Craig Ringer wrote: >> Hi all >> >> Attached is a rebased and updated logical decoding timeline following >> patch for 10.0. >> >> This is a pre-requisite for the pending work on logical decoding on >> standby servers and simplified failover of logical decoding. >> > > I went over this and it looks fine to me, I only rebased the patch on > top of current version (we renamed pg_xlog which broke the tests) and > split the test harness to separate patch. Otherwise I would consider > this to be ready for committer. > > I think this should go in early so that there is enough time in the > cycle to uncover potential issues if there are any, even though it looks > all correct to me. Thanks for the review and update. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2016-10-24 17:49:13 +0200, Petr Jelinek wrote: > + * Determine which timeline to read an xlog page from and set the > + * XLogReaderState's state->currTLI to that timeline ID. "XLogReaderState's state->currTLI" - the state's a bit redundant. > + * The caller must also make sure it doesn't read past the current redo pointer > + * so it doesn't fail to notice that the current timeline became historical. > + */ Not sure what that means? The redo pointer usually menas the "logical start" of the last checkpoint, but I don't see how thta could be meant here? > +static void > +XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength) > +{ > + const XLogRecPtr lastReadPage = state->readSegNo * XLogSegSize + state->readOff; > + > + elog(DEBUG4, "Determining timeline for read at %X/%X+%X", > + (uint32)(wantPage>>32), (uint32)wantPage, wantLength); Doing this on every single read from a page seems quite verbose. It's also (like most or all the following debug elogs) violating the casing prescribed by the message style guidelines. > + /* > + * If the desired page is currently read in and valid, we have nothing to do. > + * > + * The caller should've ensured that it didn't previously advance readOff > + * past the valid limit of this timeline, so it doesn't matter if the current > + * TLI has since become historical. > + */ > + if (lastReadPage == wantPage && > + state->readLen != 0 && > + lastReadPage + state->readLen >= wantPage + Min(wantLength,XLOG_BLCKSZ-1)) > + { > + elog(DEBUG4, "Wanted data already valid"); //XXX > + return; > + } With that kind of comment/XXX present, this surely can't be ready for committer? > + /* > + * If we're reading from the current timeline, it hasn't become historical > + * and the page we're reading is after the last page read, we can again > + * just carry on. (Seeking backwards requires a check to make sure the older > + * page isn't on a prior timeline). > + */ How can ThisTimeLineID become historical? > + if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage) > + { > + Assert(state->currTLIValidUntil == InvalidXLogRecPtr); > + elog(DEBUG4, "On current timeline"); > + return; > + } Also, is it actually ok to rely on ThisTimeLineID here? That's IIRC not maintained outside of the startup process, until recovery ended (cf it being set in InitXLOGAccess() called via RecoveryInProgress()). > /* > - * TODO: we're going to have to do something more intelligent about > - * timelines on standbys. Use readTimeLineHistory() and > - * tliOfPointInHistory() to get the proper LSN? For now we'll catch > - * that case earlier, but the code and TODO is left in here for when > - * that changes. > + * Check which timeline to get the record from. > + * > + * We have to do it each time through the loop because if we're in > + * recovery as a cascading standby, the current timeline might've > + * become historical. > */ I guess you mean cascading as "replicating logically from a physical standby"? I don't think it's good to use cascading here, because that's usually thought to mean doing physical SR from a standby... > diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c > index 318726e..4315fb3 100644 > --- a/src/backend/replication/logical/logicalfuncs.c > +++ b/src/backend/replication/logical/logicalfuncs.c > @@ -234,12 +234,6 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin > rsinfo->setResult = p->tupstore; > rsinfo->setDesc = p->tupdesc; > > - /* compute the current end-of-wal */ > - if (!RecoveryInProgress()) > - end_of_wal = GetFlushRecPtr(); > - else > - end_of_wal = GetXLogReplayRecPtr(NULL); > - > ReplicationSlotAcquire(NameStr(*name)); > > PG_TRY(); > @@ -279,6 +273,12 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin > /* invalidate non-timetravel entries */ > InvalidateSystemCaches(); > > + if (!RecoveryInProgress()) > + end_of_wal = GetFlushRecPtr(); > + else > + end_of_wal = GetXLogReplayRecPtr(NULL); > + > + /* Decode until we run out of records */ > while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) || > (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal)) > { That seems like a pretty random change? > diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h > index deaa7f5..caff9a6 100644 > --- a/src/include/access/xlogreader.h > +++ b/src/include/access/xlogreader.h > @@ -27,6 +27,10 @@ > > #include "access/xlogrecord.h" > > +#ifndef FRONTEND > +#include "nodes/pg_list.h" > +#endif Why is that needed? > diff --git a/src/test/modules/test_slot_timelines/README b/src/test/modules/test_slot_timelines/README > new file mode 100644 > index 0000000..585f02f > --- /dev/null > +++ b/src/test/modules/test_slot_timelines/README > @@ -0,0 +1,19 @@ > +A test module for logical decoding failover and timeline following. > + > +This module provides a minimal way to maintain logical slots on replicas > +that mirror the state on the master. It doesn't make decoding possible, > +just tracking slot state so that a decoding client that's using the master > +can follow a physical failover to the standby. The master doesn't know > +about the slots on the standby, they're synced by a client that connects > +to both. > + > +This is intentionally not part of the test_decoding module because that's meant > +to serve as example code, where this module exercises internal server features > +by unsafely exposing internal state to SQL. It's not the right way to do > +failover, it's just a simple way to test it from the perl TAP framework to > +prove the feature works. > + > +In a practical implementation of this approach a bgworker on the master would > +monitor slot positions and relay them to a bgworker on the standby that applies > +the position updates without exposing slot internals to SQL. That's too complex > +for this test framework though. I still don't want to have this code in postgres, it's just an example for doing something bad. Lets get proper feedback control in, and go from there. > diff --git a/src/test/modules/test_slot_timelines/expected/load_extension_1.out b/src/test/modules/test_slot_timelines/expected/load_extension_1.out > new file mode 100644 > index 0000000..0db21e4 > --- /dev/null > +++ b/src/test/modules/test_slot_timelines/expected/load_extension_1.out > @@ -0,0 +1,7 @@ > +CREATE EXTENSION test_slot_timelines; > +SELECT test_slot_timelines_create_logical_slot('test_slot', 'test_decoding'); > +ERROR: replication slots can only be used if max_replication_slots > 0 > +SELECT test_slot_timelines_advance_logical_slot('test_slot', txid_current()::text::xid, txid_current()::text::xid, pg_current_xlog_location(),pg_current_xlog_location()); > +ERROR: replication slots can only be used if max_replication_slots > 0 > +SELECT pg_drop_replication_slot('test_slot'); > +ERROR: replication slots can only be used if max_replication_slots > 0 > diff --git a/src/test/modules/test_slot_timelines/sql/load_extension.sql b/src/test/modules/test_slot_timelines/sql/load_extension.sql > new file mode 100644 > index 0000000..2440355 It doesn't seem worthwhlie to maintain a test for this - why do we run the tests with those settings in the first place? Greetings, Andres Freund
On 12 November 2016 at 23:07, Andres Freund <andres@anarazel.de> wrote > On 2016-10-24 17:49:13 +0200, Petr Jelinek wrote: >> + * Determine which timeline to read an xlog page from and set the >> + * XLogReaderState's state->currTLI to that timeline ID. > > "XLogReaderState's state->currTLI" - the state's a bit redundant. Thanks for taking a look at the patch. Agreed re above, fixed. You know, having returned to this work after a long break doing other things, it's clear that so much of the same work is being done in XLogSendPhysical(...) and walsender.c's XLogRead(...) that maybe it is worth facing the required refactoring so that we can use that in logical decoding from both the walsender and the SQL interface. The approach I originally took focused above all else on being minimally intrusive, rather than clean and clear. Maybe it's better to tidy this up instead. Instead of coupling timeline following in logical decoding to the XLogReader struct and having effectively duplicated logic to that for physical walsender, move the walsender.c globals static TimeLineID sendTimeLine = 0; static TimeLineID sendTimeLineNextTLI = 0; static bool sendTimeLineIsHistoric = false; static XLogRecPtr sendTimeLineValidUpto = InvalidXLogRecPtr; into new strut in timeline.h and move the logic into timeline.c . So we stop relying on so many magic globals in the walsender. Don't keep this state as a global, instead init it in StartReplication and pass it as a param to WalSndLoop. For logical decoding from walsender, store the walsender's copy in the XLogReaderState. For logical decoding from SQL, set it up in pg_logical_slot_get_changes_guts(...) and again store it in XLogReaderState. In the process, finally unify the two XLogRead(...) functions in xlogutils.c and walsender.c and merge walsender's logical_read_xlog_page(...) with xlogutils' logical_read_xlog_page(...) . Sure, we can't rely on a normal backend's latch being set when wal is written like we can for the walsender, but do we have to duplicate everything else? Can always add a miscadmin.h var like IsWALSender and test that to see if we can expect to have our latch set when new WAL comes in and adjust our latch wait timeout interval appropriately. The downside is of course that it touches physical replication, unlike the current approach which avoids touching anything that could affect physical replication at the cost of increasing the duplication between physical and logical replication logic. >> + * The caller must also make sure it doesn't read past the current redo pointer >> + * so it doesn't fail to notice that the current timeline became historical. >> + */ > > Not sure what that means? The redo pointer usually menas the "logical > start" of the last checkpoint, but I don't see how thta could be meant > here? What I was trying to say was the current replay position on a standby. Amended. >> +static void >> +XLogReadDetermineTimeline(XLogReaderState *state, XLogRecPtr wantPage, uint32 wantLength) >> +{ >> + const XLogRecPtr lastReadPage = state->readSegNo * XLogSegSize + state->readOff; >> + >> + elog(DEBUG4, "Determining timeline for read at %X/%X+%X", >> + (uint32)(wantPage>>32), (uint32)wantPage, wantLength); > > Doing this on every single read from a page seems quite verbose. It's > also (like most or all the following debug elogs) violating the casing > prescribed by the message style guidelines. Agreed. It's unnecessary now. It's the sort of thing I'd want to keep to have if Pg had fine grained module level log control, but we don't. >> + /* >> + * If we're reading from the current timeline, it hasn't become historical >> + * and the page we're reading is after the last page read, we can again >> + * just carry on. (Seeking backwards requires a check to make sure the older >> + * page isn't on a prior timeline). >> + */ > > How can ThisTimeLineID become historical? It can't, right now. Though I admit I didn't realise that at the time. Presently ThisTimeLineID is only set in the walsender by GetStandbyFlushRecPtr as called by IdentifySystem, and in StartReplication, but not afterwards for logical replication, so logical rep won't notice timeline transitions when on a standby unless a decoding session is restarted. We don't support decoding sessions in recovery, so it can't happen yet. It will, when we're on a cascading standby and the upstream is promoted, once we support logical decoding on standby. As part of that we'll need to maintain the timeline in the walsender in logical decoding like we do in physical, and limit logical decoding to the currently replayed position with something like walsender's GetStandbyFlushRecPtr(). But usable form the SQL interface too, of course. I'm currently implementing logical decoding on standby on top of this and the catalog_xmin feedback patch, and in the process will be adding tests for logical decoding on a physical replica where its upstream is promoted from cascading standby to master, so I'll nail down any issues for sure in that process. I think it makes sense to consider this patch as part of the same set as catalog_xmin feedback and decoding on standby, really. Without those it's hard to come up with good real world tests and it's not very useful. As you've correctly noted, the bundled tests are contrived in the extreme. Despite that, I've attached a revised version of the current approach incorporating fixes for the issues you mentioned. It's preceded by the patch to add an --endpos option to pg_recvlogical so that we can properly test the walsender interface too. >> + if (state->currTLI == ThisTimeLineID && wantPage >= lastReadPage) >> + { >> + Assert(state->currTLIValidUntil == InvalidXLogRecPtr); >> + elog(DEBUG4, "On current timeline"); >> + return; >> + } > > Also, is it actually ok to rely on ThisTimeLineID here? That's IIRC not > maintained outside of the startup process, until recovery ended (cf it > being set in InitXLOGAccess() called via RecoveryInProgress()). We don't yet allow decoding in recovery, so yes. We can be reading xlog from a prior timeline to the server's current timeline, since we're replaying xlog generated on the old master before promotion. That's when this test fails and we carry on. Amended for clarity. Added this to func header comment, hopefully helps: * It's necessary to care about timelines in xlogreader and logical decoding * when we might be reading xlog generated prior to a promotion, either if * we're currently a standby in recovery or if we're a promoted master reading * xlogs generated by the old master before our promotion. Notably, logical * decoding on a standby needs to be able to replay any remaining pending data * from the old timeline when the standby or one of its upstreams being * promoted. >> /* >> - * TODO: we're going to have to do something more intelligent about >> - * timelines on standbys. Use readTimeLineHistory() and >> - * tliOfPointInHistory() to get the proper LSN? For now we'll catch >> - * that case earlier, but the code and TODO is left in here for when >> - * that changes. >> + * Check which timeline to get the record from. >> + * >> + * We have to do it each time through the loop because if we're in >> + * recovery as a cascading standby, the current timeline might've >> + * become historical. >> */ > > I guess you mean cascading as "replicating logically from a physical > standby"? I don't think it's good to use cascading here, because that's > usually thought to mean doing physical SR from a standby... I actually do mean cascading physical standby here. If we're a physical standby and get promoted. IsInRecovery() becomes false. Fine. But if the timeline can also change while we remain in recovery, if a configuration like A => B => C -> L exists, where => is a physical replication link and -> is a logical decoding session. If A dies and B is promoted, C's timeline will change. Comment now explains this. >> diff --git a/src/backend/replication/logical/logicalfuncs.c b/src/backend/replication/logical/logicalfuncs.c >> index 318726e..4315fb3 100644 >> --- a/src/backend/replication/logical/logicalfuncs.c >> +++ b/src/backend/replication/logical/logicalfuncs.c >> @@ -234,12 +234,6 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin >> rsinfo->setResult = p->tupstore; >> rsinfo->setDesc = p->tupdesc; >> >> - /* compute the current end-of-wal */ >> - if (!RecoveryInProgress()) >> - end_of_wal = GetFlushRecPtr(); >> - else >> - end_of_wal = GetXLogReplayRecPtr(NULL); >> - >> ReplicationSlotAcquire(NameStr(*name)); >> >> PG_TRY(); >> @@ -279,6 +273,12 @@ pg_logical_slot_get_changes_guts(FunctionCallInfo fcinfo, bool confirm, bool bin >> /* invalidate non-timetravel entries */ >> InvalidateSystemCaches(); >> >> + if (!RecoveryInProgress()) >> + end_of_wal = GetFlushRecPtr(); >> + else >> + end_of_wal = GetXLogReplayRecPtr(NULL); >> + >> + /* Decode until we run out of records */ >> while ((startptr != InvalidXLogRecPtr && startptr < end_of_wal) || >> (ctx->reader->EndRecPtr != InvalidXLogRecPtr && ctx->reader->EndRecPtr < end_of_wal)) >> { > > That seems like a pretty random change? You're right. It's a change from when the timeline determination logic happened inside pg_logical_slot_get_changes_guts that I didn't realise had become irrelevant now. Reverted, and in the process: - end_of_wal = GetXLogReplayRecPtr(NULL); + end_of_wal = GetXLogReplayRecPtr(&ThisTimeLineID); so we maintain the current timeline when called in recovery, where it's not inherited from the startup process. >> diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h >> index deaa7f5..caff9a6 100644 >> --- a/src/include/access/xlogreader.h >> +++ b/src/include/access/xlogreader.h >> @@ -27,6 +27,10 @@ >> >> #include "access/xlogrecord.h" >> >> +#ifndef FRONTEND >> +#include "nodes/pg_list.h" >> +#endif > > Why is that needed? It isn't anymore. Thanks for the catch. It was used when the timeline history had to be looked up more frequently, so it was kept in the xlogreader state. Since pg_xlogdump uses the xlogreader this meant it had to be conditional on backend use. Now that the timeline history only has to be looked up when first examining the segment in which the timeline transition occurs there's no point caching it there anymore. (I'd still like to teach pg_xlogdump to follow timelines but that's a "later"). >> diff --git a/src/test/modules/test_slot_timelines/README b/src/test/modules/test_slot_timelines/README > I still don't want to have this code in postgres, it's just an example > for doing something bad. Lets get proper feedback control in, and go > from there. 100% agree, and I'm going to cut it from the patch since I'm submitting updated feedback control now and have a PoC of decoding on standby too. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On 16 November 2016 at 12:44, Craig Ringer <craig@2ndquadrant.com> wrote: > Despite that, I've attached a revised version of the current approach > incorporating fixes for the issues you mentioned. It's preceded by the > patch to add an --endpos option to pg_recvlogical so that we can > properly test the walsender interface too. I didn't rebase the patch that made the timeline following tests use the recvlogical support in PostgreNode.pm. Now attached. Even if timeline following isn't accepted as-is, I'd greatly appreciate inclusion of the first two patches as they add basic coverage of pg_recvlogical and a helper to make using it in tests simple and reliable. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
It's unlikely this will get in as a standalone patch, so I'm closing the CF entry for it as RwF https://commitfest.postgresql.org/11/779/ It is now being tracked as part of logical decoding on standby at https://commitfest.postgresql.org/12/788/ in thread "Logical decoding on standby", which began as https://www.postgresql.org/message-id/flat/CAMsr+YFi-LV7S8ehnwUiZnb=1h_14PwQ25d-vyUNq-f5S5r=Zg@mail.gmail.com#CAMsr+YFi-LV7S8ehnwUiZnb=1h_14PwQ25d-vyUNq-f5S5r=Zg@mail.gmail.com . -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services