Обсуждение: Cascading replication and recovery_target_timeline='latest'
When a cascading standby launches a new walsender, it fetches the current recovery timeline: /* * Use the recovery target timeline ID during recovery */if (am_cascading_walsender) ThisTimeLineID = GetRecoveryTargetTLI(); Comment in GetRecoveryTargetTLI() does this: /* RecoveryTargetTLI doesn't change so we need no lock to copy it */return XLogCtl->RecoveryTargetTLI; That comment is not true. RecoveryTargetTLI can change during recovery, if you set recovery_target_timeline='latest'. In 'latest' mode, when the (apparent) end of WAL is reached, the archive is scanned for any new timeline history files that may have appeared. If a new timeline is found, RecoveryTargetTLI is updated, and recovery is continued on the new timeline. Aside from the missing locking, I wonder what that does to a cascaded standby. If there is an active walsender running while RecoveryTargetTLI is changed, I think what will happen is that the walsender will continue to stream WAL from the old timeline, but because the startup process is now actually replaying from a different timeline, the walsender will send bogus WAL to the standby. When a standby ends recovery, creates a new timeline, and switches to normal operation, postmaster terminates all walsenders because of the timeline change. But don't we have a race condition there, with similar effect? It might take a while for a walsender to die, and in that window, it might send bogus WAL to the cascaded standby. - Heikki
On Fri, Aug 31, 2012 at 5:03 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > When a cascading standby launches a new walsender, it fetches the current > recovery timeline: > > /* > * Use the recovery target timeline ID during recovery > */ > if (am_cascading_walsender) > ThisTimeLineID = GetRecoveryTargetTLI(); > > Comment in GetRecoveryTargetTLI() does this: > > /* RecoveryTargetTLI doesn't change so we need no lock to copy it */ > return XLogCtl->RecoveryTargetTLI; > > > That comment is not true. RecoveryTargetTLI can change during recovery, if > you set recovery_target_timeline='latest'. In 'latest' mode, when the > (apparent) end of WAL is reached, the archive is scanned for any new > timeline history files that may have appeared. If a new timeline is found, > RecoveryTargetTLI is updated, and recovery is continued on the new timeline. Right. We need lock there for now. > Aside from the missing locking, I wonder what that does to a cascaded > standby. If there is an active walsender running while RecoveryTargetTLI is > changed, I think what will happen is that the walsender will continue to > stream WAL from the old timeline, but because the startup process is now > actually replaying from a different timeline, the walsender will send bogus > WAL to the standby. Good catch! That's really problem. To address that, we should terminate all cascading walsenders when the timeline history file is read and the recovery target timeline is changed? > When a standby ends recovery, creates a new timeline, and switches to normal > operation, postmaster terminates all walsenders because of the timeline > change. But don't we have a race condition there, with similar effect? It > might take a while for a walsender to die, and in that window, it might send > bogus WAL to the cascaded standby. No, I think. The cascading walsender can send only WAL data, up to min(replay_location, receive_location, restore_location). IOW, it sends only replayed, received (i.e., streamed), and restored WAL files. These WAL files belong to old timeline, so ISTM that the cascading walsender cannot send any new-timeline WAL files. Regards, -- Fujii Masao
On Sat, Sep 1, 2012 at 2:32 AM, Fujii Masao <masao.fujii@gmail.com> wrote: > On Fri, Aug 31, 2012 at 5:03 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> When a cascading standby launches a new walsender, it fetches the current >> recovery timeline: >> >> /* >> * Use the recovery target timeline ID during recovery >> */ >> if (am_cascading_walsender) >> ThisTimeLineID = GetRecoveryTargetTLI(); >> >> Comment in GetRecoveryTargetTLI() does this: >> >> /* RecoveryTargetTLI doesn't change so we need no lock to copy it */ >> return XLogCtl->RecoveryTargetTLI; >> >> >> That comment is not true. RecoveryTargetTLI can change during recovery, if >> you set recovery_target_timeline='latest'. In 'latest' mode, when the >> (apparent) end of WAL is reached, the archive is scanned for any new >> timeline history files that may have appeared. If a new timeline is found, >> RecoveryTargetTLI is updated, and recovery is continued on the new timeline. > > Right. We need lock there for now. > >> Aside from the missing locking, I wonder what that does to a cascaded >> standby. If there is an active walsender running while RecoveryTargetTLI is >> changed, I think what will happen is that the walsender will continue to >> stream WAL from the old timeline, but because the startup process is now >> actually replaying from a different timeline, the walsender will send bogus >> WAL to the standby. > > Good catch! That's really problem. To address that, we should terminate > all cascading walsenders when the timeline history file is read and > the recovery target timeline is changed? This is not right fix. After terminating cascading walsenders, it might take them some time to come to an end, and during that time they might send bogus WAL from old timeline. Currently there is no safeguard against sending bogus WAL from old timeline. To implement such a safeguard, cascading walsender needs to know when the timeline is updated and which is the last valid WAL file of the timeline as the startup process knows. IOW, we need to change cascading walsenders so that they also read and understand the timeline history files. This is not easy fix at this stage (9.2.0 is about to be released...). So, as one idea, I'm thiking to just forbid cascading replication when recovery_target_timeline is set to 'latest'. Thought? Regards, -- Fujii Masao
On 03.09.2012 10:43, Fujii Masao wrote: > On Sat, Sep 1, 2012 at 2:32 AM, Fujii Masao<masao.fujii@gmail.com> wrote: >> On Fri, Aug 31, 2012 at 5:03 PM, Heikki Linnakangas<hlinnaka@iki.fi> wrote: >>> Aside from the missing locking, I wonder what that does to a cascaded >>> standby. If there is an active walsender running while RecoveryTargetTLI is >>> changed, I think what will happen is that the walsender will continue to >>> stream WAL from the old timeline, but because the startup process is now >>> actually replaying from a different timeline, the walsender will send bogus >>> WAL to the standby. >> >> Good catch! That's really problem. To address that, we should terminate >> all cascading walsenders when the timeline history file is read and >> the recovery target timeline is changed? > > This is not right fix. After terminating cascading walsenders, it > might take them > some time to come to an end, and during that time they might send bogus WAL > from old timeline. Currently there is no safeguard against sending bogus WAL > from old timeline. To implement such a safeguard, cascading walsender needs > to know when the timeline is updated and which is the last valid WAL file of > the timeline as the startup process knows. IOW, we need to change cascading > walsenders so that they also read and understand the timeline history files. > This is not easy fix at this stage (9.2.0 is about to be released...). > > So, as one idea, I'm thiking to just forbid cascading replication when > recovery_target_timeline is set to 'latest'. Thought? Hmm, I was thinking that when walsender gets the position it can send the WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current recovery timeline. If it has changed, refuse to send the new WAL and terminate. That would be a fairly small change, it would just close the window between requesting walsenders to terminate and them actually terminating. - Heikki
On Tue, Sep 4, 2012 at 7:07 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: > On 03.09.2012 10:43, Fujii Masao wrote: >> >> On Sat, Sep 1, 2012 at 2:32 AM, Fujii Masao<masao.fujii@gmail.com> wrote: >>> >>> On Fri, Aug 31, 2012 at 5:03 PM, Heikki Linnakangas<hlinnaka@iki.fi> >>> wrote: >>>> >>>> Aside from the missing locking, I wonder what that does to a cascaded >>>> >>>> standby. If there is an active walsender running while RecoveryTargetTLI >>>> is >>>> changed, I think what will happen is that the walsender will continue to >>>> stream WAL from the old timeline, but because the startup process is now >>>> actually replaying from a different timeline, the walsender will send >>>> bogus >>>> WAL to the standby. >>> >>> >>> Good catch! That's really problem. To address that, we should terminate >>> all cascading walsenders when the timeline history file is read and >>> the recovery target timeline is changed? >> >> >> This is not right fix. After terminating cascading walsenders, it >> might take them >> some time to come to an end, and during that time they might send bogus >> WAL >> from old timeline. Currently there is no safeguard against sending bogus >> WAL >> from old timeline. To implement such a safeguard, cascading walsender >> needs >> to know when the timeline is updated and which is the last valid WAL file >> of >> the timeline as the startup process knows. IOW, we need to change >> cascading >> walsenders so that they also read and understand the timeline history >> files. >> This is not easy fix at this stage (9.2.0 is about to be released...). >> >> So, as one idea, I'm thiking to just forbid cascading replication when >> recovery_target_timeline is set to 'latest'. Thought? > > > Hmm, I was thinking that when walsender gets the position it can send the > WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current > recovery timeline. If it has changed, refuse to send the new WAL and > terminate. That would be a fairly small change, it would just close the > window between requesting walsenders to terminate and them actually > terminating. Yeah, sounds good. Could you implement the patch? If you don't have time, I will.... Regards, -- Fujii Masao
On 03.09.2012 16:25, Fujii Masao wrote: > On Tue, Sep 4, 2012 at 7:07 AM, Heikki Linnakangas<hlinnaka@iki.fi> wrote: >> Hmm, I was thinking that when walsender gets the position it can send the >> WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current >> recovery timeline. If it has changed, refuse to send the new WAL and >> terminate. That would be a fairly small change, it would just close the >> window between requesting walsenders to terminate and them actually >> terminating. > > Yeah, sounds good. Could you implement the patch? If you don't have time, > I will.... I'll give it a shot.. - Heikki
On 03.09.2012 16:26, Heikki Linnakangas wrote: > On 03.09.2012 16:25, Fujii Masao wrote: >> On Tue, Sep 4, 2012 at 7:07 AM, Heikki Linnakangas<hlinnaka@iki.fi> >> wrote: >>> Hmm, I was thinking that when walsender gets the position it can send >>> the >>> WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the >>> current >>> recovery timeline. If it has changed, refuse to send the new WAL and >>> terminate. That would be a fairly small change, it would just close the >>> window between requesting walsenders to terminate and them actually >>> terminating. >> >> Yeah, sounds good. Could you implement the patch? If you don't have time, >> I will.... > > I'll give it a shot.. So, this is what I came up with, please review. - Heikki
Вложения
Heikki Linnakangas <hlinnaka@iki.fi> writes: > Hmm, I was thinking that when walsender gets the position it can send the > WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current > recovery timeline. If it has changed, refuse to send the new WAL and > terminate. That would be a fairly small change, it would just close the > window between requesting walsenders to terminate and them actually > terminating. It looks to me like a bug fix that also applies to non cascading situation. Is that right? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Guys, Is this a patch to 9.3? i.e. do we need to delay the release for this? -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On 04.09.2012 03:02, Dimitri Fontaine wrote: > Heikki Linnakangas<hlinnaka@iki.fi> writes: >> Hmm, I was thinking that when walsender gets the position it can send the >> WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current >> recovery timeline. If it has changed, refuse to send the new WAL and >> terminate. That would be a fairly small change, it would just close the >> window between requesting walsenders to terminate and them actually >> terminating. > > It looks to me like a bug fix that also applies to non cascading > situation. Is that right? No, only cascading replication is affected. In non-cascading situation, the timeline never changes in the master. It's only in cascading mode that you have a problem, where the standby can cross timelines while it's replaying the WAL, and also sending it over to cascading standby. - Heikki
On 04.09.2012 15:41, Josh Berkus wrote: > Guys, > > Is this a patch to 9.3? > > i.e. do we need to delay the release for this? It is for 9.2. I'll do a little bit more testing, and barring any issues, commit the patch. What exactly is the schedule? Do we need to do a RC2 because of this? - Heikki
Heikki, > It is for 9.2. I'll do a little bit more testing, and barring any > issues, commit the patch. What exactly is the schedule? Do we need to do > a RC2 because of this? We're currently scheduled to release next week. If we need to do an RC2, we're going to have to do some fast rescheduling; we've already started the publicity machine. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: > Heikki, >> It is for 9.2. I'll do a little bit more testing, and barring any >> issues, commit the patch. What exactly is the schedule? Do we need to do >> a RC2 because of this? > We're currently scheduled to release next week. If we need to do an > RC2, we're going to have to do some fast rescheduling; we've already > started the publicity machine. At this point I would argue that the only thing that should abort the launch is a bad regression. Minor bugs in new features (and this must be minor if it wasn't noticed before) don't qualify. Having said that, it'd be good to get it fixed if we can. The schedule says to wrap 9.2.0 Thursday evening --- Heikki, can you get this fixed tomorrow (Wednesday)? regards, tom lane
On 03.09.2012 17:40, Heikki Linnakangas wrote: > On 03.09.2012 16:26, Heikki Linnakangas wrote: >> On 03.09.2012 16:25, Fujii Masao wrote: >>> On Tue, Sep 4, 2012 at 7:07 AM, Heikki Linnakangas<hlinnaka@iki.fi> >>> wrote: >>>> Hmm, I was thinking that when walsender gets the position it can send >>>> the >>>> WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the >>>> current >>>> recovery timeline. If it has changed, refuse to send the new WAL and >>>> terminate. That would be a fairly small change, it would just close the >>>> window between requesting walsenders to terminate and them actually >>>> terminating. >>> >>> Yeah, sounds good. Could you implement the patch? If you don't have >>> time, >>> I will.... >> >> I'll give it a shot.. > > So, this is what I came up with, please review. While testing, I bumped into another related bug: When a WAL segment is restored from the archive, we let a walsender to send that whole WAL segment to a cascading standby. However, there's no guarantee that the restored WAL segment is complete. In particular, if a timeline changes within that segment, e.g 000000010000000000000004, that segment will be only partially full, and the WAL continues at segment 000000020000000000000004, at the next timeline. This can also happen if you copy a partial WAL segment to the archive, for example from a crashed master server. Or if you have set up record-based WAL shipping not using streaming replication, per http://www.postgresql.org/docs/devel/static/log-shipping-alternative.html#WARM-STANDBY-RECORD. That manual page says you can only deal with whole WAL files that way, but I think with standby_mode='on', that's actually no longer true. So all in all, it seems like a shaky assumption that once you've restored a WAL file from the archive, you're free to stream it to a cascading slave. I think it would be more robust to limit it to streaming the file only up to the point that it's been replayed - and thus verified - in the 1st standby. If everyone is OK with that change in behavior, the fix is simple. - Heikki
On 04.09.2012 16:50, Tom Lane wrote: > Josh Berkus<josh@agliodbs.com> writes: >> Heikki, >>> It is for 9.2. I'll do a little bit more testing, and barring any >>> issues, commit the patch. What exactly is the schedule? Do we need to do >>> a RC2 because of this? > >> We're currently scheduled to release next week. If we need to do an >> RC2, we're going to have to do some fast rescheduling; we've already >> started the publicity machine. > > At this point I would argue that the only thing that should abort the > launch is a bad regression. Minor bugs in new features (and this must > be minor if it wasn't noticed before) don't qualify. > > Having said that, it'd be good to get it fixed if we can. The schedule > says to wrap 9.2.0 Thursday evening --- Heikki, can you get this fixed > tomorrow (Wednesday)? The attached patch fixes it for me. It fixes the original problem, by adding the missing locking and terminating walsenders on a target timeline change, and also changes the behavior wrt. WAL segments restored from the archive, as I just suggested in another email (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00206.php). The test case I've been using is a master and two standbys. The first standby is set up to connect to the master with streaming replication, and the other standby is set up to connect to the 1st standby, ie. it's a cascading slave. In addition, the master is set up to do WAL archiving to a directory, and both standbys have a restore_command to read from that archive, and restore_target_timeline='latest'. After the master and both standbys are running, I create a dummy recovery.conf file in master's data directory, with just "restore_command='/bin/false'" in it, and restart the master. That forces a timeline change in the master. With the patch, the 1st standby will notice the new timeline in the archive, switch to that, and reconnect to the master. The cascading connection to the 2nd standby is terminated because of the timeline change, the 2nd standby will also scan the archive and pick up the new timeline, reconnect to the 1st standby, and be in sync again. - Heikki
Вложения
On 04.09.2012 17:34, Heikki Linnakangas wrote: > On 04.09.2012 16:50, Tom Lane wrote: >> Josh Berkus<josh@agliodbs.com> writes: >>> Heikki, >>>> It is for 9.2. I'll do a little bit more testing, and barring any >>>> issues, commit the patch. What exactly is the schedule? Do we need >>>> to do >>>> a RC2 because of this? >> >>> We're currently scheduled to release next week. If we need to do an >>> RC2, we're going to have to do some fast rescheduling; we've already >>> started the publicity machine. >> >> At this point I would argue that the only thing that should abort the >> launch is a bad regression. Minor bugs in new features (and this must >> be minor if it wasn't noticed before) don't qualify. >> >> Having said that, it'd be good to get it fixed if we can. The schedule >> says to wrap 9.2.0 Thursday evening --- Heikki, can you get this fixed >> tomorrow (Wednesday)? > > The attached patch fixes it for me. It fixes the original problem, by > adding the missing locking and terminating walsenders on a target > timeline change, and also changes the behavior wrt. WAL segments > restored from the archive, as I just suggested in another email > (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00206.php). Committed that. - Heikki
On Tue, 2012-09-04 at 19:34 -0700, Heikki Linnakangas wrote: > > The attached patch fixes it for me. It fixes the original problem, by > > adding the missing locking and terminating walsenders on a target > > timeline change, and also changes the behavior wrt. WAL segments > > restored from the archive, as I just suggested in another email > > (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00206.php). > > Committed that. New compiler warnings: xlog.c: In function ‘XLogFileRead’: xlog.c:2785:14: error: unused variable ‘endptr’ [-Werror=unused-variable] xlog.c:2784:25: error: unused variable ‘xlogctl’ [-Werror=unused-variable]
On 04.09.2012 21:56, Peter Eisentraut wrote: > On Tue, 2012-09-04 at 19:34 -0700, Heikki Linnakangas wrote: >>> The attached patch fixes it for me. It fixes the original problem, by >>> adding the missing locking and terminating walsenders on a target >>> timeline change, and also changes the behavior wrt. WAL segments >>> restored from the archive, as I just suggested in another email >>> (http://archives.postgresql.org/pgsql-hackers/2012-09/msg00206.php). >> >> Committed that. > > New compiler warnings: > > xlog.c: In function ‘XLogFileRead’: > xlog.c:2785:14: error: unused variable ‘endptr’ [-Werror=unused-variable] > xlog.c:2784:25: error: unused variable ‘xlogctl’ [-Werror=unused-variable] Fixed, thanks. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > On 04.09.2012 03:02, Dimitri Fontaine wrote: >> Heikki Linnakangas<hlinnaka@iki.fi> writes: >>> Hmm, I was thinking that when walsender gets the position it can send the >>> WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current >>> recovery timeline. If it has changed, refuse to send the new WAL and >>> terminate. That would be a fairly small change, it would just close the >>> window between requesting walsenders to terminate and them actually >>> terminating. > > No, only cascading replication is affected. In non-cascading situation, the > timeline never changes in the master. It's only in cascading mode that you > have a problem, where the standby can cross timelines while it's replaying > the WAL, and also sending it over to cascading standby. It seems to me that it applies to connecting a standby to a newly promoted standby too, as the timeline did change in this case too. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 05.09.2012 01:03, Dimitri Fontaine wrote: > Heikki Linnakangas<hlinnaka@iki.fi> writes: >> On 04.09.2012 03:02, Dimitri Fontaine wrote: >>> Heikki Linnakangas<hlinnaka@iki.fi> writes: >>>> Hmm, I was thinking that when walsender gets the position it can send the >>>> WAL up to, in GetStandbyFlushRecPtr(), it could atomically check the current >>>> recovery timeline. If it has changed, refuse to send the new WAL and >>>> terminate. That would be a fairly small change, it would just close the >>>> window between requesting walsenders to terminate and them actually >>>> terminating. >> >> No, only cascading replication is affected. In non-cascading situation, the >> timeline never changes in the master. It's only in cascading mode that you >> have a problem, where the standby can cross timelines while it's replaying >> the WAL, and also sending it over to cascading standby. > > It seems to me that it applies to connecting a standby to a newly > promoted standby too, as the timeline did change in this case too. I was worried about that too at first, but Fujii pointed out that's OK: see last paragraph at http://archives.postgresql.org/pgsql-hackers/2012-08/msg01203.php. If you connect to a standby that was already promoted to new master, it's no different from connecting to a master in general. It works. If you connect just before a standby is promoted, it works because a cascading standby pays attention to the recovery target timeline, and the pointer to last replayed WAL record. Promoting a standby doesn't change recovery target timeline or the last replayed WAL record, it sets XLogCtl->ThisTimeLineID. So the walsender in cascading mode will send the WAL up to where the promotion happened, but will stop there until it's terminated by the signal. - Heikki
Heikki Linnakangas <hlinnaka@iki.fi> writes: > I was worried about that too at first, but Fujii pointed out that's OK: see > last paragraph at > http://archives.postgresql.org/pgsql-hackers/2012-08/msg01203.php. Mmm, ok. I'm worried about master-standby-standby setup where the master disappear, we promote a standby and the second standby now feeds from the newly promoted standby. Well we have to reconnect manually in this case, but don't we need some similar stopgaps? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On 05.09.2012 07:55, Dimitri Fontaine wrote: > Heikki Linnakangas<hlinnaka@iki.fi> writes: >> I was worried about that too at first, but Fujii pointed out that's OK: see >> last paragraph at >> http://archives.postgresql.org/pgsql-hackers/2012-08/msg01203.php. > > Mmm, ok. > > I'm worried about master-standby-standby setup where the master > disappear, we promote a standby and the second standby now feeds from > the newly promoted standby. Well we have to reconnect manually in this > case, but don't we need some similar stopgaps? The second standby will have to reconnect, but it will happen automatically. - Heikki