Обсуждение: Potential data loss due to race condition during logical replication slot creation
Potential data loss due to race condition during logical replication slot creation
Hello,
We discovered a race condition during logical replication slot creation that can result in the changes for transactions running at the time of the slot creation to only be partially replicated. We found the cause was due to the slot transitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshot that had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below:
Session 1:
SELECT pg_create_logical_replication_slot('slot1', 'test_decoding');
CREATE TABLE test (a int);
BEGIN;
INSERT INTO test VALUES (1);
Session 2:
SELECT pg_create_logical_replication_slot('slot2', 'test_decoding');
<query hangs>
Session 3:
CHECKPOINT;
select pg_logical_slot_get_changes('slot1', NULL, NULL);
<should return nothing of interest>
Session 1:
INSERT INTO test VALUES (2);
COMMIT;
<Session 2 query no longer hangs and successfully creates the slot2>
Session 2:
select pg_logical_slot_get_changes('slot1', NULL, NULL);
select pg_logical_slot_get_changes('slot2', NULL, NULL);
<expected: no rows of the txn are returned for slot2>
<actual: The 2nd row of the txn is returned for slot2>
Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL and also force a checkpoint to get the current state of the running transactions on the system. This create process will then wait for all of the transactions within that running xact record to complete before being able to transition to the next snapbuild state. During this time period, if another running xact record is written and then a different logical replication process decodes this running xact record, a globally accessible snapshot will be persisted to disk.
Once all of the transactions from the initial running xact have finished, the process performing the slot creation will become unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore that snapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed flush LSN of the slot to the start of the next record after that running xact.
We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit after our confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactions commit, we will then partially stream their changes.
The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it is building a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot” parameter which is already set by the caller to control when and how the snapbuilder is allowed to become consistent.
With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent. Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fully consistent we can prevent the issue by disallowing this behavior.
Thanks,
Drew
Вложения
Re: Potential data loss due to race condition during logical replication slot creation
Hi, Thank you for the report! On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote: > > Hello, > > We discovered a race condition during logical replication slot creation that can result in the changes for transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below: > > Session 1: > > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding'); > > CREATE TABLE test (a int); > BEGIN; > INSERT INTO test VALUES (1); > > Session 2: > > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding'); > > <query hangs> > > Session 3: > > CHECKPOINT; > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > <should return nothing of interest> > > Session 1: > > INSERT INTO test VALUES (2); > COMMIT; > > <Session 2 query no longer hangs and successfully creates the slot2> > > Session 2: > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > select pg_logical_slot_get_changes('slot2', NULL, NULL); > > <expected: no rows of the txn are returned for slot2> > <actual: The 2nd row of the txn is returned for slot2> > I've confirmed this happens HEAD and all supported backbranches. And I think it's a (long-standing) bug. I think we can prepare an isolation test of the above steps and include it in contrib/test_decoding/specs. > Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL andalso force a checkpoint to get the current state of the running transactions on the system. This create process will thenwait for all of the transactions within that running xact record to complete before being able to transition to the nextsnapbuild state. During this time period, if another running xact record is written and then a different logical replicationprocess decodes this running xact record, a globally accessible snapshot will be persisted to disk. > > Once all of the transactions from the initial running xact have finished, the process performing the slot creation willbecome unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore thatsnapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed flushLSN of the slot to the start of the next record after that running xact. My analysis is the same. After the slot creation, the slot's LSNs become like: (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further some changes) < slot's confirmed_flush_lsn < (tx-commit) > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactionscommit, we will then partially stream their changes. > > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it isbuilding a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to becomeconsistent. > > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fullyconsistent we can prevent the issue by disallowing this behavior. I basically agree with the basic idea of skipping snapshot restore while finding the initial start point. But I'd like to hear other opinions too. Another simple way to fix this is that we always set need_full_snapshot to true when slot creation. But it would then make the snapshot builder include non-catalog-change transactions too while finding the initial startpoint. Which is not necessary. As for the proposed patch, it uses a global variable, InCreate, and memory context callback to reset it. While I believe this is for not breaking any ABI as this bug exists on back branches too, I think it's better to consider how to fix it in HEAD and then how to backpatch it. We don't necessarily need to have the same fixes for all branches all the time. IIUC we need to skip snapshot restore only when we're finding the initial start point. So probably we don't need to set InCreate when calling create_logical_replication_slot() with find_startpoint = false. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
> I think we can prepare an isolation test of the above steps and > include it in contrib/test_decoding/specs. Thanks for the suggestion. Included an isolation test for the exact repro and a couple other tests for fully before and fully after. These two tests don't add a whole lot of value since this is already heavily tested elsewhere, so I'm good with removing. However, thought it made sense to include from a perspective of totality given the test name. > Another simple way to fix this is that we always set > need_full_snapshot to true when slot creation. But it would then make > the snapshot builder include non-catalog-change transactions too while > finding the initial startpoint. Which is not necessary. Since we're not accumulating changes anyway and we never distribute snapshots while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. This seems pretty contained and wouldn't require the use of a global variable and memory context callbacks which is a bit nicer from a readability standpoint. It's a little inefficient with memory and CPU, especially in the presence of long running transactions, but likely would not be noticeable to most users. > As for the proposed patch, it uses a global variable, InCreate, and > memory context callback to reset it. While I believe this is for not > breaking any ABI as this bug exists on back branches too, I think it's > better to consider how to fix it in HEAD and then how to backpatch it. > We don't necessarily need to have the same fixes for all branches all > the time. Attached a patch that is ABI breaking, but removes the need for a global variable. This relies on pushing the context of whether we are in a create mode to the logical decoding context, which is cleaner, but not as clean as just pushing all the way down to the snapbuilder. I would push down to the snap builder, but the entire object is persisted to disk. This is problematic to me since the create flag would be persisted causing exisiting slots to restore a snapbuild that could have the create flag set to true. This is not problematic as is, since the variable would only be leveraged in SnapBuildRestore() which is only called while inconsistent, but seems like a future bug waiting to happen. We could always clear the flag when persisting, but that seems a bit hacky. We could also fork the snap builder into on-disk and in-memory structs, which would be convenient since we would not need to invalidate exisiting snapshots. However, not sure how people feel about the fork. > IIUC we need to skip snapshot restore only when we're finding the > initial start point. So probably we don't need to set InCreate when > calling create_logical_replication_slot() with find_startpoint = > false. Yep, that's correct. We can safely skip setting InCreate. We could move the setter function slotfuncs.c as one way of doing this. I lean towards setting it no matter what since it is harmless if we never try to find a decoding point and removes the need for future callers of CreateInitDecodingContext() from having to determine if it's safe to be false. Though I agree that InCreate may a bigger hammer than needed which may need to be modified in the future. Thanks for the quick response, Drew Callahan
Вложения
On Fri, 2 Feb 2024 at 06:24, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > Thank you for the report! > > On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote: > > > > Hello, > > > > We discovered a race condition during logical replication slot creation that can result in the changes for transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below: > > > > Session 1: > > > > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding'); > > > > CREATE TABLE test (a int); > > BEGIN; > > INSERT INTO test VALUES (1); > > > > Session 2: > > > > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding'); > > > > <query hangs> > > > > Session 3: > > > > CHECKPOINT; > > > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > > > <should return nothing of interest> > > > > Session 1: > > > > INSERT INTO test VALUES (2); > > COMMIT; > > > > <Session 2 query no longer hangs and successfully creates the slot2> > > > > Session 2: > > > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > > > select pg_logical_slot_get_changes('slot2', NULL, NULL); > > > > <expected: no rows of the txn are returned for slot2> > > <actual: The 2nd row of the txn is returned for slot2> > > > > I've confirmed this happens HEAD and all supported backbranches. And I > think it's a (long-standing) bug. > > I think we can prepare an isolation test of the above steps and > include it in contrib/test_decoding/specs. > > > Newly created logical replication slots initialize their restart LSN to the current insert position within the WAL andalso force a checkpoint to get the current state of the running transactions on the system. This create process will thenwait for all of the transactions within that running xact record to complete before being able to transition to the nextsnapbuild state. During this time period, if another running xact record is written and then a different logical replicationprocess decodes this running xact record, a globally accessible snapshot will be persisted to disk. > > > > Once all of the transactions from the initial running xact have finished, the process performing the slot creation willbecome unblocked and will then consume the new running xact record. The process will see a valid snapshot, restore thatsnapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmed flushLSN of the slot to the start of the next record after that running xact. > > My analysis is the same. After the slot creation, the slot's LSNs become like: > > (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further > some changes) < slot's confirmed_flush_lsn < (tx-commit) > > > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactionscommit, we will then partially stream their changes. > > > > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether it isbuilding a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to becomeconsistent. > > > > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fullyconsistent we can prevent the issue by disallowing this behavior. > > I basically agree with the basic idea of skipping snapshot restore > while finding the initial start point. But I'd like to hear other > opinions too. > > Another simple way to fix this is that we always set > need_full_snapshot to true when slot creation. But it would then make > the snapshot builder include non-catalog-change transactions too while > finding the initial startpoint. Which is not necessary. > > As for the proposed patch, it uses a global variable, InCreate, and > memory context callback to reset it. While I believe this is for not > breaking any ABI as this bug exists on back branches too, I think it's > better to consider how to fix it in HEAD and then how to backpatch it. > We don't necessarily need to have the same fixes for all branches all > the time. I agree to having a separate patch for HEAD in this case. How about a fix like a patch attached. Regards, Vignesh
Вложения
Re: Potential data loss due to race condition during logical replication slot creation
On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote: > > > I think we can prepare an isolation test of the above steps and > > include it in contrib/test_decoding/specs. > > Thanks for the suggestion. Included an isolation test for the > exact repro and a couple other tests for fully before and fully after. > These two tests don't add a whole lot of value since this > is already heavily tested elsewhere, so I'm good with removing. > However, thought it made sense to include from a perspective of > totality given the test name. Thank you for the patch for the isolation test! Since Vignesh also shared the isolation patch I've merged them and modified some comments. I've attached the patch. I'm going to merge the isolation test patch to the patch to fix the issue after we get a consensus on how to fix it. > > > Another simple way to fix this is that we always set > > need_full_snapshot to true when slot creation. But it would then make > > the snapshot builder include non-catalog-change transactions too while > > finding the initial startpoint. Which is not necessary. > > Since we're not accumulating changes anyway and we never distribute snapshots > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. > This seems pretty contained and wouldn't require the use of a global variable > and memory context callbacks which is a bit nicer from a readability standpoint. > It's a little inefficient with memory and CPU, especially in the presence of > long running transactions, but likely would not be noticeable to most users. > > > As for the proposed patch, it uses a global variable, InCreate, and > > memory context callback to reset it. While I believe this is for not > > breaking any ABI as this bug exists on back branches too, I think it's > > better to consider how to fix it in HEAD and then how to backpatch it. > > We don't necessarily need to have the same fixes for all branches all > > the time. > > Attached a patch that is ABI breaking, but removes the need for a global variable. > This relies on pushing the context of whether we are in a create mode to the > logical decoding context, which is cleaner, but not as clean as just pushing > all the way down to the snapbuilder. I would push down to the snap builder, > but the entire object is persisted to disk. This is problematic to me since the > create flag would be persisted causing exisiting slots to restore a snapbuild > that could have the create flag set to true. This is not problematic as is, > since the variable would only be leveraged in SnapBuildRestore() which is only > called while inconsistent, but seems like a future bug waiting to happen. But IIUC in SnapBuildRestore(), we selectively restore variables from serialized SnapBuild. For instance, building_full_snapshot and initial_xmin_horizon are serialized to the disk but ignored when restoring a snapshot. I think that even if we add the in_create flag to SnapBuild, it would be ignored on snapshot restore. I think that it's cleaner if we could pass in_creat flag to AllocateSnapshotBuilder() than passing the flag down to SnapBuildFindSnapshot() from standby_decode(). > > We could always clear the flag when persisting, but that seems a bit hacky. > We could also fork the snap builder into on-disk and in-memory structs, which > would be convenient since we would not need to invalidate exisiting snapshots. > However, not sure how people feel about the fork. > > > IIUC we need to skip snapshot restore only when we're finding the > > initial start point. So probably we don't need to set InCreate when > > calling create_logical_replication_slot() with find_startpoint = > > false. > > Yep, that's correct. We can safely skip setting InCreate. We could move > the setter function slotfuncs.c as one way of doing this. I lean towards > setting it no matter what since it is harmless if we never try to find > a decoding point and removes the need for future callers of > CreateInitDecodingContext() from having to determine if it's safe to be false. > Though I agree that InCreate may a bigger hammer than needed which may need > to be modified in the future. I agree to set the flag no matter what. The flag is needed for finding the initial startpoint but it would be harmless to set it even if the caller doesn't call DecodingContextFindStartpoint(). If the caller skips to call DecodingContextFindStartpoint() it's the caller's responsibility to set a sane LSNs to the slots before staring the logical decoding. So it makes sense to me to remove the need for callers to determine to set the flag, as you mentioned. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
Re: Potential data loss due to race condition during logical replication slot creation
On Sat, Feb 3, 2024 at 10:48 PM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 2 Feb 2024 at 06:24, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > Hi, > > > > Thank you for the report! > > > > On Fri, Feb 2, 2024 at 7:28 AM Callahan, Drew <callaan@amazon.com> wrote: > > > > > > Hello, > > > > > > We discovered a race condition during logical replication slot creation that can result in the changes for transactionsrunning at the time of the slot creation to only be partially replicated. We found the cause was due to the slottransitioning from an inconsistent or partially consistent state to a fully consistent state when restoring a snapshotthat had been persisted to disk by a different logical slot. We provide a simple reproduction of this issue below: > > > > > > Session 1: > > > > > > SELECT pg_create_logical_replication_slot('slot1', 'test_decoding'); > > > > > > CREATE TABLE test (a int); > > > BEGIN; > > > INSERT INTO test VALUES (1); > > > > > > Session 2: > > > > > > SELECT pg_create_logical_replication_slot('slot2', 'test_decoding'); > > > > > > <query hangs> > > > > > > Session 3: > > > > > > CHECKPOINT; > > > > > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > > > > > <should return nothing of interest> > > > > > > Session 1: > > > > > > INSERT INTO test VALUES (2); > > > COMMIT; > > > > > > <Session 2 query no longer hangs and successfully creates the slot2> > > > > > > Session 2: > > > > > > select pg_logical_slot_get_changes('slot1', NULL, NULL); > > > > > > select pg_logical_slot_get_changes('slot2', NULL, NULL); > > > > > > <expected: no rows of the txn are returned for slot2> > > > <actual: The 2nd row of the txn is returned for slot2> > > > > > > > I've confirmed this happens HEAD and all supported backbranches. And I > > think it's a (long-standing) bug. > > > > I think we can prepare an isolation test of the above steps and > > include it in contrib/test_decoding/specs. > > > > > Newly created logical replication slots initialize their restart LSN to the current insert position within the WALand also force a checkpoint to get the current state of the running transactions on the system. This create process willthen wait for all of the transactions within that running xact record to complete before being able to transition tothe next snapbuild state. During this time period, if another running xact record is written and then a different logicalreplication process decodes this running xact record, a globally accessible snapshot will be persisted to disk. > > > > > > Once all of the transactions from the initial running xact have finished, the process performing the slot creationwill become unblocked and will then consume the new running xact record. The process will see a valid snapshot, restorethat snapshot from disk, and then transition immediately to the consistent state. The slot will then set the confirmedflush LSN of the slot to the start of the next record after that running xact. > > > > My analysis is the same. After the slot creation, the slot's LSNs become like: > > > > (tx-begin) < (tx's some changes) < slot's resetart_lsn < (fx's further > > some changes) < slot's confirmed_flush_lsn < (tx-commit) > > > > > We now have a logical replication slot that has a restart LSN after the changes of transactions that will commit afterour confirmed flushed LSN in any case where the two running xact records are not fully disjointed. Once those transactionscommit, we will then partially stream their changes. > > > > > > The attached fix addresses the issue by providing the snapshot builder with enough context to understand whether itis building a snapshot for a slot for the first time or if this is a previously existing slot. This is similar to the “need_full_snapshot”parameter which is already set by the caller to control when and how the snapbuilder is allowed to becomeconsistent. > > > > > > With this context, the snapshot builder can always skip performing snapshot restore in order to become fully consistent.Since this issue only occurs when the the logical replication slot consumes a persisted snapshot to become fullyconsistent we can prevent the issue by disallowing this behavior. > > > > I basically agree with the basic idea of skipping snapshot restore > > while finding the initial start point. But I'd like to hear other > > opinions too. > > > > Another simple way to fix this is that we always set > > need_full_snapshot to true when slot creation. But it would then make > > the snapshot builder include non-catalog-change transactions too while > > finding the initial startpoint. Which is not necessary. > > > > As for the proposed patch, it uses a global variable, InCreate, and > > memory context callback to reset it. While I believe this is for not > > breaking any ABI as this bug exists on back branches too, I think it's > > better to consider how to fix it in HEAD and then how to backpatch it. > > We don't necessarily need to have the same fixes for all branches all > > the time. > > I agree to having a separate patch for HEAD in this case. How about a > fix like a patch attached. Thank you for the patch! I've just sent some comments on how to fix this issue and updated the isolation test patch[1]. While I basically agree to add the flag in SnapBuild, I think we can set the flag in AllocateSnapshotBuilder(). Also, we need to bump SNAPBUILD_VERSION. Regards, [1] https://www.postgresql.org/message-id/CAD21AoCsSkwcAYgcUMi2zKp8F%3DmoJL4K1beGD6AV_%3DpHhfAM9Q%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote: > > > I think we can prepare an isolation test of the above steps and > > include it in contrib/test_decoding/specs. > > Thanks for the suggestion. Included an isolation test for the > exact repro and a couple other tests for fully before and fully after. > These two tests don't add a whole lot of value since this > is already heavily tested elsewhere, so I'm good with removing. > However, thought it made sense to include from a perspective of > totality given the test name. > > > Another simple way to fix this is that we always set > > need_full_snapshot to true when slot creation. But it would then make > > the snapshot builder include non-catalog-change transactions too while > > finding the initial startpoint. Which is not necessary. > > Since we're not accumulating changes anyway and we never distribute snapshots > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. > This seems pretty contained and wouldn't require the use of a global variable > and memory context callbacks which is a bit nicer from a readability standpoint. Sorry I missed this comment. I think it's a good point that this fix doesn't lead to any on-disk compatibility while not affecting existing users much, especially for back-patching. If we want to choose this approach for bank branches, we need to carefully consider the side impacts. > It's a little inefficient with memory and CPU, especially in the presence of > long running transactions, but likely would not be noticeable to most users. This matches my analysis. Here is the summary of several proposals we've discussed: a) Have CreateInitDecodingContext() always pass need_full_snapshot = true to AllocateSnapshotBuilder(). This is the simplest approach and doesn't break the on-disk SnapBuildOnDisck compatibility. A downside is that it's a little inefficient with memory and CPU since it includes non-catalog change transactions too. Also, passing need_full_snapshot to CreateInitDecodingContext() will no longer have any meaning. b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions. This bumps SNAPBUILD_VERSION and therefore breaks the on-disk SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple versions. But since different branches are using different versions of SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk version that is consistent across all versions. c) Add a global variable, say in_create, to snapbuild.c it would also require adding a setter function for in_create. There are several ideas where to set/reset the flag. One idea is that we reset the flag in AllocateSnapshotBuilder() and set the flag anywhere before starting to find the start point, for example at the beginning of DecodingContextFindStartpoint(). It probably won't require a memory context callback to make sure to clear the flag. This idea doesn't have a downside from users and extensions perspects. But I'm slightly hesitant to add a global variable. What do you think? and any other ideas? As a separate topic, I think that this backpatching complexity comes from the fact that we're serializing a whole SnapBuild to the disk although we don't need to serialize all of its fields. In order to make future back-patching easy, it might be worth considering decoupling the fields that need to be serialized from SnapBuild struct. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Potential data loss due to race condition during logical replication slot creation
On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote: > > > > > I think we can prepare an isolation test of the above steps and > > > include it in contrib/test_decoding/specs. > > > > Thanks for the suggestion. Included an isolation test for the > > exact repro and a couple other tests for fully before and fully after. > > These two tests don't add a whole lot of value since this > > is already heavily tested elsewhere, so I'm good with removing. > > However, thought it made sense to include from a perspective of > > totality given the test name. > > > > > Another simple way to fix this is that we always set > > > need_full_snapshot to true when slot creation. But it would then make > > > the snapshot builder include non-catalog-change transactions too while > > > finding the initial startpoint. Which is not necessary. > > > > Since we're not accumulating changes anyway and we never distribute snapshots > > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. > > This seems pretty contained and wouldn't require the use of a global variable > > and memory context callbacks which is a bit nicer from a readability standpoint. > > Sorry I missed this comment. > > I think it's a good point that this fix doesn't lead to any on-disk > compatibility while not affecting existing users much, especially for > back-patching. If we want to choose this approach for bank branches, > we need to carefully consider the side impacts. > > > It's a little inefficient with memory and CPU, especially in the presence of > > long running transactions, but likely would not be noticeable to most users. > > This matches my analysis. > > Here is the summary of several proposals we've discussed: > > a) Have CreateInitDecodingContext() always pass need_full_snapshot = > true to AllocateSnapshotBuilder(). > > This is the simplest approach and doesn't break the on-disk > SnapBuildOnDisck compatibility. A downside is that it's a little > inefficient with memory and CPU since it includes non-catalog change > transactions too. Also, passing need_full_snapshot to > CreateInitDecodingContext() will no longer have any meaning. > > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions. > > This bumps SNAPBUILD_VERSION and therefore breaks the on-disk > SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple > versions. But since different branches are using different versions of > SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk > version that is consistent across all versions. > > c) Add a global variable, say in_create, to snapbuild.c > > it would also require adding a setter function for in_create. There > are several ideas where to set/reset the flag. One idea is that we > reset the flag in AllocateSnapshotBuilder() and set the flag anywhere > before starting to find the start point, for example at the beginning > of DecodingContextFindStartpoint(). It probably won't require a memory > context callback to make sure to clear the flag. This idea doesn't > have a downside from users and extensions perspects. But I'm slightly > hesitant to add a global variable. > > What do you think? and any other ideas? I've drafted the idea (c) for discussion (for master and v16 for now). I also liked the idea (a) but I'm concerned a bit about future impact. > > As a separate topic, I think that this backpatching complexity comes > from the fact that we're serializing a whole SnapBuild to the disk > although we don't need to serialize all of its fields. In order to > make future back-patching easy, it might be worth considering > decoupling the fields that need to be serialized from SnapBuild > struct. If we implement this idea, we won't need to bump SNAPBUILD_VERSION for HEAD. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
RE: Potential data loss due to race condition during logical replication slot creation
Dear hackers, While analyzing another failure [1], I found here. I think they occurred by the same reason. The reported failure occurred when the replication slot is created in the middle of the transaction and it reuses the snapshot from other slot. The reproducer is: ``` Session0 SELECT pg_create_logical_replication_slot('slot0', 'test_decoding'); BEGIN; INSERT INTO foo ... Session1 SELECT pg_create_logical_replication_slot('slot1', 'test_decoding'); Session2 CHECKPOINT; SELECT pg_logical_slot_get_changes('slot0', NULL, NULL); Session0 INSERT INTO var ... // var is defined with (user_catalog_table = true) COMMIT; Session1 SELECT pg_logical_slot_get_changes('slot1', NULL, NULL); -> Assertion failure. ``` > Here is the summary of several proposals we've discussed: > a) Have CreateInitDecodingContext() always pass need_full_snapshot = > true to AllocateSnapshotBuilder(). > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions. > c) Add a global variable, say in_create, to snapbuild.c Regarding three options raised by Sawada-san, I preferred the approach a). Since the issue could happen for all supported branches, we should choose the conservative approach. Also, it is quite painful if there are some codes for handling the same issue. Attached patch implemented the approach a) since no one made. I also added the test which can do assertion failure, but not sure it should be included. [1]: https://www.postgresql.org/message-id/TYCPR01MB1207717063D701F597EF98A0CF5272%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Вложения
RE: Potential data loss due to race condition during logical replication slot creation
> Attached patch implemented the approach a) since no one made. I also added > the test which can do assertion failure, but not sure it should be included. Attached one had unnecessary changes. PSA the corrected version. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Вложения
On Wed, Mar 13, 2024 at 3:17 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > Attached patch implemented the approach a) since no one made. I also added > > the test which can do assertion failure, but not sure it should be included. > I feel setting "needs_full_snapshot" to true for decoding means the snapshot will start tracking non-catalog committed xacts as well which is costly. See SnapBuildCommitTxn(). Can we avoid this problem if we would have list of all running xacts when we serialize the snapshot by not decoding any xact whose xid lies in that list? If so, one idea to achieve could be that we maintain the highest_running_xid while serailizing the snapshot and then during restore if that highest_running_xid is <= builder->initial_xmin_horizon, then we ignore restoring the snapshot. We already have few such cases handled in SnapBuildRestore(). -- With Regards, Amit Kapila.
On Mon, Feb 19, 2024 at 2:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote: > > > > > > > I think we can prepare an isolation test of the above steps and > > > > include it in contrib/test_decoding/specs. > > > > > > Thanks for the suggestion. Included an isolation test for the > > > exact repro and a couple other tests for fully before and fully after. > > > These two tests don't add a whole lot of value since this > > > is already heavily tested elsewhere, so I'm good with removing. > > > However, thought it made sense to include from a perspective of > > > totality given the test name. > > > > > > > Another simple way to fix this is that we always set > > > > need_full_snapshot to true when slot creation. But it would then make > > > > the snapshot builder include non-catalog-change transactions too while > > > > finding the initial startpoint. Which is not necessary. > > > > > > Since we're not accumulating changes anyway and we never distribute snapshots > > > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. > > > This seems pretty contained and wouldn't require the use of a global variable > > > and memory context callbacks which is a bit nicer from a readability standpoint. > > > > Sorry I missed this comment. > > > > I think it's a good point that this fix doesn't lead to any on-disk > > compatibility while not affecting existing users much, especially for > > back-patching. If we want to choose this approach for bank branches, > > we need to carefully consider the side impacts. > > > > > It's a little inefficient with memory and CPU, especially in the presence of > > > long running transactions, but likely would not be noticeable to most users. > > > > This matches my analysis. > > > > Here is the summary of several proposals we've discussed: > > > > a) Have CreateInitDecodingContext() always pass need_full_snapshot = > > true to AllocateSnapshotBuilder(). > > > > This is the simplest approach and doesn't break the on-disk > > SnapBuildOnDisck compatibility. A downside is that it's a little > > inefficient with memory and CPU since it includes non-catalog change > > transactions too. Also, passing need_full_snapshot to > > CreateInitDecodingContext() will no longer have any meaning. > > > > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions. > > > > This bumps SNAPBUILD_VERSION and therefore breaks the on-disk > > SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple > > versions. But since different branches are using different versions of > > SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk > > version that is consistent across all versions. > > > > c) Add a global variable, say in_create, to snapbuild.c > > > > it would also require adding a setter function for in_create. There > > are several ideas where to set/reset the flag. One idea is that we > > reset the flag in AllocateSnapshotBuilder() and set the flag anywhere > > before starting to find the start point, for example at the beginning > > of DecodingContextFindStartpoint(). It probably won't require a memory > > context callback to make sure to clear the flag. This idea doesn't > > have a downside from users and extensions perspects. But I'm slightly > > hesitant to add a global variable. > > > > What do you think? and any other ideas? > > I've drafted the idea (c) for discussion (for master and v16 for now). > I also liked the idea (a) but I'm concerned a bit about future impact. > I agree that (c) would be better than (a) as it avoids collecting non-catalog xacts in snapshot. However, I think we shouldn't avoid restoring the snapshot unless really required. See my previous response. -- With Regards, Amit Kapila.
RE: Potential data loss due to race condition during logical replication slot creation
Dear Amit, > I feel setting "needs_full_snapshot" to true for decoding means the > snapshot will start tracking non-catalog committed xacts as well which > is costly. I think the approach was most conservative one which does not have to change the version of the snapshot. However, I understood that you wanted to consider the optimized solution for HEAD first. > See SnapBuildCommitTxn(). Can we avoid this problem if we > would have list of all running xacts when we serialize the snapshot by > not decoding any xact whose xid lies in that list? If so, one idea to > achieve could be that we maintain the highest_running_xid while > serailizing the snapshot and then during restore if that > highest_running_xid is <= builder->initial_xmin_horizon, then we > ignore restoring the snapshot. We already have few such cases handled > in SnapBuildRestore(). Based on the idea, I made a prototype. It can pass tests added by others and me. How do other think? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Вложения
On Tue, Mar 19, 2024 at 7:46 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > I think the approach was most conservative one which does not have to change > the version of the snapshot. However, I understood that you wanted to consider > the optimized solution for HEAD first. > Right, let's see if we can have a solution other than always avoiding restoring snapshots during slot creation even if that is for just HEAD. > > See SnapBuildCommitTxn(). Can we avoid this problem if we > > would have list of all running xacts when we serialize the snapshot by > > not decoding any xact whose xid lies in that list? If so, one idea to > > achieve could be that we maintain the highest_running_xid while > > serailizing the snapshot and then during restore if that > > highest_running_xid is <= builder->initial_xmin_horizon, then we > > ignore restoring the snapshot. We already have few such cases handled > > in SnapBuildRestore(). > > Based on the idea, I made a prototype. It can pass tests added by others and me. > How do other think? > Won't it be possible to achieve the same thing if we just save (serialize) the highest xid among all running xacts? -- With Regards, Amit Kapila.
RE: Potential data loss due to race condition during logical replication slot creation
Dear Amit, > > Won't it be possible to achieve the same thing if we just save > (serialize) the highest xid among all running xacts? > Indeed, here is an updated version. Since the array in xl_running_xact is not ordered, entries of it must be seeked and found the highest one. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Вложения
Re: Potential data loss due to race condition during logical replication slot creation
On Mon, Mar 18, 2024 at 6:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > If so, one idea to > achieve could be that we maintain the highest_running_xid while > serailizing the snapshot and then during restore if that > highest_running_xid is <= builder->initial_xmin_horizon, then we > ignore restoring the snapshot. We already have few such cases handled > in SnapBuildRestore(). I think that builder->initial_xmin_horizon could be older than highest_running_xid, for example, when there is a logical replication slot whose catalog_xmin is old. However, even in this case, we might need to ignore restoring the snapshot. For example, a slightly modified test case still can cause the same problem. The test case in the Kuroda-san's v2 patch: permutation "s0_init" "s0_begin" "s0_insert1" "s1_init" "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" "s1_get_changes_slot0"\ "s1_get_changes_slot1" Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"): permutation "s0_init" "s0_insert1" "s0_begin" "s0_insert1" "s1_init" "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" "s1_get_changes_slot0\ " "s1_get_changes_slot1" Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
RE: Potential data loss due to race condition during logical replication slot creation
Dear Sawada-san, Thanks for giving comments and sorry for late reply. > > If so, one idea to > > achieve could be that we maintain the highest_running_xid while > > serailizing the snapshot and then during restore if that > > highest_running_xid is <= builder->initial_xmin_horizon, then we > > ignore restoring the snapshot. We already have few such cases handled > > in SnapBuildRestore(). > > I think that builder->initial_xmin_horizon could be older than > highest_running_xid, for example, when there is a logical replication > slot whose catalog_xmin is old. However, even in this case, we might > need to ignore restoring the snapshot. For example, a slightly > modified test case still can cause the same problem. > > The test case in the Kuroda-san's v2 patch: > permutation "s0_init" "s0_begin" "s0_insert1" "s1_init" > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" > "s1_get_changes_slot0"\ "s1_get_changes_slot1" > > Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"): > permutation "s0_init" "s0_insert1" "s0_begin" "s0_insert1" "s1_init" > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" > "s1_get_changes_slot0\ " "s1_get_changes_slot1" Good catch, I confirmed that the variant still partially output transactions: ``` step s1_get_changes_slot1: SELECT data FROM pg_logical_slot_get_changes('slot1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids','0'); data ----------------------------------------- BEGIN table public.tbl: INSERT: val1[integer]:2 COMMIT (3 rows) ``` ///// I analyzed a bit. In below descriptions, I uses these notations: txn0 - has a single insert txn1 - has begin-insert-commit operations. slot1 is created in parallel While creating slot0, initial_xmin_horizon was also set, and it come from GetOldestSafeDecodingTransactionId(). Since there were no slots in the database before, the value was just a next xid. Then, the pinned xid was assigned to the txn0. It is correct behavior because slot0 must decode txn0. Finally, while creating slot1 after the txn1, the snapshot serialized by slot0 was restored once. txn1 was the recorded as the highest_running, but initial_xmin was older than it. So the snapshot was restored and txn1 was partially decoded. initial_xmin_horizon (txn0) < highest_running_xid (txn1) ///// So, how should we fix? One idea is based on (c), and adds a flag which indicates an existence of concurrent transactions. The restoration is skipped if both two flags are true. PSA the PoC patch. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Вложения
Re: Potential data loss due to race condition during logical replication slot creation
On Wed, Mar 27, 2024 at 4:54 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > Thanks for giving comments and sorry for late reply. > > > > If so, one idea to > > > achieve could be that we maintain the highest_running_xid while > > > serailizing the snapshot and then during restore if that > > > highest_running_xid is <= builder->initial_xmin_horizon, then we > > > ignore restoring the snapshot. We already have few such cases handled > > > in SnapBuildRestore(). > > > > I think that builder->initial_xmin_horizon could be older than > > highest_running_xid, for example, when there is a logical replication > > slot whose catalog_xmin is old. However, even in this case, we might > > need to ignore restoring the snapshot. For example, a slightly > > modified test case still can cause the same problem. > > > > The test case in the Kuroda-san's v2 patch: > > permutation "s0_init" "s0_begin" "s0_insert1" "s1_init" > > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" > > "s1_get_changes_slot0"\ "s1_get_changes_slot1" > > > > Modified-version test case (add "s0_insert1" between "s0_init" and "s0_begin"): > > permutation "s0_init" "s0_insert1" "s0_begin" "s0_insert1" "s1_init" > > "s2_checkpoint" "s2_get_changes_slot0" "s0_insert2" "s0_commit" > > "s1_get_changes_slot0\ " "s1_get_changes_slot1" > > Good catch, I confirmed that the variant still partially output transactions: > > ``` > step s1_get_changes_slot1: SELECT data FROM pg_logical_slot_get_changes('slot1', NULL, NULL, 'skip-empty-xacts', '1', 'include-xids','0'); > data > ----------------------------------------- > BEGIN > table public.tbl: INSERT: val1[integer]:2 > COMMIT > (3 rows) > ``` > > ///// > > I analyzed a bit. In below descriptions, I uses these notations: > > txn0 - has a single insert > txn1 - has begin-insert-commit operations. > slot1 is created in parallel > > While creating slot0, initial_xmin_horizon was also set, and it come from > GetOldestSafeDecodingTransactionId(). > Since there were no slots in the database before, the value was just a next xid. > > Then, the pinned xid was assigned to the txn0. > It is correct behavior because slot0 must decode txn0. > > Finally, while creating slot1 after the txn1, the snapshot serialized by slot0 was > restored once. txn1 was the recorded as the highest_running, but initial_xmin was > older than it. So the snapshot was restored and txn1 was partially decoded. > > initial_xmin_horizon (txn0) < highest_running_xid (txn1) > > ///// > > So, how should we fix? One idea is based on (c), and adds a flag which indicates > an existence of concurrent transactions. The restoration is skipped if both two > flags are true. PSA the PoC patch. > With the PoC patch, we check ondisk.builder.is_there_running_xact in SnapBuildRestore(), but can we just check running->xcnt in SnapBuildFindSnapshot() to skip calling SnapBuildRestore()? That is, if builder->initial_xmin_horizon is valid (or builder->finding_start_point is true) and running->xcnt > 0, we skip the snapshot restore. However, I think there are still cases where we unnecessarily skip snapshot restores. Probably, what we would like to avoid is, we compute initial_xmin_horizon and start to find the initial start point while there is a concurrently running transaction, and then jump to the consistent state by restoring the consistent snapshot before the concurrent transaction commits. So we can ignore snapshot restores if (oldest XID among transactions running at the time of CreateInitDecodingContext()) >= (OldestRunningXID in xl_running_xacts). I've drafted this idea in the attached patch just for discussion. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Вложения
RE: Potential data loss due to race condition during logical replication slot creation
Dear Sawada-san, > > With the PoC patch, we check ondisk.builder.is_there_running_xact in > SnapBuildRestore(), Yes, the PoC requires that the state of snapshot in the file must be read. > but can we just check running->xcnt in > SnapBuildFindSnapshot() to skip calling SnapBuildRestore()? That is, > if builder->initial_xmin_horizon is valid (or > builder->finding_start_point is true) and running->xcnt > 0, we skip > the snapshot restore. IIUC, it does not require modifications of API. It may be an advantage. > However, I think there are still cases where we > unnecessarily skip snapshot restores > > Probably, what we would like to avoid is, we compute > initial_xmin_horizon and start to find the initial start point while > there is a concurrently running transaction, and then jump to the > consistent state by restoring the consistent snapshot before the > concurrent transaction commits. Yeah, information before concurrent txns are committed should not be used. I think that's why SnapBuildWaitSnapshot() waits until listed transactions are finished. > So we can ignore snapshot restores if > (oldest XID among transactions running at the time of > CreateInitDecodingContext()) >= (OldestRunningXID in > xl_running_xacts). > > I've drafted this idea in the attached patch just for discussion. Thanks for sharing the patch. At least I confirmed all tests and workload you pointed out in [1] were passed. I will post here if I found other issues. [1]: https://www.postgresql.org/message-id/CAD21AoDzLY9vRpo%2Bxb2qPtfn46ikiULPXDpT94sPyFH4GE8bYg%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Re: Potential data loss due to race condition during logical replication slot creation
On Mon, Mar 18, 2024 at 6:19 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Feb 19, 2024 at 2:43 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Thu, Feb 8, 2024 at 2:54 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Sat, Feb 3, 2024 at 12:12 PM Callahan, Drew <callaan@amazon.com> wrote: > > > > > > > > > I think we can prepare an isolation test of the above steps and > > > > > include it in contrib/test_decoding/specs. > > > > > > > > Thanks for the suggestion. Included an isolation test for the > > > > exact repro and a couple other tests for fully before and fully after. > > > > These two tests don't add a whole lot of value since this > > > > is already heavily tested elsewhere, so I'm good with removing. > > > > However, thought it made sense to include from a perspective of > > > > totality given the test name. > > > > > > > > > Another simple way to fix this is that we always set > > > > > need_full_snapshot to true when slot creation. But it would then make > > > > > the snapshot builder include non-catalog-change transactions too while > > > > > finding the initial startpoint. Which is not necessary. > > > > > > > > Since we're not accumulating changes anyway and we never distribute snapshots > > > > while inconsistent it wouldn't cause any memory pressure in the ReorderBuffer. > > > > This seems pretty contained and wouldn't require the use of a global variable > > > > and memory context callbacks which is a bit nicer from a readability standpoint. > > > > > > Sorry I missed this comment. > > > > > > I think it's a good point that this fix doesn't lead to any on-disk > > > compatibility while not affecting existing users much, especially for > > > back-patching. If we want to choose this approach for bank branches, > > > we need to carefully consider the side impacts. > > > > > > > It's a little inefficient with memory and CPU, especially in the presence of > > > > long running transactions, but likely would not be noticeable to most users. > > > > > > This matches my analysis. > > > > > > Here is the summary of several proposals we've discussed: > > > > > > a) Have CreateInitDecodingContext() always pass need_full_snapshot = > > > true to AllocateSnapshotBuilder(). > > > > > > This is the simplest approach and doesn't break the on-disk > > > SnapBuildOnDisck compatibility. A downside is that it's a little > > > inefficient with memory and CPU since it includes non-catalog change > > > transactions too. Also, passing need_full_snapshot to > > > CreateInitDecodingContext() will no longer have any meaning. > > > > > > b) Have snapbuild.c being able to handle multiple SnapBuildOnDisk versions. > > > > > > This bumps SNAPBUILD_VERSION and therefore breaks the on-disk > > > SnapBuildOnDisk compatibility, but snapbuild.c can handle multiple > > > versions. But since different branches are using different versions of > > > SnapBuildOnDisk, it would probably be hard to have the SnapBuildOnDisk > > > version that is consistent across all versions. > > > > > > c) Add a global variable, say in_create, to snapbuild.c > > > > > > it would also require adding a setter function for in_create. There > > > are several ideas where to set/reset the flag. One idea is that we > > > reset the flag in AllocateSnapshotBuilder() and set the flag anywhere > > > before starting to find the start point, for example at the beginning > > > of DecodingContextFindStartpoint(). It probably won't require a memory > > > context callback to make sure to clear the flag. This idea doesn't > > > have a downside from users and extensions perspects. But I'm slightly > > > hesitant to add a global variable. > > > > > > What do you think? and any other ideas? > > > > I've drafted the idea (c) for discussion (for master and v16 for now). > > I also liked the idea (a) but I'm concerned a bit about future impact. > > > > I agree that (c) would be better than (a) as it avoids collecting > non-catalog xacts in snapshot. However, I think we shouldn't avoid > restoring the snapshot unless really required. Pondering further, I came across the question; in what case we would need to restore the snapshot and jump to the consistent state when finding the initial start point? When creating a slot, in ReplicationSlotReserveWal(), we determine the restart_lsn and write a RUNNING_XACT record. This record would be the first RUNNING_XACT record that the logical decoding decodes in most cases. In SnapBuildFindSnapshot(), if the record satisfies (a) (i.e. running->oldestRunningXid == running->nextXid), it can jump to the consistent state. If not, it means there were running transactions at that time of the RUNNING_XACT record being produced. Therefore, we must not restore the snapshot and jump to the consistent state. Because otherwise, we end up deciding the start point in the middle of a transaction that started before the RUNNING_XACT record. Probably the same is true for all subsequent snapbuild states. I might be missing something but I could not find the case where we can or want to restore the serialized snapshot when finding the initial start point. If my analysis is correct, we can go with either (a) or (c) I proposed before[1]. Between these two options, it also could be an option that (a) is for backbranches for safety and (c) is for master. Regards, [1] https://www.postgresql.org/message-id/CAD21AoBH4NWdRcEjXpBFHSKuVO6wia-vHHHaKuEi-h4i4wbi8A%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com