Обсуждение: Deadlock between logrep apply worker and tablesync worker
On my machine, the src/test/subscription/t/002_types.pl test usually takes right about 1.5 seconds: $ time make check PROVE_FLAGS=--timer PROVE_TESTS=t/002_types.pl ... [14:22:12] t/002_types.pl .. ok 1550 ms ( 0.00 usr 0.00 sys + 0.70 cusr 0.25 csys = 0.95 CPU) [14:22:13] I noticed however that sometimes (at least one try in ten, for me) it takes about 2.5 seconds: [14:22:16] t/002_types.pl .. ok 2591 ms ( 0.00 usr 0.00 sys + 0.69 cusr 0.28 csys = 0.97 CPU) [14:22:18] and I've even seen 3.5 seconds. I dug into this and eventually identified the cause: it's a deadlock between a subscription's apply worker and a tablesync worker that it's spawned. Sometimes the apply worker calls wait_for_relation_state_change (to wait for the tablesync worker to finish) while it's holding a lock on pg_replication_origin. If that's the case, then when the tablesync worker reaches process_syncing_tables_for_sync it is able to perform UpdateSubscriptionRelState and reach the transaction commit below that; but when it tries to do replorigin_drop_by_name a little further down, it blocks on acquiring ExclusiveLock on pg_replication_origin. So we have an undetected deadlock. We escape that because wait_for_relation_state_change has a 1-second timeout, after which it rechecks GetSubscriptionRelState and is able to see the committed relation state change; so it continues, and eventually releases its transaction and the lock, permitting the tablesync worker to finish. I've not tracked down the exact circumstances in which the apply worker ends up holding a problematic lock, but it seems likely that it corresponds to cases where its main loop has itself called replorigin_drop_by_name, a bit further up, for some other concurrent tablesync operation. (In all the cases I've traced through, the apply worker is herding multiple tablesync workers when this happens.) I experimented with having the apply worker release its locks before waiting for the tablesync worker, as attached. This passes check-world and it seems to eliminate the test runtime instability, but I wonder whether it's semantically correct. This whole business of taking table-wide ExclusiveLock on pg_replication_origin looks like a horrid kluge that we should try to get rid of, not least because I don't see any clear documentation of what hazard it's trying to prevent. Another thing that has a bad smell about it is the fact that process_syncing_tables_for_sync uses two transactions in the first place. There's a comment there claiming that it's for crash safety, but I can't help suspecting it's really because this case becomes a hard deadlock without that mid-function commit. It's not great in any case that the apply worker can move on in the belief that the tablesync worker is done when in fact the latter still has catalog state updates to make. And I wonder what we're doing with having both of them calling replorigin_drop_by_name ... shouldn't that responsibility belong to just one of them? So I think this whole area deserves a hard look, and I'm not at all sure that what's attached is a good solution. regards, tom lane diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c index 4647837b82..bb45c2107f 100644 --- a/src/backend/replication/logical/tablesync.c +++ b/src/backend/replication/logical/tablesync.c @@ -539,15 +539,27 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn) LWLockRelease(LogicalRepWorkerLock); /* - * Enter busy loop and wait for synchronization worker to - * reach expected state (or die trying). + * If we have a transaction, we must commit it to release + * any locks we have (it's quite likely we hold lock on + * pg_replication_origin, which the sync worker will need + * to update). Then start a new transaction so we can + * examine catalog state. */ - if (!started_tx) + if (started_tx) + { + CommitTransactionCommand(); + StartTransactionCommand(); + } + else { StartTransactionCommand(); started_tx = true; } + /* + * Enter busy loop and wait for synchronization worker to + * reach expected state (or die trying). + */ wait_for_relation_state_change(rstate->relid, SUBREL_STATE_SYNCDONE); }
On Mon, Jan 23, 2023 at 1:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > On my machine, the src/test/subscription/t/002_types.pl test > usually takes right about 1.5 seconds: > > $ time make check PROVE_FLAGS=--timer PROVE_TESTS=t/002_types.pl > ... > [14:22:12] t/002_types.pl .. ok 1550 ms ( 0.00 usr 0.00 sys + 0.70 cusr 0.25 csys = 0.95 CPU) > [14:22:13] > > I noticed however that sometimes (at least one try in ten, for me) > it takes about 2.5 seconds: > > [14:22:16] t/002_types.pl .. ok 2591 ms ( 0.00 usr 0.00 sys + 0.69 cusr 0.28 csys = 0.97 CPU) > [14:22:18] > > and I've even seen 3.5 seconds. I dug into this and eventually > identified the cause: it's a deadlock between a subscription's apply > worker and a tablesync worker that it's spawned. Sometimes the apply > worker calls wait_for_relation_state_change (to wait for the tablesync > worker to finish) while it's holding a lock on pg_replication_origin. > If that's the case, then when the tablesync worker reaches > process_syncing_tables_for_sync it is able to perform > UpdateSubscriptionRelState and reach the transaction commit below > that; but when it tries to do replorigin_drop_by_name a little further > down, it blocks on acquiring ExclusiveLock on pg_replication_origin. > So we have an undetected deadlock. We escape that because > wait_for_relation_state_change has a 1-second timeout, after which > it rechecks GetSubscriptionRelState and is able to see the committed > relation state change; so it continues, and eventually releases its > transaction and the lock, permitting the tablesync worker to finish. > > I've not tracked down the exact circumstances in which the apply > worker ends up holding a problematic lock, but it seems likely > that it corresponds to cases where its main loop has itself called > replorigin_drop_by_name, a bit further up, for some other concurrent > tablesync operation. (In all the cases I've traced through, the apply > worker is herding multiple tablesync workers when this happens.) > > I experimented with having the apply worker release its locks > before waiting for the tablesync worker, as attached. > I don't see any problem with your proposed change but I was wondering if it would be better to commit the transaction and release locks immediately after performing the replication origin drop? By doing that, we will minimize the amount of time the transaction holds the lock. > This passes > check-world and it seems to eliminate the test runtime instability, > but I wonder whether it's semantically correct. This whole business > of taking table-wide ExclusiveLock on pg_replication_origin looks > like a horrid kluge that we should try to get rid of, not least > because I don't see any clear documentation of what hazard it's > trying to prevent. > IIRC, this is done to prevent concurrent drops of origin drop say by exposed API pg_replication_origin_drop(). See the discussion in [1] related to it. If we want we can optimize it so that we can acquire the lock on the specific origin as mentioned in comments replorigin_drop_by_name() but it was not clear that this operation would be frequent enough. [1] - https://www.postgresql.org/message-id/CAHut%2BPuW8DWV5fskkMWWMqzt-x7RPcNQOtJQBp6SdwyRghCk7A%40mail.gmail.com -- With Regards, Amit Kapila.
On Mon, Jan 23, 2023 at 1:29 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Another thing that has a bad smell about it is the fact that > process_syncing_tables_for_sync uses two transactions in the first > place. There's a comment there claiming that it's for crash safety, > but I can't help suspecting it's really because this case becomes a > hard deadlock without that mid-function commit. > > It's not great in any case that the apply worker can move on in > the belief that the tablesync worker is done when in fact the latter > still has catalog state updates to make. And I wonder what we're > doing with having both of them calling replorigin_drop_by_name > ... shouldn't that responsibility belong to just one of them? > Originally, it was being dropped at one place only (via tablesync worker) but we found a race condition as mentioned in the comments in process_syncing_tables_for_sync() before the start of the second transaction which leads to this change. See the report and discussion about that race condition in the email [1]. [1] - https://www.postgresql.org/message-id/CAD21AoAw0Oofi4kiDpJBOwpYyBBBkJj=sLUOn4Gd2GjUAKG-fw@mail.gmail.com -- With Regards, Amit Kapila.
On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> wrote: > > IIRC, this is done to prevent concurrent drops of origin drop say by > exposed API pg_replication_origin_drop(). See the discussion in [1] > related to it. If we want we can optimize it so that we can acquire > the lock on the specific origin as mentioned in comments > replorigin_drop_by_name() but it was not clear that this operation > would be frequent enough. Here is an attached patch to lock the replication origin record using LockSharedObject instead of locking pg_replication_origin relation in ExclusiveLock mode. Now tablesync worker will wait only if the tablesync worker is trying to drop the same replication origin which has already been dropped by the apply worker, the other tablesync workers will be able to successfully drop the replication origin without any wait. Regards, Vignesh
Вложения
On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > IIRC, this is done to prevent concurrent drops of origin drop say by > > exposed API pg_replication_origin_drop(). See the discussion in [1] > > related to it. If we want we can optimize it so that we can acquire > > the lock on the specific origin as mentioned in comments > > replorigin_drop_by_name() but it was not clear that this operation > > would be frequent enough. > > Here is an attached patch to lock the replication origin record using > LockSharedObject instead of locking pg_replication_origin relation in > ExclusiveLock mode. Now tablesync worker will wait only if the > tablesync worker is trying to drop the same replication origin which > has already been dropped by the apply worker, the other tablesync > workers will be able to successfully drop the replication origin > without any wait. > There is a code in the function replorigin_drop_guts() that uses the functionality introduced by replorigin_exists(). Can we reuse this function for the same? Also, it would be good if you can share the numbers for different runs of "src/test/subscription/t/002_types.pl" before and after the patch. -- With Regards, Amit Kapila.
On Friday, January 27, 2023 8:16 PM Amit Kapila <amit.kapila16@gmail.com> > > On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > IIRC, this is done to prevent concurrent drops of origin drop say by > > > exposed API pg_replication_origin_drop(). See the discussion in [1] > > > related to it. If we want we can optimize it so that we can acquire > > > the lock on the specific origin as mentioned in comments > > > replorigin_drop_by_name() but it was not clear that this operation > > > would be frequent enough. > > > > Here is an attached patch to lock the replication origin record using > > LockSharedObject instead of locking pg_replication_origin relation in > > ExclusiveLock mode. Now tablesync worker will wait only if the > > tablesync worker is trying to drop the same replication origin which > > has already been dropped by the apply worker, the other tablesync > > workers will be able to successfully drop the replication origin > > without any wait. > > > > There is a code in the function replorigin_drop_guts() that uses the > functionality introduced by replorigin_exists(). Can we reuse this function for > the same? Maybe we can use SearchSysCacheExists1 to check the existence instead of adding a new function. One comment about the patch. @@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) ... + /* Drop the replication origin if it has not been dropped already */ + if (replorigin_exists(roident)) replorigin_drop_guts(rel, roident, nowait); If developer pass missing_ok as false, should we report an ERROR here instead of silently return ? Best Regards, Hou zj
On Sat, Jan 28, 2023 at 9:36 AM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Friday, January 27, 2023 8:16 PM Amit Kapila <amit.kapila16@gmail.com> > > > > On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > > > IIRC, this is done to prevent concurrent drops of origin drop say by > > > > exposed API pg_replication_origin_drop(). See the discussion in [1] > > > > related to it. If we want we can optimize it so that we can acquire > > > > the lock on the specific origin as mentioned in comments > > > > replorigin_drop_by_name() but it was not clear that this operation > > > > would be frequent enough. > > > > > > Here is an attached patch to lock the replication origin record using > > > LockSharedObject instead of locking pg_replication_origin relation in > > > ExclusiveLock mode. Now tablesync worker will wait only if the > > > tablesync worker is trying to drop the same replication origin which > > > has already been dropped by the apply worker, the other tablesync > > > workers will be able to successfully drop the replication origin > > > without any wait. > > > > > > > There is a code in the function replorigin_drop_guts() that uses the > > functionality introduced by replorigin_exists(). Can we reuse this function for > > the same? > > Maybe we can use SearchSysCacheExists1 to check the existence instead of > adding a new function. > Yeah, I think that would be better. > One comment about the patch. > > @@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) > ... > + /* Drop the replication origin if it has not been dropped already */ > + if (replorigin_exists(roident)) > replorigin_drop_guts(rel, roident, nowait); > > If developer pass missing_ok as false, should we report an ERROR here > instead of silently return ? > One thing that looks a bit odd is that we will anyway have a similar check in replorigin_drop_guts() which is a static function and called from only one place, so, will it be required to check at both places? -- With Regards, Amit Kapila.
On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, Jan 28, 2023 at 9:36 AM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > > > On Friday, January 27, 2023 8:16 PM Amit Kapila <amit.kapila16@gmail.com> > > > > > > On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> > > > wrote: > > > > > > > > > > IIRC, this is done to prevent concurrent drops of origin drop say by > > > > > exposed API pg_replication_origin_drop(). See the discussion in [1] > > > > > related to it. If we want we can optimize it so that we can acquire > > > > > the lock on the specific origin as mentioned in comments > > > > > replorigin_drop_by_name() but it was not clear that this operation > > > > > would be frequent enough. > > > > > > > > Here is an attached patch to lock the replication origin record using > > > > LockSharedObject instead of locking pg_replication_origin relation in > > > > ExclusiveLock mode. Now tablesync worker will wait only if the > > > > tablesync worker is trying to drop the same replication origin which > > > > has already been dropped by the apply worker, the other tablesync > > > > workers will be able to successfully drop the replication origin > > > > without any wait. > > > > > > > > > > There is a code in the function replorigin_drop_guts() that uses the > > > functionality introduced by replorigin_exists(). Can we reuse this function for > > > the same? > > > > Maybe we can use SearchSysCacheExists1 to check the existence instead of > > adding a new function. > > > > Yeah, I think that would be better. > > > One comment about the patch. > > > > @@ -430,23 +445,21 @@ replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) > > ... > > + /* Drop the replication origin if it has not been dropped already */ > > + if (replorigin_exists(roident)) > > replorigin_drop_guts(rel, roident, nowait); > > > > If developer pass missing_ok as false, should we report an ERROR here > > instead of silently return ? > > > > One thing that looks a bit odd is that we will anyway have a similar > check in replorigin_drop_guts() which is a static function and called > from only one place, so, will it be required to check at both places? There is a possibility that the initial check to verify if replication origin exists in replorigin_drop_by_name was successful but later one of either table sync worker or apply worker process might have dropped the replication origin, so it is better to check again before calling replorigin_drop_guts, ideally the tuple should be valid in replorigin_drop_guts, but can we keep the check as it is to maintain the consistency before calling CatalogTupleDelete. Regards, Vignesh
On Fri, 27 Jan 2023 at 17:46, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > IIRC, this is done to prevent concurrent drops of origin drop say by > > > exposed API pg_replication_origin_drop(). See the discussion in [1] > > > related to it. If we want we can optimize it so that we can acquire > > > the lock on the specific origin as mentioned in comments > > > replorigin_drop_by_name() but it was not clear that this operation > > > would be frequent enough. > > > > Here is an attached patch to lock the replication origin record using > > LockSharedObject instead of locking pg_replication_origin relation in > > ExclusiveLock mode. Now tablesync worker will wait only if the > > tablesync worker is trying to drop the same replication origin which > > has already been dropped by the apply worker, the other tablesync > > workers will be able to successfully drop the replication origin > > without any wait. > > > > There is a code in the function replorigin_drop_guts() that uses the > functionality introduced by replorigin_exists(). Can we reuse this > function for the same? > > Also, it would be good if you can share the numbers for different runs > of "src/test/subscription/t/002_types.pl" before and after the patch. By using only Tom Lane's Fix, I noticed that the execution time is varying between 3.6 seconds to 4.4 seconds. By using only the "Changing to LockSharedObject" fix, I noticed that the execution time is varying between 3.8 seconds to 4.6 seconds. By using the combined Fix(Tom Lane's fix + changing to LockSharedObject) , I noticed that the execution time is varying between 3.5 seconds to 3.8 seconds. I felt both the changes will be required as it will also handle the scenario when both the apply worker and the table sync worker try to drop the same replication origin. The execution results for the same: With only Tom Lane's fix: [12:25:32] t/002_types.pl .. ok 3604 ms ( 0.00 usr 0.00 sys + 2.26 cusr 0.37 csys = 2.63 CPU) [12:25:48] t/002_types.pl .. ok 3788 ms ( 0.00 usr 0.00 sys + 2.24 cusr 0.39 csys = 2.63 CPU) [12:26:01] t/002_types.pl .. ok 3783 ms ( 0.00 usr 0.00 sys + 2.42 cusr 0.37 csys = 2.79 CPU) [12:26:14] t/002_types.pl .. ok 3845 ms ( 0.00 usr 0.00 sys + 2.38 cusr 0.44 csys = 2.82 CPU) [12:26:29] t/002_types.pl .. ok 3923 ms ( 0.00 usr 0.00 sys + 2.54 cusr 0.39 csys = 2.93 CPU) [12:26:42] t/002_types.pl .. ok 4416 ms ( 0.00 usr 0.00 sys + 2.73 cusr 0.48 csys = 3.21 CPU) [12:26:55] t/002_types.pl .. ok 4310 ms ( 0.00 usr 0.00 sys + 2.62 cusr 0.39 csys = 3.01 CPU) [12:27:09] t/002_types.pl .. ok 4168 ms ( 0.00 usr 0.00 sys + 2.67 cusr 0.46 csys = 3.13 CPU) [12:27:21] t/002_types.pl .. ok 4167 ms ( 0.00 usr 0.00 sys + 2.46 cusr 0.53 csys = 2.99 CPU) [12:27:34] t/002_types.pl .. ok 4144 ms ( 0.00 usr 0.00 sys + 2.59 cusr 0.41 csys = 3.00 CPU) [12:27:46] t/002_types.pl .. ok 3982 ms ( 0.00 usr 0.00 sys + 2.52 cusr 0.41 csys = 2.93 CPU) [12:28:03] t/002_types.pl .. ok 4190 ms ( 0.01 usr 0.00 sys + 2.67 cusr 0.46 csys = 3.14 CPU) With only "Changing to LockSharedObject" fix: [12:33:02] t/002_types.pl .. ok 3815 ms ( 0.00 usr 0.00 sys + 2.30 cusr 0.38 csys = 2.68 CPU) [12:33:16] t/002_types.pl .. ok 4295 ms ( 0.00 usr 0.00 sys + 2.66 cusr 0.42 csys = 3.08 CPU) [12:33:31] t/002_types.pl .. ok 4270 ms ( 0.00 usr 0.00 sys + 2.72 cusr 0.44 csys = 3.16 CPU) [12:33:44] t/002_types.pl .. ok 4460 ms ( 0.00 usr 0.00 sys + 2.78 cusr 0.45 csys = 3.23 CPU) [12:33:58] t/002_types.pl .. ok 4340 ms ( 0.01 usr 0.00 sys + 2.67 cusr 0.45 csys = 3.13 CPU) [12:34:11] t/002_types.pl .. ok 4142 ms ( 0.00 usr 0.00 sys + 2.58 cusr 0.42 csys = 3.00 CPU) [12:34:24] t/002_types.pl .. ok 4459 ms ( 0.00 usr 0.00 sys + 2.76 cusr 0.49 csys = 3.25 CPU) [12:34:38] t/002_types.pl .. ok 4427 ms ( 0.00 usr 0.00 sys + 2.68 cusr 0.48 csys = 3.16 CPU) [12:35:10] t/002_types.pl .. ok 4642 ms ( 0.00 usr 0.00 sys + 2.84 cusr 0.55 csys = 3.39 CPU) [12:35:22] t/002_types.pl .. ok 4047 ms ( 0.01 usr 0.00 sys + 2.49 cusr 0.46 csys = 2.96 CPU) [12:35:32] t/002_types.pl .. ok 4505 ms ( 0.01 usr 0.00 sys + 2.90 cusr 0.45 csys = 3.36 CPU) [12:36:03] t/002_types.pl .. ok 4088 ms ( 0.00 usr 0.00 sys + 2.51 cusr 0.42 csys = 2.93 CPU) 002_types with combination of Tom Lane's and "Changing to LockSharedObject" fix: [10:22:04] t/002_types.pl .. ok 3730 ms ( 0.00 usr 0.00 sys + 2.30 cusr 0.41 csys = 2.71 CPU) [10:23:40] t/002_types.pl .. ok 3666 ms ( 0.00 usr 0.00 sys + 2.16 cusr 0.42 csys = 2.58 CPU) [10:23:31] t/002_types.pl .. ok 3665 ms ( 0.00 usr 0.00 sys + 2.31 cusr 0.40 csys = 2.71 CPU) [10:23:23] t/002_types.pl .. ok 3500 ms ( 0.00 usr 0.00 sys + 2.20 cusr 0.36 csys = 2.56 CPU) [10:23:14] t/002_types.pl .. ok 3704 ms ( 0.00 usr 0.00 sys + 2.36 cusr 0.35 csys = 2.71 CPU) [10:23:05] t/002_types.pl .. ok 3594 ms ( 0.00 usr 0.00 sys + 2.32 cusr 0.31 csys = 2.63 CPU) [10:24:10] t/002_types.pl .. ok 3702 ms ( 0.00 usr 0.00 sys + 2.27 cusr 0.42 csys = 2.69 CPU) [10:24:22] t/002_types.pl .. ok 3741 ms ( 0.00 usr 0.00 sys + 2.39 cusr 0.36 csys = 2.75 CPU) [10:24:38] t/002_types.pl .. ok 3676 ms ( 0.00 usr 0.00 sys + 2.28 cusr 0.43 csys = 2.71 CPU) [10:24:50] t/002_types.pl .. ok 3843 ms ( 0.00 usr 0.00 sys + 2.36 cusr 0.43 csys = 2.79 CPU) [10:25:03] t/002_types.pl .. ok 3710 ms ( 0.00 usr 0.00 sys + 2.30 cusr 0.36 csys = 2.66 CPU) [10:25:12] t/002_types.pl .. ok 3695 ms ( 0.00 usr 0.00 sys + 2.34 cusr 0.35 csys = 2.69 CPU) 002_types on HEAD: [10:31:05] t/002_types.pl .. ok 5687 ms ( 0.00 usr 0.00 sys + 2.35 cusr 0.45 csys = 2.80 CPU) [10:31:31] t/002_types.pl .. ok 6815 ms ( 0.00 usr 0.00 sys + 2.61 cusr 0.43 csys = 3.04 CPU) [10:31:47] t/002_types.pl .. ok 5561 ms ( 0.00 usr 0.00 sys + 2.24 cusr 0.47 csys = 2.71 CPU) [10:32:05] t/002_types.pl .. ok 4542 ms ( 0.00 usr 0.00 sys + 2.27 cusr 0.39 csys = 2.66 CPU) [10:32:20] t/002_types.pl .. ok 3663 ms ( 0.00 usr 0.00 sys + 2.30 cusr 0.38 csys = 2.68 CPU) [10:32:33] t/002_types.pl .. ok 3627 ms ( 0.00 usr 0.00 sys + 2.27 cusr 0.32 csys = 2.59 CPU) [10:32:45] t/002_types.pl .. ok 3808 ms ( 0.00 usr 0.00 sys + 2.41 cusr 0.39 csys = 2.80 CPU) [10:32:59] t/002_types.pl .. ok 4536 ms ( 0.00 usr 0.00 sys + 2.24 cusr 0.38 csys = 2.62 CPU) [10:33:13] t/002_types.pl .. ok 3638 ms ( 0.00 usr 0.00 sys + 2.25 cusr 0.41 csys = 2.66 CPU) [10:33:35] t/002_types.pl .. ok 4796 ms ( 0.00 usr 0.00 sys + 2.38 cusr 0.38 csys = 2.76 CPU) [10:33:51] t/002_types.pl .. ok 4695 ms ( 0.00 usr 0.00 sys + 2.40 cusr 0.37 csys = 2.77 CPU) [10:34:06] t/002_types.pl .. ok 5738 ms ( 0.00 usr 0.00 sys + 2.44 cusr 0.43 csys = 2.87 CPU) Regards, Vignesh
On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote: > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > One thing that looks a bit odd is that we will anyway have a similar > > check in replorigin_drop_guts() which is a static function and called > > from only one place, so, will it be required to check at both places? > > There is a possibility that the initial check to verify if replication > origin exists in replorigin_drop_by_name was successful but later one > of either table sync worker or apply worker process might have dropped > the replication origin, > Won't locking on the particular origin prevent concurrent drops? IIUC, the drop happens after the patch acquires the lock on the origin. -- With Regards, Amit Kapila.
On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > One thing that looks a bit odd is that we will anyway have a similar > > > check in replorigin_drop_guts() which is a static function and > > > called from only one place, so, will it be required to check at both places? > > > > There is a possibility that the initial check to verify if replication > > origin exists in replorigin_drop_by_name was successful but later one > > of either table sync worker or apply worker process might have dropped > > the replication origin, > > > > Won't locking on the particular origin prevent concurrent drops? IIUC, the > drop happens after the patch acquires the lock on the origin. Yes, I think the existence check in replorigin_drop_guts is unnecessary as we already lock the origin before that. I think the check in replorigin_drop_guts is a custom check after calling SearchSysCache1 to get the tuple, but the error should not happen as no concurrent drop can be performed. To make it simpler, one idea is to move the code that getting the tuple from system cache to the replorigin_drop_by_name(). After locking the origin, we can try to get the tuple and do the existence check, and we can reuse this tuple to perform origin delete. In this approach we only need to check origin existence once after locking. BTW, if we do this, then we'd better rename the replorigin_drop_guts() to something like replorigin_state_clear() as the function only clear the in-memory information after that. The code could be like: ------- replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) ... /* * Lock the origin to prevent concurrent drops. We keep the lock until the * end of transaction. */ LockSharedObject(ReplicationOriginRelationId, roident, 0, AccessExclusiveLock); tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident)); if (!HeapTupleIsValid(tuple)) { if (!missing_ok) elog(ERROR, "cache lookup failed for replication origin with ID %d", roident); return; } replorigin_state_clear(rel, roident, nowait); /* * Now, we can delete the catalog entry. */ CatalogTupleDelete(rel, &tuple->t_self); ReleaseSysCache(tuple); CommandCounterIncrement(); ... ------- Best Regards, Hou zj
On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > One thing that looks a bit odd is that we will anyway have a similar > > > > check in replorigin_drop_guts() which is a static function and > > > > called from only one place, so, will it be required to check at both places? > > > > > > There is a possibility that the initial check to verify if replication > > > origin exists in replorigin_drop_by_name was successful but later one > > > of either table sync worker or apply worker process might have dropped > > > the replication origin, > > > > > > > Won't locking on the particular origin prevent concurrent drops? IIUC, the > > drop happens after the patch acquires the lock on the origin. > > Yes, I think the existence check in replorigin_drop_guts is unnecessary as we > already lock the origin before that. I think the check in replorigin_drop_guts > is a custom check after calling SearchSysCache1 to get the tuple, but the error > should not happen as no concurrent drop can be performed. This scenario is possible while creating subscription, apply worker will try to drop the replication origin if the state is SUBREL_STATE_SYNCDONE. Table sync worker will set the state to SUBREL_STATE_SYNCDONE and update the relation state before calling replorigin_drop_by_name. Since the transaction is committed by table sync worker, the state is visible to apply worker, now apply worker will parallelly try to drop the replication origin in this case. There is a race condition in this case, one of the process table sync worker or apply worker will acquire the lock and drop the replication origin, the other process will get the lock after the process drops the origin and commits the transaction. Now the other process will try to drop the replication origin once it acquires the lock and get the error(from replorigin_drop_guts): cache lookup failed for replication origin with ID. Concurrent drop is possible in this case. Regards, Vignesh
On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > One thing that looks a bit odd is that we will anyway have a similar > > > > check in replorigin_drop_guts() which is a static function and > > > > called from only one place, so, will it be required to check at both places? > > > > > > There is a possibility that the initial check to verify if replication > > > origin exists in replorigin_drop_by_name was successful but later one > > > of either table sync worker or apply worker process might have dropped > > > the replication origin, > > > > > > > Won't locking on the particular origin prevent concurrent drops? IIUC, the > > drop happens after the patch acquires the lock on the origin. > > Yes, I think the existence check in replorigin_drop_guts is unnecessary as we > already lock the origin before that. I think the check in replorigin_drop_guts > is a custom check after calling SearchSysCache1 to get the tuple, but the error > should not happen as no concurrent drop can be performed. > > To make it simpler, one idea is to move the code that getting the tuple from > system cache to the replorigin_drop_by_name(). After locking the origin, we > can try to get the tuple and do the existence check, and we can reuse > this tuple to perform origin delete. In this approach we only need to check > origin existence once after locking. BTW, if we do this, then we'd better rename the > replorigin_drop_guts() to something like replorigin_state_clear() as the function > only clear the in-memory information after that. > > The code could be like: > > ------- > replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) > ... > /* > * Lock the origin to prevent concurrent drops. We keep the lock until the > * end of transaction. > */ > LockSharedObject(ReplicationOriginRelationId, roident, 0, > AccessExclusiveLock); > > tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident)); > if (!HeapTupleIsValid(tuple)) > { > if (!missing_ok) > elog(ERROR, "cache lookup failed for replication origin with ID %d", > roident); > > return; > } > > replorigin_state_clear(rel, roident, nowait); > > /* > * Now, we can delete the catalog entry. > */ > CatalogTupleDelete(rel, &tuple->t_self); > ReleaseSysCache(tuple); > > CommandCounterIncrement(); > ... +1 for this change as it removes the redundant check which is not required. I will post an updated version for this. Regards, Vignesh
On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > One thing that looks a bit odd is that we will anyway have a similar > > > > check in replorigin_drop_guts() which is a static function and > > > > called from only one place, so, will it be required to check at both places? > > > > > > There is a possibility that the initial check to verify if replication > > > origin exists in replorigin_drop_by_name was successful but later one > > > of either table sync worker or apply worker process might have dropped > > > the replication origin, > > > > > > > Won't locking on the particular origin prevent concurrent drops? IIUC, the > > drop happens after the patch acquires the lock on the origin. > > Yes, I think the existence check in replorigin_drop_guts is unnecessary as we > already lock the origin before that. I think the check in replorigin_drop_guts > is a custom check after calling SearchSysCache1 to get the tuple, but the error > should not happen as no concurrent drop can be performed. > > To make it simpler, one idea is to move the code that getting the tuple from > system cache to the replorigin_drop_by_name(). After locking the origin, we > can try to get the tuple and do the existence check, and we can reuse > this tuple to perform origin delete. In this approach we only need to check > origin existence once after locking. BTW, if we do this, then we'd better rename the > replorigin_drop_guts() to something like replorigin_state_clear() as the function > only clear the in-memory information after that. > > The code could be like: > > ------- > replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) > ... > /* > * Lock the origin to prevent concurrent drops. We keep the lock until the > * end of transaction. > */ > LockSharedObject(ReplicationOriginRelationId, roident, 0, > AccessExclusiveLock); > > tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident)); > if (!HeapTupleIsValid(tuple)) > { > if (!missing_ok) > elog(ERROR, "cache lookup failed for replication origin with ID %d", > roident); > > return; > } > > replorigin_state_clear(rel, roident, nowait); > > /* > * Now, we can delete the catalog entry. > */ > CatalogTupleDelete(rel, &tuple->t_self); > ReleaseSysCache(tuple); > > CommandCounterIncrement(); The attached updated patch has the changes to handle the same. Regards, Vignesh
Вложения
On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > > > On Monday, January 30, 2023 2:32 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > One thing that looks a bit odd is that we will anyway have a similar > > > > > check in replorigin_drop_guts() which is a static function and > > > > > called from only one place, so, will it be required to check at both places? > > > > > > > > There is a possibility that the initial check to verify if replication > > > > origin exists in replorigin_drop_by_name was successful but later one > > > > of either table sync worker or apply worker process might have dropped > > > > the replication origin, > > > > > > > > > > Won't locking on the particular origin prevent concurrent drops? IIUC, the > > > drop happens after the patch acquires the lock on the origin. > > > > Yes, I think the existence check in replorigin_drop_guts is unnecessary as we > > already lock the origin before that. I think the check in replorigin_drop_guts > > is a custom check after calling SearchSysCache1 to get the tuple, but the error > > should not happen as no concurrent drop can be performed. > > > > To make it simpler, one idea is to move the code that getting the tuple from > > system cache to the replorigin_drop_by_name(). After locking the origin, we > > can try to get the tuple and do the existence check, and we can reuse > > this tuple to perform origin delete. In this approach we only need to check > > origin existence once after locking. BTW, if we do this, then we'd better rename the > > replorigin_drop_guts() to something like replorigin_state_clear() as the function > > only clear the in-memory information after that. > > > > The code could be like: > > > > ------- > > replorigin_drop_by_name(const char *name, bool missing_ok, bool nowait) > > ... > > /* > > * Lock the origin to prevent concurrent drops. We keep the lock until the > > * end of transaction. > > */ > > LockSharedObject(ReplicationOriginRelationId, roident, 0, > > AccessExclusiveLock); > > > > tuple = SearchSysCache1(REPLORIGIDENT, ObjectIdGetDatum(roident)); > > if (!HeapTupleIsValid(tuple)) > > { > > if (!missing_ok) > > elog(ERROR, "cache lookup failed for replication origin with ID %d", > > roident); > > > > return; > > } > > > > replorigin_state_clear(rel, roident, nowait); > > > > /* > > * Now, we can delete the catalog entry. > > */ > > CatalogTupleDelete(rel, &tuple->t_self); > > ReleaseSysCache(tuple); > > > > CommandCounterIncrement(); > > The attached updated patch has the changes to handle the same. I had not merged one of the local changes that was present, please find the updated patch including that change. Sorry for missing that change. Regards, Vignesh
Вложения
On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com> wrote: > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote: > > > > On Mon, 30 Jan 2023 at 13:00, houzj.fnst@fujitsu.com > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Monday, January 30, 2023 2:32 PM Amit Kapila > <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, Jan 30, 2023 at 9:20 AM vignesh C <vignesh21@gmail.com> > wrote: > > > > > > > > > > On Sat, 28 Jan 2023 at 11:26, Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > > > > > > > > > One thing that looks a bit odd is that we will anyway have a > > > > > > similar check in replorigin_drop_guts() which is a static > > > > > > function and called from only one place, so, will it be required to > check at both places? > > > > > > > > > > There is a possibility that the initial check to verify if > > > > > replication origin exists in replorigin_drop_by_name was > > > > > successful but later one of either table sync worker or apply > > > > > worker process might have dropped the replication origin, > > > > > > > > > > > > > Won't locking on the particular origin prevent concurrent drops? > > > > IIUC, the drop happens after the patch acquires the lock on the origin. > > > > > > Yes, I think the existence check in replorigin_drop_guts is > > > unnecessary as we already lock the origin before that. I think the > > > check in replorigin_drop_guts is a custom check after calling > > > SearchSysCache1 to get the tuple, but the error should not happen as no > concurrent drop can be performed. > > > > > > To make it simpler, one idea is to move the code that getting the > > > tuple from system cache to the replorigin_drop_by_name(). After > > > locking the origin, we can try to get the tuple and do the existence > > > check, and we can reuse this tuple to perform origin delete. In this > > > approach we only need to check origin existence once after locking. > > > BTW, if we do this, then we'd better rename the > > > replorigin_drop_guts() to something like replorigin_state_clear() as > > > the function only clear the in-memory information after that. > > > > > > > The attached updated patch has the changes to handle the same. > > I had not merged one of the local changes that was present, please find the > updated patch including that change. Sorry for missing that change. > I also tried to test the time of "src/test/subscription/t/002_types.pl" before and after the patch(change the lock level) and Tom's patch(split transaction) like what Vignesh has shared on -hackers. I run about 100 times for each case. Tom's and the lock level patch behave similarly on my machines[1]. HEAD: 3426 ~ 6425 ms HEAD + Tom: 3404 ~ 3462 ms HEAD + Vignesh: 3419 ~ 3474 ms HEAD + Tom + Vignesh: 3408 ~ 3454 ms Even apart from the testing time reduction, reducing the lock level and lock the specific object can also help improve the lock contention which user(that use the exposed function) , table sync worker and apply worker can also benefit from it. So, I think pushing the patch to change the lock level makes sense. And the patch looks good to me. While on it, after pushing the patch, I think there is another case might also worth to be improved, that is the table sync and apply worker try to drop the same origin which might cause some delay. This is another case(different from the deadlock), so I feel we can try to improve this in another patch. [1] CentOS 8.2, 128G RAM, 40 processors Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz Best Regards, Hou zj
On Thu, Feb 2, 2023 at 12:05 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com> wrote: > > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote: > > > > I also tried to test the time of "src/test/subscription/t/002_types.pl" > before and after the patch(change the lock level) and Tom's patch(split > transaction) like what Vignesh has shared on -hackers. > > I run about 100 times for each case. Tom's and the lock level patch > behave similarly on my machines[1]. > > HEAD: 3426 ~ 6425 ms > HEAD + Tom: 3404 ~ 3462 ms > HEAD + Vignesh: 3419 ~ 3474 ms > HEAD + Tom + Vignesh: 3408 ~ 3454 ms > > Even apart from the testing time reduction, reducing the lock level and lock > the specific object can also help improve the lock contention which user(that > use the exposed function) , table sync worker and apply worker can also benefit > from it. So, I think pushing the patch to change the lock level makes sense. > > And the patch looks good to me. > Thanks for the tests. I also see a reduction in test time variability with Vignesh's patch. I think we can release the locks in case the origin is concurrently dropped as in the attached patch. I am planning to commit this patch tomorrow unless there are more comments or objections. > While on it, after pushing the patch, I think there is another case might also > worth to be improved, that is the table sync and apply worker try to drop the > same origin which might cause some delay. This is another case(different from > the deadlock), so I feel we can try to improve this in another patch. > Right, I think that case could be addressed by Tom's patch to some extent but I am thinking we should also try to analyze if we can completely avoid the need to remove origins from both processes. One idea could be to introduce another relstate something like PRE_SYNCDONE and set it in a separate transaction before we set the state as SYNCDONE and remove the slot and origin in tablesync worker. Now, if the tablesync worker errors out due to some reason during the second transaction, it can remove the slot and origin after restart by checking the state. However, it would add another relstate which may not be the best way to address this problem. Anyway, that can be accomplished as a separate patch. -- With Regards, Amit Kapila.
Вложения
On Thu, Feb 2, 2023 at 4:51 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Thanks for the tests. I also see a reduction in test time variability > with Vignesh's patch. I think we can release the locks in case the > origin is concurrently dropped as in the attached patch. I am planning > to commit this patch tomorrow unless there are more comments or > objections. > Pushed. -- With Regards, Amit Kapila.
On Thursday, February 2, 2023 7:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Feb 2, 2023 at 12:05 PM houzj.fnst@fujitsu.com > <houzj.fnst@fujitsu.com> wrote: > > > > On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com> > wrote: > > > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote: > > > > > > > I also tried to test the time of "src/test/subscription/t/002_types.pl" > > before and after the patch(change the lock level) and Tom's > > patch(split > > transaction) like what Vignesh has shared on -hackers. > > > > I run about 100 times for each case. Tom's and the lock level patch > > behave similarly on my machines[1]. > > > > HEAD: 3426 ~ 6425 ms > > HEAD + Tom: 3404 ~ 3462 ms > > HEAD + Vignesh: 3419 ~ 3474 ms > > HEAD + Tom + Vignesh: 3408 ~ 3454 ms > > > > Even apart from the testing time reduction, reducing the lock level > > and lock the specific object can also help improve the lock contention > > which user(that use the exposed function) , table sync worker and > > apply worker can also benefit from it. So, I think pushing the patch to change > the lock level makes sense. > > > > And the patch looks good to me. > > > > Thanks for the tests. I also see a reduction in test time variability with Vignesh's > patch. I think we can release the locks in case the origin is concurrently > dropped as in the attached patch. I am planning to commit this patch > tomorrow unless there are more comments or objections. > > > While on it, after pushing the patch, I think there is another case > > might also worth to be improved, that is the table sync and apply > > worker try to drop the same origin which might cause some delay. This > > is another case(different from the deadlock), so I feel we can try to improve > this in another patch. > > > > Right, I think that case could be addressed by Tom's patch to some extent but > I am thinking we should also try to analyze if we can completely avoid the need > to remove origins from both processes. One idea could be to introduce > another relstate something like PRE_SYNCDONE and set it in a separate > transaction before we set the state as SYNCDONE and remove the slot and > origin in tablesync worker. > Now, if the tablesync worker errors out due to some reason during the second > transaction, it can remove the slot and origin after restart by checking the state. > However, it would add another relstate which may not be the best way to > address this problem. Anyway, that can be accomplished as a separate patch. Here is an attempt to achieve the same. Basically, the patch removes the code that drop the origin in apply worker. And add a new state PRE_SYNCDONE after synchronization finished in front of apply (sublsn set), but before dropping the origin and other final cleanups. The tablesync will restart and redo the cleanup if it failed after reaching the new state. Besides, since the changes can already be applied on the table in PRE_SYNCDONE state, so I also modified the check in should_apply_changes_for_rel(). And some other conditions for the origin drop in subscription commands are were adjusted in this patch. Best Regards, Hou zj
Вложения
On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > ... > > Right, I think that case could be addressed by Tom's patch to some extent but > > I am thinking we should also try to analyze if we can completely avoid the need > > to remove origins from both processes. One idea could be to introduce > > another relstate something like PRE_SYNCDONE and set it in a separate > > transaction before we set the state as SYNCDONE and remove the slot and > > origin in tablesync worker. > > Now, if the tablesync worker errors out due to some reason during the second > > transaction, it can remove the slot and origin after restart by checking the state. > > However, it would add another relstate which may not be the best way to > > address this problem. Anyway, that can be accomplished as a separate patch. > > Here is an attempt to achieve the same. > Basically, the patch removes the code that drop the origin in apply worker. And > add a new state PRE_SYNCDONE after synchronization finished in front of apply > (sublsn set), but before dropping the origin and other final cleanups. The > tablesync will restart and redo the cleanup if it failed after reaching the new > state. Besides, since the changes can already be applied on the table in > PRE_SYNCDONE state, so I also modified the check in > should_apply_changes_for_rel(). And some other conditions for the origin drop > in subscription commands are were adjusted in this patch. > Here are some review comments for the 0001 patch ====== General Comment 0. The idea of using the extra relstate for clean-up seems OK, but the implementation of the new state in this patch appears misordered and misnamed to me. The state name should indicate what it is doing (PRE_SYNCDONE is meaningless). The patch describes in several places that this state means "synchronized, but not yet cleaned up" therefore IMO it means the SYNCDONE state should be *before* this new state. And since this new state is for "cleanup" then let's call it something like that. To summarize, I don’t think the meaning of SYNCDONE should be touched. SYNCDONE means the synchronization is done, same as before. And your new "cleanup" state belongs directly *after* that. IMO it should be like this: 1. STATE_INIT 2. STATE_DATASYNC 3. STATE_FINISHEDCOPY 4. STATE_SYNCDONE 5. STATE_CLEANUP <-- new relstate 6. STATE_READY Of course, this is going to impact almost every aspect of the patch, but I think everything will be basically the same as you have it now -- only all the state names and comments need to be adjusted according to the above. ====== Commit Message 1. Previously, we allowed the apply worker to drop the origin to avoid the case that the tablesync worker fails to the origin(due to crash). In this case we don't restart the tablesync worker, and the apply worker can clean the origin. ~ There seem to be some words missing in this paragraph. SUGGESTION Previously, we allowed the apply worker to drop the origin as a way to recover from the scenario where the tablesync worker failed to drop it (due to crash). ~~~ 2. To improve this, we introduce a new relstate SUBREL_STATE_PRE_SYNCDONE which will be set after synchronization finished in front of apply (sublsn set), but before dropping the origin and other final cleanups. The apply worker will restart tablesync worker if the relstate is SUBREL_STATE_PRE_SYNCDONE. This way, even if the tablesync worker error out in the transaction that tries to drop the origin, the apply worker will restart the tablesync worker to redo the cleanup(for origin and other stuff) and then directly exit. ~ 2a. This is going to be impacted by my "General Comment". Notice how you describe again "will be set after synchronization finished". Evidence again this means the new CLEANUP state should directly follow the SYNCDONE state. 2b. "error out” --> "encounters an error" 2c. "cleanup(for origin" --> space before the "(" ====== doc/src/sgml/catalogs.sgml 3. @@ -8071,7 +8071,8 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l <literal>i</literal> = initialize, <literal>d</literal> = data is being copied, <literal>f</literal> = finished table copy, - <literal>s</literal> = synchronized, + <literal>p</literal> = synchronized but not yet cleaned up, + <literal>s</literal> = synchronization done, <literal>r</literal> = ready (normal replication) </para></entry> </row> @@ -8082,8 +8083,8 @@ SCRAM-SHA-256$<replaceable><iteration count></replaceable>:<replaceable>&l </para> <para> Remote LSN of the state change used for synchronization coordination - when in <literal>s</literal> or <literal>r</literal> states, - otherwise null + when in <literal>p</literal>, <literal>s</literal> or + <literal>r</literal> states, otherwise null </para></entry> </row> </tbody> This state order and choice of the letter are impacted by my "General Comment". IMO it should be more like this: State code: i = initialize, d = data is being copied, f = finished table copy, s = synchronization done, c = clean-up done, r = ready (normal replication) ====== src/backend/commands/subscriptioncmds.c 4. AlterSubscription_refresh Some adjustments are needed according to my "General Comment". ~~~ 5. DropSubscription Some adjustments are needed according to my "General Comment". ====== src/backend/replication/logical/tablesync.c 6. + * Update the state of the table to SUBREL_STATE_SYNCDONE and cleanup the + * tablesync slot and drop the tablesync's origin tracking. + */ +static void +finish_synchronization(bool restart_after_crash) 6a. Suggest calling this function something like 'cleanup_after_synchronization' ~ 6b. Some adjustments to states and comments are needed according to my "General Comment". ~~~ 7. process_syncing_tables_for_sync - MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCDONE; + MyLogicalRepWorker->relstate = SUBREL_STATE_PRE_SYNCDONE; MyLogicalRepWorker->relstate_lsn = current_lsn; This should just be setting SUBREL_STATE_SYNCDONE how it previously did. Other states/comments in this function to change according to my "General Comments". ~ 8. if (rstate->state == SUBREL_STATE_SYNCDONE) { /* * Apply has caught up to the position where the table sync has * finished. Mark the table as ready so that the apply will just * continue to replicate it normally. */ That should now be checking for SUBREL_STATE_CLEANUPDONE according to me "General Comment" ~~~ 9. process_syncing_tables_for_apply Some adjustments to states and comments are needed according to my "General Comment". ~~~ 10. LogicalRepSyncTableStart Some adjustments to states and comments are needed according to my "General Comment". ====== src/backend/replication/logical/worker.c 11. should_apply_changes_for_rel Some adjustments to states according to my "General Comment". ====== src/include/catalog/pg_subscription_rel.h 12. @@ -62,8 +62,10 @@ DECLARE_UNIQUE_INDEX_PKEY(pg_subscription_rel_srrelid_srsubid_index, 6117, Subsc * NULL) */ #define SUBREL_STATE_FINISHEDCOPY 'f' /* tablesync copy phase is completed * (sublsn NULL) */ -#define SUBREL_STATE_SYNCDONE 's' /* synchronization finished in front of - * apply (sublsn set) */ +#define SUBREL_STATE_PRE_SYNCDONE 'p' /* synchronization finished in front of + * apply (sublsn set), but the final + * cleanup has not yet been performed */ +#define SUBREL_STATE_SYNCDONE 's' /* synchronization complete */ #define SUBREL_STATE_READY 'r' /* ready (sublsn set) */ Some adjustments to states and comments are needed according to my "General Comment". ------ Kind Regards, Peter Smith. Fujitsu Australia
On Tuesday, February 7, 2023 12:12 PM Peter Smith <smithpb2250@gmail.com> wrote: > On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> > wrote: > > > ... > > > Right, I think that case could be addressed by Tom's patch to some > > > extent but I am thinking we should also try to analyze if we can > > > completely avoid the need to remove origins from both processes. One > > > idea could be to introduce another relstate something like > > > PRE_SYNCDONE and set it in a separate transaction before we set the > > > state as SYNCDONE and remove the slot and origin in tablesync worker. > > > Now, if the tablesync worker errors out due to some reason during > > > the second transaction, it can remove the slot and origin after restart by > checking the state. > > > However, it would add another relstate which may not be the best way > > > to address this problem. Anyway, that can be accomplished as a separate > patch. > > > > Here is an attempt to achieve the same. > > Basically, the patch removes the code that drop the origin in apply > > worker. And add a new state PRE_SYNCDONE after synchronization > > finished in front of apply (sublsn set), but before dropping the > > origin and other final cleanups. The tablesync will restart and redo > > the cleanup if it failed after reaching the new state. Besides, since > > the changes can already be applied on the table in PRE_SYNCDONE state, > > so I also modified the check in should_apply_changes_for_rel(). And > > some other conditions for the origin drop in subscription commands are > were adjusted in this patch. > > > > Here are some review comments for the 0001 patch > > ====== > General Comment > > 0. > The idea of using the extra relstate for clean-up seems OK, but the > implementation of the new state in this patch appears misordered and > misnamed to me. > > The state name should indicate what it is doing (PRE_SYNCDONE is > meaningless). The patch describes in several places that this state means > "synchronized, but not yet cleaned up" therefore IMO it means the SYNCDONE > state should be *before* this new state. And since this new state is for > "cleanup" then let's call it something like that. > > To summarize, I don’t think the meaning of SYNCDONE should be touched. > SYNCDONE means the synchronization is done, same as before. And your new > "cleanup" state belongs directly *after* that. IMO it should be like this: > > 1. STATE_INIT > 2. STATE_DATASYNC > 3. STATE_FINISHEDCOPY > 4. STATE_SYNCDONE > 5. STATE_CLEANUP <-- new relstate > 6. STATE_READY > > Of course, this is going to impact almost every aspect of the patch, but I think > everything will be basically the same as you have it now > -- only all the state names and comments need to be adjusted according to the > above. Although I agree the CLEANUP is easier to understand, but I am a bit concerned that the changes would be a bit invasive. If we add a CLEANUP state at the end as suggested, it will change the meaning of the existing SYNCDONE state, before the change it means both data sync and cleanup have been done, but after the change it only mean the data sync is over. This also means all the current C codes that considered the SYNCDONE as the final state of table sync will need to be changed. Moreover, it's common for user to query the relation state from pg_subscription_rel to identify if the table sync of a table is finished(e.g. check relstate IN ('r', 's')), but if we add a new state(CLEANUP) as the final state, then all these SQLs would need to be changed as they need to check like relstate IN ('r', 'x'(new cleanup state)). Best Regards, Hou zj
On Tue, Feb 7, 2023 at 6:46 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, February 7, 2023 12:12 PM Peter Smith <smithpb2250@gmail.com> wrote: > > On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> > > wrote: > > > > > ... > > > > Right, I think that case could be addressed by Tom's patch to some > > > > extent but I am thinking we should also try to analyze if we can > > > > completely avoid the need to remove origins from both processes. One > > > > idea could be to introduce another relstate something like > > > > PRE_SYNCDONE and set it in a separate transaction before we set the > > > > state as SYNCDONE and remove the slot and origin in tablesync worker. > > > > Now, if the tablesync worker errors out due to some reason during > > > > the second transaction, it can remove the slot and origin after restart by > > checking the state. > > > > However, it would add another relstate which may not be the best way > > > > to address this problem. Anyway, that can be accomplished as a separate > > patch. > > > > > > Here is an attempt to achieve the same. > > > Basically, the patch removes the code that drop the origin in apply > > > worker. And add a new state PRE_SYNCDONE after synchronization > > > finished in front of apply (sublsn set), but before dropping the > > > origin and other final cleanups. The tablesync will restart and redo > > > the cleanup if it failed after reaching the new state. Besides, since > > > the changes can already be applied on the table in PRE_SYNCDONE state, > > > so I also modified the check in should_apply_changes_for_rel(). And > > > some other conditions for the origin drop in subscription commands are > > were adjusted in this patch. > > > > > > > Here are some review comments for the 0001 patch > > > > ====== > > General Comment > > > > 0. > > The idea of using the extra relstate for clean-up seems OK, but the > > implementation of the new state in this patch appears misordered and > > misnamed to me. > > > > The state name should indicate what it is doing (PRE_SYNCDONE is > > meaningless). The patch describes in several places that this state means > > "synchronized, but not yet cleaned up" therefore IMO it means the SYNCDONE > > state should be *before* this new state. And since this new state is for > > "cleanup" then let's call it something like that. > > > > To summarize, I don’t think the meaning of SYNCDONE should be touched. > > SYNCDONE means the synchronization is done, same as before. And your new > > "cleanup" state belongs directly *after* that. IMO it should be like this: > > > > 1. STATE_INIT > > 2. STATE_DATASYNC > > 3. STATE_FINISHEDCOPY > > 4. STATE_SYNCDONE > > 5. STATE_CLEANUP <-- new relstate > > 6. STATE_READY > > > > Of course, this is going to impact almost every aspect of the patch, but I think > > everything will be basically the same as you have it now > > -- only all the state names and comments need to be adjusted according to the > > above. > > Although I agree the CLEANUP is easier to understand, but I am a bit concerned > that the changes would be a bit invasive. > > If we add a CLEANUP state at the end as suggested, it will change the meaning > of the existing SYNCDONE state, before the change it means both data sync and > cleanup have been done, but after the change it only mean the data sync is > over. This also means all the current C codes that considered the SYNCDONE as > the final state of table sync will need to be changed. Moreover, it's common > for user to query the relation state from pg_subscription_rel to identify if > the table sync of a table is finished(e.g. check relstate IN ('r', 's')), but > if we add a new state(CLEANUP) as the final state, then all these SQLs would > need to be changed as they need to check like relstate IN ('r', 'x'(new cleanup > state)). > IIUC, you are saying that we still want to keep the SYNCDONE state as the last state before READY mainly because otherwise there is too much impact on user/test SQL that is currently checking those ('s','r') states. OTOH, in the current 001 patch you had the SUBREL_STATE_PRE_SYNCDONE meaning "synchronized but not yet cleaned up" (that's verbatim from your PGDOCS). And there is C code where you are checking SUBREL_STATE_PRE_SYNCDONE and essentially giving the state before the SYNCDONE an equal status to the SYNCDONE (e.g. should_apply_changes_for_rel seemed to be doing this). It seems to be trying to have an each-way bet... ~~~ But I think there may be an easy way out of this problem: Current HEAD 1. STATE_INIT 'i' 2. STATE_DATASYNC 'd' 3. STATE_FINISHEDCOPY 'f' 4. STATE_SYNCDONE 's' 5. STATE_READY 'r' The patch 0001 1. STATE_INIT 'i' 2. STATE_DATASYNC 'd' 3. STATE_FINISHEDCOPY 'f' 4. STATE_PRESYNCDONE 'p' <-- new relstate 4. STATE_SYNCDONE 's' 5. STATE_READY 'r' My previous suggestion (which you acknowledge is easier to understand, but might cause hassles for existing SQL) 1. STATE_INIT 'i' 2. STATE_DATASYNC 'd' 3. STATE_FINISHEDCOPY 'f' 4. STATE_SYNCDONE 's' 5. STATE_CLEANUP 'x' <-- new relstate 6. STATE_READY 'r' SUGGESTED (hack to solve everything?) 1. STATE_INIT 'i' 2. STATE_DATASYNC 'd' 3. STATE_FINISHEDCOPY 'f' 4. STATE_SYNCDONE_PRE_CLEANUP 'x' <-- change the char code for this existing relstate (was SYNCDONE 's') 5. STATE_SYNCDONE_WITH_CLEANUP 's' <-- new relstate using 's' 6. STATE_READY 'r' By commandeering the 's' flag for the new CLEANUP state it means no existing user code or test code needs to change - IIUC everything will work the same as before. ~ Hmmm -- In hindsight, perhaps I have gone around in a big circle here and the solution I am describing here is almost exactly the same as your patch 0001 only with better names for the relstates. Thoughts? ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, Feb 3, 2023 at 6:58 PM houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com> wrote: > > On Thursday, February 2, 2023 7:21 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Feb 2, 2023 at 12:05 PM houzj.fnst@fujitsu.com > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21@gmail.com> > > wrote: > > > > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > > I also tried to test the time of "src/test/subscription/t/002_types.pl" > > > before and after the patch(change the lock level) and Tom's > > > patch(split > > > transaction) like what Vignesh has shared on -hackers. > > > > > > I run about 100 times for each case. Tom's and the lock level patch > > > behave similarly on my machines[1]. > > > > > > HEAD: 3426 ~ 6425 ms > > > HEAD + Tom: 3404 ~ 3462 ms > > > HEAD + Vignesh: 3419 ~ 3474 ms > > > HEAD + Tom + Vignesh: 3408 ~ 3454 ms > > > > > > Even apart from the testing time reduction, reducing the lock level > > > and lock the specific object can also help improve the lock contention > > > which user(that use the exposed function) , table sync worker and > > > apply worker can also benefit from it. So, I think pushing the patch to change > > the lock level makes sense. > > > > > > And the patch looks good to me. > > > > > > > Thanks for the tests. I also see a reduction in test time variability with Vignesh's > > patch. I think we can release the locks in case the origin is concurrently > > dropped as in the attached patch. I am planning to commit this patch > > tomorrow unless there are more comments or objections. > > > > > While on it, after pushing the patch, I think there is another case > > > might also worth to be improved, that is the table sync and apply > > > worker try to drop the same origin which might cause some delay. This > > > is another case(different from the deadlock), so I feel we can try to improve > > this in another patch. > > > > > > > Right, I think that case could be addressed by Tom's patch to some extent but > > I am thinking we should also try to analyze if we can completely avoid the need > > to remove origins from both processes. One idea could be to introduce > > another relstate something like PRE_SYNCDONE and set it in a separate > > transaction before we set the state as SYNCDONE and remove the slot and > > origin in tablesync worker. > > Now, if the tablesync worker errors out due to some reason during the second > > transaction, it can remove the slot and origin after restart by checking the state. > > However, it would add another relstate which may not be the best way to > > address this problem. Anyway, that can be accomplished as a separate patch. > > Here is an attempt to achieve the same. > Basically, the patch removes the code that drop the origin in apply worker. And > add a new state PRE_SYNCDONE after synchronization finished in front of apply > (sublsn set), but before dropping the origin and other final cleanups. The > tablesync will restart and redo the cleanup if it failed after reaching the new > state. Besides, since the changes can already be applied on the table in > PRE_SYNCDONE state, so I also modified the check in > should_apply_changes_for_rel(). And some other conditions for the origin drop > in subscription commands are were adjusted in this patch. > BTW, the tablesync.c has a large file header comment which describes all about the relstates including some examples. So this patch needs to include modifications to that comment. ------ Kind Regards, Peter Smith. Fujitsu Australia.