Обсуждение: Race condition in FetchTableStates() breaks synchronization of subscription tables

Поиск
Список
Период
Сортировка

Race condition in FetchTableStates() breaks synchronization of subscription tables

От
Alexander Lakhin
Дата:
Hello hackers,

After determining a possible cause for intermittent failures of the test
subscription/031_column_list [1], I was wondering what makes another
subscription test (014_binary) fail on the buildfarm:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-01-22%2001%3A19%3A03
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-01-14%2018%3A19%3A20
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2023-12-21%2001%3A11%3A52
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-27%2001%3A42%3A39

All those failures caused by a timeout when waiting for a message expected
in _subscriber.log. For example, in the snakefly's case:
[01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type
timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at 
/home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269.

_subscriber.log contains:
2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2023-12-21 01:17:46.756 UTC [409999] ERROR:  could not receive data from WAL stream: server closed the connection 
unexpectedly
         This probably means the server terminated abnormally
         before or while processing the request.
2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker "logical replication apply worker" (PID 409999) exited
with
 
exit code 1
2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication apply worker for subscription "tsub" has started
...

While _subscriber.log from a successful test run contains:
2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: session time: 0:00:00.014 user=postgres 
database=postgres host=[local]
2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication apply worker for subscription "tsub" has started
2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication table synchronization worker for subscription "tsub", 
table "test_mismatching_types" has started
2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data format
2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY test_mismatching_types, line 1, column a

In this case, "logical replication apply worker for subscription "tsub" has
started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later.

I've managed to reproduce this failure locally by running multiple tests in
parallel, and my analysis shows that it is caused by a race condition when
accessing variable table_states_valid inside tablesync.c.

tablesync.c does the following with table_states_valid:
/*
  * Callback from syscache invalidation.
  */
void
invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
{
     table_states_valid = false;
}
...
static bool
FetchTableStates(bool *started_tx)
{
...
     if (!table_states_valid)
     {
...
         /* Fetch all non-ready tables. */
         rstates = GetSubscriptionRelations(MySubscription->oid, true);
...
         table_states_valid = true;
     }

So, when syscache invalidation occurs inside the code block
"if (!table_states_valid)", that invalidation is effectively ignored.

In a failed case I observe the following events:
1. logical replication apply worker performs
  LogicalRepApplyLoop() -> process_syncing_tables() ->
  process_syncing_tables_for_apply() -> FetchTableStates() periodically.

2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
  for SUBSCRIPTIONRELMAP, and that leads to calling
  invalidate_syncing_table_states().

3. If the apply worker manages to fetch no non-ready tables in
  FetchTableStates() and ignore "table_states_valid = false" from
  invalidate_syncing_table_states(), then it just misses the invalidation
  event, so it keeps working without noticing non-ready tables appeared as
  the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips
  a loop "foreach(lc, table_states_not_ready) ..." until some other event
  occurs).

pg_usleep(100000) added just below GetSubscriptionRelations(...) proves my
analysis -- without it, I need tens of iterations (with 50 tests running in
parallel) to catch the failure, but with it, I get the failure on the first
iteration.

(Reproduced on REL_16_STABLE..master, where the test 014_binary tries to
"Refresh the publication to trigger the tablesync", but I think the race
condition exists in the previous versions as well.)

[1] https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com

Best regards,
Alexander



On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello hackers,
>
> After determining a possible cause for intermittent failures of the test
> subscription/031_column_list [1], I was wondering what makes another
> subscription test (014_binary) fail on the buildfarm:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-01-22%2001%3A19%3A03
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-01-14%2018%3A19%3A20
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2023-12-21%2001%3A11%3A52
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-27%2001%3A42%3A39
>
> All those failures caused by a timeout when waiting for a message expected
> in _subscriber.log. For example, in the snakefly's case:
> [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type
> timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at
> /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269.
>
> _subscriber.log contains:
> 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> 2023-12-21 01:17:46.756 UTC [409999] ERROR:  could not receive data from WAL stream: server closed the connection
> unexpectedly
>          This probably means the server terminated abnormally
>          before or while processing the request.
> 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker "logical replication apply worker" (PID 409999) exited
with
> exit code 1
> 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication apply worker for subscription "tsub" has started
> ...
>
> While _subscriber.log from a successful test run contains:
> 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: session time: 0:00:00.014 user=postgres
> database=postgres host=[local]
> 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication apply worker for subscription "tsub" has started
> 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication table synchronization worker for subscription "tsub",
> table "test_mismatching_types" has started
> 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data format
> 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY test_mismatching_types, line 1, column a
>
> In this case, "logical replication apply worker for subscription "tsub" has
> started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later.
>
> I've managed to reproduce this failure locally by running multiple tests in
> parallel, and my analysis shows that it is caused by a race condition when
> accessing variable table_states_valid inside tablesync.c.
>
> tablesync.c does the following with table_states_valid:
> /*
>   * Callback from syscache invalidation.
>   */
> void
> invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue)
> {
>      table_states_valid = false;
> }
> ...
> static bool
> FetchTableStates(bool *started_tx)
> {
> ...
>      if (!table_states_valid)
>      {
> ...
>          /* Fetch all non-ready tables. */
>          rstates = GetSubscriptionRelations(MySubscription->oid, true);
> ...
>          table_states_valid = true;
>      }
>
> So, when syscache invalidation occurs inside the code block
> "if (!table_states_valid)", that invalidation is effectively ignored.
>
> In a failed case I observe the following events:
> 1. logical replication apply worker performs
>   LogicalRepApplyLoop() -> process_syncing_tables() ->
>   process_syncing_tables_for_apply() -> FetchTableStates() periodically.
>
> 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
>   for SUBSCRIPTIONRELMAP, and that leads to calling
>   invalidate_syncing_table_states().
>
> 3. If the apply worker manages to fetch no non-ready tables in
>   FetchTableStates() and ignore "table_states_valid = false" from
>   invalidate_syncing_table_states(), then it just misses the invalidation
>   event, so it keeps working without noticing non-ready tables appeared as
>   the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips
>   a loop "foreach(lc, table_states_not_ready) ..." until some other event
>   occurs).
>
> pg_usleep(100000) added just below GetSubscriptionRelations(...) proves my
> analysis -- without it, I need tens of iterations (with 50 tests running in
> parallel) to catch the failure, but with it, I get the failure on the first
> iteration.

Thanks for the analysis, I was able to reproduce the issue with the
steps you had shared. I agree with your analysis. I added some logs to
verify that the invalidation was getting missed.

I felt that this invalidation is getting ignored because we have used
a boolean variable here, how about changing it slightly so that
table_states_invalid gets incremented for every invalidation and then
decrementing table_states_invalid after getting the non-ready tables
like in the attached patch.  I was able to verify that the test passes
with the attached patch.

Regards,
Vignesh

Вложения

RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

От
"Zhijie Hou (Fujitsu)"
Дата:
On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com> wrote:
> 
> On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote:
> >
> > Hello hackers,
> >
> > After determining a possible cause for intermittent failures of the
> > test subscription/031_column_list [1], I was wondering what makes
> > another subscription test (014_binary) fail on the buildfarm:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=20
> > 24-01-22%2001%3A19%3A03
> >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202
> > 4-01-14%2018%3A19%3A20
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=202
> > 3-12-21%2001%3A11%3A52
> >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202
> > 3-11-27%2001%3A42%3A39
> >
> > All those failures caused by a timeout when waiting for a message
> > expected in _subscriber.log. For example, in the snakefly's case:
> > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
> > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
> > incorrect binary data format) at
> /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl
> line 269.
> >
> > _subscriber.log contains:
> > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement:
> > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > 2023-12-21 01:17:46.756 UTC [409999] ERROR:  could not receive data
> > from WAL stream: server closed the connection unexpectedly
> >          This probably means the server terminated abnormally
> >          before or while processing the request.
> > 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker "logical
> > replication apply worker" (PID 409999) exited with exit code 1
> > 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication apply
> > worker for subscription "tsub" has started ...
> >
> > While _subscriber.log from a successful test run contains:
> > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement:
> > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
> > session time: 0:00:00.014 user=postgres database=postgres host=[local]
> > 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication apply
> > worker for subscription "tsub" has started
> > 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication table
> > synchronization worker for subscription "tsub", table
> > "test_mismatching_types" has started
> > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data
> > format
> > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY
> > test_mismatching_types, line 1, column a
> >
> > In this case, "logical replication apply worker for subscription
> > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes
> later.
> >
> > I've managed to reproduce this failure locally by running multiple
> > tests in parallel, and my analysis shows that it is caused by a race
> > condition when accessing variable table_states_valid inside tablesync.c.
> >
> > tablesync.c does the following with table_states_valid:
> > /*
> >   * Callback from syscache invalidation.
> >   */
> > void
> > invalidate_syncing_table_states(Datum arg, int cacheid, uint32
> > hashvalue) {
> >      table_states_valid = false;
> > }
> > ...
> > static bool
> > FetchTableStates(bool *started_tx)
> > {
> > ...
> >      if (!table_states_valid)
> >      {
> > ...
> >          /* Fetch all non-ready tables. */
> >          rstates = GetSubscriptionRelations(MySubscription->oid,
> > true); ...
> >          table_states_valid = true;
> >      }
> >
> > So, when syscache invalidation occurs inside the code block "if
> > (!table_states_valid)", that invalidation is effectively ignored.
> >
> > In a failed case I observe the following events:
> > 1. logical replication apply worker performs
> >   LogicalRepApplyLoop() -> process_syncing_tables() ->
> >   process_syncing_tables_for_apply() -> FetchTableStates() periodically.
> >
> > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
> >   for SUBSCRIPTIONRELMAP, and that leads to calling
> >   invalidate_syncing_table_states().
> >
> > 3. If the apply worker manages to fetch no non-ready tables in
> >   FetchTableStates() and ignore "table_states_valid = false" from
> >   invalidate_syncing_table_states(), then it just misses the invalidation
> >   event, so it keeps working without noticing non-ready tables appeared as
> >   the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply()
> skips
> >   a loop "foreach(lc, table_states_not_ready) ..." until some other event
> >   occurs).
> >
> > pg_usleep(100000) added just below GetSubscriptionRelations(...)
> > proves my analysis -- without it, I need tens of iterations (with 50
> > tests running in
> > parallel) to catch the failure, but with it, I get the failure on the
> > first iteration.
> 
> Thanks for the analysis, I was able to reproduce the issue with the steps you had
> shared. I agree with your analysis. I added some logs to verify that the
> invalidation was getting missed.
> 
> I felt that this invalidation is getting ignored because we have used a boolean
> variable here, how about changing it slightly so that table_states_invalid gets
> incremented for every invalidation and then decrementing table_states_invalid
> after getting the non-ready tables like in the attached patch.  I was able to
> verify that the test passes with the attached patch.

Thanks for the patch.

I am not sure if counting the invalidation number is needed, as even if there
are hundreds of invalidations outside of FetchTableStates, one FetchTableStates
call should reset the count to 0 as it is checking the latest catalog. I think
a better approach is to make the table_states_valid tristate
{TABLE_STATE_NEEDS_REBUILD, TABLE_STATE_REBUILD_START, TABLE_STATE_VALID}. The
cache invalidate callback will set state to TABLE_STATE_NEEDS_REBUILD.
FetchTableStates will set the state to TABLE_STATE_VALID only if the state was
TABLE_STATE_REBUILD_START. See the existing EventTriggerCacheStateType for
reference.

Best Regards,
Hou zj


On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote:
> > >
> > > Hello hackers,
> > >
> > > After determining a possible cause for intermittent failures of the
> > > test subscription/031_column_list [1], I was wondering what makes
> > > another subscription test (014_binary) fail on the buildfarm:
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=20
> > > 24-01-22%2001%3A19%3A03
> > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202
> > > 4-01-14%2018%3A19%3A20
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=202
> > > 3-12-21%2001%3A11%3A52
> > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202
> > > 3-11-27%2001%3A42%3A39
> > >
> > > All those failures caused by a timeout when waiting for a message
> > > expected in _subscriber.log. For example, in the snakefly's case:
> > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
> > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
> > > incorrect binary data format) at
> > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl
> > line 269.
> > >
> > > _subscriber.log contains:
> > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement:
> > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > 2023-12-21 01:17:46.756 UTC [409999] ERROR:  could not receive data
> > > from WAL stream: server closed the connection unexpectedly
> > >          This probably means the server terminated abnormally
> > >          before or while processing the request.
> > > 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker "logical
> > > replication apply worker" (PID 409999) exited with exit code 1
> > > 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication apply
> > > worker for subscription "tsub" has started ...
> > >
> > > While _subscriber.log from a successful test run contains:
> > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement:
> > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
> > > session time: 0:00:00.014 user=postgres database=postgres host=[local]
> > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication apply
> > > worker for subscription "tsub" has started
> > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication table
> > > synchronization worker for subscription "tsub", table
> > > "test_mismatching_types" has started
> > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data
> > > format
> > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY
> > > test_mismatching_types, line 1, column a
> > >
> > > In this case, "logical replication apply worker for subscription
> > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes
> > later.
> > >
> > > I've managed to reproduce this failure locally by running multiple
> > > tests in parallel, and my analysis shows that it is caused by a race
> > > condition when accessing variable table_states_valid inside tablesync.c.
> > >
> > > tablesync.c does the following with table_states_valid:
> > > /*
> > >   * Callback from syscache invalidation.
> > >   */
> > > void
> > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32
> > > hashvalue) {
> > >      table_states_valid = false;
> > > }
> > > ...
> > > static bool
> > > FetchTableStates(bool *started_tx)
> > > {
> > > ...
> > >      if (!table_states_valid)
> > >      {
> > > ...
> > >          /* Fetch all non-ready tables. */
> > >          rstates = GetSubscriptionRelations(MySubscription->oid,
> > > true); ...
> > >          table_states_valid = true;
> > >      }
> > >
> > > So, when syscache invalidation occurs inside the code block "if
> > > (!table_states_valid)", that invalidation is effectively ignored.
> > >
> > > In a failed case I observe the following events:
> > > 1. logical replication apply worker performs
> > >   LogicalRepApplyLoop() -> process_syncing_tables() ->
> > >   process_syncing_tables_for_apply() -> FetchTableStates() periodically.
> > >
> > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache
> > >   for SUBSCRIPTIONRELMAP, and that leads to calling
> > >   invalidate_syncing_table_states().
> > >
> > > 3. If the apply worker manages to fetch no non-ready tables in
> > >   FetchTableStates() and ignore "table_states_valid = false" from
> > >   invalidate_syncing_table_states(), then it just misses the invalidation
> > >   event, so it keeps working without noticing non-ready tables appeared as
> > >   the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply()
> > skips
> > >   a loop "foreach(lc, table_states_not_ready) ..." until some other event
> > >   occurs).
> > >
> > > pg_usleep(100000) added just below GetSubscriptionRelations(...)
> > > proves my analysis -- without it, I need tens of iterations (with 50
> > > tests running in
> > > parallel) to catch the failure, but with it, I get the failure on the
> > > first iteration.
> >
> > Thanks for the analysis, I was able to reproduce the issue with the steps you had
> > shared. I agree with your analysis. I added some logs to verify that the
> > invalidation was getting missed.
> >
> > I felt that this invalidation is getting ignored because we have used a boolean
> > variable here, how about changing it slightly so that table_states_invalid gets
> > incremented for every invalidation and then decrementing table_states_invalid
> > after getting the non-ready tables like in the attached patch.  I was able to
> > verify that the test passes with the attached patch.
>
> Thanks for the patch.
>
> I am not sure if counting the invalidation number is needed, as even if there
> are hundreds of invalidations outside of FetchTableStates, one FetchTableStates
> call should reset the count to 0 as it is checking the latest catalog.

Another approach I was thinking of is to reset table_states_valid
immediately in the beginning of FetchTableStates, so any new
invalidations will take care of setting table_states_valid again it
again which will be handled in the next iteration of fetching
non-ready tables like in the attached patch.

Regards,
Vigneshj

Вложения

RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

От
"Zhijie Hou (Fujitsu)"
Дата:
On Tuesday, January 30, 2024 11:21 AM vignesh C <vignesh21@gmail.com> wrote:
> 
> On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> wrote:
> >
> > On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com>
> wrote:
> > >
> > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com>
> wrote:
> > > >
> > > > Hello hackers,
> > > >
> > > > After determining a possible cause for intermittent failures of
> > > > the test subscription/031_column_list [1], I was wondering what
> > > > makes another subscription test (014_binary) fail on the buildfarm:
> > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&d
> > > > t=20
> > > > 24-01-22%2001%3A19%3A03
> > > >
> > >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2
> > > 02
> > > > 4-01-14%2018%3A19%3A20
> > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt
> > > > =202
> > > > 3-12-21%2001%3A11%3A52
> > > >
> > >
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2
> > > 02
> > > > 3-11-27%2001%3A42%3A39
> > > >
> > > > All those failures caused by a timeout when waiting for a message
> > > > expected in _subscriber.log. For example, in the snakefly's case:
> > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
> > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
> > > > incorrect binary data format) at
> > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin
> > > ary.pl
> > > line 269.
> > > >
> > > > _subscriber.log contains:
> > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement:
> > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > 2023-12-21 01:17:46.756 UTC [409999] ERROR:  could not receive
> > > > data from WAL stream: server closed the connection unexpectedly
> > > >          This probably means the server terminated abnormally
> > > >          before or while processing the request.
> > > > 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker
> > > > "logical replication apply worker" (PID 409999) exited with exit
> > > > code 1
> > > > 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication
> > > > apply worker for subscription "tsub" has started ...
> > > >
> > > > While _subscriber.log from a successful test run contains:
> > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement:
> > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
> > > > session time: 0:00:00.014 user=postgres database=postgres
> > > > host=[local]
> > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication
> > > > apply worker for subscription "tsub" has started
> > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication
> > > > table synchronization worker for subscription "tsub", table
> > > > "test_mismatching_types" has started
> > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data
> > > > format
> > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY
> > > > test_mismatching_types, line 1, column a
> > > >
> > > > In this case, "logical replication apply worker for subscription
> > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3
> > > > minutes
> > > later.
> > > >
> > > > I've managed to reproduce this failure locally by running multiple
> > > > tests in parallel, and my analysis shows that it is caused by a
> > > > race condition when accessing variable table_states_valid inside
> tablesync.c.
> > > >
> > > > tablesync.c does the following with table_states_valid:
> > > > /*
> > > >   * Callback from syscache invalidation.
> > > >   */
> > > > void
> > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32
> > > > hashvalue) {
> > > >      table_states_valid = false;
> > > > }
> > > > ...
> > > > static bool
> > > > FetchTableStates(bool *started_tx) { ...
> > > >      if (!table_states_valid)
> > > >      {
> > > > ...
> > > >          /* Fetch all non-ready tables. */
> > > >          rstates = GetSubscriptionRelations(MySubscription->oid,
> > > > true); ...
> > > >          table_states_valid = true;
> > > >      }
> > > >
> > > > So, when syscache invalidation occurs inside the code block "if
> > > > (!table_states_valid)", that invalidation is effectively ignored.
> > > >
> > > > In a failed case I observe the following events:
> > > > 1. logical replication apply worker performs
> > > >   LogicalRepApplyLoop() -> process_syncing_tables() ->
> > > >   process_syncing_tables_for_apply() -> FetchTableStates() periodically.
> > > >
> > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates
> syscache
> > > >   for SUBSCRIPTIONRELMAP, and that leads to calling
> > > >   invalidate_syncing_table_states().
> > > >
> > > > 3. If the apply worker manages to fetch no non-ready tables in
> > > >   FetchTableStates() and ignore "table_states_valid = false" from
> > > >   invalidate_syncing_table_states(), then it just misses the invalidation
> > > >   event, so it keeps working without noticing non-ready tables
> appeared as
> > > >   the result of ALTER SUBSCRIPTION
> > > > (process_syncing_tables_for_apply()
> > > skips
> > > >   a loop "foreach(lc, table_states_not_ready) ..." until some other event
> > > >   occurs).
> > > >
> > > > pg_usleep(100000) added just below GetSubscriptionRelations(...)
> > > > proves my analysis -- without it, I need tens of iterations (with
> > > > 50 tests running in
> > > > parallel) to catch the failure, but with it, I get the failure on
> > > > the first iteration.
> > >
> > > Thanks for the analysis, I was able to reproduce the issue with the
> > > steps you had shared. I agree with your analysis. I added some logs
> > > to verify that the invalidation was getting missed.
> > >
> > > I felt that this invalidation is getting ignored because we have
> > > used a boolean variable here, how about changing it slightly so that
> > > table_states_invalid gets incremented for every invalidation and
> > > then decrementing table_states_invalid after getting the non-ready
> > > tables like in the attached patch.  I was able to verify that the test passes
> with the attached patch.
> >
> > Thanks for the patch.
> >
> > I am not sure if counting the invalidation number is needed, as even
> > if there are hundreds of invalidations outside of FetchTableStates,
> > one FetchTableStates call should reset the count to 0 as it is checking the
> latest catalog.
> 
> Another approach I was thinking of is to reset table_states_valid immediately in
> the beginning of FetchTableStates, so any new invalidations will take care of
> setting table_states_valid again it again which will be handled in the next
> iteration of fetching non-ready tables like in the attached patch.

I think this is not the standard appraoch for cache building. Because if any
ERROR happens during the cache building, then we will come into the situation
that table_states_valid=true while the cache data is invalid. Even if we
currently don't access these cache after erroring out, but I think we'd better
to avoid this risk.

Best Regards,
Hou zj

On Tue, 30 Jan 2024 at 17:22, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Tuesday, January 30, 2024 11:21 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com>
> > wrote:
> > >
> > > On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com>
> > wrote:
> > > >
> > > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com>
> > wrote:
> > > > >
> > > > > Hello hackers,
> > > > >
> > > > > After determining a possible cause for intermittent failures of
> > > > > the test subscription/031_column_list [1], I was wondering what
> > > > > makes another subscription test (014_binary) fail on the buildfarm:
> > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&d
> > > > > t=20
> > > > > 24-01-22%2001%3A19%3A03
> > > > >
> > > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2
> > > > 02
> > > > > 4-01-14%2018%3A19%3A20
> > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt
> > > > > =202
> > > > > 3-12-21%2001%3A11%3A52
> > > > >
> > > >
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2
> > > > 02
> > > > > 3-11-27%2001%3A42%3A39
> > > > >
> > > > > All those failures caused by a timeout when waiting for a message
> > > > > expected in _subscriber.log. For example, in the snakefly's case:
> > > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with
> > > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)?
> > > > > incorrect binary data format) at
> > > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin
> > > > ary.pl
> > > > line 269.
> > > > >
> > > > > _subscriber.log contains:
> > > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG:  statement:
> > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > > 2023-12-21 01:17:46.756 UTC [409999] ERROR:  could not receive
> > > > > data from WAL stream: server closed the connection unexpectedly
> > > > >          This probably means the server terminated abnormally
> > > > >          before or while processing the request.
> > > > > 2023-12-21 01:17:46.760 UTC [405057] LOG:  background worker
> > > > > "logical replication apply worker" (PID 409999) exited with exit
> > > > > code 1
> > > > > 2023-12-21 01:17:46.779 UTC [532857] LOG:  logical replication
> > > > > apply worker for subscription "tsub" has started ...
> > > > >
> > > > > While _subscriber.log from a successful test run contains:
> > > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG:  statement:
> > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> > > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection:
> > > > > session time: 0:00:00.014 user=postgres database=postgres
> > > > > host=[local]
> > > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG:  logical replication
> > > > > apply worker for subscription "tsub" has started
> > > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG:  logical replication
> > > > > table synchronization worker for subscription "tsub", table
> > > > > "test_mismatching_types" has started
> > > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR:  incorrect binary data
> > > > > format
> > > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT:  COPY
> > > > > test_mismatching_types, line 1, column a
> > > > >
> > > > > In this case, "logical replication apply worker for subscription
> > > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3
> > > > > minutes
> > > > later.
> > > > >
> > > > > I've managed to reproduce this failure locally by running multiple
> > > > > tests in parallel, and my analysis shows that it is caused by a
> > > > > race condition when accessing variable table_states_valid inside
> > tablesync.c.
> > > > >
> > > > > tablesync.c does the following with table_states_valid:
> > > > > /*
> > > > >   * Callback from syscache invalidation.
> > > > >   */
> > > > > void
> > > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32
> > > > > hashvalue) {
> > > > >      table_states_valid = false;
> > > > > }
> > > > > ...
> > > > > static bool
> > > > > FetchTableStates(bool *started_tx) { ...
> > > > >      if (!table_states_valid)
> > > > >      {
> > > > > ...
> > > > >          /* Fetch all non-ready tables. */
> > > > >          rstates = GetSubscriptionRelations(MySubscription->oid,
> > > > > true); ...
> > > > >          table_states_valid = true;
> > > > >      }
> > > > >
> > > > > So, when syscache invalidation occurs inside the code block "if
> > > > > (!table_states_valid)", that invalidation is effectively ignored.
> > > > >
> > > > > In a failed case I observe the following events:
> > > > > 1. logical replication apply worker performs
> > > > >   LogicalRepApplyLoop() -> process_syncing_tables() ->
> > > > >   process_syncing_tables_for_apply() -> FetchTableStates() periodically.
> > > > >
> > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates
> > syscache
> > > > >   for SUBSCRIPTIONRELMAP, and that leads to calling
> > > > >   invalidate_syncing_table_states().
> > > > >
> > > > > 3. If the apply worker manages to fetch no non-ready tables in
> > > > >   FetchTableStates() and ignore "table_states_valid = false" from
> > > > >   invalidate_syncing_table_states(), then it just misses the invalidation
> > > > >   event, so it keeps working without noticing non-ready tables
> > appeared as
> > > > >   the result of ALTER SUBSCRIPTION
> > > > > (process_syncing_tables_for_apply()
> > > > skips
> > > > >   a loop "foreach(lc, table_states_not_ready) ..." until some other event
> > > > >   occurs).
> > > > >
> > > > > pg_usleep(100000) added just below GetSubscriptionRelations(...)
> > > > > proves my analysis -- without it, I need tens of iterations (with
> > > > > 50 tests running in
> > > > > parallel) to catch the failure, but with it, I get the failure on
> > > > > the first iteration.
> > > >
> > > > Thanks for the analysis, I was able to reproduce the issue with the
> > > > steps you had shared. I agree with your analysis. I added some logs
> > > > to verify that the invalidation was getting missed.
> > > >
> > > > I felt that this invalidation is getting ignored because we have
> > > > used a boolean variable here, how about changing it slightly so that
> > > > table_states_invalid gets incremented for every invalidation and
> > > > then decrementing table_states_invalid after getting the non-ready
> > > > tables like in the attached patch.  I was able to verify that the test passes
> > with the attached patch.
> > >
> > > Thanks for the patch.
> > >
> > > I am not sure if counting the invalidation number is needed, as even
> > > if there are hundreds of invalidations outside of FetchTableStates,
> > > one FetchTableStates call should reset the count to 0 as it is checking the
> > latest catalog.
> >
> > Another approach I was thinking of is to reset table_states_valid immediately in
> > the beginning of FetchTableStates, so any new invalidations will take care of
> > setting table_states_valid again it again which will be handled in the next
> > iteration of fetching non-ready tables like in the attached patch.
>
> I think this is not the standard appraoch for cache building. Because if any
> ERROR happens during the cache building, then we will come into the situation
> that table_states_valid=true while the cache data is invalid. Even if we
> currently don't access these cache after erroring out, but I think we'd better
> to avoid this risk.

Here is an updated patch which changes the boolean variable to a
tri-state enum and set stable state to valid only if no invalidations
have been occurred while the list is being prepared.

Regards,
Vignesh

Вложения

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

От
Alexander Lakhin
Дата:
Hello Vignesh and Hou-san,

01.02.2024 07:59, vignesh C wrote:
> Here is an updated patch which changes the boolean variable to a
> tri-state enum and set stable state to valid only if no invalidations
> have been occurred while the list is being prepared.
>

While testing the v3 patch, I observed another anomaly with the 014_binary
test. When I run the test in parallel, I see that sometimes one of the test
instances runs much longer than others. For example:
49      All tests successful.
49      Files=1, Tests=8,  4 wallclock secs ( 0.03 usr  0.01 sys + 0.43 cusr  0.30 csys =  0.77 CPU)
49      Result: PASS
12      All tests successful.
12      Files=1, Tests=8, 184 wallclock secs ( 0.02 usr  0.01 sys + 0.46 cusr  0.40 csys =  0.89 CPU)
12      Result: PASS

As far as I can see, this happens due to another race condition, this time
in launcher.c.
For such a three-minute case I see in _subscriber.log:
2024-02-01 14:33:13.604 UTC [949255] DEBUG:  relation "public.test_mismatching_types" does not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during message

type "INSERT" in transaction 757, finished at 0/153C838
2024-02-01 14:33:13.604 UTC [949255] ERROR:  logical replication target relation "public.test_mismatching_types" does 
not exist
2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during message

type "INSERT" in transaction 757, finished at 0/153C838
...
2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG:  statement: CREATE TABLE public.test_mismatching_types (
             a int PRIMARY KEY
         );
2024-02-01 14:33:13.605 UTC [942451] DEBUG:  unregistering background worker "logical replication apply worker for 
subscription 16398"
2024-02-01 14:33:13.605 UTC [942451] LOG:  background worker "logical replication apply worker" (PID 949255) exited
with
 
exit code 1
...
2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
...
2024-02-01 14:36:13.642 UTC [942527] DEBUG:  starting logical replication worker for subscription "tsub"
(there is no interesting activity between 14:33:13 and 14:36:13)

So logical replication apply worker exited because CREATE TABLE was not
executed on subscriber yet, and new replication worker started because of a
timeout occurred in WaitLatch(..., wait_time, ...) inside
ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE
(180 sec)).

But in a normal (fast) case the same WaitLatch exits due to MyLatch set as
a result of:
logical replication apply worker| logicalrep_worker_onexit() ->
   ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) ->
launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)).

In a bad case, I see that the SetLatch() called as well, but then the latch
gets reset by the following code in WaitForReplicationWorkerAttach():
         rc = WaitLatch(MyLatch,
                        WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
                        10L, WAIT_EVENT_BGWORKER_STARTUP);

         if (rc & WL_LATCH_SET)
         {
             ResetLatch(MyLatch);
             CHECK_FOR_INTERRUPTS();
         }

With pg_usleep(300000); added just before ResetLatch and
  $node_subscriber->safe_psql(
      'postgres', qq(
+SELECT pg_sleep(0.5);
      CREATE TABLE public.test_mismatching_types (
in 014_binary.pl, I can see the anomaly without running tests in parallel,
just when running that test in a loop:
for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done
...
iteration 2
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8,  5 wallclock secs ( 0.00 usr  0.00 sys +  0.24 cusr  0.18 csys =  0.42 CPU)
Result: PASS
iteration 3
# +++ tap check in src/test/subscription +++
t/014_binary.pl .. ok
All tests successful.
Files=1, Tests=8, 183 wallclock secs ( 0.02 usr  0.00 sys +  0.28 cusr  0.25 csys =  0.55 CPU)
Result: PASS
...

In other words, the abnormal test execution takes place due to the
following events:
1. logicalrep worker launcher launches replication worker and waits for it
  to attach:
  ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach()
2. logicalrep worker exits due to some error (logical replication target
  relation does not exist, in our case) and sends a signal to set the latch
  for launcher
3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets
  it inside WaitForReplicationWorkerAttach()
4. launcher gets back to ApplyLauncherMain() where it waits for the latch
  (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms).

Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the
test 027_nosuperuser executes for 3+ minutes on each run for me.
And we can find such abnormal execution on the buildfarm too:

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=snakefly&dt=2024-02-01%2020%3A34%3A03&stg=subscription-check

So, if we had PG_TEST_TIMEOUT_DEFAULT less than DEFAULT_NAPTIME_PER_CYCLE,
we would notice this situation as a test failure, but that timeouts are
equal (accidentally?), hence, say, wait_for_log() just waits long enough
for an expected replication worker to start eventually.

As to the initial issue fixes, v3 and v2 look good to me, but v2 is more
appealing for back-patching (if this fix going to be back-patched), IMHO.

Best regards,
Alexander



On Fri, 2 Feb 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello Vignesh and Hou-san,
>
> 01.02.2024 07:59, vignesh C wrote:
> > Here is an updated patch which changes the boolean variable to a
> > tri-state enum and set stable state to valid only if no invalidations
> > have been occurred while the list is being prepared.
> >
>
> While testing the v3 patch, I observed another anomaly with the 014_binary
> test. When I run the test in parallel, I see that sometimes one of the test
> instances runs much longer than others. For example:
> 49      All tests successful.
> 49      Files=1, Tests=8,  4 wallclock secs ( 0.03 usr  0.01 sys + 0.43 cusr  0.30 csys =  0.77 CPU)
> 49      Result: PASS
> 12      All tests successful.
> 12      Files=1, Tests=8, 184 wallclock secs ( 0.02 usr  0.01 sys + 0.46 cusr  0.40 csys =  0.89 CPU)
> 12      Result: PASS
>
> As far as I can see, this happens due to another race condition, this time
> in launcher.c.
> For such a three-minute case I see in _subscriber.log:
> 2024-02-01 14:33:13.604 UTC [949255] DEBUG:  relation "public.test_mismatching_types" does not exist
> 2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during
message
> type "INSERT" in transaction 757, finished at 0/153C838
> 2024-02-01 14:33:13.604 UTC [949255] ERROR:  logical replication target relation "public.test_mismatching_types"
does
> not exist
> 2024-02-01 14:33:13.604 UTC [949255] CONTEXT:  processing remote data for replication origin "pg_16398" during
message
> type "INSERT" in transaction 757, finished at 0/153C838
> ...
> 2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG:  statement: CREATE TABLE public.test_mismatching_types (
>              a int PRIMARY KEY
>          );
> 2024-02-01 14:33:13.605 UTC [942451] DEBUG:  unregistering background worker "logical replication apply worker for
> subscription 16398"
> 2024-02-01 14:33:13.605 UTC [942451] LOG:  background worker "logical replication apply worker" (PID 949255) exited
with
> exit code 1
> ...
> 2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG:  statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION;
> ...
> 2024-02-01 14:36:13.642 UTC [942527] DEBUG:  starting logical replication worker for subscription "tsub"
> (there is no interesting activity between 14:33:13 and 14:36:13)
>
> So logical replication apply worker exited because CREATE TABLE was not
> executed on subscriber yet, and new replication worker started because of a
> timeout occurred in WaitLatch(..., wait_time, ...) inside
> ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE
> (180 sec)).
>
> But in a normal (fast) case the same WaitLatch exits due to MyLatch set as
> a result of:
> logical replication apply worker| logicalrep_worker_onexit() ->
>    ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) ->
> launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)).
>
> In a bad case, I see that the SetLatch() called as well, but then the latch
> gets reset by the following code in WaitForReplicationWorkerAttach():
>          rc = WaitLatch(MyLatch,
>                         WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
>                         10L, WAIT_EVENT_BGWORKER_STARTUP);
>
>          if (rc & WL_LATCH_SET)
>          {
>              ResetLatch(MyLatch);
>              CHECK_FOR_INTERRUPTS();
>          }
>
> With pg_usleep(300000); added just before ResetLatch and
>   $node_subscriber->safe_psql(
>       'postgres', qq(
> +SELECT pg_sleep(0.5);
>       CREATE TABLE public.test_mismatching_types (
> in 014_binary.pl, I can see the anomaly without running tests in parallel,
> just when running that test in a loop:
> for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done
> ...
> iteration 2
> # +++ tap check in src/test/subscription +++
> t/014_binary.pl .. ok
> All tests successful.
> Files=1, Tests=8,  5 wallclock secs ( 0.00 usr  0.00 sys +  0.24 cusr  0.18 csys =  0.42 CPU)
> Result: PASS
> iteration 3
> # +++ tap check in src/test/subscription +++
> t/014_binary.pl .. ok
> All tests successful.
> Files=1, Tests=8, 183 wallclock secs ( 0.02 usr  0.00 sys +  0.28 cusr  0.25 csys =  0.55 CPU)
> Result: PASS
> ...
>
> In other words, the abnormal test execution takes place due to the
> following events:
> 1. logicalrep worker launcher launches replication worker and waits for it
>   to attach:
>   ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach()
> 2. logicalrep worker exits due to some error (logical replication target
>   relation does not exist, in our case) and sends a signal to set the latch
>   for launcher
> 3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets
>   it inside WaitForReplicationWorkerAttach()
> 4. launcher gets back to ApplyLauncherMain() where it waits for the latch
>   (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms).
>
> Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the
> test 027_nosuperuser executes for 3+ minutes on each run for me.

Thanks for the steps for the issue, I was able to reproduce this issue
in my environment with the steps provided. The attached patch has a
proposed fix where the latch will not be set in case of the apply
worker exiting immediately after starting.

Regards,
Vignesh

Вложения

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

От
Alexander Lakhin
Дата:
05.02.2024 13:13, vignesh C wrote:
> Thanks for the steps for the issue, I was able to reproduce this issue
> in my environment with the steps provided. The attached patch has a
> proposed fix where the latch will not be set in case of the apply
> worker exiting immediately after starting.

It looks like the proposed fix doesn't help when ApplyLauncherWakeup()
called by a backend executing CREATE SUBSCRIPTION command.
That is, with the v4-0002 patch applied and pg_usleep(300000L); added
just below
             if (!worker_in_use)
                 return worker_in_use;
I still observe the test 027_nosuperuser running for 3+ minutes:
t/027_nosuperuser.pl .. ok
All tests successful.
Files=1, Tests=19, 187 wallclock secs ( 0.01 usr  0.00 sys +  4.82 cusr  4.47 csys =  9.30 CPU)

IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION
regression_sub ...", gets missed when launcher waits for start of another
worker (logical replication worker for subscription "admin_sub"), launched
just before that command.

Best regards,
Alexander



On Tue, 6 Feb 2024 at 18:30, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 05.02.2024 13:13, vignesh C wrote:
> > Thanks for the steps for the issue, I was able to reproduce this issue
> > in my environment with the steps provided. The attached patch has a
> > proposed fix where the latch will not be set in case of the apply
> > worker exiting immediately after starting.
>
> It looks like the proposed fix doesn't help when ApplyLauncherWakeup()
> called by a backend executing CREATE SUBSCRIPTION command.
> That is, with the v4-0002 patch applied and pg_usleep(300000L); added
> just below
>              if (!worker_in_use)
>                  return worker_in_use;
> I still observe the test 027_nosuperuser running for 3+ minutes:
> t/027_nosuperuser.pl .. ok
> All tests successful.
> Files=1, Tests=19, 187 wallclock secs ( 0.01 usr  0.00 sys +  4.82 cusr  4.47 csys =  9.30 CPU)
>
> IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION
> regression_sub ...", gets missed when launcher waits for start of another
> worker (logical replication worker for subscription "admin_sub"), launched
> just before that command.

Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
getting missed in this case. The wakeup call can be sent during
subscription creation/modification and when the apply worker exits.
WaitForReplicationWorkerAttach should not reset the latch here as it
will end up delaying the apply worker to get started after 180 seconds
timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
the latch and lets ApplyLauncherMain to reset the latch and checks if
any new worker or missing worker needs to be started.

Regards,
Vignesh

Вложения

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

От
Alexander Lakhin
Дата:
08.02.2024 12:25, vignesh C wrote:
> Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
> getting missed in this case. The wakeup call can be sent during
> subscription creation/modification and when the apply worker exits.
> WaitForReplicationWorkerAttach should not reset the latch here as it
> will end up delaying the apply worker to get started after 180 seconds
> timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
> the latch and lets ApplyLauncherMain to reset the latch and checks if
> any new worker or missing worker needs to be started.

Thank you for the updated patch!
I ran all the subscription tests in a loop (with the sleeps added as
before) and observed no failures and 180+ seconds duration.

Best regards,
Alexander



On Thu, 8 Feb 2024 at 23:30, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> 08.02.2024 12:25, vignesh C wrote:
> > Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
> > getting missed in this case. The wakeup call can be sent during
> > subscription creation/modification and when the apply worker exits.
> > WaitForReplicationWorkerAttach should not reset the latch here as it
> > will end up delaying the apply worker to get started after 180 seconds
> > timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
> > the latch and lets ApplyLauncherMain to reset the latch and checks if
> > any new worker or missing worker needs to be started.
>
> Thank you for the updated patch!
> I ran all the subscription tests in a loop (with the sleeps added as
> before) and observed no failures and 180+ seconds duration.

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh





On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:

Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/

Regards,
Vignesh

Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the problem. 
Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change the name of the variable table_states_valid to table_states_validity. The current name made sense when it was a bool, but now that it is a tri-state enum, it doesn't fit well.

regards,
Ajin Cherian
Fujitsu Australia
On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:
>
>
>
> On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:
>>
>>
>> Thanks, I have created the following Commitfest entry for this:
>> https://commitfest.postgresql.org/47/4816/
>>
>> Regards,
>> Vignesh
>
>
> Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the
problem.
> Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change the
nameof the variable table_states_valid to table_states_validity. The current name made sense when it was a bool, but
nowthat it is a tri-state enum, it doesn't fit well. 

Thanks for reviewing the patch, the attached v6 patch has the changes
for the same.

Regards,
Vignesh

Вложения

RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

От
"Zhijie Hou (Fujitsu)"
Дата:
On Wednesday, March 13, 2024 11:49 AMvignesh C <vignesh21@gmail.com> wrote:
> On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:
> >
> >
> >
> > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:
> >>
> >>
> >> Thanks, I have created the following Commitfest entry for this:
> >> https://commitfest.postgresql.org/47/4816/
> >>
> >> Regards,
> >> Vignesh
> >
> >
> > Thanks for the patch, I have verified that the fix works well by following the
> steps mentioned to reproduce the problem.
> > Reviewing the patch, it seems good and is well documented. Just one minor
> comment I had was probably to change the name of the variable
> table_states_valid to table_states_validity. The current name made sense when
> it was a bool, but now that it is a tri-state enum, it doesn't fit well.
> 
> Thanks for reviewing the patch, the attached v6 patch has the changes for the
> same.

Thanks for the patches.

I saw a recent similar BF error[1] which seems related to the issue that 0001
patch is trying to solve. i.e. The table sync worker is somehow not started
after refreshing the publication on the subscriber. I didn't see other related ERRORs in
the log, so I think the reason is the same as the one being discussed in this
thread, which is the table state invalidation got lost. And the 0001 patch
looks good to me.

For 0002, instead of avoid resetting the latch, is it possible to let the
logical rep worker wake up the launcher once after attaching ?

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&dt=2024-03-11%2000%3A52%3A42

Best Regards,
Hou zj

On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, March 13, 2024 11:49 AMvignesh C <vignesh21@gmail.com> wrote:
> > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > >
> > >
> > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:
> > >>
> > >>
> > >> Thanks, I have created the following Commitfest entry for this:
> > >> https://commitfest.postgresql.org/47/4816/
> > >>
> > >> Regards,
> > >> Vignesh
> > >
> > >
> > > Thanks for the patch, I have verified that the fix works well by following the
> > steps mentioned to reproduce the problem.
> > > Reviewing the patch, it seems good and is well documented. Just one minor
> > comment I had was probably to change the name of the variable
> > table_states_valid to table_states_validity. The current name made sense when
> > it was a bool, but now that it is a tri-state enum, it doesn't fit well.
> >
> > Thanks for reviewing the patch, the attached v6 patch has the changes for the
> > same.
>
> Thanks for the patches.
>
> I saw a recent similar BF error[1] which seems related to the issue that 0001
> patch is trying to solve. i.e. The table sync worker is somehow not started
> after refreshing the publication on the subscriber. I didn't see other related ERRORs in
> the log, so I think the reason is the same as the one being discussed in this
> thread, which is the table state invalidation got lost. And the 0001 patch
> looks good to me.
>
> For 0002, instead of avoid resetting the latch, is it possible to let the
> logical rep worker wake up the launcher once after attaching ?

Waking up of the launch process uses the same latch that is used for
subscription creation/modification and apply worker process exit. As
the handling of this latch for subscription creation/modification and
worker process exit can be done only by ApplyLauncherMain, we will not
be able to reset the latch in WaitForReplicationWorkerAttach. I feel
waking up the launcher process might not help in this case as
currently we will not be able to differentiate between worker
attached, subscription creation/modification and apply worker process
exit.

Regards,
Vignesh



RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

От
"Zhijie Hou (Fujitsu)"
Дата:
On Wednesday, March 13, 2024 11:49 AM vignesh C <vignesh21@gmail.com> wrote:
> 
> On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:
> >
> >
> >
> > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:
> >>
> >>
> >> Thanks, I have created the following Commitfest entry for this:
> >> https://commitfest.postgresql.org/47/4816/
> >>
> >> Regards,
> >> Vignesh
> >
> >
> > Thanks for the patch, I have verified that the fix works well by following the
> steps mentioned to reproduce the problem.
> > Reviewing the patch, it seems good and is well documented. Just one minor
> comment I had was probably to change the name of the variable
> table_states_valid to table_states_validity. The current name made sense when
> it was a bool, but now that it is a tri-state enum, it doesn't fit well.
> 
> Thanks for reviewing the patch, the attached v6 patch has the changes for the
> same.

FYI, I noticed that there is one open item on
https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items which is related to
the fix in this thread.

--
Intermittent failures in 040_standby_failover_slots_sync test
    Possible solution in this thread: Race condition in FetchTableStates
--

AFAICS, the bug discussed here is not a new issue on
PG17, so I am thinking to move the item to the "Older bugs affecting stable
branches" section if no objections.

Best Regards,
Hou zj


On Wed, Mar 13, 2024 at 11:59 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> >
> > For 0002, instead of avoid resetting the latch, is it possible to let the
> > logical rep worker wake up the launcher once after attaching ?
>
> Waking up of the launch process uses the same latch that is used for
> subscription creation/modification and apply worker process exit. As
> the handling of this latch for subscription creation/modification and
> worker process exit can be done only by ApplyLauncherMain, we will not
> be able to reset the latch in WaitForReplicationWorkerAttach. I feel
> waking up the launcher process might not help in this case as
> currently we will not be able to differentiate between worker
> attached, subscription creation/modification and apply worker process
> exit.
>

IIUC, even if we set the latch once the worker attaches, the other
set_latch by subscription creation/modification or apply_worker_exit
could also be consumed due to reset of latch in
WaitForReplicationWorkerAttach(). Is that understanding correct? If
so, can we use some other way to wake up
WaitForReplicationWorkerAttach() say condition variable? The current
proposal can fix the issue but looks bit adhoc.

--
With Regards,
Amit Kapila.



On Wed, Mar 13, 2024 at 9:19 AM vignesh C <vignesh21@gmail.com> wrote:
>
> On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:
> >
> >
> >
> > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:
> >>
> >>
> >> Thanks, I have created the following Commitfest entry for this:
> >> https://commitfest.postgresql.org/47/4816/
> >>
> >> Regards,
> >> Vignesh
> >
> >
> > Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the
problem.
> > Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change the
nameof the variable table_states_valid to table_states_validity. The current name made sense when it was a bool, but
nowthat it is a tri-state enum, it doesn't fit well. 
>
> Thanks for reviewing the patch, the attached v6 patch has the changes
> for the same.
>

v6_0001* looks good to me. This should be backpatched unless you or
others think otherwise.

--
With Regards,
Amit Kapila.



On Tue, Apr 23, 2024 at 4:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 13, 2024 at 11:59 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > >
> > > For 0002, instead of avoid resetting the latch, is it possible to let the
> > > logical rep worker wake up the launcher once after attaching ?
> >
> > Waking up of the launch process uses the same latch that is used for
> > subscription creation/modification and apply worker process exit. As
> > the handling of this latch for subscription creation/modification and
> > worker process exit can be done only by ApplyLauncherMain, we will not
> > be able to reset the latch in WaitForReplicationWorkerAttach. I feel
> > waking up the launcher process might not help in this case as
> > currently we will not be able to differentiate between worker
> > attached, subscription creation/modification and apply worker process
> > exit.
> >
>
> IIUC, even if we set the latch once the worker attaches, the other
> set_latch by subscription creation/modification or apply_worker_exit
> could also be consumed due to reset of latch in
> WaitForReplicationWorkerAttach(). Is that understanding correct? If
> so, can we use some other way to wake up
> WaitForReplicationWorkerAttach() say condition variable?
>

The other possibility is to have a GUC launcher_retry_time or
something like that instead of using a DEFAULT_NAPTIME_PER_CYCLE. This
still may not resolve the issue if launcher_retry_time is longer but
at least users would be able to configure it. I am not sure if this is
a good idea or not but just trying to brainstorm different ideas to
solve this problem.

BTW, as far as I understand, this is an improvement in the existing
code, so should be done only for HEAD (probably PG18) and should be
discussed in a separate thread.

--
With Regards,
Amit Kapila.



On Wed, 24 Apr 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Mar 13, 2024 at 9:19 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:
> > >
> > >
> > >
> > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:
> > >>
> > >>
> > >> Thanks, I have created the following Commitfest entry for this:
> > >> https://commitfest.postgresql.org/47/4816/
> > >>
> > >> Regards,
> > >> Vignesh
> > >
> > >
> > > Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the
problem.
> > > Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change
thename of the variable table_states_valid to table_states_validity. The current name made sense when it was a bool,
butnow that it is a tri-state enum, it doesn't fit well. 
> >
> > Thanks for reviewing the patch, the attached v6 patch has the changes
> > for the same.
> >
>
> v6_0001* looks good to me. This should be backpatched unless you or
> others think otherwise.

This patch needs to be committed in master,PG16 and PG15.
This is not required from PG14 onwards because they don't have
HasSubscriptionRelations call before updating table_states_valid:
    /*
     * Does the subscription have tables?
     *
     * If there were not-READY relations found then we know it does. But
     * if table_states_not_ready was empty we still need to check again to
     * see if there are 0 tables.
     */
    has_subrels = (table_states_not_ready != NIL) ||
      HasSubscriptionRelations(MySubscription->oid);

So the invalidation function will not be called here.

Whereas for PG15 and newer versions this is applicable:
HasSubscriptionRelations calls table_open function which will get the
invalidate callback like in:
HasSubscriptionRelations -> table_open -> relation_open ->
LockRelationOid -> AcceptInvalidationMessages ->
ReceiveSharedInvalidMessages -> LocalExecuteInvalidationMessage ->
CallSyscacheCallbacks -> invalidate_syncing_table_states

The attached patch
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
applies for master and PG16 branch,
v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
applies for PG15 branch.

Regards,
Vignesh

Вложения

RE: Race condition in FetchTableStates() breaks synchronization of subscription tables

От
"Zhijie Hou (Fujitsu)"
Дата:
On Wednesday, April 24, 2024 6:29 PM vignesh C <vignesh21@gmail.com> wrote:
> 
> On Wed, 24 Apr 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Mar 13, 2024 at 9:19 AM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote:
> > > >
> > > >
> > > >
> > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com>
> wrote:
> > > >>
> > > >>
> > > >> Thanks, I have created the following Commitfest entry for this:
> > > >> https://commitfest.postgresql.org/47/4816/
> > > >>
> > > >> Regards,
> > > >> Vignesh
> > > >
> > > >
> > > > Thanks for the patch, I have verified that the fix works well by following
> the steps mentioned to reproduce the problem.
> > > > Reviewing the patch, it seems good and is well documented. Just one
> minor comment I had was probably to change the name of the variable
> table_states_valid to table_states_validity. The current name made sense when
> it was a bool, but now that it is a tri-state enum, it doesn't fit well.
> > >
> > > Thanks for reviewing the patch, the attached v6 patch has the
> > > changes for the same.
> > >
> >
> > v6_0001* looks good to me. This should be backpatched unless you or
> > others think otherwise.
> 
> This patch needs to be committed in master,PG16 and PG15.
> This is not required from PG14 onwards because they don't have
> HasSubscriptionRelations call before updating table_states_valid:
>     /*
>      * Does the subscription have tables?
>      *
>      * If there were not-READY relations found then we know it does. But
>      * if table_states_not_ready was empty we still need to check again to
>      * see if there are 0 tables.
>      */
>     has_subrels = (table_states_not_ready != NIL) ||
>       HasSubscriptionRelations(MySubscription->oid);
> 
> So the invalidation function will not be called here.
> 
> Whereas for PG15 and newer versions this is applicable:
> HasSubscriptionRelations calls table_open function which will get the
> invalidate callback like in:
> HasSubscriptionRelations -> table_open -> relation_open -> LockRelationOid
> -> AcceptInvalidationMessages -> ReceiveSharedInvalidMessages ->
> LocalExecuteInvalidationMessage -> CallSyscacheCallbacks ->
> invalidate_syncing_table_states
> 
> The attached patch
> v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
> applies for master and PG16 branch,
> v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
> applies for PG15 branch.

Thanks, I have verified that the patches can be applied cleanly and fix the
issue on each branch. The regression test can also pass after applying the patch
on my machine.

Best Regards,
Hou zj

On Thu, Apr 25, 2024 at 7:01 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Wednesday, April 24, 2024 6:29 PM vignesh C <vignesh21@gmail.com> wrote:
> >
> >
> > The attached patch
> > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
> > applies for master and PG16 branch,
> > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
> > applies for PG15 branch.
>
> Thanks, I have verified that the patches can be applied cleanly and fix the
> issue on each branch. The regression test can also pass after applying the patch
> on my machine.
>

Pushed.

--
With Regards,
Amit Kapila.



On Wed, 24 Apr 2024 at 15:49, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Apr 23, 2024 at 4:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Mar 13, 2024 at 11:59 AM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > > On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
> > > <houzj.fnst@fujitsu.com> wrote:
> > > >
> > > >
> > > > For 0002, instead of avoid resetting the latch, is it possible to let the
> > > > logical rep worker wake up the launcher once after attaching ?
> > >
> > > Waking up of the launch process uses the same latch that is used for
> > > subscription creation/modification and apply worker process exit. As
> > > the handling of this latch for subscription creation/modification and
> > > worker process exit can be done only by ApplyLauncherMain, we will not
> > > be able to reset the latch in WaitForReplicationWorkerAttach. I feel
> > > waking up the launcher process might not help in this case as
> > > currently we will not be able to differentiate between worker
> > > attached, subscription creation/modification and apply worker process
> > > exit.
> > >
> >
> > IIUC, even if we set the latch once the worker attaches, the other
> > set_latch by subscription creation/modification or apply_worker_exit
> > could also be consumed due to reset of latch in
> > WaitForReplicationWorkerAttach(). Is that understanding correct? If
> > so, can we use some other way to wake up
> > WaitForReplicationWorkerAttach() say condition variable?
> >
>
> The other possibility is to have a GUC launcher_retry_time or
> something like that instead of using a DEFAULT_NAPTIME_PER_CYCLE. This
> still may not resolve the issue if launcher_retry_time is longer but
> at least users would be able to configure it. I am not sure if this is
> a good idea or not but just trying to brainstorm different ideas to
> solve this problem.
>
> BTW, as far as I understand, this is an improvement in the existing
> code, so should be done only for HEAD (probably PG18) and should be
> discussed in a separate thread.

I have started a new thread at [1] to discuss this:
https://www.postgresql.org/message-id/CALDaNm01_KEgHM1tKtgXkCGLJ5209SMSmGw3UmhZbOz365_%3DeA%40mail.gmail.com

Regards,
Vignesh



On Thu, 25 Apr 2024 at 16:20, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Apr 25, 2024 at 7:01 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
> > On Wednesday, April 24, 2024 6:29 PM vignesh C <vignesh21@gmail.com> wrote:
> > >
> > >
> > > The attached patch
> > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
> > > applies for master and PG16 branch,
> > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
> > > applies for PG15 branch.
> >
> > Thanks, I have verified that the patches can be applied cleanly and fix the
> > issue on each branch. The regression test can also pass after applying the patch
> > on my machine.
> >
>
> Pushed.

Thanks for pushing this, I have marked the commitfest entry at [1] as committed.
[1] - https://commitfest.postgresql.org/48/4816/

Regards,
Vignesh