Обсуждение: Race condition in FetchTableStates() breaks synchronization of subscription tables
Race condition in FetchTableStates() breaks synchronization of subscription tables
От
Alexander Lakhin
Дата:
Hello hackers, After determining a possible cause for intermittent failures of the test subscription/031_column_list [1], I was wondering what makes another subscription test (014_binary) fail on the buildfarm: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-01-22%2001%3A19%3A03 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-01-14%2018%3A19%3A20 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2023-12-21%2001%3A11%3A52 https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-27%2001%3A42%3A39 All those failures caused by a timeout when waiting for a message expected in _subscriber.log. For example, in the snakefly's case: [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269. _subscriber.log contains: 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; 2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical replication apply worker" (PID 409999) exited with exit code 1 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply worker for subscription "tsub" has started ... While _subscriber.log from a successful test run contains: 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: session time: 0:00:00.014 user=postgres database=postgres host=[local] 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply worker for subscription "tsub" has started 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table synchronization worker for subscription "tsub", table "test_mismatching_types" has started 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data format 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY test_mismatching_types, line 1, column a In this case, "logical replication apply worker for subscription "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later. I've managed to reproduce this failure locally by running multiple tests in parallel, and my analysis shows that it is caused by a race condition when accessing variable table_states_valid inside tablesync.c. tablesync.c does the following with table_states_valid: /* * Callback from syscache invalidation. */ void invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue) { table_states_valid = false; } ... static bool FetchTableStates(bool *started_tx) { ... if (!table_states_valid) { ... /* Fetch all non-ready tables. */ rstates = GetSubscriptionRelations(MySubscription->oid, true); ... table_states_valid = true; } So, when syscache invalidation occurs inside the code block "if (!table_states_valid)", that invalidation is effectively ignored. In a failed case I observe the following events: 1. logical replication apply worker performs LogicalRepApplyLoop() -> process_syncing_tables() -> process_syncing_tables_for_apply() -> FetchTableStates() periodically. 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache for SUBSCRIPTIONRELMAP, and that leads to calling invalidate_syncing_table_states(). 3. If the apply worker manages to fetch no non-ready tables in FetchTableStates() and ignore "table_states_valid = false" from invalidate_syncing_table_states(), then it just misses the invalidation event, so it keeps working without noticing non-ready tables appeared as the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips a loop "foreach(lc, table_states_not_ready) ..." until some other event occurs). pg_usleep(100000) added just below GetSubscriptionRelations(...) proves my analysis -- without it, I need tens of iterations (with 50 tests running in parallel) to catch the failure, but with it, I get the failure on the first iteration. (Reproduced on REL_16_STABLE..master, where the test 014_binary tries to "Refresh the publication to trigger the tablesync", but I think the race condition exists in the previous versions as well.) [1] https://www.postgresql.org/message-id/16d6d9cc-f97d-0b34-be65-425183ed3721@gmail.com Best regards, Alexander
On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello hackers, > > After determining a possible cause for intermittent failures of the test > subscription/031_column_list [1], I was wondering what makes another > subscription test (014_binary) fail on the buildfarm: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=2024-01-22%2001%3A19%3A03 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2024-01-14%2018%3A19%3A20 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=2023-12-21%2001%3A11%3A52 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2023-11-27%2001%3A42%3A39 > > All those failures caused by a timeout when waiting for a message expected > in _subscriber.log. For example, in the snakefly's case: > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with custom type > timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? incorrect binary data format) at > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl line 269. > > _subscriber.log contains: > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > 2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive data from WAL stream: server closed the connection > unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical replication apply worker" (PID 409999) exited with > exit code 1 > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply worker for subscription "tsub" has started > ... > > While _subscriber.log from a successful test run contains: > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: session time: 0:00:00.014 user=postgres > database=postgres host=[local] > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply worker for subscription "tsub" has started > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table synchronization worker for subscription "tsub", > table "test_mismatching_types" has started > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data format > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY test_mismatching_types, line 1, column a > > In this case, "logical replication apply worker for subscription "tsub" has > started" appears just after "ALTER SUBSCRIPTION", not 3 minutes later. > > I've managed to reproduce this failure locally by running multiple tests in > parallel, and my analysis shows that it is caused by a race condition when > accessing variable table_states_valid inside tablesync.c. > > tablesync.c does the following with table_states_valid: > /* > * Callback from syscache invalidation. > */ > void > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 hashvalue) > { > table_states_valid = false; > } > ... > static bool > FetchTableStates(bool *started_tx) > { > ... > if (!table_states_valid) > { > ... > /* Fetch all non-ready tables. */ > rstates = GetSubscriptionRelations(MySubscription->oid, true); > ... > table_states_valid = true; > } > > So, when syscache invalidation occurs inside the code block > "if (!table_states_valid)", that invalidation is effectively ignored. > > In a failed case I observe the following events: > 1. logical replication apply worker performs > LogicalRepApplyLoop() -> process_syncing_tables() -> > process_syncing_tables_for_apply() -> FetchTableStates() periodically. > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache > for SUBSCRIPTIONRELMAP, and that leads to calling > invalidate_syncing_table_states(). > > 3. If the apply worker manages to fetch no non-ready tables in > FetchTableStates() and ignore "table_states_valid = false" from > invalidate_syncing_table_states(), then it just misses the invalidation > event, so it keeps working without noticing non-ready tables appeared as > the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() skips > a loop "foreach(lc, table_states_not_ready) ..." until some other event > occurs). > > pg_usleep(100000) added just below GetSubscriptionRelations(...) proves my > analysis -- without it, I need tens of iterations (with 50 tests running in > parallel) to catch the failure, but with it, I get the failure on the first > iteration. Thanks for the analysis, I was able to reproduce the issue with the steps you had shared. I agree with your analysis. I added some logs to verify that the invalidation was getting missed. I felt that this invalidation is getting ignored because we have used a boolean variable here, how about changing it slightly so that table_states_invalid gets incremented for every invalidation and then decrementing table_states_invalid after getting the non-ready tables like in the attached patch. I was able to verify that the test passes with the attached patch. Regards, Vignesh
Вложения
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
От
"Zhijie Hou (Fujitsu)"
Дата:
On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > > > Hello hackers, > > > > After determining a possible cause for intermittent failures of the > > test subscription/031_column_list [1], I was wondering what makes > > another subscription test (014_binary) fail on the buildfarm: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=20 > > 24-01-22%2001%3A19%3A03 > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202 > > 4-01-14%2018%3A19%3A20 > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=202 > > 3-12-21%2001%3A11%3A52 > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202 > > 3-11-27%2001%3A42%3A39 > > > > All those failures caused by a timeout when waiting for a message > > expected in _subscriber.log. For example, in the snakefly's case: > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? > > incorrect binary data format) at > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl > line 269. > > > > _subscriber.log contains: > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > 2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive data > > from WAL stream: server closed the connection unexpectedly > > This probably means the server terminated abnormally > > before or while processing the request. > > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical > > replication apply worker" (PID 409999) exited with exit code 1 > > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply > > worker for subscription "tsub" has started ... > > > > While _subscriber.log from a successful test run contains: > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: > > session time: 0:00:00.014 user=postgres database=postgres host=[local] > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply > > worker for subscription "tsub" has started > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table > > synchronization worker for subscription "tsub", table > > "test_mismatching_types" has started > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data > > format > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY > > test_mismatching_types, line 1, column a > > > > In this case, "logical replication apply worker for subscription > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes > later. > > > > I've managed to reproduce this failure locally by running multiple > > tests in parallel, and my analysis shows that it is caused by a race > > condition when accessing variable table_states_valid inside tablesync.c. > > > > tablesync.c does the following with table_states_valid: > > /* > > * Callback from syscache invalidation. > > */ > > void > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 > > hashvalue) { > > table_states_valid = false; > > } > > ... > > static bool > > FetchTableStates(bool *started_tx) > > { > > ... > > if (!table_states_valid) > > { > > ... > > /* Fetch all non-ready tables. */ > > rstates = GetSubscriptionRelations(MySubscription->oid, > > true); ... > > table_states_valid = true; > > } > > > > So, when syscache invalidation occurs inside the code block "if > > (!table_states_valid)", that invalidation is effectively ignored. > > > > In a failed case I observe the following events: > > 1. logical replication apply worker performs > > LogicalRepApplyLoop() -> process_syncing_tables() -> > > process_syncing_tables_for_apply() -> FetchTableStates() periodically. > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache > > for SUBSCRIPTIONRELMAP, and that leads to calling > > invalidate_syncing_table_states(). > > > > 3. If the apply worker manages to fetch no non-ready tables in > > FetchTableStates() and ignore "table_states_valid = false" from > > invalidate_syncing_table_states(), then it just misses the invalidation > > event, so it keeps working without noticing non-ready tables appeared as > > the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() > skips > > a loop "foreach(lc, table_states_not_ready) ..." until some other event > > occurs). > > > > pg_usleep(100000) added just below GetSubscriptionRelations(...) > > proves my analysis -- without it, I need tens of iterations (with 50 > > tests running in > > parallel) to catch the failure, but with it, I get the failure on the > > first iteration. > > Thanks for the analysis, I was able to reproduce the issue with the steps you had > shared. I agree with your analysis. I added some logs to verify that the > invalidation was getting missed. > > I felt that this invalidation is getting ignored because we have used a boolean > variable here, how about changing it slightly so that table_states_invalid gets > incremented for every invalidation and then decrementing table_states_invalid > after getting the non-ready tables like in the attached patch. I was able to > verify that the test passes with the attached patch. Thanks for the patch. I am not sure if counting the invalidation number is needed, as even if there are hundreds of invalidations outside of FetchTableStates, one FetchTableStates call should reset the count to 0 as it is checking the latest catalog. I think a better approach is to make the table_states_valid tristate {TABLE_STATE_NEEDS_REBUILD, TABLE_STATE_REBUILD_START, TABLE_STATE_VALID}. The cache invalidate callback will set state to TABLE_STATE_NEEDS_REBUILD. FetchTableStates will set the state to TABLE_STATE_VALID only if the state was TABLE_STATE_REBUILD_START. See the existing EventTriggerCacheStateType for reference. Best Regards, Hou zj
On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com> wrote: > > > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > > > > > Hello hackers, > > > > > > After determining a possible cause for intermittent failures of the > > > test subscription/031_column_list [1], I was wondering what makes > > > another subscription test (014_binary) fail on the buildfarm: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&dt=20 > > > 24-01-22%2001%3A19%3A03 > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202 > > > 4-01-14%2018%3A19%3A20 > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt=202 > > > 3-12-21%2001%3A11%3A52 > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=202 > > > 3-11-27%2001%3A42%3A39 > > > > > > All those failures caused by a timeout when waiting for a message > > > expected in _subscriber.log. For example, in the snakefly's case: > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? > > > incorrect binary data format) at > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_binary.pl > > line 269. > > > > > > _subscriber.log contains: > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > 2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive data > > > from WAL stream: server closed the connection unexpectedly > > > This probably means the server terminated abnormally > > > before or while processing the request. > > > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker "logical > > > replication apply worker" (PID 409999) exited with exit code 1 > > > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication apply > > > worker for subscription "tsub" has started ... > > > > > > While _subscriber.log from a successful test run contains: > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: > > > session time: 0:00:00.014 user=postgres database=postgres host=[local] > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication apply > > > worker for subscription "tsub" has started > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication table > > > synchronization worker for subscription "tsub", table > > > "test_mismatching_types" has started > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data > > > format > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY > > > test_mismatching_types, line 1, column a > > > > > > In this case, "logical replication apply worker for subscription > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 minutes > > later. > > > > > > I've managed to reproduce this failure locally by running multiple > > > tests in parallel, and my analysis shows that it is caused by a race > > > condition when accessing variable table_states_valid inside tablesync.c. > > > > > > tablesync.c does the following with table_states_valid: > > > /* > > > * Callback from syscache invalidation. > > > */ > > > void > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 > > > hashvalue) { > > > table_states_valid = false; > > > } > > > ... > > > static bool > > > FetchTableStates(bool *started_tx) > > > { > > > ... > > > if (!table_states_valid) > > > { > > > ... > > > /* Fetch all non-ready tables. */ > > > rstates = GetSubscriptionRelations(MySubscription->oid, > > > true); ... > > > table_states_valid = true; > > > } > > > > > > So, when syscache invalidation occurs inside the code block "if > > > (!table_states_valid)", that invalidation is effectively ignored. > > > > > > In a failed case I observe the following events: > > > 1. logical replication apply worker performs > > > LogicalRepApplyLoop() -> process_syncing_tables() -> > > > process_syncing_tables_for_apply() -> FetchTableStates() periodically. > > > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates syscache > > > for SUBSCRIPTIONRELMAP, and that leads to calling > > > invalidate_syncing_table_states(). > > > > > > 3. If the apply worker manages to fetch no non-ready tables in > > > FetchTableStates() and ignore "table_states_valid = false" from > > > invalidate_syncing_table_states(), then it just misses the invalidation > > > event, so it keeps working without noticing non-ready tables appeared as > > > the result of ALTER SUBSCRIPTION (process_syncing_tables_for_apply() > > skips > > > a loop "foreach(lc, table_states_not_ready) ..." until some other event > > > occurs). > > > > > > pg_usleep(100000) added just below GetSubscriptionRelations(...) > > > proves my analysis -- without it, I need tens of iterations (with 50 > > > tests running in > > > parallel) to catch the failure, but with it, I get the failure on the > > > first iteration. > > > > Thanks for the analysis, I was able to reproduce the issue with the steps you had > > shared. I agree with your analysis. I added some logs to verify that the > > invalidation was getting missed. > > > > I felt that this invalidation is getting ignored because we have used a boolean > > variable here, how about changing it slightly so that table_states_invalid gets > > incremented for every invalidation and then decrementing table_states_invalid > > after getting the non-ready tables like in the attached patch. I was able to > > verify that the test passes with the attached patch. > > Thanks for the patch. > > I am not sure if counting the invalidation number is needed, as even if there > are hundreds of invalidations outside of FetchTableStates, one FetchTableStates > call should reset the count to 0 as it is checking the latest catalog. Another approach I was thinking of is to reset table_states_valid immediately in the beginning of FetchTableStates, so any new invalidations will take care of setting table_states_valid again it again which will be handled in the next iteration of fetching non-ready tables like in the attached patch. Regards, Vigneshj
Вложения
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
От
"Zhijie Hou (Fujitsu)"
Дата:
On Tuesday, January 30, 2024 11:21 AM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > wrote: > > > > On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com> > wrote: > > > > > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> > wrote: > > > > > > > > Hello hackers, > > > > > > > > After determining a possible cause for intermittent failures of > > > > the test subscription/031_column_list [1], I was wondering what > > > > makes another subscription test (014_binary) fail on the buildfarm: > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&d > > > > t=20 > > > > 24-01-22%2001%3A19%3A03 > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2 > > > 02 > > > > 4-01-14%2018%3A19%3A20 > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt > > > > =202 > > > > 3-12-21%2001%3A11%3A52 > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2 > > > 02 > > > > 3-11-27%2001%3A42%3A39 > > > > > > > > All those failures caused by a timeout when waiting for a message > > > > expected in _subscriber.log. For example, in the snakefly's case: > > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with > > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? > > > > incorrect binary data format) at > > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin > > > ary.pl > > > line 269. > > > > > > > > _subscriber.log contains: > > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > > 2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive > > > > data from WAL stream: server closed the connection unexpectedly > > > > This probably means the server terminated abnormally > > > > before or while processing the request. > > > > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker > > > > "logical replication apply worker" (PID 409999) exited with exit > > > > code 1 > > > > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication > > > > apply worker for subscription "tsub" has started ... > > > > > > > > While _subscriber.log from a successful test run contains: > > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: > > > > session time: 0:00:00.014 user=postgres database=postgres > > > > host=[local] > > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication > > > > apply worker for subscription "tsub" has started > > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication > > > > table synchronization worker for subscription "tsub", table > > > > "test_mismatching_types" has started > > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data > > > > format > > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY > > > > test_mismatching_types, line 1, column a > > > > > > > > In this case, "logical replication apply worker for subscription > > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 > > > > minutes > > > later. > > > > > > > > I've managed to reproduce this failure locally by running multiple > > > > tests in parallel, and my analysis shows that it is caused by a > > > > race condition when accessing variable table_states_valid inside > tablesync.c. > > > > > > > > tablesync.c does the following with table_states_valid: > > > > /* > > > > * Callback from syscache invalidation. > > > > */ > > > > void > > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 > > > > hashvalue) { > > > > table_states_valid = false; > > > > } > > > > ... > > > > static bool > > > > FetchTableStates(bool *started_tx) { ... > > > > if (!table_states_valid) > > > > { > > > > ... > > > > /* Fetch all non-ready tables. */ > > > > rstates = GetSubscriptionRelations(MySubscription->oid, > > > > true); ... > > > > table_states_valid = true; > > > > } > > > > > > > > So, when syscache invalidation occurs inside the code block "if > > > > (!table_states_valid)", that invalidation is effectively ignored. > > > > > > > > In a failed case I observe the following events: > > > > 1. logical replication apply worker performs > > > > LogicalRepApplyLoop() -> process_syncing_tables() -> > > > > process_syncing_tables_for_apply() -> FetchTableStates() periodically. > > > > > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates > syscache > > > > for SUBSCRIPTIONRELMAP, and that leads to calling > > > > invalidate_syncing_table_states(). > > > > > > > > 3. If the apply worker manages to fetch no non-ready tables in > > > > FetchTableStates() and ignore "table_states_valid = false" from > > > > invalidate_syncing_table_states(), then it just misses the invalidation > > > > event, so it keeps working without noticing non-ready tables > appeared as > > > > the result of ALTER SUBSCRIPTION > > > > (process_syncing_tables_for_apply() > > > skips > > > > a loop "foreach(lc, table_states_not_ready) ..." until some other event > > > > occurs). > > > > > > > > pg_usleep(100000) added just below GetSubscriptionRelations(...) > > > > proves my analysis -- without it, I need tens of iterations (with > > > > 50 tests running in > > > > parallel) to catch the failure, but with it, I get the failure on > > > > the first iteration. > > > > > > Thanks for the analysis, I was able to reproduce the issue with the > > > steps you had shared. I agree with your analysis. I added some logs > > > to verify that the invalidation was getting missed. > > > > > > I felt that this invalidation is getting ignored because we have > > > used a boolean variable here, how about changing it slightly so that > > > table_states_invalid gets incremented for every invalidation and > > > then decrementing table_states_invalid after getting the non-ready > > > tables like in the attached patch. I was able to verify that the test passes > with the attached patch. > > > > Thanks for the patch. > > > > I am not sure if counting the invalidation number is needed, as even > > if there are hundreds of invalidations outside of FetchTableStates, > > one FetchTableStates call should reset the count to 0 as it is checking the > latest catalog. > > Another approach I was thinking of is to reset table_states_valid immediately in > the beginning of FetchTableStates, so any new invalidations will take care of > setting table_states_valid again it again which will be handled in the next > iteration of fetching non-ready tables like in the attached patch. I think this is not the standard appraoch for cache building. Because if any ERROR happens during the cache building, then we will come into the situation that table_states_valid=true while the cache data is invalid. Even if we currently don't access these cache after erroring out, but I think we'd better to avoid this risk. Best Regards, Hou zj
On Tue, 30 Jan 2024 at 17:22, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Tuesday, January 30, 2024 11:21 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Tue, 30 Jan 2024 at 07:24, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> > > wrote: > > > > > > On Monday, January 29, 2024 9:22 PM vignesh C <vignesh21@gmail.com> > > wrote: > > > > > > > > On Fri, 26 Jan 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> > > wrote: > > > > > > > > > > Hello hackers, > > > > > > > > > > After determining a possible cause for intermittent failures of > > > > > the test subscription/031_column_list [1], I was wondering what > > > > > makes another subscription test (014_binary) fail on the buildfarm: > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=snakefly&d > > > > > t=20 > > > > > 24-01-22%2001%3A19%3A03 > > > > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2 > > > > 02 > > > > > 4-01-14%2018%3A19%3A20 > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=piculet&dt > > > > > =202 > > > > > 3-12-21%2001%3A11%3A52 > > > > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mylodon&dt=2 > > > > 02 > > > > > 3-11-27%2001%3A42%3A39 > > > > > > > > > > All those failures caused by a timeout when waiting for a message > > > > > expected in _subscriber.log. For example, in the snakefly's case: > > > > > [01:14:46.158](1.937s) ok 7 - check synced data on subscriber with > > > > > custom type timed out waiting for match: (?^:ERROR: ( [A-Z0-9]+:)? > > > > > incorrect binary data format) at > > > > /home/bf/bf-build/piculet/HEAD/pgsql/src/test/subscription/t/014_bin > > > > ary.pl > > > > line 269. > > > > > > > > > > _subscriber.log contains: > > > > > 2023-12-21 01:14:46.215 UTC [410039] 014_binary.pl LOG: statement: > > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > > > 2023-12-21 01:17:46.756 UTC [409999] ERROR: could not receive > > > > > data from WAL stream: server closed the connection unexpectedly > > > > > This probably means the server terminated abnormally > > > > > before or while processing the request. > > > > > 2023-12-21 01:17:46.760 UTC [405057] LOG: background worker > > > > > "logical replication apply worker" (PID 409999) exited with exit > > > > > code 1 > > > > > 2023-12-21 01:17:46.779 UTC [532857] LOG: logical replication > > > > > apply worker for subscription "tsub" has started ... > > > > > > > > > > While _subscriber.log from a successful test run contains: > > > > > 2024-01-26 03:49:07.065 UTC [9726:5] 014_binary.pl LOG: statement: > > > > > ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > > > > > 2024-01-26 03:49:07.075 UTC [9726:6] 014_binary.pl LOG: disconnection: > > > > > session time: 0:00:00.014 user=postgres database=postgres > > > > > host=[local] > > > > > 2024-01-26 03:49:07.558 UTC [9729:1] LOG: logical replication > > > > > apply worker for subscription "tsub" has started > > > > > 2024-01-26 03:49:07.563 UTC [9731:1] LOG: logical replication > > > > > table synchronization worker for subscription "tsub", table > > > > > "test_mismatching_types" has started > > > > > 2024-01-26 03:49:07.585 UTC [9731:2] ERROR: incorrect binary data > > > > > format > > > > > 2024-01-26 03:49:07.585 UTC [9731:3] CONTEXT: COPY > > > > > test_mismatching_types, line 1, column a > > > > > > > > > > In this case, "logical replication apply worker for subscription > > > > > "tsub" has started" appears just after "ALTER SUBSCRIPTION", not 3 > > > > > minutes > > > > later. > > > > > > > > > > I've managed to reproduce this failure locally by running multiple > > > > > tests in parallel, and my analysis shows that it is caused by a > > > > > race condition when accessing variable table_states_valid inside > > tablesync.c. > > > > > > > > > > tablesync.c does the following with table_states_valid: > > > > > /* > > > > > * Callback from syscache invalidation. > > > > > */ > > > > > void > > > > > invalidate_syncing_table_states(Datum arg, int cacheid, uint32 > > > > > hashvalue) { > > > > > table_states_valid = false; > > > > > } > > > > > ... > > > > > static bool > > > > > FetchTableStates(bool *started_tx) { ... > > > > > if (!table_states_valid) > > > > > { > > > > > ... > > > > > /* Fetch all non-ready tables. */ > > > > > rstates = GetSubscriptionRelations(MySubscription->oid, > > > > > true); ... > > > > > table_states_valid = true; > > > > > } > > > > > > > > > > So, when syscache invalidation occurs inside the code block "if > > > > > (!table_states_valid)", that invalidation is effectively ignored. > > > > > > > > > > In a failed case I observe the following events: > > > > > 1. logical replication apply worker performs > > > > > LogicalRepApplyLoop() -> process_syncing_tables() -> > > > > > process_syncing_tables_for_apply() -> FetchTableStates() periodically. > > > > > > > > > > 2. ALTER SUBSCRIPTION tsub REFRESH PUBLICATION invalidates > > syscache > > > > > for SUBSCRIPTIONRELMAP, and that leads to calling > > > > > invalidate_syncing_table_states(). > > > > > > > > > > 3. If the apply worker manages to fetch no non-ready tables in > > > > > FetchTableStates() and ignore "table_states_valid = false" from > > > > > invalidate_syncing_table_states(), then it just misses the invalidation > > > > > event, so it keeps working without noticing non-ready tables > > appeared as > > > > > the result of ALTER SUBSCRIPTION > > > > > (process_syncing_tables_for_apply() > > > > skips > > > > > a loop "foreach(lc, table_states_not_ready) ..." until some other event > > > > > occurs). > > > > > > > > > > pg_usleep(100000) added just below GetSubscriptionRelations(...) > > > > > proves my analysis -- without it, I need tens of iterations (with > > > > > 50 tests running in > > > > > parallel) to catch the failure, but with it, I get the failure on > > > > > the first iteration. > > > > > > > > Thanks for the analysis, I was able to reproduce the issue with the > > > > steps you had shared. I agree with your analysis. I added some logs > > > > to verify that the invalidation was getting missed. > > > > > > > > I felt that this invalidation is getting ignored because we have > > > > used a boolean variable here, how about changing it slightly so that > > > > table_states_invalid gets incremented for every invalidation and > > > > then decrementing table_states_invalid after getting the non-ready > > > > tables like in the attached patch. I was able to verify that the test passes > > with the attached patch. > > > > > > Thanks for the patch. > > > > > > I am not sure if counting the invalidation number is needed, as even > > > if there are hundreds of invalidations outside of FetchTableStates, > > > one FetchTableStates call should reset the count to 0 as it is checking the > > latest catalog. > > > > Another approach I was thinking of is to reset table_states_valid immediately in > > the beginning of FetchTableStates, so any new invalidations will take care of > > setting table_states_valid again it again which will be handled in the next > > iteration of fetching non-ready tables like in the attached patch. > > I think this is not the standard appraoch for cache building. Because if any > ERROR happens during the cache building, then we will come into the situation > that table_states_valid=true while the cache data is invalid. Even if we > currently don't access these cache after erroring out, but I think we'd better > to avoid this risk. Here is an updated patch which changes the boolean variable to a tri-state enum and set stable state to valid only if no invalidations have been occurred while the list is being prepared. Regards, Vignesh
Вложения
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
От
Alexander Lakhin
Дата:
Hello Vignesh and Hou-san, 01.02.2024 07:59, vignesh C wrote: > Here is an updated patch which changes the boolean variable to a > tri-state enum and set stable state to valid only if no invalidations > have been occurred while the list is being prepared. > While testing the v3 patch, I observed another anomaly with the 014_binary test. When I run the test in parallel, I see that sometimes one of the test instances runs much longer than others. For example: 49 All tests successful. 49 Files=1, Tests=8, 4 wallclock secs ( 0.03 usr 0.01 sys + 0.43 cusr 0.30 csys = 0.77 CPU) 49 Result: PASS 12 All tests successful. 12 Files=1, Tests=8, 184 wallclock secs ( 0.02 usr 0.01 sys + 0.46 cusr 0.40 csys = 0.89 CPU) 12 Result: PASS As far as I can see, this happens due to another race condition, this time in launcher.c. For such a three-minute case I see in _subscriber.log: 2024-02-01 14:33:13.604 UTC [949255] DEBUG: relation "public.test_mismatching_types" does not exist 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message type "INSERT" in transaction 757, finished at 0/153C838 2024-02-01 14:33:13.604 UTC [949255] ERROR: logical replication target relation "public.test_mismatching_types" does not exist 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message type "INSERT" in transaction 757, finished at 0/153C838 ... 2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG: statement: CREATE TABLE public.test_mismatching_types ( a int PRIMARY KEY ); 2024-02-01 14:33:13.605 UTC [942451] DEBUG: unregistering background worker "logical replication apply worker for subscription 16398" 2024-02-01 14:33:13.605 UTC [942451] LOG: background worker "logical replication apply worker" (PID 949255) exited with exit code 1 ... 2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; ... 2024-02-01 14:36:13.642 UTC [942527] DEBUG: starting logical replication worker for subscription "tsub" (there is no interesting activity between 14:33:13 and 14:36:13) So logical replication apply worker exited because CREATE TABLE was not executed on subscriber yet, and new replication worker started because of a timeout occurred in WaitLatch(..., wait_time, ...) inside ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE (180 sec)). But in a normal (fast) case the same WaitLatch exits due to MyLatch set as a result of: logical replication apply worker| logicalrep_worker_onexit() -> ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) -> launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)). In a bad case, I see that the SetLatch() called as well, but then the latch gets reset by the following code in WaitForReplicationWorkerAttach(): rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, 10L, WAIT_EVENT_BGWORKER_STARTUP); if (rc & WL_LATCH_SET) { ResetLatch(MyLatch); CHECK_FOR_INTERRUPTS(); } With pg_usleep(300000); added just before ResetLatch and $node_subscriber->safe_psql( 'postgres', qq( +SELECT pg_sleep(0.5); CREATE TABLE public.test_mismatching_types ( in 014_binary.pl, I can see the anomaly without running tests in parallel, just when running that test in a loop: for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done ... iteration 2 # +++ tap check in src/test/subscription +++ t/014_binary.pl .. ok All tests successful. Files=1, Tests=8, 5 wallclock secs ( 0.00 usr 0.00 sys + 0.24 cusr 0.18 csys = 0.42 CPU) Result: PASS iteration 3 # +++ tap check in src/test/subscription +++ t/014_binary.pl .. ok All tests successful. Files=1, Tests=8, 183 wallclock secs ( 0.02 usr 0.00 sys + 0.28 cusr 0.25 csys = 0.55 CPU) Result: PASS ... In other words, the abnormal test execution takes place due to the following events: 1. logicalrep worker launcher launches replication worker and waits for it to attach: ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach() 2. logicalrep worker exits due to some error (logical replication target relation does not exist, in our case) and sends a signal to set the latch for launcher 3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets it inside WaitForReplicationWorkerAttach() 4. launcher gets back to ApplyLauncherMain() where it waits for the latch (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms). Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the test 027_nosuperuser executes for 3+ minutes on each run for me. And we can find such abnormal execution on the buildfarm too: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=snakefly&dt=2024-02-01%2020%3A34%3A03&stg=subscription-check So, if we had PG_TEST_TIMEOUT_DEFAULT less than DEFAULT_NAPTIME_PER_CYCLE, we would notice this situation as a test failure, but that timeouts are equal (accidentally?), hence, say, wait_for_log() just waits long enough for an expected replication worker to start eventually. As to the initial issue fixes, v3 and v2 look good to me, but v2 is more appealing for back-patching (if this fix going to be back-patched), IMHO. Best regards, Alexander
On Fri, 2 Feb 2024 at 11:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Vignesh and Hou-san, > > 01.02.2024 07:59, vignesh C wrote: > > Here is an updated patch which changes the boolean variable to a > > tri-state enum and set stable state to valid only if no invalidations > > have been occurred while the list is being prepared. > > > > While testing the v3 patch, I observed another anomaly with the 014_binary > test. When I run the test in parallel, I see that sometimes one of the test > instances runs much longer than others. For example: > 49 All tests successful. > 49 Files=1, Tests=8, 4 wallclock secs ( 0.03 usr 0.01 sys + 0.43 cusr 0.30 csys = 0.77 CPU) > 49 Result: PASS > 12 All tests successful. > 12 Files=1, Tests=8, 184 wallclock secs ( 0.02 usr 0.01 sys + 0.46 cusr 0.40 csys = 0.89 CPU) > 12 Result: PASS > > As far as I can see, this happens due to another race condition, this time > in launcher.c. > For such a three-minute case I see in _subscriber.log: > 2024-02-01 14:33:13.604 UTC [949255] DEBUG: relation "public.test_mismatching_types" does not exist > 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message > type "INSERT" in transaction 757, finished at 0/153C838 > 2024-02-01 14:33:13.604 UTC [949255] ERROR: logical replication target relation "public.test_mismatching_types" does > not exist > 2024-02-01 14:33:13.604 UTC [949255] CONTEXT: processing remote data for replication origin "pg_16398" during message > type "INSERT" in transaction 757, finished at 0/153C838 > ... > 2024-02-01 14:33:13.605 UTC [949276] 014_binary.pl LOG: statement: CREATE TABLE public.test_mismatching_types ( > a int PRIMARY KEY > ); > 2024-02-01 14:33:13.605 UTC [942451] DEBUG: unregistering background worker "logical replication apply worker for > subscription 16398" > 2024-02-01 14:33:13.605 UTC [942451] LOG: background worker "logical replication apply worker" (PID 949255) exited with > exit code 1 > ... > 2024-02-01 14:33:13.607 UTC [949276] 014_binary.pl LOG: statement: ALTER SUBSCRIPTION tsub REFRESH PUBLICATION; > ... > 2024-02-01 14:36:13.642 UTC [942527] DEBUG: starting logical replication worker for subscription "tsub" > (there is no interesting activity between 14:33:13 and 14:36:13) > > So logical replication apply worker exited because CREATE TABLE was not > executed on subscriber yet, and new replication worker started because of a > timeout occurred in WaitLatch(..., wait_time, ...) inside > ApplyLauncherMain() (wait_time in this case is DEFAULT_NAPTIME_PER_CYCLE > (180 sec)). > > But in a normal (fast) case the same WaitLatch exits due to MyLatch set as > a result of: > logical replication apply worker| logicalrep_worker_onexit() -> > ApplyLauncherWakeup() -> kill(LogicalRepCtx->launcher_pid, SIGUSR1) -> > launcher| procsignal_sigusr1_handler() -> SetLatch(MyLatch)). > > In a bad case, I see that the SetLatch() called as well, but then the latch > gets reset by the following code in WaitForReplicationWorkerAttach(): > rc = WaitLatch(MyLatch, > WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, > 10L, WAIT_EVENT_BGWORKER_STARTUP); > > if (rc & WL_LATCH_SET) > { > ResetLatch(MyLatch); > CHECK_FOR_INTERRUPTS(); > } > > With pg_usleep(300000); added just before ResetLatch and > $node_subscriber->safe_psql( > 'postgres', qq( > +SELECT pg_sleep(0.5); > CREATE TABLE public.test_mismatching_types ( > in 014_binary.pl, I can see the anomaly without running tests in parallel, > just when running that test in a loop: > for ((i=1;i<=10;i++)); do echo "iteration $i"; make -s check -C src/test/subscription/ PROVE_TESTS="t/014*"; done > ... > iteration 2 > # +++ tap check in src/test/subscription +++ > t/014_binary.pl .. ok > All tests successful. > Files=1, Tests=8, 5 wallclock secs ( 0.00 usr 0.00 sys + 0.24 cusr 0.18 csys = 0.42 CPU) > Result: PASS > iteration 3 > # +++ tap check in src/test/subscription +++ > t/014_binary.pl .. ok > All tests successful. > Files=1, Tests=8, 183 wallclock secs ( 0.02 usr 0.00 sys + 0.28 cusr 0.25 csys = 0.55 CPU) > Result: PASS > ... > > In other words, the abnormal test execution takes place due to the > following events: > 1. logicalrep worker launcher launches replication worker and waits for it > to attach: > ApplyLauncherMain() -> logicalrep_worker_launch() -> WaitForReplicationWorkerAttach() > 2. logicalrep worker exits due to some error (logical replication target > relation does not exist, in our case) and sends a signal to set the latch > for launcher > 3. launcher sets the latch in procsignal_sigusr1_handler(), but then resets > it inside WaitForReplicationWorkerAttach() > 4. launcher gets back to ApplyLauncherMain() where it waits for the latch > (not set) or a timeout (which happens after DEFAULT_NAPTIME_PER_CYCLE ms). > > Moreover, with that sleep in WaitForReplicationWorkerAttach() added, the > test 027_nosuperuser executes for 3+ minutes on each run for me. Thanks for the steps for the issue, I was able to reproduce this issue in my environment with the steps provided. The attached patch has a proposed fix where the latch will not be set in case of the apply worker exiting immediately after starting. Regards, Vignesh
Вложения
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
От
Alexander Lakhin
Дата:
05.02.2024 13:13, vignesh C wrote: > Thanks for the steps for the issue, I was able to reproduce this issue > in my environment with the steps provided. The attached patch has a > proposed fix where the latch will not be set in case of the apply > worker exiting immediately after starting. It looks like the proposed fix doesn't help when ApplyLauncherWakeup() called by a backend executing CREATE SUBSCRIPTION command. That is, with the v4-0002 patch applied and pg_usleep(300000L); added just below if (!worker_in_use) return worker_in_use; I still observe the test 027_nosuperuser running for 3+ minutes: t/027_nosuperuser.pl .. ok All tests successful. Files=1, Tests=19, 187 wallclock secs ( 0.01 usr 0.00 sys + 4.82 cusr 4.47 csys = 9.30 CPU) IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION regression_sub ...", gets missed when launcher waits for start of another worker (logical replication worker for subscription "admin_sub"), launched just before that command. Best regards, Alexander
On Tue, 6 Feb 2024 at 18:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > 05.02.2024 13:13, vignesh C wrote: > > Thanks for the steps for the issue, I was able to reproduce this issue > > in my environment with the steps provided. The attached patch has a > > proposed fix where the latch will not be set in case of the apply > > worker exiting immediately after starting. > > It looks like the proposed fix doesn't help when ApplyLauncherWakeup() > called by a backend executing CREATE SUBSCRIPTION command. > That is, with the v4-0002 patch applied and pg_usleep(300000L); added > just below > if (!worker_in_use) > return worker_in_use; > I still observe the test 027_nosuperuser running for 3+ minutes: > t/027_nosuperuser.pl .. ok > All tests successful. > Files=1, Tests=19, 187 wallclock secs ( 0.01 usr 0.00 sys + 4.82 cusr 4.47 csys = 9.30 CPU) > > IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION > regression_sub ...", gets missed when launcher waits for start of another > worker (logical replication worker for subscription "admin_sub"), launched > just before that command. Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was getting missed in this case. The wakeup call can be sent during subscription creation/modification and when the apply worker exits. WaitForReplicationWorkerAttach should not reset the latch here as it will end up delaying the apply worker to get started after 180 seconds timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset the latch and lets ApplyLauncherMain to reset the latch and checks if any new worker or missing worker needs to be started. Regards, Vignesh
Вложения
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
От
Alexander Lakhin
Дата:
08.02.2024 12:25, vignesh C wrote: > Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was > getting missed in this case. The wakeup call can be sent during > subscription creation/modification and when the apply worker exits. > WaitForReplicationWorkerAttach should not reset the latch here as it > will end up delaying the apply worker to get started after 180 seconds > timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset > the latch and lets ApplyLauncherMain to reset the latch and checks if > any new worker or missing worker needs to be started. Thank you for the updated patch! I ran all the subscription tests in a loop (with the sleeps added as before) and observed no failures and 180+ seconds duration. Best regards, Alexander
On Thu, 8 Feb 2024 at 23:30, Alexander Lakhin <exclusion@gmail.com> wrote: > > 08.02.2024 12:25, vignesh C wrote: > > Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was > > getting missed in this case. The wakeup call can be sent during > > subscription creation/modification and when the apply worker exits. > > WaitForReplicationWorkerAttach should not reset the latch here as it > > will end up delaying the apply worker to get started after 180 seconds > > timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset > > the latch and lets ApplyLauncherMain to reset the latch and checks if > > any new worker or missing worker needs to be started. > > Thank you for the updated patch! > I ran all the subscription tests in a loop (with the sleeps added as > before) and observed no failures and 180+ seconds duration. Thanks, I have created the following Commitfest entry for this: https://commitfest.postgresql.org/47/4816/ Regards, Vignesh
Re: Race condition in FetchTableStates() breaks synchronization of subscription tables
От
Ajin Cherian
Дата:
On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote:
Thanks, I have created the following Commitfest entry for this:
https://commitfest.postgresql.org/47/4816/
Regards,
Vignesh
Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the problem.
Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change the name of the variable table_states_valid to table_states_validity. The current name made sense when it was a bool, but now that it is a tri-state enum, it doesn't fit well.
regards,
Ajin Cherian
Fujitsu Australia
On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote: > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote: >> >> >> Thanks, I have created the following Commitfest entry for this: >> https://commitfest.postgresql.org/47/4816/ >> >> Regards, >> Vignesh > > > Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the problem. > Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change the nameof the variable table_states_valid to table_states_validity. The current name made sense when it was a bool, but nowthat it is a tri-state enum, it doesn't fit well. Thanks for reviewing the patch, the attached v6 patch has the changes for the same. Regards, Vignesh
Вложения
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
От
"Zhijie Hou (Fujitsu)"
Дата:
On Wednesday, March 13, 2024 11:49 AMvignesh C <vignesh21@gmail.com> wrote: > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote: > >> > >> > >> Thanks, I have created the following Commitfest entry for this: > >> https://commitfest.postgresql.org/47/4816/ > >> > >> Regards, > >> Vignesh > > > > > > Thanks for the patch, I have verified that the fix works well by following the > steps mentioned to reproduce the problem. > > Reviewing the patch, it seems good and is well documented. Just one minor > comment I had was probably to change the name of the variable > table_states_valid to table_states_validity. The current name made sense when > it was a bool, but now that it is a tri-state enum, it doesn't fit well. > > Thanks for reviewing the patch, the attached v6 patch has the changes for the > same. Thanks for the patches. I saw a recent similar BF error[1] which seems related to the issue that 0001 patch is trying to solve. i.e. The table sync worker is somehow not started after refreshing the publication on the subscriber. I didn't see other related ERRORs in the log, so I think the reason is the same as the one being discussed in this thread, which is the table state invalidation got lost. And the 0001 patch looks good to me. For 0002, instead of avoid resetting the latch, is it possible to let the logical rep worker wake up the launcher once after attaching ? [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=francolin&dt=2024-03-11%2000%3A52%3A42 Best Regards, Hou zj
On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, March 13, 2024 11:49 AMvignesh C <vignesh21@gmail.com> wrote: > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote: > > >> > > >> > > >> Thanks, I have created the following Commitfest entry for this: > > >> https://commitfest.postgresql.org/47/4816/ > > >> > > >> Regards, > > >> Vignesh > > > > > > > > > Thanks for the patch, I have verified that the fix works well by following the > > steps mentioned to reproduce the problem. > > > Reviewing the patch, it seems good and is well documented. Just one minor > > comment I had was probably to change the name of the variable > > table_states_valid to table_states_validity. The current name made sense when > > it was a bool, but now that it is a tri-state enum, it doesn't fit well. > > > > Thanks for reviewing the patch, the attached v6 patch has the changes for the > > same. > > Thanks for the patches. > > I saw a recent similar BF error[1] which seems related to the issue that 0001 > patch is trying to solve. i.e. The table sync worker is somehow not started > after refreshing the publication on the subscriber. I didn't see other related ERRORs in > the log, so I think the reason is the same as the one being discussed in this > thread, which is the table state invalidation got lost. And the 0001 patch > looks good to me. > > For 0002, instead of avoid resetting the latch, is it possible to let the > logical rep worker wake up the launcher once after attaching ? Waking up of the launch process uses the same latch that is used for subscription creation/modification and apply worker process exit. As the handling of this latch for subscription creation/modification and worker process exit can be done only by ApplyLauncherMain, we will not be able to reset the latch in WaitForReplicationWorkerAttach. I feel waking up the launcher process might not help in this case as currently we will not be able to differentiate between worker attached, subscription creation/modification and apply worker process exit. Regards, Vignesh
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
От
"Zhijie Hou (Fujitsu)"
Дата:
On Wednesday, March 13, 2024 11:49 AM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote: > >> > >> > >> Thanks, I have created the following Commitfest entry for this: > >> https://commitfest.postgresql.org/47/4816/ > >> > >> Regards, > >> Vignesh > > > > > > Thanks for the patch, I have verified that the fix works well by following the > steps mentioned to reproduce the problem. > > Reviewing the patch, it seems good and is well documented. Just one minor > comment I had was probably to change the name of the variable > table_states_valid to table_states_validity. The current name made sense when > it was a bool, but now that it is a tri-state enum, it doesn't fit well. > > Thanks for reviewing the patch, the attached v6 patch has the changes for the > same. FYI, I noticed that there is one open item on https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items which is related to the fix in this thread. -- Intermittent failures in 040_standby_failover_slots_sync test Possible solution in this thread: Race condition in FetchTableStates -- AFAICS, the bug discussed here is not a new issue on PG17, so I am thinking to move the item to the "Older bugs affecting stable branches" section if no objections. Best Regards, Hou zj
On Wed, Mar 13, 2024 at 11:59 AM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > > > For 0002, instead of avoid resetting the latch, is it possible to let the > > logical rep worker wake up the launcher once after attaching ? > > Waking up of the launch process uses the same latch that is used for > subscription creation/modification and apply worker process exit. As > the handling of this latch for subscription creation/modification and > worker process exit can be done only by ApplyLauncherMain, we will not > be able to reset the latch in WaitForReplicationWorkerAttach. I feel > waking up the launcher process might not help in this case as > currently we will not be able to differentiate between worker > attached, subscription creation/modification and apply worker process > exit. > IIUC, even if we set the latch once the worker attaches, the other set_latch by subscription creation/modification or apply_worker_exit could also be consumed due to reset of latch in WaitForReplicationWorkerAttach(). Is that understanding correct? If so, can we use some other way to wake up WaitForReplicationWorkerAttach() say condition variable? The current proposal can fix the issue but looks bit adhoc. -- With Regards, Amit Kapila.
On Wed, Mar 13, 2024 at 9:19 AM vignesh C <vignesh21@gmail.com> wrote: > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote: > >> > >> > >> Thanks, I have created the following Commitfest entry for this: > >> https://commitfest.postgresql.org/47/4816/ > >> > >> Regards, > >> Vignesh > > > > > > Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the problem. > > Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change the nameof the variable table_states_valid to table_states_validity. The current name made sense when it was a bool, but nowthat it is a tri-state enum, it doesn't fit well. > > Thanks for reviewing the patch, the attached v6 patch has the changes > for the same. > v6_0001* looks good to me. This should be backpatched unless you or others think otherwise. -- With Regards, Amit Kapila.
On Tue, Apr 23, 2024 at 4:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 13, 2024 at 11:59 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > For 0002, instead of avoid resetting the latch, is it possible to let the > > > logical rep worker wake up the launcher once after attaching ? > > > > Waking up of the launch process uses the same latch that is used for > > subscription creation/modification and apply worker process exit. As > > the handling of this latch for subscription creation/modification and > > worker process exit can be done only by ApplyLauncherMain, we will not > > be able to reset the latch in WaitForReplicationWorkerAttach. I feel > > waking up the launcher process might not help in this case as > > currently we will not be able to differentiate between worker > > attached, subscription creation/modification and apply worker process > > exit. > > > > IIUC, even if we set the latch once the worker attaches, the other > set_latch by subscription creation/modification or apply_worker_exit > could also be consumed due to reset of latch in > WaitForReplicationWorkerAttach(). Is that understanding correct? If > so, can we use some other way to wake up > WaitForReplicationWorkerAttach() say condition variable? > The other possibility is to have a GUC launcher_retry_time or something like that instead of using a DEFAULT_NAPTIME_PER_CYCLE. This still may not resolve the issue if launcher_retry_time is longer but at least users would be able to configure it. I am not sure if this is a good idea or not but just trying to brainstorm different ideas to solve this problem. BTW, as far as I understand, this is an improvement in the existing code, so should be done only for HEAD (probably PG18) and should be discussed in a separate thread. -- With Regards, Amit Kapila.
On Wed, 24 Apr 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Mar 13, 2024 at 9:19 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> wrote: > > >> > > >> > > >> Thanks, I have created the following Commitfest entry for this: > > >> https://commitfest.postgresql.org/47/4816/ > > >> > > >> Regards, > > >> Vignesh > > > > > > > > > Thanks for the patch, I have verified that the fix works well by following the steps mentioned to reproduce the problem. > > > Reviewing the patch, it seems good and is well documented. Just one minor comment I had was probably to change thename of the variable table_states_valid to table_states_validity. The current name made sense when it was a bool, butnow that it is a tri-state enum, it doesn't fit well. > > > > Thanks for reviewing the patch, the attached v6 patch has the changes > > for the same. > > > > v6_0001* looks good to me. This should be backpatched unless you or > others think otherwise. This patch needs to be committed in master,PG16 and PG15. This is not required from PG14 onwards because they don't have HasSubscriptionRelations call before updating table_states_valid: /* * Does the subscription have tables? * * If there were not-READY relations found then we know it does. But * if table_states_not_ready was empty we still need to check again to * see if there are 0 tables. */ has_subrels = (table_states_not_ready != NIL) || HasSubscriptionRelations(MySubscription->oid); So the invalidation function will not be called here. Whereas for PG15 and newer versions this is applicable: HasSubscriptionRelations calls table_open function which will get the invalidate callback like in: HasSubscriptionRelations -> table_open -> relation_open -> LockRelationOid -> AcceptInvalidationMessages -> ReceiveSharedInvalidMessages -> LocalExecuteInvalidationMessage -> CallSyscacheCallbacks -> invalidate_syncing_table_states The attached patch v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch applies for master and PG16 branch, v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch applies for PG15 branch. Regards, Vignesh
Вложения
RE: Race condition in FetchTableStates() breaks synchronization of subscription tables
От
"Zhijie Hou (Fujitsu)"
Дата:
On Wednesday, April 24, 2024 6:29 PM vignesh C <vignesh21@gmail.com> wrote: > > On Wed, 24 Apr 2024 at 11:59, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Mar 13, 2024 at 9:19 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Tue, 12 Mar 2024 at 09:34, Ajin Cherian <itsajin@gmail.com> wrote: > > > > > > > > > > > > > > > > On Tue, Mar 12, 2024 at 2:59 PM vignesh C <vignesh21@gmail.com> > wrote: > > > >> > > > >> > > > >> Thanks, I have created the following Commitfest entry for this: > > > >> https://commitfest.postgresql.org/47/4816/ > > > >> > > > >> Regards, > > > >> Vignesh > > > > > > > > > > > > Thanks for the patch, I have verified that the fix works well by following > the steps mentioned to reproduce the problem. > > > > Reviewing the patch, it seems good and is well documented. Just one > minor comment I had was probably to change the name of the variable > table_states_valid to table_states_validity. The current name made sense when > it was a bool, but now that it is a tri-state enum, it doesn't fit well. > > > > > > Thanks for reviewing the patch, the attached v6 patch has the > > > changes for the same. > > > > > > > v6_0001* looks good to me. This should be backpatched unless you or > > others think otherwise. > > This patch needs to be committed in master,PG16 and PG15. > This is not required from PG14 onwards because they don't have > HasSubscriptionRelations call before updating table_states_valid: > /* > * Does the subscription have tables? > * > * If there were not-READY relations found then we know it does. But > * if table_states_not_ready was empty we still need to check again to > * see if there are 0 tables. > */ > has_subrels = (table_states_not_ready != NIL) || > HasSubscriptionRelations(MySubscription->oid); > > So the invalidation function will not be called here. > > Whereas for PG15 and newer versions this is applicable: > HasSubscriptionRelations calls table_open function which will get the > invalidate callback like in: > HasSubscriptionRelations -> table_open -> relation_open -> LockRelationOid > -> AcceptInvalidationMessages -> ReceiveSharedInvalidMessages -> > LocalExecuteInvalidationMessage -> CallSyscacheCallbacks -> > invalidate_syncing_table_states > > The attached patch > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch > applies for master and PG16 branch, > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch > applies for PG15 branch. Thanks, I have verified that the patches can be applied cleanly and fix the issue on each branch. The regression test can also pass after applying the patch on my machine. Best Regards, Hou zj
On Thu, Apr 25, 2024 at 7:01 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wednesday, April 24, 2024 6:29 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > The attached patch > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch > > applies for master and PG16 branch, > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch > > applies for PG15 branch. > > Thanks, I have verified that the patches can be applied cleanly and fix the > issue on each branch. The regression test can also pass after applying the patch > on my machine. > Pushed. -- With Regards, Amit Kapila.
On Wed, 24 Apr 2024 at 15:49, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Apr 23, 2024 at 4:53 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Mar 13, 2024 at 11:59 AM vignesh C <vignesh21@gmail.com> wrote: > > > > > > On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > For 0002, instead of avoid resetting the latch, is it possible to let the > > > > logical rep worker wake up the launcher once after attaching ? > > > > > > Waking up of the launch process uses the same latch that is used for > > > subscription creation/modification and apply worker process exit. As > > > the handling of this latch for subscription creation/modification and > > > worker process exit can be done only by ApplyLauncherMain, we will not > > > be able to reset the latch in WaitForReplicationWorkerAttach. I feel > > > waking up the launcher process might not help in this case as > > > currently we will not be able to differentiate between worker > > > attached, subscription creation/modification and apply worker process > > > exit. > > > > > > > IIUC, even if we set the latch once the worker attaches, the other > > set_latch by subscription creation/modification or apply_worker_exit > > could also be consumed due to reset of latch in > > WaitForReplicationWorkerAttach(). Is that understanding correct? If > > so, can we use some other way to wake up > > WaitForReplicationWorkerAttach() say condition variable? > > > > The other possibility is to have a GUC launcher_retry_time or > something like that instead of using a DEFAULT_NAPTIME_PER_CYCLE. This > still may not resolve the issue if launcher_retry_time is longer but > at least users would be able to configure it. I am not sure if this is > a good idea or not but just trying to brainstorm different ideas to > solve this problem. > > BTW, as far as I understand, this is an improvement in the existing > code, so should be done only for HEAD (probably PG18) and should be > discussed in a separate thread. I have started a new thread at [1] to discuss this: https://www.postgresql.org/message-id/CALDaNm01_KEgHM1tKtgXkCGLJ5209SMSmGw3UmhZbOz365_%3DeA%40mail.gmail.com Regards, Vignesh
On Thu, 25 Apr 2024 at 16:20, Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Apr 25, 2024 at 7:01 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Wednesday, April 24, 2024 6:29 PM vignesh C <vignesh21@gmail.com> wrote: > > > > > > > > > The attached patch > > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch > > > applies for master and PG16 branch, > > > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch > > > applies for PG15 branch. > > > > Thanks, I have verified that the patches can be applied cleanly and fix the > > issue on each branch. The regression test can also pass after applying the patch > > on my machine. > > > > Pushed. Thanks for pushing this, I have marked the commitfest entry at [1] as committed. [1] - https://commitfest.postgresql.org/48/4816/ Regards, Vignesh