Re: 001_rep_changes.pl fails due to publisher stuck on shutdown

Поиск
Список
Период
Сортировка
От Peter Smith
Тема Re: 001_rep_changes.pl fails due to publisher stuck on shutdown
Дата
Msg-id CAHut+PtrtW27eME9bRekULy_RHfyT8JG1aLdHyT_TY-rBJ9Dgg@mail.gmail.com
обсуждение исходный текст
Ответ на 001_rep_changes.pl fails due to publisher stuck on shutdown  (Alexander Lakhin <exclusion@gmail.com>)
Список pgsql-hackers
On Wed, May 29, 2024 at 9:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Hello hackers,
>
> As a recent buildfarm test failure [1] shows:
> [14:33:02.374](0.333s) ok 23 - update works with dropped subscriber column
> ### Stopping node "publisher" using mode fast
> # Running: pg_ctl -D
>
/home/bf/bf-build/adder/HEAD/pgsql.build/testrun/subscription/001_rep_changes/data/t_001_rep_changes_publisher_data/pgdata
> -m fast stop
> waiting for server to shut down.. ... ... ... .. failed
> pg_ctl: server does not shut down
> # pg_ctl stop failed: 256
> # Postmaster PID for node "publisher" is 2222549
> [14:39:04.375](362.001s) Bail out!  pg_ctl stop failed
>
> 001_rep_changes_publisher.log
> 2024-05-16 14:33:02.907 UTC [2238704][client backend][4/22:0] LOG: statement: DELETE FROM tab_rep
> 2024-05-16 14:33:02.925 UTC [2238704][client backend][:0] LOG: disconnection: session time: 0:00:00.078 user=bf
> database=postgres host=[local]
> 2024-05-16 14:33:02.939 UTC [2222549][postmaster][:0] LOG:  received fast shutdown request
> 2024-05-16 14:33:03.000 UTC [2222549][postmaster][:0] LOG:  aborting any active transactions
> 2024-05-16 14:33:03.049 UTC [2222549][postmaster][:0] LOG: background worker "logical replication launcher" (PID
> 2223110) exited with exit code 1
> 2024-05-16 14:33:03.062 UTC [2222901][checkpointer][:0] LOG: shutting down
> 2024-05-16 14:39:04.377 UTC [2222549][postmaster][:0] LOG:  received immediate shutdown request
> 2024-05-16 14:39:04.382 UTC [2222549][postmaster][:0] LOG:  database system is shut down
>
> the publisher node may hang on stopping.
>
> I reproduced the failure (with aggressive autovacuum) locally and
> discovered that it happens because:
> 1) checkpointer calls WalSndInitStopping() (which sends
>   PROCSIG_WALSND_INIT_STOPPING to walsender), and then spins inside
>   WalSndWaitStopping() indefinitely, because:
> 2) walsender receives the signal, sets got_STOPPING = true, but can't exit
> WalSndLoop():
> 3) it never sets got_SIGUSR2 (to get into WalSndDone()) in
>   XLogSendLogical():
> 4) it never sets WalSndCaughtUp to true:
> 5) logical_decoding_ctx->reader->EndRecPtr can't reach flushPtr in
>   XLogSendLogical():
> 6) EndRecPtr doesn't advance in XLogNextRecord():
> 7) XLogDecodeNextRecord() fails do decode a record that crosses a page
>   boundary:
> 8) ReadPageInternal() (commented "Wait for the next page to become
>   available") constantly returns XLREAD_FAIL:
> 9) state->routine.page_read()/logical_read_xlog_page() constantly returns
>   -1:
> 10) flushptr = WalSndWaitForWal() stops advancing, because
>   got_STOPPING == true (see 2).
>
> That is, walsender doesn't let itself to catch up, if it gets the stop
> signal when it's lagging behind and decoding a record requires reading
> the next wal page.
>
> Please look at the reproducing test (based on 001_rep_changes.pl) attached.
> If fails for me as below:
> # 17
> Bailout called.  Further testing stopped:  pg_ctl stop failed
> FAILED--Further testing stopped: pg_ctl stop failed
> make: *** [Makefile:21: check] Ошибка 255
>
> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-05-16%2014%3A22%3A38
> [2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=dikkop&dt=2024-04-24%2014%3A38%3A35 (apparently the same)
>

Hi Alexander,

FYI, by injecting a lot of logging, I’ve confirmed your findings that
for the failing scenario, the ‘got_SIGUSR2’ flag never gets set to
true, meaning the WalSndLoop() cannot finish. Furthermore, I agree
with your step 8 finding that when it fails the ReadPageInternal
function call (the one in XLogDecodeNextRecord with the comment "Wait
for the next page to become available") constantly returns -1.

I will continue digging next week...

======
Kind Regards,
Peter Smith.
Fujitsu Australia



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Daniel Gustafsson
Дата:
Сообщение: Re: Fix possible dereference null pointer (PQprint)
Следующее
От: "ISHAN CHHANGANI ."
Дата:
Сообщение: Switch background worker on/off in runtime.