Re: [BUG] non archived WAL removed during production crash recovery
От | Kyotaro Horiguchi |
---|---|
Тема | Re: [BUG] non archived WAL removed during production crash recovery |
Дата | |
Msg-id | 20200427.182107.1145997462405167356.horikyota.ntt@gmail.com обсуждение исходный текст |
Ответ на | Re: [BUG] non archived WAL removed during production crash recovery (Michael Paquier <michael@paquier.xyz>) |
Ответы |
Re: [BUG] non archived WAL removed during production crash recovery
Re: [BUG] non archived WAL removed during production crash recovery |
Список | pgsql-bugs |
At Mon, 27 Apr 2020 16:49:45 +0900, Michael Paquier <michael@paquier.xyz> wrote in > On Fri, Apr 24, 2020 at 03:03:00PM +0200, Jehan-Guillaume de Rorthais wrote: > > I agree the three tests could be removed as they were not covering the bug we > > were chasing. However, they might still be useful to detect futur non expected > > behavior changes. If you agree with this, please, find in attachment a patch > > proposal against HEAD that recreate these three tests **after** a waiting loop > > on both standby1 and standby2. This waiting loop is inspired from the tests in > > 9.5 -> 10. > > FWIW, I would prefer keeping all three tests as well. > > So.. I have spent more time on this problem and mereswin here is a > very good sample because it failed all three tests: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mereswine&dt=2020-04-24%2006%3A03%3A53 > > For standby2, we get this failure: > ok 11 - .ready file for WAL segment 000000010000000000000001 existing > in backup is kept with archive_mode=always on standby > not ok 12 - .ready file for WAL segment 000000010000000000000002 > created with archive_mode=always on standby > > Then, looking at 020_archive_status_standby2.log, we have the > following logs: > 2020-04-24 02:08:32.032 PDT [9841:3] 020_archive_status.pl LOG: > statement: CHECKPOINT > [...] > 2020-04-24 02:08:32.303 PDT [9821:7] LOG: restored log file > "000000010000000000000002" from archive > > In this case, the test forced a checkpoint to test the segment > recycling *before* the extra restored segment we'd like to work on was > actually restored. So it looks like my initial feeling about the > timing issue was right, and I am also able to reproduce the original > set of failures by adding a manual sleep to delay restores of > segments, like that for example: > --- a/src/backend/access/transam/xlogarchive.c > +++ b/src/backend/access/transam/xlogarchive.c > @@ -74,6 +74,8 @@ RestoreArchivedFile(char *path, const char *xlogfname, > if (recoveryRestoreCommand == NULL || > strcmp(recoveryRestoreCommand, "") == 0) > goto not_available; > > + pg_usleep(10 * 1000000); /* 10s */ > + > /* > > With your patch the problem does not show up anymore even with the > delay added, so I would like to apply what you have sent and add back > those tests. For now, I would just patch HEAD though as that's not > worth the risk of destabilizing stable branches in the buildfarm. Agreed to the diagnosis and the fix. The fix reliably cause a restart point then the restart point manipulats the status files the right way before the CHECKPOINT command resturns, in the both cases. If I would add something to the fix, the following line may need a comment. +# Wait for the checkpoint record is replayed so that the following +# CHECKPOINT causes a restart point reliably. |+$standby1->poll_query_until('postgres', |+ qq{ SELECT pg_wal_lsn_diff(pg_last_wal_replay_lsn(), '$primary_lsn') >= 0 } regards. -- Kyotaro Horiguchi NTT Open Source Software Center
В списке pgsql-bugs по дате отправления:
Предыдущее
От: PG Bug reporting formДата:
Сообщение: BUG #16393: PANIC: cannot abort transaction, it was already committed
Следующее
От: PG Bug reporting formДата:
Сообщение: BUG #16394: Conflicting package postgis versions 2.5 and 3.0