Re: Corruption during WAL replay

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: Corruption during WAL replay
Дата
Msg-id 3192026.1648185780@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: Corruption during WAL replay  (Andres Freund <andres@anarazel.de>)
Ответы Re: Corruption during WAL replay  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
> I do see that the LSN that ends up on the page is the same across a few runs
> of the test on serinus. Which presumably differs between different
> animals. Surprised that it's this predictable - but I guess the run is short
> enough that there's no variation due to autovacuum, checkpoints etc.

Uh-huh.  I'm not surprised that it's repeatable on a given animal.
What remains to be explained:

1. Why'd it start failing now?  I'm guessing that ce95c5437 *was* the
culprit after all, by slightly changing the amount of catalog data
written during initdb, and thus moving the initial LSN.

2. Why just these two animals?  If initial LSN is the critical thing,
then the results of "locale -a" would affect it, so platform
dependence is hardly surprising ... but I'd have thought that all
the animals on that host would use the same initial set of
collations.  OTOH, I see petalura and pogona just fell over too.
Do you have some of those animals --with-icu and others not?

> 16bit checksums for the win.

Yay :-(

As for a fix, would damaging more of the page help?  I guess
it'd just move around the one-in-64K chance of failure.
Maybe we have to intentionally corrupt (e.g. invert) the
checksum field specifically.

            regards, tom lane



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Kyotaro Horiguchi
Дата:
Сообщение: Re: shared-memory based stats collector - v66
Следующее
От: "wangw.fnst@fujitsu.com"
Дата:
Сообщение: RE: Logical replication timeout problem