Re: Wall shiping replica failed to recover database with error:invalid contrecord length 1956 at FED/38FFE208

Поиск
Список
Период
Сортировка
От Aleš Zelený
Тема Re: Wall shiping replica failed to recover database with error:invalid contrecord length 1956 at FED/38FFE208
Дата
Msg-id CAODqTUZMNQ223Dtr9zJpcMSvZRLRo8qcj2OLbc0_1yFAdZsGGQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Wall shiping replica failed to recover database with error:invalid contrecord length 1956 at FED/38FFE208  (Stephen Frost <sfrost@snowman.net>)
Список pgsql-general
Hello,

čt 3. 10. 2019 v 0:09 odesílatel Stephen Frost <sfrost@snowman.net> napsal:
Greetings,

* Aleš Zelený (zeleny.ales@gmail.com) wrote:
> But recovery on replica failed to proceed WAL file
> 0000000100000FED00000039  with log message: " invalid contrecord length
> 1956 at FED/38FFE208".

Err- you've drawn the wrong conclusion from that message (and you're
certainly not alone- it's a terrible message and we should really have a
HINT there or something).  That's an INFO-level message, not an error,
and basically just means "oh, look, there's an invalid WAL record, guess
we got to the end of the WAL available from this source."  If you had
had primary_conninfo configured in your recovery.conf, PG would likely
have connected to the primary and started replication.  One other point
is that if you actually did a promotion in this process somewhere, then
you might want to set recovery_target_timeline=latest, to make sure the
replica follows along on the timeline switch that happens when a
promotion happens.

Thanks (for all comments form others as well) for this explanation. I've failed to describe properly our case. We are recovering  the replica instance from WALs only, since the replica is in separate network (used as source for zfs clonning for development), so there is no primary_conninfo (replica can't influence primary instance any way, it juts consumes primary instance WALs) and we did not performed replica promotion.

I'd guess, thath on out of disk space issue, last WAL might be incomplete, but the size was expected 16777216 Bytes on primary instance disk and it was binary identical to file restores on replica from backup. The issue wsas, that replica emit this INFO message, but it was not able to move to next wal file and started falling behind primary instance.

If the WAL was incomplete during out of space it probably might be appeneded during instance start ( but I'll doubt incomplete archive_command to be invoced on incomplete WAL), that is why I have checked the file on primary (after it was back up&running) with restored one on replica instance.

In orther words, if this log message will be emmited only once and recovery continue retoring subsequent WALs, I'll be OK with that, but due to recovery stucked at this WAL I'm in doubts whether I did something wrong (e.g. improper recovery.conf ...) or what is possible workaround to enable replica (if possible) proceed this wal and continue with recovery. The database size is almost 2 TB, so that is why I'd like to avoid full restores to create DEV environments and using ZFS clones instead.

Thanks for any hints how to let replica continue applying WAL files.

Kind regards Ales Zeleny

В списке pgsql-general по дате отправления:

Предыдущее
От: "Arnaud L."
Дата:
Сообщение: psql \copy hanging
Следующее
От: Nikolai Lusan
Дата:
Сообщение: Advice for geographically dispersed multi master