Unarchived WALs deleted after crash

Поиск
Список
Период
Сортировка
От Jehan-Guillaume de Rorthais
Тема Unarchived WALs deleted after crash
Дата
Msg-id 511D06A6.8050106@dalibo.com
обсуждение исходный текст
Ответы Re: Unarchived WALs deleted after crash  (Daniel Farina <daniel@heroku.com>)
Re: Unarchived WALs deleted after crash  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Список pgsql-hackers
Hi,

I am facing an unexpected behavior on a 9.2.2 cluster that I can
reproduce on current HEAD.

On a cluster with archive enabled but failing, after a crash of
postmaster, the checkpoint occurring before leaving the recovery mode
deletes any additional WALs, even those waiting to be archived.

Because of this, after recovering from the crash, previous PITR backup
can not be used to restore the instance to a time where archiving was
failing. Any slaves fed by WAL or lagging in SR need to be recreated.

AFAICT, this is not documented and I would expect the WALs to be
archived by the archiver process when the cluster exits the recovery step.

Here is a simple scenario to reproduce this.

Configuration:
 wal_level = archive archive_mode = on archive_command = '/bin/false' log_checkpoints = on

Scenario: createdb test psql -c 'create table test as select i, md5(i::text) from
generate_series(1,3000000) as i;' test kill -9 $(head -1 $PGDATA/postmaster.pid) pg_ctl start

Using this scenario, log files shows:
 LOG:  archive command failed with exit code 1 DETAIL:  The failed archive command was: /bin/false WARNING:
transactionlog file "000000010000000000000001" could not
 
be archived: too many failures LOG:  database system was interrupted; last known up at 2013-02-14
16:12:58 CET LOG:  database system was not properly shut down; automatic recovery
in progress LOG:  crash recovery starts in timeline 1 and has target timeline 1 LOG:  redo starts at 0/11400078 LOG:
recordwith zero length at 0/13397190 LOG:  redo done at 0/13397160 LOG:  last completed transaction was at log time
2013-02-14
16:12:58.49303+01 LOG:  checkpoint starting: end-of-recovery immediate LOG:  checkpoint complete: wrote 2869 buffers
(17.5%);0 transaction
 
log file(s) added, 9 removed, 7 recycled; write=0.023 s, sync=0.468 s,
total=0.739 s; sync files=2, longest=0.426 s, average=0.234 s LOG:  autovacuum launcher started LOG:  database system
isready to accept connections LOG:  archive command failed with exit code 1 DETAIL:  The failed archive command was:
/bin/falseLOG:  archive command failed with exit code 1 DETAIL:  The failed archive command was: /bin/false LOG:
archivecommand failed with exit code 1 DETAIL:  The failed archive command was: /bin/false WARNING:  transaction log
file"000000010000000000000011" could not
 
be archived: too many failures

Before the kill, "000000010000000000000001" was the WAL to archive.
After the kill, the checkpoint deleted 9 files before exiting recovery
mode and "000000010000000000000011" become the first WAL to archive.
"000000010000000000000001" through "000000010000000000000010" were
removed or recycled.

Is it expected ?
-- 
Jehan-Guillaume de Rorthais
http://www.dalibo.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Alvaro Herrera
Дата:
Сообщение: Re: proposal or just idea for psql - show first N rows from relation backslash statement
Следующее
От: Merlin Moncure
Дата:
Сообщение: Re: proposal or just idea for psql - show first N rows from relation backslash statement