Обсуждение: question about warm standby databases in 8.2.5

Поиск
Список
Период
Сортировка

question about warm standby databases in 8.2.5

От
"Brett Neumeier"
Дата:
Hi,

I set up a warm standby failover system on Redhat, using built-from-source postgresql 8.2.5 on (of course) both the master and standby systems.

The setup of the system was very easy, and the recovery script we have in place on the standby system correctly copies in the archived WAL log files, which are then applied.

What seems odd is what happens when we abort the continuous recovery so the standby database becomes primary.

It seems that the recovery command always copies the source WAL file (with a name like 00000001000000020000009C) to a file path "pg_xlog/RECOVERYXLOG", which is fine. However, then when we abort recovery, postgresql seems to expect that the most recent WAL log should be in pg_xlog with its original filename, e.g., the 0....9C filename from above.

This seems broken -- if the WAL file should wind up in the pg_xlog directory with the 0...9C name, why isn't postgresql copying it there?

Here are the log messages that show what I'm talking about. Note that everything is fine for quite a while; then we triggered the standby database to come online before 0...B4 was archived...and postgresql then bails out because 0...B3 (which has already been restored) doesn't exist!

We're working around this, for now, by having the recovery command script copy archived WAL files to the specified location pg_xlog/RECOVERYXLOG, and also to the pg_xlog directory with the file's original basename. But that seems awfully sloppy, and isn't the process documented in the manual.

Advice is eagerly solicited!

LOG:  starting archive recovery
LOG:  restore_command = "/home/pgsql/bin/recover_script.rb %f %p"
LOG:  restored log file "0000000100000002000000A1.001FAD68.backup " from archive
LOG:  restored log file "0000000100000002000000A1" from archive
LOG:  checkpoint record is at 2/A11FAD68
LOG:  redo record is at 2/A11FAD68; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 0/82464990; next OID: 45282
LOG:  next MultiXactId: 28; next MultiXactOffset: 55
LOG:  automatic recovery in progress
LOG:  redo starts at 2/A11FADB0
LOG:  restored log file "0000000100000002000000A2" from archive
[a bunch of similar messages omitted]
LOG:  restored log file "0000000100000002000000B3" from archive
LOG:  could not open file "pg_xlog/0000000100000002000000B4" (log file 2, segment 180): No such file or directory
LOG:  redo done at 2/B354BDD0
PANIC:  could not open file "pg_xlog/0000000100000002000000B3" (log file 2, segment 179): No such file or directory
LOG:  startup process (PID 17604) was terminated by signal 6
LOG:  aborting startup due to startup process failure
LOG:  database system was interrupted while in recovery at log time 2007-12-10 16:57:42 EST
HINT:  If this has occurred more than once some data may be corrupted and you may need to choose an earlier recovery target.

Cheers,

bn

--
Brett Neumeier (bneumeier@gmail.com)

Re: question about warm standby databases in 8.2.5

От
Greg Smith
Дата:
On Mon, 10 Dec 2007, Brett Neumeier wrote:

> It seems that the recovery command always copies the source WAL file (with a
> name like 00000001000000020000009C) to a file path "pg_xlog/RECOVERYXLOG",
> which is fine. However, then when we abort recovery, postgresql seems to
> expect that the most recent WAL log should be in pg_xlog with its original
> filename, e.g., the 0....9C filename from above.

I think your expectation for what the recovery command script you provide
does and what the server actually requires are a little mismatched.
RECOVERYXLOG is strictly a temporary file and as you've discovered the
server may want the original back again by its original name.  This has
come up before--check out this thread, from this message to the end:

http://archives.postgresql.org/pgsql-admin/2007-08/msg00425.php

It's pointed out there that the documentation is little sparse in this
area.

Not that many people run into this because there is a reference
implementation of a recovery command that correctly implements the
required behavior:

http://developer.postgresql.org/pgdocs/postgres/pgstandby.html

pg_standby ships with the upcoming 8.3, but if you grab that code you can
run it just fine against an 8.2 system.  You'd be better off in the long
run (and probably the short run too) replacing whatever script you've
developed with that one, so you can just get updates to it rather than
keeping one updated in-house.  Consider the time you've spend working on
your own not wasted but educational--you can never know too much about
disaster recovery of your database.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: question about warm standby databases in 8.2.5

От
Simon Riggs
Дата:
On Mon, 2007-12-10 at 22:43 -0600, Brett Neumeier wrote:

> It seems that the recovery command always copies the source WAL file
> (with a name like 00000001000000020000009C) to a file path
> "pg_xlog/RECOVERYXLOG", which is fine. However, then when we abort
> recovery, postgresql seems to expect that the most recent WAL log
> should be in pg_xlog with its original filename, e.g., the 0....9C
> filename from above.

This allows the recovery to be restartable, which you'll want even if if
you haven't realised it yet.

> This seems broken -- if the WAL file should wind up in the pg_xlog
> directory with the 0...9C name, why isn't postgresql copying it there?

This part of the design specifically allows infinitely long recoveries.

You can only delete WAL files prior to the last restartpoint.

--
  Simon Riggs
  2ndQuadrant  http://www.2ndQuadrant.com