Обсуждение: WAL ends before end time of backup dump
Version: 8.1.4 I am having a problem restoring one of my base backups. I took a successful backup of the production DB already since this one, and this is just a routine test, so it's fortunately not an emergency. I think that I either have a corrupted base backup or corrupted WAL segments, or maybe I hit some strange bug. When I try to restore, I point recovery.conf to the full set of archived WAL segments, and get the following result: [snip] LOG: restored log file "00000001000000170000002B.004A3CAC.backup" from archive LOG: restored log file "00000001000000170000002B" from archive LOG: checkpoint record is at 17/2B4CDC58 LOG: redo record is at 17/2B4A3CAC; undo record is at 0/0; shutdown FALSE LOG: next transaction ID: 41438715; next OID: 42280 LOG: next MultiXactId: 1; next MultiXactOffset: 0 LOG: automatic recovery in progress LOG: redo starts at 17/2B4A3CAC LOG: record with zero length at 17/2B6EACC8 LOG: redo done at 17/2B6EAC84 LOG: restored log file "00000001000000170000002B" from archive PANIC: WAL ends before end time of backup dump LOG: startup process (PID 88979) was terminated by signal 6 LOG: aborting startup due to startup process failure If I restore from the earlier base backup, which grinds slowly through a week's worth of WAL segments, it stops at segment 00000001000000170000002B, like so: [ snip ] LOG: restored log file "00000001000000170000002B" from archive LOG: record with zero length at 17/2B6EACC8 LOG: redo done at 17/2B6EAC84 LOG: restored log file "00000001000000170000002B" from archive LOG: archive recovery complete LOG: database system is ready LOG: transaction ID wrap limit is 1094453440, limited by database "postgres" If I restore from a later backup, everything works fine. The thing that stands out to me about the base backup that doesn't work is that it took several WAL segments to complete. Here's the .backup file for the base backup that fails: $ cat wal/00000001000000170000002B.004A3CAC.backup START WAL LOCATION: 17/2B4A3CAC (file 00000001000000170000002B) STOP WAL LOCATION: 17/397B7D64 (file 000000010000001700000039) CHECKPOINT LOCATION: 17/2B4CDC58 START TIME: 2006-11-05 01:00:01 PST LABEL: 20061105010001.27375.tar.gz STOP TIME: 2006-11-05 01:14:03 PST I noticed the 8.2beta3 included a fix for WAL replay, is that related? Can someone link to the thread about that bug? I can't test newer versions of postgres because all my other backups seem to work. Basically, I'd just like to know what happened to prevent it in the future. I am archiving to an NFS mount, I don't know whether that carries a risk of corruption or not. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > LOG: restored log file "00000001000000170000002B" from archive > LOG: record with zero length at 17/2B6EACC8 > LOG: redo done at 17/2B6EAC84 It looks to me like you archived this log file before it was fully written. You should take a close look at your archiving procedures. regards, tom lane
On Tue, 2006-11-07 at 11:15 -0800, Jeff Davis wrote: > Version: 8.1.4 > [ snip ] > OG: restored log file "00000001000000170000002B" from archive > LOG: record with zero length at 17/2B6EACC8 > LOG: redo done at 17/2B6EAC84 > LOG: restored log file "00000001000000170000002B" from archive > LOG: archive recovery complete > LOG: database system is ready > LOG: transaction ID wrap limit is 1094453440, limited by database > "postgres" > From the 8.1.4 source I see that: /* * Currently, xl_len == 0 must be bad data, but that might not be true Does that mean that I must have a corrupt WAL segment file 00000001000000170000002B? Is it possible this was caused by a bug? Would an already known bug possibly cause this problem? I am sending the transaction logs to an NFS mount. I consider it to be reliable storage, so I can't see how that would have been corrupted unless it was sent that way by PostgreSQL. Regards, Jeff Davis
On Tue, 2006-11-07 at 17:20 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > LOG: restored log file "00000001000000170000002B" from archive > > LOG: record with zero length at 17/2B6EACC8 > > LOG: redo done at 17/2B6EAC84 > > It looks to me like you archived this log file before it was fully > written. You should take a close look at your archiving procedures. > Hmm... the only way the file gets there is archive command. I didn't do any manual moves of the segments, certainly not last Sunday at 1am (which is when this segment was put there). From postgresql.conf: archive_command = '/usr/local/pgsql/bin/archive_command.bash "%p" "%f"' and: ----------- $ cat /usr/local/pgsql/bin/archive_command.bash #!/usr/local/bin/bash # $1 is the full path to the file to archive # $2 is the filename WAL=$1 WAL_NAME=$2 WAL_ARCHIVE=/path/to/archive echo test ! -f $WAL_ARCHIVE/$WAL_NAME && gzip -c $WAL > $WAL_ARCHIVE/$WAL_NAME.gz ------------ Any obvious holes in there? Is gzip doing something weird? It worked with all my other WAL segments, and I've tested the backups multiple times. Also, the WAL segment is exactly the right size, although I suppose there's no way for it not to be. Regards, Jeff Davis
Which Sunday? Could you have been bitten by some DST time shift? There were 2 1ams a couple of weeks ago. Woody IGLASS Networks -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Jeff Davis Sent: Tuesday, November 07, 2006 5:38 PM To: Tom Lane Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] WAL ends before end time of backup dump On Tue, 2006-11-07 at 17:20 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > LOG: restored log file "00000001000000170000002B" from archive > > LOG: record with zero length at 17/2B6EACC8 > > LOG: redo done at 17/2B6EAC84 > > It looks to me like you archived this log file before it was fully > written. You should take a close look at your archiving procedures. > Hmm... the only way the file gets there is archive command. I didn't do any manual moves of the segments, certainly not last Sunday at 1am (which is when this segment was put there). From postgresql.conf: archive_command = '/usr/local/pgsql/bin/archive_command.bash "%p" "%f"' and: ----------- $ cat /usr/local/pgsql/bin/archive_command.bash #!/usr/local/bin/bash # $1 is the full path to the file to archive # $2 is the filename WAL=$1 WAL_NAME=$2 WAL_ARCHIVE=/path/to/archive echo test ! -f $WAL_ARCHIVE/$WAL_NAME && gzip -c $WAL > $WAL_ARCHIVE/$WAL_NAME.gz ------------ Any obvious holes in there? Is gzip doing something weird? It worked with all my other WAL segments, and I've tested the backups multiple times. Also, the WAL segment is exactly the right size, although I suppose there's no way for it not to be. Regards, Jeff Davis ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster
On Wed, 2006-11-08 at 09:39 -0500, Woody Woodring wrote: > Which Sunday? Could you have been bitten by some DST time shift? There > were 2 1ams a couple of weeks ago. > Hah, interesting idea, but it wasn't that the basebackup didn't exist, and it wasn't that I had two of them close together. I had one base backup, and apparently one of the logs archived at that time is corrupt (or perhaps archived too soon). The log file has bytes throughout, but I suppose that doesn't tell me anything because of the recycling. It would be handy if there was a way to check whether archived log files are good or not without trying to go through recovery. I still really don't know what caused this. I think it may have something to do with a lot of activity happening during my base backup. I am sending syslog data to the database, and I imagine a lot of syslog data is generated by other cron scripts at that time. Regards, Jeff Davis
On Tue, 2006-11-07 at 17:20 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > LOG: restored log file "00000001000000170000002B" from archive > > LOG: record with zero length at 17/2B6EACC8 > > LOG: redo done at 17/2B6EAC84 > > It looks to me like you archived this log file before it was fully > written. You should take a close look at your archiving procedures. Ok, I found the problem. Completely my mistake, because I was running through the test without being careful. On the first recovery attempt, I left the postmaster.conf, including the archive_command setting, in the data dir along with some archives in pg_xlog. When I started recovery, it overwrote some of my archived segments. The archive script isn't supposed to overwrite existing files, but I had a typo in the check and didn't realize it because in the normal case everything was archiving correctly. Would it make sense to throw an error if there's a recovery.conf and files already exist in pg_xlog? Thanks for narrowing down the problem for me. Regards, Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > Would it make sense to throw an error if there's a recovery.conf and > files already exist in pg_xlog? No; that's actually a feature in some scenarios (eg, your last few segments didn't get archived yet). There's no substitute for debugging your archiver script ;-) regards, tom lane