Обсуждение: Failing to recover after panic shutdown
Hi
I am running Postgres 8.2 on Windows 2003 server SP2.
Every now and then (2-3 times a year) our Postgres service is down and we need to manually start it. This is what we find:
In log when going down:
2008-06-02 13:40:02 PANIC: could not open file "pg_xlog/000000010000001C00000081" (log file 28, segment 129): Invalid argument
This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
2008-06-02 13:40:02 LOG: server process (PID 5792) exited with exit code 3
2008-06-02 13:40:02 LOG: terminating any other active server processes
2008-06-02 13:40:02 WARNING: terminating connection because of crash of another server process
2008-06-02 13:40:02 DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2008-06-02 13:40:02 HINT: In a moment you should be able to reconnect to the database and repeat your command.
(WARNING, DETAIL and HINT is repeated a few times…)
2008-06-02 13:40:02 LOG: all server processes terminated; reinitializing
2008-06-02 13:40:02 LOG: database system was interrupted at 2008-06-02 13:39:39 W. Europe Daylight Time
2008-06-02 13:40:02 LOG: checkpoint record is at 1C/80F646B0
2008-06-02 13:40:02 LOG: redo record is at 1C/80F646B0; undo record is at 0/0; shutdown FALSE
2008-06-02 13:40:02 LOG: next transaction ID: 0/316291661; next OID: 44447343
2008-06-02 13:40:02 LOG: next MultiXactId: 1929; next MultiXactOffset: 4093
2008-06-02 13:40:02 LOG: database system was not properly shut down; automatic recovery in progress
2008-06-02 13:40:02 LOG: redo starts at 1C/80F64700
2008-06-02 13:40:02 LOG: unexpected pageaddr 1C/79000000 in log file 28, segment 129, offset 0
2008-06-02 13:40:02 LOG: redo done at 1C/80FFD6E0
2008-06-02 13:40:02 FATAL: the database system is starting up
2008-06-02 13:40:03 FATAL: the database system is starting up
2008-06-02 13:40:03 FATAL: the database system is starting up
2008-06-02 13:40:03 FATAL: the database system is starting up
2008-06-02 13:40:03 FATAL: could not open file "pg_xlog/000000010000001C00000081" (log file 28, segment 129): Invalid argument
2008-06-02 13:40:03 FATAL: the database system is starting up
2008-06-02 13:40:03 LOG: startup process (PID 4420) exited with exit code 1
2008-06-02 13:40:03 LOG: aborting startup due to startup process failure
2008-06-02 13:40:05 LOG: archived transaction log file "000000010000001C00000080"
2008-06-02 13:40:05 LOG: logger shutting down
In log when manually started:
2008-06-02 13:50:34 LOG: database system was shut down at 2008-06-02 13:40:03 W. Europe Daylight Time
2008-06-02 13:50:34 LOG: checkpoint record is at 1C/80FFE990
2008-06-02 13:50:34 LOG: redo record is at 1C/80FFE990; undo record is at 0/0; shutdown TRUE
2008-06-02 13:50:34 LOG: next transaction ID: 0/316292191; next OID: 44447343
2008-06-02 13:50:34 LOG: next MultiXactId: 1929; next MultiXactOffset: 4093
2008-06-02 13:50:34 LOG: database system is ready
2008-06-02 13:50:35 LOG: archived transaction log file "000000010000001C00000080"
2008-06-02 13:50:54 LOG: archived transaction log file "000000010000001C00000080"
We are archiving WAL-segments at a remote machine, and we are copying non-filled WAL-segments every 10 minutes to be able to rebuild the DB with a maximum of 10 minutes of missing data. (I don’t know if that has anything to do with it).
Does anyone have a clue on what the problem could be?
Vennlig hilsen
Per Lauvås
Systemutvikler
Fax: +47 22 41 60 61
Direct: +47 24 15 55 51
Mintra as
Storgata 1
P.O. Box 8945 Youngstorget
N-0028 Oslo
Tel: +47 24 15 55 00
Fax: +47 22 41 60 61
http://www.mintra.no
Faglig forum: Helhetlig læring og kompetanse
Motta Mintras faglige e-postmagasin Helhetlig LÆRING & KOMPETANSE - gratis hver måned. Svar på denne e-posten med "HLK" i emnefeltet eller se www.mintra.no for å melde din interesse.
Hyllevare e-læringskurs på Trainingportal.no
Få tilgang til Mintras hyllevarer av e-læringskurs innen HMS, applikasjonsopplæring, prosjektledelse, teknisk opplæring med mer på www.trainingportal.no
Per Lauvås wrote: > Hi > > I am running Postgres 8.2 on Windows 2003 server SP2. > > Every now and then (2-3 times a year) our Postgres service is down > and we need to manually start it. This is what we find: > > In log when going down: > 2008-06-02 13:40:02 PANIC: could not open file > "pg_xlog/000000010000001C00000081" (log file 28, segment 129): > Invalid argument Are you by any chance running an antivirus or other "security software" on this server? > We are archiving WAL-segments at a remote machine, and we are copying > non-filled WAL-segments every 10 minutes to be able to rebuild the DB > with a maximum of 10 minutes of missing data. (I don't know if that > has anything to do with it). How are you copying these files? Are you saying you're actually copying the files out of the pg_xlog directory, or are you using the archive_command along with archive_timeout? //Magnus
Yes, we are copying from pg_xlog. By doing so we let the WAL-segments fill up (not using timeout) and we are able to recoverwithin a 10 minute interval. Could it be that this copy operation is causing the problem? Per -----Original Message----- From: Magnus Hagander [mailto:magnus@hagander.net] Sent: 3. juni 2008 15:47 To: Per Lauvås Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Failing to recover after panic shutdown Per Lauvås wrote: > Hi > > I am running Postgres 8.2 on Windows 2003 server SP2. > > Every now and then (2-3 times a year) our Postgres service is down > and we need to manually start it. This is what we find: > > In log when going down: > 2008-06-02 13:40:02 PANIC: could not open file > "pg_xlog/000000010000001C00000081" (log file 28, segment 129): > Invalid argument Are you by any chance running an antivirus or other "security software" on this server? > We are archiving WAL-segments at a remote machine, and we are copying > non-filled WAL-segments every 10 minutes to be able to rebuild the DB > with a maximum of 10 minutes of missing data. (I don't know if that > has anything to do with it). How are you copying these files? Are you saying you're actually copying the files out of the pg_xlog directory, or are you using the archive_command along with archive_timeout? //Magnus
Hi! Yes, almost certianly. Windows has major issues with more than one process opening the same file, so it's very likely that this is your issue. The only way you can safely get the file off the system without affecting the running PostgreSQL instance is to use a Volume Shadow Copy snapshot. That said, I believe what you are trying to do is not safe even if you do that. You can't just copy WAL segments out of there - if that was actually safe, you wouldn't really need archive_command at all. To be safe to just "grab files out of the $PGDATA directory" you can again use a VSS snapshot, but that will require you to copy all of PGDATA - both the data and the xlog directories. Bottom line: you really should be using archive_command and archive_timeout for this :-) //Magnus Per Lauvås wrote: > Yes, we are copying from pg_xlog. By doing so we let the WAL-segments > fill up (not using timeout) and we are able to recover within a 10 > minute interval. > > Could it be that this copy operation is causing the problem? > > Per > > -----Original Message----- > From: Magnus Hagander [mailto:magnus@hagander.net] > Sent: 3. juni 2008 15:47 > To: Per Lauvås > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Failing to recover after panic shutdown > > Per Lauvås wrote: > > Hi > > > > I am running Postgres 8.2 on Windows 2003 server SP2. > > > > Every now and then (2-3 times a year) our Postgres service is down > > and we need to manually start it. This is what we find: > > > > In log when going down: > > 2008-06-02 13:40:02 PANIC: could not open file > > "pg_xlog/000000010000001C00000081" (log file 28, segment 129): > > Invalid argument > > Are you by any chance running an antivirus or other "security > software" on this server? > > > We are archiving WAL-segments at a remote machine, and we are > > copying non-filled WAL-segments every 10 minutes to be able to > > rebuild the DB with a maximum of 10 minutes of missing data. (I > > don't know if that has anything to do with it). > > How are you copying these files? Are you saying you're actually > copying the files out of the pg_xlog directory, or are you using the > archive_command along with archive_timeout? > > //Magnus >
Hi, and thanks for the replies! OK. I think we will reconsider this. The backup procedure was set up a few years ago. I have personally made several pointin time recoveries using this technique (for testing purposes), and it works. But I guess an undesirable side-effectis a recovery failure every now and then. And, Magnus: The DB is producing about 25 WALs each day (I guess it will increase to at least 144 with a 10 min timeout).Do you know how often a base backup is taken out there by the average administrator? I am getting fed up with doinga new base backup each week. Could the base backup operation be automated? And good luck with the Euro championship (if you are from Sweden). Per -----Original Message----- From: Magnus Hagander [mailto:magnus@hagander.net] Sent: 4. juni 2008 09:04 To: Per Lauvås Cc: pgsql-general@postgresql.org Subject: Re: [GENERAL] Failing to recover after panic shutdown Hi! Yes, almost certianly. Windows has major issues with more than one process opening the same file, so it's very likely that this is your issue. The only way you can safely get the file off the system without affecting the running PostgreSQL instance is to use a Volume Shadow Copy snapshot. That said, I believe what you are trying to do is not safe even if you do that. You can't just copy WAL segments out of there - if that was actually safe, you wouldn't really need archive_command at all. To be safe to just "grab files out of the $PGDATA directory" you can again use a VSS snapshot, but that will require you to copy all of PGDATA - both the data and the xlog directories. Bottom line: you really should be using archive_command and archive_timeout for this :-) //Magnus Per Lauvås wrote: > Yes, we are copying from pg_xlog. By doing so we let the WAL-segments > fill up (not using timeout) and we are able to recover within a 10 > minute interval. > > Could it be that this copy operation is causing the problem? > > Per > > -----Original Message----- > From: Magnus Hagander [mailto:magnus@hagander.net] > Sent: 3. juni 2008 15:47 > To: Per Lauvås > Cc: pgsql-general@postgresql.org > Subject: Re: [GENERAL] Failing to recover after panic shutdown > > Per Lauvås wrote: > > Hi > > > > I am running Postgres 8.2 on Windows 2003 server SP2. > > > > Every now and then (2-3 times a year) our Postgres service is down > > and we need to manually start it. This is what we find: > > > > In log when going down: > > 2008-06-02 13:40:02 PANIC: could not open file > > "pg_xlog/000000010000001C00000081" (log file 28, segment 129): > > Invalid argument > > Are you by any chance running an antivirus or other "security > software" on this server? > > > We are archiving WAL-segments at a remote machine, and we are > > copying non-filled WAL-segments every 10 minutes to be able to > > rebuild the DB with a maximum of 10 minutes of missing data. (I > > don't know if that has anything to do with it). > > How are you copying these files? Are you saying you're actually > copying the files out of the pg_xlog directory, or are you using the > archive_command along with archive_timeout? > > //Magnus >
Per Lauvås wrote: > Hi, and thanks for the replies! > > OK. I think we will reconsider this. The backup procedure was set up > a few years ago. I have personally made several point in time > recoveries using this technique (for testing purposes), and it works. > But I guess an undesirable side-effect is a recovery failure every > now and then. I'm afraid it may be one of those things that appears to work, because it works much of the time, but once you have an actual failure, it might not work :-( I wouldn't trust it, really. > And, Magnus: The DB is producing about 25 WALs each day (I guess it > will increase to at least 144 with a 10 min timeout). Do you know how > often a base backup is taken out there by the average administrator? > I am getting fed up with doing a new base backup each week. Could the > base backup operation be automated? Uh, yeah, sure. You can script it (though scripting on Windows using .bat files certainly isn't fun, you can use any kind of scripting language to do it). A pretty nice way is actually to use VSS - you just do a pg_start_backup() and pg_stop_backup(). If your regular backup software uses VSS and supports pre/post scripts, you can use those to implement it... > And good luck with the Euro championship (if you are from Sweden). I am - thanks :-) //Magnus