Обсуждение: Failing to recover after panic shutdown

Поиск
Список
Период
Сортировка

Failing to recover after panic shutdown

От
Per Lauvås
Дата:

Hi

I am running Postgres 8.2 on Windows 2003 server SP2.

Every now and then (2-3 times a year) our Postgres service is down and we need to manually start it. This is what we find:

In log when going down:

2008-06-02 13:40:02 PANIC:  could not open file "pg_xlog/000000010000001C00000081" (log file 28, segment 129): Invalid argument

This application has requested the Runtime to terminate it in an unusual way.

Please contact the application's support team for more information.

2008-06-02 13:40:02 LOG:  server process (PID 5792) exited with exit code 3

2008-06-02 13:40:02 LOG:  terminating any other active server processes

2008-06-02 13:40:02 WARNING:  terminating connection because of crash of another server process

2008-06-02 13:40:02 DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

2008-06-02 13:40:02 HINT:  In a moment you should be able to reconnect to the database and repeat your command.

(WARNING, DETAIL and HINT is repeated a few times)

2008-06-02 13:40:02 LOG:  all server processes terminated; reinitializing

2008-06-02 13:40:02 LOG:  database system was interrupted at 2008-06-02 13:39:39 W. Europe Daylight Time

2008-06-02 13:40:02 LOG:  checkpoint record is at 1C/80F646B0

2008-06-02 13:40:02 LOG:  redo record is at 1C/80F646B0; undo record is at 0/0; shutdown FALSE

2008-06-02 13:40:02 LOG:  next transaction ID: 0/316291661; next OID: 44447343

2008-06-02 13:40:02 LOG:  next MultiXactId: 1929; next MultiXactOffset: 4093

2008-06-02 13:40:02 LOG:  database system was not properly shut down; automatic recovery in progress

2008-06-02 13:40:02 LOG:  redo starts at 1C/80F64700

2008-06-02 13:40:02 LOG:  unexpected pageaddr 1C/79000000 in log file 28, segment 129, offset 0

2008-06-02 13:40:02 LOG:  redo done at 1C/80FFD6E0

2008-06-02 13:40:02 FATAL:  the database system is starting up

2008-06-02 13:40:03 FATAL:  the database system is starting up

2008-06-02 13:40:03 FATAL:  the database system is starting up

2008-06-02 13:40:03 FATAL:  the database system is starting up

2008-06-02 13:40:03 FATAL:  could not open file "pg_xlog/000000010000001C00000081" (log file 28, segment 129): Invalid argument

2008-06-02 13:40:03 FATAL:  the database system is starting up

2008-06-02 13:40:03 LOG:  startup process (PID 4420) exited with exit code 1

2008-06-02 13:40:03 LOG:  aborting startup due to startup process failure

2008-06-02 13:40:05 LOG:  archived transaction log file "000000010000001C00000080"

2008-06-02 13:40:05 LOG:  logger shutting down

In log when manually started:

2008-06-02 13:50:34 LOG:  database system was shut down at 2008-06-02 13:40:03 W. Europe Daylight Time

2008-06-02 13:50:34 LOG:  checkpoint record is at 1C/80FFE990

2008-06-02 13:50:34 LOG:  redo record is at 1C/80FFE990; undo record is at 0/0; shutdown TRUE

2008-06-02 13:50:34 LOG:  next transaction ID: 0/316292191; next OID: 44447343

2008-06-02 13:50:34 LOG:  next MultiXactId: 1929; next MultiXactOffset: 4093

2008-06-02 13:50:34 LOG:  database system is ready

2008-06-02 13:50:35 LOG:  archived transaction log file "000000010000001C00000080"

2008-06-02 13:50:54 LOG:  archived transaction log file "000000010000001C00000080"

We are archiving WAL-segments at a remote machine, and we are copying non-filled WAL-segments every 10 minutes to be able to rebuild the DB with a maximum of 10 minutes of missing data. (I dont know if that has anything to do with it).

Does anyone have a clue on what the problem could be?

Vennlig hilsen

Per Lauvås
Systemutvikler

Fax: +47 22 41 60 61
Direct: +47 24 15 55 51
Mintra as
Storgata 1
P.O. Box 8945 Youngstorget
N-0028 Oslo
Tel: +47 24 15 55 00
Fax: +47 22 41 60 61
http://www.mintra.no

Faglig forum: Helhetlig læring og kompetanse

Motta Mintras faglige e-postmagasin Helhetlig LÆRING & KOMPETANSE - gratis hver måned. Svar på denne e-posten med "HLK" i emnefeltet eller se www.mintra.no for å melde din interesse.

Hyllevare e-læringskurs på Trainingportal.no

Få tilgang til Mintras hyllevarer av e-læringskurs innen HMS, applikasjonsopplæring, prosjektledelse, teknisk opplæring med mer på www.trainingportal.no

Re: Failing to recover after panic shutdown

От
Magnus Hagander
Дата:
Per Lauvås wrote:
> Hi
>
> I am running Postgres 8.2 on Windows 2003 server SP2.
>
> Every now and then (2-3 times a year) our Postgres service is down
> and we need to manually start it. This is what we find:
>
> In log when going down:
> 2008-06-02 13:40:02 PANIC:  could not open file
> "pg_xlog/000000010000001C00000081" (log file 28, segment 129):
> Invalid argument

Are you by any chance running an antivirus or other "security software"
on this server?

> We are archiving WAL-segments at a remote machine, and we are copying
> non-filled WAL-segments every 10 minutes to be able to rebuild the DB
> with a maximum of 10 minutes of missing data. (I don't know if that
> has anything to do with it).

How are you copying these files? Are you saying you're actually copying
the files out of the pg_xlog directory, or are you using the
archive_command along with archive_timeout?

//Magnus

Re: Failing to recover after panic shutdown

От
Per Lauvås
Дата:
Yes, we are copying from pg_xlog. By doing so we let the WAL-segments fill up (not using timeout) and we are able to
recoverwithin a 10 minute interval.
 

Could it be that this copy operation is causing the problem?

Per

-----Original Message-----
From: Magnus Hagander [mailto:magnus@hagander.net] 
Sent: 3. juni 2008 15:47
To: Per Lauvås
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Failing to recover after panic shutdown

Per Lauvås wrote:
> Hi
> 
> I am running Postgres 8.2 on Windows 2003 server SP2.
> 
> Every now and then (2-3 times a year) our Postgres service is down
> and we need to manually start it. This is what we find:
> 
> In log when going down:
> 2008-06-02 13:40:02 PANIC:  could not open file
> "pg_xlog/000000010000001C00000081" (log file 28, segment 129):
> Invalid argument

Are you by any chance running an antivirus or other "security software"
on this server?

> We are archiving WAL-segments at a remote machine, and we are copying
> non-filled WAL-segments every 10 minutes to be able to rebuild the DB
> with a maximum of 10 minutes of missing data. (I don't know if that
> has anything to do with it).

How are you copying these files? Are you saying you're actually copying
the files out of the pg_xlog directory, or are you using the
archive_command along with archive_timeout?

//Magnus


Re: Failing to recover after panic shutdown

От
Magnus Hagander
Дата:
Hi!

Yes, almost certianly. Windows has major issues with more than one
process opening the same file, so it's very likely that this is your
issue. The only way you can safely get the file off the system without
affecting the running PostgreSQL instance is to use a Volume Shadow
Copy snapshot.

That said, I believe what you are trying to do is not safe even if you
do that. You can't just copy WAL segments out of there - if that was
actually safe, you wouldn't really need archive_command at all. To be
safe to just "grab files out of the $PGDATA directory" you can again
use a VSS snapshot, but that will require you to copy all of PGDATA -
both the data and the xlog directories.

Bottom line: you really should be using archive_command and
archive_timeout for this :-)

//Magnus


Per Lauvås wrote:
> Yes, we are copying from pg_xlog. By doing so we let the WAL-segments
> fill up (not using timeout) and we are able to recover within a 10
> minute interval.
>
> Could it be that this copy operation is causing the problem?
>
> Per
>
> -----Original Message-----
> From: Magnus Hagander [mailto:magnus@hagander.net]
> Sent: 3. juni 2008 15:47
> To: Per Lauvås
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Failing to recover after panic shutdown
>
> Per Lauvås wrote:
> > Hi
> >
> > I am running Postgres 8.2 on Windows 2003 server SP2.
> >
> > Every now and then (2-3 times a year) our Postgres service is down
> > and we need to manually start it. This is what we find:
> >
> > In log when going down:
> > 2008-06-02 13:40:02 PANIC:  could not open file
> > "pg_xlog/000000010000001C00000081" (log file 28, segment 129):
> > Invalid argument
>
> Are you by any chance running an antivirus or other "security
> software" on this server?
>
> > We are archiving WAL-segments at a remote machine, and we are
> > copying non-filled WAL-segments every 10 minutes to be able to
> > rebuild the DB with a maximum of 10 minutes of missing data. (I
> > don't know if that has anything to do with it).
>
> How are you copying these files? Are you saying you're actually
> copying the files out of the pg_xlog directory, or are you using the
> archive_command along with archive_timeout?
>
> //Magnus
>


Re: Failing to recover after panic shutdown

От
Per Lauvås
Дата:
Hi, and thanks for the replies!

OK. I think we will reconsider this. The backup procedure was set up a few years ago. I have personally made several
pointin time recoveries using this technique (for testing purposes), and it works. But I guess an undesirable
side-effectis a recovery failure every now and then.
 

And, Magnus: The DB is producing about 25 WALs each day (I guess it will increase to at least 144 with a 10 min
timeout).Do you know how often a base backup is taken out there by the average administrator? I am getting fed up with
doinga new base backup each week. Could the base backup operation be automated?
 

And good luck with the Euro championship (if you are from Sweden).

Per

-----Original Message-----
From: Magnus Hagander [mailto:magnus@hagander.net] 
Sent: 4. juni 2008 09:04
To: Per Lauvås
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] Failing to recover after panic shutdown

Hi!

Yes, almost certianly. Windows has major issues with more than one
process opening the same file, so it's very likely that this is your
issue. The only way you can safely get the file off the system without
affecting the running PostgreSQL instance is to use a Volume Shadow
Copy snapshot.

That said, I believe what you are trying to do is not safe even if you
do that. You can't just copy WAL segments out of there - if that was
actually safe, you wouldn't really need archive_command at all. To be
safe to just "grab files out of the $PGDATA directory" you can again
use a VSS snapshot, but that will require you to copy all of PGDATA -
both the data and the xlog directories.

Bottom line: you really should be using archive_command and
archive_timeout for this :-)

//Magnus


Per Lauvås wrote:
> Yes, we are copying from pg_xlog. By doing so we let the WAL-segments
> fill up (not using timeout) and we are able to recover within a 10
> minute interval.
> 
> Could it be that this copy operation is causing the problem?
> 
> Per
> 
> -----Original Message-----
> From: Magnus Hagander [mailto:magnus@hagander.net] 
> Sent: 3. juni 2008 15:47
> To: Per Lauvås
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Failing to recover after panic shutdown
> 
> Per Lauvås wrote:
> > Hi
> > 
> > I am running Postgres 8.2 on Windows 2003 server SP2.
> > 
> > Every now and then (2-3 times a year) our Postgres service is down
> > and we need to manually start it. This is what we find:
> > 
> > In log when going down:
> > 2008-06-02 13:40:02 PANIC:  could not open file
> > "pg_xlog/000000010000001C00000081" (log file 28, segment 129):
> > Invalid argument
> 
> Are you by any chance running an antivirus or other "security
> software" on this server?
> 
> > We are archiving WAL-segments at a remote machine, and we are
> > copying non-filled WAL-segments every 10 minutes to be able to
> > rebuild the DB with a maximum of 10 minutes of missing data. (I
> > don't know if that has anything to do with it).
> 
> How are you copying these files? Are you saying you're actually
> copying the files out of the pg_xlog directory, or are you using the
> archive_command along with archive_timeout?
> 
> //Magnus
> 



Re: Failing to recover after panic shutdown

От
Magnus Hagander
Дата:
Per Lauvås wrote:
> Hi, and thanks for the replies!
>
> OK. I think we will reconsider this. The backup procedure was set up
> a few years ago. I have personally made several point in time
> recoveries using this technique (for testing purposes), and it works.
> But I guess an undesirable side-effect is a recovery failure every
> now and then.

I'm afraid it may be one of those things that appears to work, because
it works much of the time, but once you have an actual failure, it
might not work :-( I wouldn't trust it, really.


> And, Magnus: The DB is producing about 25 WALs each day (I guess it
> will increase to at least 144 with a 10 min timeout). Do you know how
> often a base backup is taken out there by the average administrator?
> I am getting fed up with doing a new base backup each week. Could the
> base backup operation be automated?

Uh, yeah, sure. You can script it (though scripting on Windows
using .bat files certainly isn't fun, you can use any kind of scripting
language to do it). A pretty nice way is actually to use VSS - you just
do a pg_start_backup() and pg_stop_backup(). If your regular backup
software uses VSS and supports pre/post scripts, you can use those to
implement it...

> And good luck with the Euro championship (if you are from Sweden).

I am - thanks :-)

//Magnus