Обсуждение: FATAL: the database system is in recovery mode

Поиск
Список
Период
Сортировка

FATAL: the database system is in recovery mode

От
"Laura Hornbeck"
Дата:


Hi,

I am getting "FATAL:  the database system is in recovery mode" when trying to reconnect after a crash.  This has been happening for about an 1 hour now.  Is this something I need to wait out or can I safely kill postgres and restart?

Thanks, Laura

Re: FATAL: the database system is in recovery mode

От
Tom Lane
Дата:
"Laura Hornbeck" <lhornbeck@oppunl.com> writes:
> I am getting "FATAL:  the database system is in recovery mode" when trying
> to reconnect after a crash.  This has been happening for about an 1 hour
> now.  Is this something I need to wait out or can I safely kill postgres and
> restart?

Hmmm ... unless you had extremely high settings for both checkpoint_segments
and checkpoint_timeout, it shouldn't take an hour to recover from a
crash.  Does it appear that the startup subprocess is making progress at
all?  (Use "ps" to find the postmaster's startup process child, then see
if it's doing anything using "strace" or some such.)

            regards, tom lane

Re: FATAL: the database system is in recovery mode

От
"Laura Hornbeck"
Дата:

-----Original Message-----
From: pgsql-novice-owner@postgresql.org
[mailto:pgsql-novice-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Thursday, October 12, 2006 12:43 PM
To: lhornbeck@oppunl.com
Cc: pgsql-novice@postgresql.org
Subject: Re: [NOVICE] FATAL: the database system is in recovery mode

"Laura Hornbeck" <lhornbeck@oppunl.com> writes:
> I am getting "FATAL:  the database system is in recovery mode" when
> trying to reconnect after a crash.  This has been happening for about
> an 1 hour now.  Is this something I need to wait out or can I safely
> kill postgres and restart?

Hmmm ... unless you had extremely high settings for both checkpoint_segments
and checkpoint_timeout, it shouldn't take an hour to recover from a crash.
Does it appear that the startup subprocess is making progress at all?  (Use
"ps" to find the postmaster's startup process child, then see if it's doing
anything using "strace" or some such.)

            regards, tom lane

checkpoint_segments is 8.

strace -p26891
Process 26891 attached - interrupt to quit
futex(0xb7db2880, FUTEX_WAIT, 2, NULL



---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match


Re: FATAL: the database system is in recovery mode

От
Tom Lane
Дата:
"Laura Hornbeck" <lhornbeck@oppunl.com> writes:
> Hmmm ... unless you had extremely high settings for both checkpoint_segments
> and checkpoint_timeout, it shouldn't take an hour to recover from a crash.

> checkpoint_segments is 8.

That's certainly not out of line --- I'd expect a max recovery time on
the order of a minute or so for that much WAL.

> strace -p26891
> Process 26891 attached - interrupt to quit
> futex(0xb7db2880, FUTEX_WAIT, 2, NULL

Interesting.  We don't use futexes directly, so this smells like a
problem in glibc or some such.  Can you get a stack trace?

    $ gdb /path/to/postgres-executable 26891
    gdb> bt
    gdb> quit

            regards, tom lane

Re: FATAL: the database system is in recovery mode

От
"Laura Hornbeck"
Дата:

-----Original Message-----
From: pgsql-novice-owner@postgresql.org
[mailto:pgsql-novice-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Thursday, October 12, 2006 1:10 PM
To: lhornbeck@oppunl.com
Cc: pgsql-novice@postgresql.org
Subject: Re: [NOVICE] FATAL: the database system is in recovery mode

"Laura Hornbeck" <lhornbeck@oppunl.com> writes:
> Hmmm ... unless you had extremely high settings for both
> checkpoint_segments and checkpoint_timeout, it shouldn't take an hour to
recover from a crash.

> checkpoint_segments is 8.

That's certainly not out of line --- I'd expect a max recovery time on the
order of a minute or so for that much WAL.

> strace -p26891
> Process 26891 attached - interrupt to quit futex(0xb7db2880,
> FUTEX_WAIT, 2, NULL

Interesting.  We don't use futexes directly, so this smells like a problem
in glibc or some such.  Can you get a stack trace?

    $ gdb /path/to/postgres-executable 26891
    gdb> bt
    gdb> quit

            regards, tom lane



#0  0xffffe410 in __kernel_vsyscall ()
#1  0xb7d6031e in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
#2  0xb7cfe2b4 in _L_mutex_lock_2495 () from /lib/tls/libc.so.6
#3  0xb7da2946 in __PRETTY_FUNCTION__.2189 () from /lib/tls/libc.so.6
#4  0x00000000 in ?? ()


---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq


Re: FATAL: the database system is in recovery mode

От
Tom Lane
Дата:
"Laura Hornbeck" <lhornbeck@oppunl.com> writes:
>> Interesting.  We don't use futexes directly, so this smells like a problem
>> in glibc or some such.  Can you get a stack trace?

> #0  0xffffe410 in __kernel_vsyscall ()
> #1  0xb7d6031e in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
> #2  0xb7cfe2b4 in _L_mutex_lock_2495 () from /lib/tls/libc.so.6
> #3  0xb7da2946 in __PRETTY_FUNCTION__.2189 () from /lib/tls/libc.so.6
> #4  0x00000000 in ?? ()

Hm, that's pretty unhelpful :-( ... I suppose you are using stripped
Postgres executables, so we're not going to be able to learn more here.
But it's definitely glibc getting wedged for some reason.

At this point I'd agree with kill -9'ing the subprocess, which will make
its parent postmaster quit, and then you can try again.  It seems quite
possible that it won't lock up the next time.  If it does lock up
repeatably, perhaps we could learn more with strace (try launching the
postmaster under strace -f).  The last hundred or so lines of the strace
output before it stops at the futex call should give a hint what it's
doing.

            regards, tom lane

Re: FATAL: the database system is in recovery mode

От
"Laura Hornbeck"
Дата:
I have killed it and it restarted fine. Thank you so much for your help.

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, October 12, 2006 1:32 PM
To: lhornbeck@oppunl.com
Cc: pgsql-novice@postgresql.org
Subject: Re: [NOVICE] FATAL: the database system is in recovery mode

"Laura Hornbeck" <lhornbeck@oppunl.com> writes:
>> Interesting.  We don't use futexes directly, so this smells like a
>> problem in glibc or some such.  Can you get a stack trace?

> #0  0xffffe410 in __kernel_vsyscall ()
> #1  0xb7d6031e in __lll_mutex_lock_wait () from /lib/tls/libc.so.6
> #2  0xb7cfe2b4 in _L_mutex_lock_2495 () from /lib/tls/libc.so.6
> #3  0xb7da2946 in __PRETTY_FUNCTION__.2189 () from /lib/tls/libc.so.6
> #4  0x00000000 in ?? ()

Hm, that's pretty unhelpful :-( ... I suppose you are using stripped
Postgres executables, so we're not going to be able to learn more here.
But it's definitely glibc getting wedged for some reason.

At this point I'd agree with kill -9'ing the subprocess, which will make its
parent postmaster quit, and then you can try again.  It seems quite possible
that it won't lock up the next time.  If it does lock up repeatably, perhaps
we could learn more with strace (try launching the postmaster under strace
-f).  The last hundred or so lines of the strace output before it stops at
the futex call should give a hint what it's doing.

            regards, tom lane