Обсуждение: FATAL: the database system is in recovery mode
Hi,
I am getting "FATAL: the database system is in recovery mode" when trying to reconnect after a crash. This has been happening for about an 1 hour now. Is this something I need to wait out or can I safely kill postgres and restart?
Thanks, Laura
"Laura Hornbeck" <lhornbeck@oppunl.com> writes: > I am getting "FATAL: the database system is in recovery mode" when trying > to reconnect after a crash. This has been happening for about an 1 hour > now. Is this something I need to wait out or can I safely kill postgres and > restart? Hmmm ... unless you had extremely high settings for both checkpoint_segments and checkpoint_timeout, it shouldn't take an hour to recover from a crash. Does it appear that the startup subprocess is making progress at all? (Use "ps" to find the postmaster's startup process child, then see if it's doing anything using "strace" or some such.) regards, tom lane
-----Original Message----- From: pgsql-novice-owner@postgresql.org [mailto:pgsql-novice-owner@postgresql.org] On Behalf Of Tom Lane Sent: Thursday, October 12, 2006 12:43 PM To: lhornbeck@oppunl.com Cc: pgsql-novice@postgresql.org Subject: Re: [NOVICE] FATAL: the database system is in recovery mode "Laura Hornbeck" <lhornbeck@oppunl.com> writes: > I am getting "FATAL: the database system is in recovery mode" when > trying to reconnect after a crash. This has been happening for about > an 1 hour now. Is this something I need to wait out or can I safely > kill postgres and restart? Hmmm ... unless you had extremely high settings for both checkpoint_segments and checkpoint_timeout, it shouldn't take an hour to recover from a crash. Does it appear that the startup subprocess is making progress at all? (Use "ps" to find the postmaster's startup process child, then see if it's doing anything using "strace" or some such.) regards, tom lane checkpoint_segments is 8. strace -p26891 Process 26891 attached - interrupt to quit futex(0xb7db2880, FUTEX_WAIT, 2, NULL ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
"Laura Hornbeck" <lhornbeck@oppunl.com> writes: > Hmmm ... unless you had extremely high settings for both checkpoint_segments > and checkpoint_timeout, it shouldn't take an hour to recover from a crash. > checkpoint_segments is 8. That's certainly not out of line --- I'd expect a max recovery time on the order of a minute or so for that much WAL. > strace -p26891 > Process 26891 attached - interrupt to quit > futex(0xb7db2880, FUTEX_WAIT, 2, NULL Interesting. We don't use futexes directly, so this smells like a problem in glibc or some such. Can you get a stack trace? $ gdb /path/to/postgres-executable 26891 gdb> bt gdb> quit regards, tom lane
-----Original Message----- From: pgsql-novice-owner@postgresql.org [mailto:pgsql-novice-owner@postgresql.org] On Behalf Of Tom Lane Sent: Thursday, October 12, 2006 1:10 PM To: lhornbeck@oppunl.com Cc: pgsql-novice@postgresql.org Subject: Re: [NOVICE] FATAL: the database system is in recovery mode "Laura Hornbeck" <lhornbeck@oppunl.com> writes: > Hmmm ... unless you had extremely high settings for both > checkpoint_segments and checkpoint_timeout, it shouldn't take an hour to recover from a crash. > checkpoint_segments is 8. That's certainly not out of line --- I'd expect a max recovery time on the order of a minute or so for that much WAL. > strace -p26891 > Process 26891 attached - interrupt to quit futex(0xb7db2880, > FUTEX_WAIT, 2, NULL Interesting. We don't use futexes directly, so this smells like a problem in glibc or some such. Can you get a stack trace? $ gdb /path/to/postgres-executable 26891 gdb> bt gdb> quit regards, tom lane #0 0xffffe410 in __kernel_vsyscall () #1 0xb7d6031e in __lll_mutex_lock_wait () from /lib/tls/libc.so.6 #2 0xb7cfe2b4 in _L_mutex_lock_2495 () from /lib/tls/libc.so.6 #3 0xb7da2946 in __PRETTY_FUNCTION__.2189 () from /lib/tls/libc.so.6 #4 0x00000000 in ?? () ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
"Laura Hornbeck" <lhornbeck@oppunl.com> writes: >> Interesting. We don't use futexes directly, so this smells like a problem >> in glibc or some such. Can you get a stack trace? > #0 0xffffe410 in __kernel_vsyscall () > #1 0xb7d6031e in __lll_mutex_lock_wait () from /lib/tls/libc.so.6 > #2 0xb7cfe2b4 in _L_mutex_lock_2495 () from /lib/tls/libc.so.6 > #3 0xb7da2946 in __PRETTY_FUNCTION__.2189 () from /lib/tls/libc.so.6 > #4 0x00000000 in ?? () Hm, that's pretty unhelpful :-( ... I suppose you are using stripped Postgres executables, so we're not going to be able to learn more here. But it's definitely glibc getting wedged for some reason. At this point I'd agree with kill -9'ing the subprocess, which will make its parent postmaster quit, and then you can try again. It seems quite possible that it won't lock up the next time. If it does lock up repeatably, perhaps we could learn more with strace (try launching the postmaster under strace -f). The last hundred or so lines of the strace output before it stops at the futex call should give a hint what it's doing. regards, tom lane
I have killed it and it restarted fine. Thank you so much for your help. -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Thursday, October 12, 2006 1:32 PM To: lhornbeck@oppunl.com Cc: pgsql-novice@postgresql.org Subject: Re: [NOVICE] FATAL: the database system is in recovery mode "Laura Hornbeck" <lhornbeck@oppunl.com> writes: >> Interesting. We don't use futexes directly, so this smells like a >> problem in glibc or some such. Can you get a stack trace? > #0 0xffffe410 in __kernel_vsyscall () > #1 0xb7d6031e in __lll_mutex_lock_wait () from /lib/tls/libc.so.6 > #2 0xb7cfe2b4 in _L_mutex_lock_2495 () from /lib/tls/libc.so.6 > #3 0xb7da2946 in __PRETTY_FUNCTION__.2189 () from /lib/tls/libc.so.6 > #4 0x00000000 in ?? () Hm, that's pretty unhelpful :-( ... I suppose you are using stripped Postgres executables, so we're not going to be able to learn more here. But it's definitely glibc getting wedged for some reason. At this point I'd agree with kill -9'ing the subprocess, which will make its parent postmaster quit, and then you can try again. It seems quite possible that it won't lock up the next time. If it does lock up repeatably, perhaps we could learn more with strace (try launching the postmaster under strace -f). The last hundred or so lines of the strace output before it stops at the futex call should give a hint what it's doing. regards, tom lane