Обсуждение: System in Recovery Mode But No Activity

Поиск
Список
Период
Сортировка

System in Recovery Mode But No Activity

От
"John Cheng"
Дата:
We had a run away process on our database box that used up all the
physical and all the virtual memory (swap). This caused the RedHat
Linux oom-killer to kill many processes, including some Postgres ones.
Postgres went into a funky state after that time:

2008-06-20 14:19:10 CDT [unknown] LOG:  invalid length of startup packet
2008-06-20 14:20:50 CDT  ERROR:  canceling autovacuum task
2008-06-20 14:20:50 CDT  CONTEXT:  automatic vacuum of table
"lms_nna.pg_catalog.pg_listener"
2008-06-20 14:21:48 CDT  ERROR:  canceling autovacuum task
2008-06-20 14:21:49 CDT  CONTEXT:  automatic vacuum of table
"lms_infiniti.pg_catalog.pg_listener"
2008-06-20 14:27:33 CDT  WARNING:  worker took too long to start; cancelled
2008-06-20 14:45:53 CDT  LOG:  server process (PID 22435) was
terminated by signal 9: Killed
...
2008-06-20 14:50:28 CDT <dbname> DETAIL:  The postmaster has commanded
this server process to roll back the current transaction and exit,
because another server process exited abnormally and possibly
corrupted shared memory.
2008-06-20 14:50:28 CDT <dbname> HINT:  In a moment you should be able
to reconnect to the database and repeat your command.

...
2008-06-20 14:50:36 CDT <dbname> FATAL:  the database system is in recovery mode

Now any attempt to connect to the database yield the message "the
database system is in recovery mode".

The processes that are still running are:

postgres 31103  0.0  0.0 149024 492 ?        S    Apr19   8:18
/opt/lms/64/postgres/8.3/bin/postmaster -p 6544 -D /pgdata/8.3/data
postgres 31109  0.0  0.0 65408  104 ?        Ss   Apr19   1:54  \_
postgres: logger process
postgres 32645  0.0  0.0 152708 564 ?        Ss   Jun15   3:14  \_
postgres: slony <databasename> 172.16.172.246(41114) idle in
transaction
postgres  1179  0.0  0.0 152708 348 ?        Ss   Jun15   3:30  \_
postgres: slony  <databasename> 172.16.172.246(41138) idle in
transaction

I executed strace and gdb on each process except for 31109 (the logger process)

Process 31103 attached - interrupt to quit # Main postmaster
select(6, [3 5], NULL, NULL, {15, 795000} <unfinished ...>
Process 31103 detached
[postgres@126053-db2 64]$ strace -p 32645 # connection #1
Process 32645 attached - interrupt to quit
futex(0x3954d32930, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 32645 detached
[postgres@126053-db2 64]$ strace -p 1179 # connection #2
Process 1179 attached - interrupt to quit
futex(0x3954d32930, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 1179 detached

I then used gdb to gather more information:

#
# For main postmaster process
#

[postgres@126053-db2 64]$ gdb 31103
GNU gdb Red Hat Linux (6.3.0.0-1.153.el4_6.2rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...31103: No such
file or directory.

[postgres@126053-db2 64]$ gdb /opt/lms/postgres-8.3_64/bin/postmaster 31103
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging
symbols found)
Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Attaching to program: /opt/lms/64/postgres/8.3/bin/postmaster, process 31103
Reading symbols from /usr/lib64/libxslt.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libxslt.so.1
Reading symbols from /usr/lib64/libxml2.so.2...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libxml2.so.2
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/tls/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libm.so.6
Reading symbols from /lib64/tls/libc.so.6...
(no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /usr/lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libz.so.1
Reading symbols from /lib64/tls/libpthread.so.0...(no debugging
symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 182894088192 (LWP 31103)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libnss_files.so.2

0x0000003954bc0125 in __select_nocancel () from /lib64/tls/libc.so.6

#
# For connection no. 1
#

[postgres@126053-db2 64]$ gdb /opt/lms/postgres-8.3_64/bin/postmaster 32645
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging
symbols found)
Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Attaching to program: /opt/lms/64/postgres/8.3/bin/postmaster, process 32645
Reading symbols from /usr/lib64/libxslt.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libxslt.so.1
Reading symbols from /usr/lib64/libxml2.so.2...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libxml2.so.2
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/tls/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libm.so.6
Reading symbols from /lib64/tls/libc.so.6...
(no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /usr/lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libz.so.1
Reading symbols from /lib64/tls/libpthread.so.0...(no debugging
symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 182894088192 (LWP 32645)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /opt/lms/64/postgres/8.3/lib/plpgsql.so...
(no debugging symbols found)...done.
Loaded symbols for /opt/lms/64/postgres/8.3/lib/plpgsql.so
Reading symbols from /opt/lms/64/postgres/8.3/lib/xxid.so...done.
Loaded symbols for /opt/lms/64/postgres/8.3/lib/xxid.so
0x0000003954bd28db in __lll_mutex_lock_wait () from /lib64/tls/libc.so.6

#
# For connection no. 3
#

[postgres@126053-db2 64]$ gdb /opt/lms/postgres-8.3_64/bin/postmaster 1179
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging
symbols found)
Using host libthread_db library "/lib64/tls/libthread_db.so.1".

Attaching to program: /opt/lms/64/postgres/8.3/bin/postmaster, process 1179
Reading symbols from /usr/lib64/libxslt.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libxslt.so.1
Reading symbols from /usr/lib64/libxml2.so.2...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libxml2.so.2
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/tls/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libm.so.6
Reading symbols from /lib64/tls/libc.so.6...
(no debugging symbols found)...done.
Loaded symbols for /lib64/tls/libc.so.6
Reading symbols from /usr/lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libz.so.1
Reading symbols from /lib64/tls/libpthread.so.0...(no debugging
symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 182894088192 (LWP 1179)]
Loaded symbols for /lib64/tls/libpthread.so.0
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging
symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /opt/lms/64/postgres/8.3/lib/plpgsql.so...
(no debugging symbols found)...done.
Loaded symbols for /opt/lms/64/postgres/8.3/lib/plpgsql.so
Reading symbols from /opt/lms/64/postgres/8.3/lib/xxid.so...done.
Loaded symbols for /opt/lms/64/postgres/8.3/lib/xxid.so
0x0000003954bd28db in __lll_mutex_lock_wait () from /lib64/tls/libc.so.6

I think the fact that a process used up all the available memory
(physical and virtual) caused Postgres to go into a weird state. Now
it will not respond to kill, or pg_ctl for shutdown. Would the right
thing to do be using kill -9 to stop the server?


--
- John L Cheng

Re: System in Recovery Mode But No Activity

От
"Scott Marlowe"
Дата:
On Fri, Jun 20, 2008 at 7:12 PM, John Cheng <chonger.cheng@gmail.com> wrote:
> We had a run away process on our database box that used up all the
> physical and all the virtual memory (swap). This caused the RedHat
> Linux oom-killer to kill many processes, including some Postgres ones.
> Postgres went into a funky state after that time:

SNIP

> I think the fact that a process used up all the available memory
> (physical and virtual) caused Postgres to go into a weird state. Now
> it will not respond to kill, or pg_ctl for shutdown. Would the right
> thing to do be using kill -9 to stop the server?

When you say it won't respond to pg_ctl for shutdown, have you tried
the three options for the -m switch in order?  Are you running a
pretty recent pg version?  Which one?

Re: System in Recovery Mode But No Activity

От
"John Cheng"
Дата:
I am running postgresql 8.3, I was not aware of the 3 options (smart,
fast, or immediate). So it used the default - "fast".

The state of the server when I sent this e-mail was that there were
two remaining connections/postgres subprocesses. I used kill -9 to
stop those two subprocesses. Then postgres was able to stop normally.
After that, I restarted postgresql normally and it went into recovery
mode for about 30 seconds. After that, it started to behave normally
again.


On Fri, Jun 20, 2008 at 9:34 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Fri, Jun 20, 2008 at 7:12 PM, John Cheng <chonger.cheng@gmail.com> wrote:
>> We had a run away process on our database box that used up all the
>> physical and all the virtual memory (swap). This caused the RedHat
>> Linux oom-killer to kill many processes, including some Postgres ones.
>> Postgres went into a funky state after that time:
>
> SNIP
>
>> I think the fact that a process used up all the available memory
>> (physical and virtual) caused Postgres to go into a weird state. Now
>> it will not respond to kill, or pg_ctl for shutdown. Would the right
>> thing to do be using kill -9 to stop the server?
>
> When you say it won't respond to pg_ctl for shutdown, have you tried
> the three options for the -m switch in order?  Are you running a
> pretty recent pg version?  Which one?
>



--
- John L Cheng

Re: System in Recovery Mode But No Activity

От
"Scott Marlowe"
Дата:
On Fri, Jun 20, 2008 at 7:56 PM, John Cheng <chonger.cheng@gmail.com> wrote:
> I am running postgresql 8.3, I was not aware of the 3 options (smart,
> fast, or immediate). So it used the default - "fast".
>
> The state of the server when I sent this e-mail was that there were
> two remaining connections/postgres subprocesses. I used kill -9 to
> stop those two subprocesses. Then postgres was able to stop normally.
> After that, I restarted postgresql normally and it went into recovery
> mode for about 30 seconds. After that, it started to behave normally
> again.

Yeah, there's some danger in kill -9 to the postgres processes, but
I've had to killall -9 -f postgres once or twice in the distant past.
The -m immediate is the signal just this side of kill -9, forget the
name.  You should be ok.

Definitely look into what's causing the oom killer to come out and
play, and look at turning off overcommit (I think the setting is 2 to
turn it off)

Re: System in Recovery Mode But No Activity

От
Tom Lane
Дата:
"Scott Marlowe" <scott.marlowe@gmail.com> writes:
> On Fri, Jun 20, 2008 at 7:56 PM, John Cheng <chonger.cheng@gmail.com> wrote:
>> The state of the server when I sent this e-mail was that there were
>> two remaining connections/postgres subprocesses. I used kill -9 to
>> stop those two subprocesses. Then postgres was able to stop normally.
>> After that, I restarted postgresql normally and it went into recovery
>> mode for about 30 seconds. After that, it started to behave normally
>> again.

> Definitely look into what's causing the oom killer to come out and
> play, and look at turning off overcommit (I think the setting is 2 to
> turn it off)

If you see this again, please get stack backtraces ('bt' command
to gdb).  The fact that both stuck processes were in
__lll_mutex_lock_wait() suggests some sort of deadlock, but it's
impossible to guess more without seeing how they got there.

Also, the reference to libpthread is a bit worrisome; we've seen
deadlocks in the past that were a direct result of the backend
unexpectedly becoming multithreaded, eg see this thread:
http://archives.postgresql.org/pgsql-general/2007-11/msg00580.php
You should look into what's causing libpthread to get loaded, and
see if you can stop it.  I don't see libperl mentioned in your
gdb output, but maybe something else is pulling it in --- ldd
might help track that down.

            regards, tom lane