Обсуждение: Backends stuck in LISTEN

Поиск
Список
Период
Сортировка

Backends stuck in LISTEN

От
Marko Tiikkaja
Дата:
Hi people,

Today we had a problem where around 30-40 backends were stuck in LISTEN
for a long time.  I snatched a core dump before we had to kill the
server, and the backtrace looks like this:

#0  0x00007f26dd17d5b7 in monstartup () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f26d79e5c70 in ?? ()
#2  0x00007f26d79e5c80 in ?? ()
#3  0x00007f254d772b60 in ?? ()
#4  0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464
#5  0x000000000049fabb in SimpleLruWaitIO (slotno=6, ctl=<optimized
out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:306
#6  0x000000000049fe00 in SlruSelectLRUPage (ctl=<optimized out>,
pageno=34331) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:994
#7  0x00000000004a02ab in SimpleLruReadPage (ctl=0xb046a0, pageno=34331,
write_ok=1 '\001', xid=1124964531) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:366
#8  0x0000000000487fb3 in TransactionIdGetStatus (xid=1124964531,
lsn=0x7fffe8503750) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/clog.c:400
#9  0x0000000000488460 in TransactionLogFetch (transactionId=1124964531)
at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:83
#10 TransactionLogFetch (transactionId=1124964531) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:56
#11 0x00000000004885b1 in TransactionIdDidCommit
(transactionId=<optimized out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:133
#12 0x00000000005159f4 in asyncQueueProcessPageEntries
(page_buffer=0x7fffe85038b0 "$", stop=..., current=<synthetic pointer>)
at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1998
#13 asyncQueueReadAllNotifications () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1923
#14 0x0000000000516926 in Exec_ListenPreCommit () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:977
#15 PreCommit_Notify () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:801
#16 0x000000000048c863 in CommitTransaction () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:1827
#17 0x000000000048ce8d in CommitTransactionCommand () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:2563
#18 0x000000000063fafb in finish_xact_command () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2452
#19 finish_xact_command () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2441
#20 0x0000000000641e92 in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=0xc5dee0 "gluepay", username=<optimized
out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:4153
#21 0x0000000000603f83 in BackendRun (port=0xca6f50) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3612
#22 BackendStartup (port=0xca6f50) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3302
#23 ServerLoop () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1466
#24 0x00000000006049f1 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1127
#25 0x000000000045a600 in main (argc=5, argv=0xc5d180) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/main/main.c:199


This is on 64-bit Linux, version 9.1.9 from the apt repositories.

Any idea what could cause this?  It took around 30 seconds for the
backends to respond to pg_terminate_backend().


Regards,
Marko Tiikkaja

Re: Backends stuck in LISTEN

От
Alvaro Herrera
Дата:
Marko Tiikkaja wrote:
> Hi people,
>
> Today we had a problem where around 30-40 backends were stuck in
> LISTEN for a long time.  I snatched a core dump before we had to
> kill the server, and the backtrace looks like this:
>
> #0  0x00007f26dd17d5b7 in monstartup () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007f26d79e5c70 in ?? ()
> #2  0x00007f26d79e5c80 in ?? ()
> #3  0x00007f254d772b60 in ?? ()
> #4  0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED)
> at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464

Hm, looks like either they got stuck trying to activate profiling (but
then Debian doesn't build its packages with gcc -pg, does it?), or the
backtrace is flawed for those upper frames.  Maybe installing the libc
-dbg (I imagine it'd be libc6-dbg) package would be useful, if this
reoccurs.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Backends stuck in LISTEN

От
Marko Tiikkaja
Дата:
On 1/8/14 3:00 PM, Alvaro Herrera wrote:
> Marko Tiikkaja wrote:
>>
>> #0  0x00007f26dd17d5b7 in monstartup () from /lib/x86_64-linux-gnu/libc.so.6
>> #1  0x00007f26d79e5c70 in ?? ()
>> #2  0x00007f26d79e5c80 in ?? ()
>> #3  0x00007f254d772b60 in ?? ()
>> #4  0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED)
>> at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464
>
> Hm, looks like either they got stuck trying to activate profiling (but
> then Debian doesn't build its packages with gcc -pg, does it?), or the
> backtrace is flawed for those upper frames.  Maybe installing the libc
> -dbg (I imagine it'd be libc6-dbg) package would be useful, if this
> reoccurs.

libc6-dbg only changed it a bit:

#0  0x00007f26dd17d5b7 in __monstartup (lowpc=<optimized out>,
highpc=140737090958784) at gmon.c:130
#1  0x00007f26d79fcae0 in ?? ()
#2  0x00007f26d79fcaf0 in ?? ()
#3  0x00007f254d772aa0 in ?? ()

I still have the core dump available, is there something else I can do?
  From what I heard, this is the second time this happened already :-(



Regards,
Marko Tiikkaja

Re: Backends stuck in LISTEN

От
Marko Tiikkaja
Дата:
On 1/8/14 3:24 PM, I wrote:
>
> libc6-dbg only changed it a bit:

Oops, looks like there was something wonky about that dump.  This should
make more sense:

#0  0x00007f26dd17d5b7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00000000005f6128 in PGSemaphoreLock (sema=0x7f26d79e5c80,
interruptOK=0 '\000') at pg_sema.c:418
#2  0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464
#3  0x000000000049fabb in SimpleLruWaitIO (slotno=6, ctl=<optimized
out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:306
#4  0x000000000049fe00 in SlruSelectLRUPage (ctl=<optimized out>,
pageno=34331) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:994
#5  0x00000000004a02ab in SimpleLruReadPage (ctl=0xb046a0, pageno=34331,
write_ok=1 '\001', xid=1124964531) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:366
#6  0x0000000000487fb3 in TransactionIdGetStatus (xid=1124964531,
lsn=0x7fffe8503750) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/clog.c:400
#7  0x0000000000488460 in TransactionLogFetch (transactionId=1124964531)
at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:83
#8  TransactionLogFetch (transactionId=1124964531) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:56
#9  0x00000000004885b1 in TransactionIdDidCommit
(transactionId=<optimized out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:133
#10 0x00000000005159f4 in asyncQueueProcessPageEntries
(page_buffer=0x7fffe85038b0 "$", stop=..., current=<synthetic pointer>)
     at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1998
#11 asyncQueueReadAllNotifications () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1923
#12 0x0000000000516926 in Exec_ListenPreCommit () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:977
#13 PreCommit_Notify () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:801
#14 0x000000000048c863 in CommitTransaction () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:1827
#15 0x000000000048ce8d in CommitTransactionCommand () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:2563
#16 0x000000000063fafb in finish_xact_command () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2452
#17 finish_xact_command () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2441
#18 0x0000000000641e92 in PostgresMain (argc=<optimized out>,
argv=<optimized out>, dbname=0xc5dee0 "gluepay", username=<optimized out>)
     at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:4153
#19 0x0000000000603f83 in BackendRun (port=0xca6f50) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3612
#20 BackendStartup (port=0xca6f50) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3302
#21 ServerLoop () at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1466
#22 0x00000000006049f1 in PostmasterMain (argc=<optimized out>,
argv=<optimized out>) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1127
#23 0x000000000045a600 in main (argc=5, argv=0xc5d180) at
/build/buildd/postgresql-9.1-9.1.9/build/../src/backend/main/main.c:199


Regards,
Marko Tiikkaja

Re: Backends stuck in LISTEN

От
Alvaro Herrera
Дата:
Marko Tiikkaja wrote:
> On 1/8/14 3:24 PM, I wrote:
> >
> >libc6-dbg only changed it a bit:
>
> Oops, looks like there was something wonky about that dump.  This
> should make more sense:
>
> #0  0x00007f26dd17d5b7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00000000005f6128 in PGSemaphoreLock (sema=0x7f26d79e5c80,
> interruptOK=0 '\000') at pg_sema.c:418
> #2  0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED)
> at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464
> #3  0x000000000049fabb in SimpleLruWaitIO (slotno=6, ctl=<optimized
> out>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:306

Yeah, this makes a lot more sense.  What this is saying is that it's
waiting for the I/O-lock on the pg_clog page to be released.  In other
words there is some other process holding that lock, which is supposed
to be reading the page, but for some reason it's failing to do that
timely.  I guess you would need to get backtraces from all processes to
see which one differs (i.e. which one has the lock and is trying to do
something but not doing it).

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: Backends stuck in LISTEN

От
Marko Tiikkaja
Дата:
On 1/8/14 3:50 PM, Alvaro Herrera wrote:
> Yeah, this makes a lot more sense.  What this is saying is that it's
> waiting for the I/O-lock on the pg_clog page to be released.  In other
> words there is some other process holding that lock, which is supposed
> to be reading the page, but for some reason it's failing to do that
> timely.  I guess you would need to get backtraces from all processes to
> see which one differs (i.e. which one has the lock and is trying to do
> something but not doing it).

I got only two, and they both looked like this, unfortunately.  I'll try
to find the offender should this happen again.

In the meanwhile I'll be happy to provide more information if someone
has any ideas.


Regards,
Marko Tiikkaja

Re: Backends stuck in LISTEN

От
Tom Lane
Дата:
Marko Tiikkaja <marko@joh.to> writes:
> In the meanwhile I'll be happy to provide more information if someone
> has any ideas.

I don't suppose there's anything in the postmaster log suggesting trouble
with accessing pg_clog/ files shortly before the lockup?

I concur with Alvaro's assessment that this looks like a bunch of
processes all waiting for somebody else to finish reading the clog page
they want.  If somebody had forgotten to unlock the I/O lock after loading
a clog buffer, we could get this symptom later on; but I find it hard to
see how that could happen.

BTW ... just looking at the code a bit ... I wonder whether there is any
interlock that ensures that listen-queue messages will be cleared before
the originating transaction's entry is truncated away from pg_clog?
It doesn't seem like an ancient XID in the message queue could cause this
particular symptom, but I could see it leading to "could not access
status of transaction" failures.

            regards, tom lane

Re: Backends stuck in LISTEN

От
Tom Lane
Дата:
Marko Tiikkaja <marko@joh.to> writes:
>> In the meanwhile I'll be happy to provide more information if someone
>> has any ideas.

Oh, after further thought: this trace says the process was trying to look
up the status of XID 1124964531 (see the TransactionIdGetStatus and
TransactionLogFetch calls).  Do all the backtraces show this same XID?

If I did the math right, this corresponds to pg_clog file 010C.  Do you
have such a file?  Does it look to have the right size and permissions?
What file names are present in pg_clog?

It'd also be interesting to see the current XID counter according to
pg_controldata, especially if you have some idea how far it's advanced
since the problem happened.

            regards, tom lane

Re: Backends stuck in LISTEN

От
Marko Tiikkaja
Дата:
On 1/8/14 4:38 PM, Tom Lane wrote:
> I don't suppose there's anything in the postmaster log suggesting trouble
> with accessing pg_clog/ files shortly before the lockup?

Nope, nothing out of the ordinary in the logs that I can find.



Regards,
Marko Tiikkaja

Re: Backends stuck in LISTEN

От
Marko Tiikkaja
Дата:
On 1/8/14 4:55 PM, Tom Lane wrote:
> Marko Tiikkaja <marko@joh.to> writes:
>>> In the meanwhile I'll be happy to provide more information if someone
>>> has any ideas.
>
> Oh, after further thought: this trace says the process was trying to look
> up the status of XID 1124964531 (see the TransactionIdGetStatus and
> TransactionLogFetch calls).  Do all the backtraces show this same XID?

I only got two, and the other one wants the status of 1133447326.

> If I did the math right, this corresponds to pg_clog file 010C.  Do you
> have such a file?  Does it look to have the right size and permissions?
> What file names are present in pg_clog?

No, it's not there.  The files look like this:

   -rw------- 1 postgres postgres 256K Nov  3 15:55 0390
   -rw------- 1 postgres postgres 256K Nov  3 22:26 0391
     ..
   -rw------- 1 postgres postgres 256K Jan  8 16:43 044D
   -rw------- 1 postgres postgres  24K Jan  8 17:28 044E

> It'd also be interesting to see the current XID counter according to
> pg_controldata, especially if you have some idea how far it's advanced
> since the problem happened.

Latest checkpoint's NextXID:          0/1155614649

(I hope that's the number you were looking for.)

Judging from a few days old base backup that number would have been
(very roughly) around 1148225540 when this incident happened.



Regards,
Marko Tiikkaja

Re: Backends stuck in LISTEN

От
Tom Lane
Дата:
Marko Tiikkaja <marko@joh.to> writes:
> On 1/8/14 4:55 PM, Tom Lane wrote:
>> If I did the math right, this corresponds to pg_clog file 010C.  Do you
>> have such a file?  Does it look to have the right size and permissions?
>> What file names are present in pg_clog?

> No, it's not there.  The files look like this:

>    -rw------- 1 postgres postgres 256K Nov  3 15:55 0390
>    -rw------- 1 postgres postgres 256K Nov  3 22:26 0391
>      ..
>    -rw------- 1 postgres postgres 256K Jan  8 16:43 044D
>    -rw------- 1 postgres postgres  24K Jan  8 17:28 044E

Oh, I fat-fingered the math somehow, because when I do it over again,
I get 0430 (or 0438 for your other observed XID).  That's right in the
range of your available clog files, so the notion of a crazy XID in
the message queue doesn't seem to be the explanation here.

Since the referenced XIDs are not close together, I assume that the two
traces also show different pagenos (SimpleLruReadPage) and slotnos
(SimpleLruWaitIO) and lockids (LWLockAcquire)?

That leads to the idea that we've got some condition that broke
all of the clog SLRU slots at the same time, though it's really hard
to see what that might be ...

            regards, tom lane

Re: Backends stuck in LISTEN

От
Marko Tiikkaja
Дата:
On 1/8/14 6:04 PM, Tom Lane wrote:
> Since the referenced XIDs are not close together, I assume that the two
> traces also show different pagenos (SimpleLruReadPage) and slotnos
> (SimpleLruWaitIO) and lockids (LWLockAcquire)?

Correct.  The other one has a lockno of NumFixedLWLocks, if that's of
any significance.


Regards,
Marko Tiikkaja