Обсуждение: Backends stuck in LISTEN
Hi people, Today we had a problem where around 30-40 backends were stuck in LISTEN for a long time. I snatched a core dump before we had to kill the server, and the backtrace looks like this: #0 0x00007f26dd17d5b7 in monstartup () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007f26d79e5c70 in ?? () #2 0x00007f26d79e5c80 in ?? () #3 0x00007f254d772b60 in ?? () #4 0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464 #5 0x000000000049fabb in SimpleLruWaitIO (slotno=6, ctl=<optimized out>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:306 #6 0x000000000049fe00 in SlruSelectLRUPage (ctl=<optimized out>, pageno=34331) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:994 #7 0x00000000004a02ab in SimpleLruReadPage (ctl=0xb046a0, pageno=34331, write_ok=1 '\001', xid=1124964531) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:366 #8 0x0000000000487fb3 in TransactionIdGetStatus (xid=1124964531, lsn=0x7fffe8503750) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/clog.c:400 #9 0x0000000000488460 in TransactionLogFetch (transactionId=1124964531) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:83 #10 TransactionLogFetch (transactionId=1124964531) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:56 #11 0x00000000004885b1 in TransactionIdDidCommit (transactionId=<optimized out>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:133 #12 0x00000000005159f4 in asyncQueueProcessPageEntries (page_buffer=0x7fffe85038b0 "$", stop=..., current=<synthetic pointer>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1998 #13 asyncQueueReadAllNotifications () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1923 #14 0x0000000000516926 in Exec_ListenPreCommit () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:977 #15 PreCommit_Notify () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:801 #16 0x000000000048c863 in CommitTransaction () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:1827 #17 0x000000000048ce8d in CommitTransactionCommand () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:2563 #18 0x000000000063fafb in finish_xact_command () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2452 #19 finish_xact_command () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2441 #20 0x0000000000641e92 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=0xc5dee0 "gluepay", username=<optimized out>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:4153 #21 0x0000000000603f83 in BackendRun (port=0xca6f50) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3612 #22 BackendStartup (port=0xca6f50) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3302 #23 ServerLoop () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1466 #24 0x00000000006049f1 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1127 #25 0x000000000045a600 in main (argc=5, argv=0xc5d180) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/main/main.c:199 This is on 64-bit Linux, version 9.1.9 from the apt repositories. Any idea what could cause this? It took around 30 seconds for the backends to respond to pg_terminate_backend(). Regards, Marko Tiikkaja
Marko Tiikkaja wrote: > Hi people, > > Today we had a problem where around 30-40 backends were stuck in > LISTEN for a long time. I snatched a core dump before we had to > kill the server, and the backtrace looks like this: > > #0 0x00007f26dd17d5b7 in monstartup () from /lib/x86_64-linux-gnu/libc.so.6 > #1 0x00007f26d79e5c70 in ?? () > #2 0x00007f26d79e5c80 in ?? () > #3 0x00007f254d772b60 in ?? () > #4 0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED) > at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464 Hm, looks like either they got stuck trying to activate profiling (but then Debian doesn't build its packages with gcc -pg, does it?), or the backtrace is flawed for those upper frames. Maybe installing the libc -dbg (I imagine it'd be libc6-dbg) package would be useful, if this reoccurs. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 1/8/14 3:00 PM, Alvaro Herrera wrote: > Marko Tiikkaja wrote: >> >> #0 0x00007f26dd17d5b7 in monstartup () from /lib/x86_64-linux-gnu/libc.so.6 >> #1 0x00007f26d79e5c70 in ?? () >> #2 0x00007f26d79e5c80 in ?? () >> #3 0x00007f254d772b60 in ?? () >> #4 0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED) >> at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464 > > Hm, looks like either they got stuck trying to activate profiling (but > then Debian doesn't build its packages with gcc -pg, does it?), or the > backtrace is flawed for those upper frames. Maybe installing the libc > -dbg (I imagine it'd be libc6-dbg) package would be useful, if this > reoccurs. libc6-dbg only changed it a bit: #0 0x00007f26dd17d5b7 in __monstartup (lowpc=<optimized out>, highpc=140737090958784) at gmon.c:130 #1 0x00007f26d79fcae0 in ?? () #2 0x00007f26d79fcaf0 in ?? () #3 0x00007f254d772aa0 in ?? () I still have the core dump available, is there something else I can do? From what I heard, this is the second time this happened already :-( Regards, Marko Tiikkaja
On 1/8/14 3:24 PM, I wrote: > > libc6-dbg only changed it a bit: Oops, looks like there was something wonky about that dump. This should make more sense: #0 0x00007f26dd17d5b7 in semop () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00000000005f6128 in PGSemaphoreLock (sema=0x7f26d79e5c80, interruptOK=0 '\000') at pg_sema.c:418 #2 0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464 #3 0x000000000049fabb in SimpleLruWaitIO (slotno=6, ctl=<optimized out>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:306 #4 0x000000000049fe00 in SlruSelectLRUPage (ctl=<optimized out>, pageno=34331) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:994 #5 0x00000000004a02ab in SimpleLruReadPage (ctl=0xb046a0, pageno=34331, write_ok=1 '\001', xid=1124964531) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:366 #6 0x0000000000487fb3 in TransactionIdGetStatus (xid=1124964531, lsn=0x7fffe8503750) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/clog.c:400 #7 0x0000000000488460 in TransactionLogFetch (transactionId=1124964531) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:83 #8 TransactionLogFetch (transactionId=1124964531) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:56 #9 0x00000000004885b1 in TransactionIdDidCommit (transactionId=<optimized out>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/transam.c:133 #10 0x00000000005159f4 in asyncQueueProcessPageEntries (page_buffer=0x7fffe85038b0 "$", stop=..., current=<synthetic pointer>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1998 #11 asyncQueueReadAllNotifications () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:1923 #12 0x0000000000516926 in Exec_ListenPreCommit () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:977 #13 PreCommit_Notify () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/commands/async.c:801 #14 0x000000000048c863 in CommitTransaction () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:1827 #15 0x000000000048ce8d in CommitTransactionCommand () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/xact.c:2563 #16 0x000000000063fafb in finish_xact_command () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2452 #17 finish_xact_command () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:2441 #18 0x0000000000641e92 in PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=0xc5dee0 "gluepay", username=<optimized out>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/tcop/postgres.c:4153 #19 0x0000000000603f83 in BackendRun (port=0xca6f50) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3612 #20 BackendStartup (port=0xca6f50) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:3302 #21 ServerLoop () at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1466 #22 0x00000000006049f1 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/postmaster/postmaster.c:1127 #23 0x000000000045a600 in main (argc=5, argv=0xc5d180) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/main/main.c:199 Regards, Marko Tiikkaja
Marko Tiikkaja wrote: > On 1/8/14 3:24 PM, I wrote: > > > >libc6-dbg only changed it a bit: > > Oops, looks like there was something wonky about that dump. This > should make more sense: > > #0 0x00007f26dd17d5b7 in semop () from /lib/x86_64-linux-gnu/libc.so.6 > #1 0x00000000005f6128 in PGSemaphoreLock (sema=0x7f26d79e5c80, > interruptOK=0 '\000') at pg_sema.c:418 > #2 0x00000000006359c5 in LWLockAcquire (lockid=87, mode=LW_SHARED) > at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/storage/lmgr/lwlock.c:464 > #3 0x000000000049fabb in SimpleLruWaitIO (slotno=6, ctl=<optimized > out>) at /build/buildd/postgresql-9.1-9.1.9/build/../src/backend/access/transam/slru.c:306 Yeah, this makes a lot more sense. What this is saying is that it's waiting for the I/O-lock on the pg_clog page to be released. In other words there is some other process holding that lock, which is supposed to be reading the page, but for some reason it's failing to do that timely. I guess you would need to get backtraces from all processes to see which one differs (i.e. which one has the lock and is trying to do something but not doing it). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 1/8/14 3:50 PM, Alvaro Herrera wrote: > Yeah, this makes a lot more sense. What this is saying is that it's > waiting for the I/O-lock on the pg_clog page to be released. In other > words there is some other process holding that lock, which is supposed > to be reading the page, but for some reason it's failing to do that > timely. I guess you would need to get backtraces from all processes to > see which one differs (i.e. which one has the lock and is trying to do > something but not doing it). I got only two, and they both looked like this, unfortunately. I'll try to find the offender should this happen again. In the meanwhile I'll be happy to provide more information if someone has any ideas. Regards, Marko Tiikkaja
Marko Tiikkaja <marko@joh.to> writes: > In the meanwhile I'll be happy to provide more information if someone > has any ideas. I don't suppose there's anything in the postmaster log suggesting trouble with accessing pg_clog/ files shortly before the lockup? I concur with Alvaro's assessment that this looks like a bunch of processes all waiting for somebody else to finish reading the clog page they want. If somebody had forgotten to unlock the I/O lock after loading a clog buffer, we could get this symptom later on; but I find it hard to see how that could happen. BTW ... just looking at the code a bit ... I wonder whether there is any interlock that ensures that listen-queue messages will be cleared before the originating transaction's entry is truncated away from pg_clog? It doesn't seem like an ancient XID in the message queue could cause this particular symptom, but I could see it leading to "could not access status of transaction" failures. regards, tom lane
Marko Tiikkaja <marko@joh.to> writes: >> In the meanwhile I'll be happy to provide more information if someone >> has any ideas. Oh, after further thought: this trace says the process was trying to look up the status of XID 1124964531 (see the TransactionIdGetStatus and TransactionLogFetch calls). Do all the backtraces show this same XID? If I did the math right, this corresponds to pg_clog file 010C. Do you have such a file? Does it look to have the right size and permissions? What file names are present in pg_clog? It'd also be interesting to see the current XID counter according to pg_controldata, especially if you have some idea how far it's advanced since the problem happened. regards, tom lane
On 1/8/14 4:38 PM, Tom Lane wrote: > I don't suppose there's anything in the postmaster log suggesting trouble > with accessing pg_clog/ files shortly before the lockup? Nope, nothing out of the ordinary in the logs that I can find. Regards, Marko Tiikkaja
On 1/8/14 4:55 PM, Tom Lane wrote: > Marko Tiikkaja <marko@joh.to> writes: >>> In the meanwhile I'll be happy to provide more information if someone >>> has any ideas. > > Oh, after further thought: this trace says the process was trying to look > up the status of XID 1124964531 (see the TransactionIdGetStatus and > TransactionLogFetch calls). Do all the backtraces show this same XID? I only got two, and the other one wants the status of 1133447326. > If I did the math right, this corresponds to pg_clog file 010C. Do you > have such a file? Does it look to have the right size and permissions? > What file names are present in pg_clog? No, it's not there. The files look like this: -rw------- 1 postgres postgres 256K Nov 3 15:55 0390 -rw------- 1 postgres postgres 256K Nov 3 22:26 0391 .. -rw------- 1 postgres postgres 256K Jan 8 16:43 044D -rw------- 1 postgres postgres 24K Jan 8 17:28 044E > It'd also be interesting to see the current XID counter according to > pg_controldata, especially if you have some idea how far it's advanced > since the problem happened. Latest checkpoint's NextXID: 0/1155614649 (I hope that's the number you were looking for.) Judging from a few days old base backup that number would have been (very roughly) around 1148225540 when this incident happened. Regards, Marko Tiikkaja
Marko Tiikkaja <marko@joh.to> writes: > On 1/8/14 4:55 PM, Tom Lane wrote: >> If I did the math right, this corresponds to pg_clog file 010C. Do you >> have such a file? Does it look to have the right size and permissions? >> What file names are present in pg_clog? > No, it's not there. The files look like this: > -rw------- 1 postgres postgres 256K Nov 3 15:55 0390 > -rw------- 1 postgres postgres 256K Nov 3 22:26 0391 > .. > -rw------- 1 postgres postgres 256K Jan 8 16:43 044D > -rw------- 1 postgres postgres 24K Jan 8 17:28 044E Oh, I fat-fingered the math somehow, because when I do it over again, I get 0430 (or 0438 for your other observed XID). That's right in the range of your available clog files, so the notion of a crazy XID in the message queue doesn't seem to be the explanation here. Since the referenced XIDs are not close together, I assume that the two traces also show different pagenos (SimpleLruReadPage) and slotnos (SimpleLruWaitIO) and lockids (LWLockAcquire)? That leads to the idea that we've got some condition that broke all of the clog SLRU slots at the same time, though it's really hard to see what that might be ... regards, tom lane
On 1/8/14 6:04 PM, Tom Lane wrote: > Since the referenced XIDs are not close together, I assume that the two > traces also show different pagenos (SimpleLruReadPage) and slotnos > (SimpleLruWaitIO) and lockids (LWLockAcquire)? Correct. The other one has a lockno of NumFixedLWLocks, if that's of any significance. Regards, Marko Tiikkaja