Обсуждение: SR standby hangs
[this time from the right address - apologies if we get a duplicate] PostgreSQL Experts Inc has a client with a 9.0.2 streaming replication server that somehow becomes wedged after running forsome time. The server is running as a warm standby, and the client's application tries to connect to both the master and the slave, accepting whichever lets it connect (hence hot standby is not turned on). Archive files are being shipped as well as WAL streaming. The symptom is that the recovery process blocks forever on a semaphore. We've crashed it and got the following backtrace: #0 0x0000003493ed5337 in semop () from /lib64/libc.so.6 #1 0x00000000005bd103 in PGSemaphoreLock (sema=0x2b14986aec38,interruptOK=1 '\001') at pg_sema.c:420 #2 0x00000000005de645 in LockBufferForCleanup () at bufmgr.c:2432 #3 0x0000000000463733 in heap_xlog_clean (lsn=<value optimized out>, record=0x1787e1c0) at heapam.c:4168 #4 heap2_redo (lsn=<value optimized out>, record=0x1787e1c0) at heapam.c:4858 #5 0x0000000000488780 inStartupXLOG () at xlog.c:6250 #6 0x000000000048a888 in StartupProcessMain () at xlog.c:9254 #7 0x00000000004a11efin AuxiliaryProcessMain (argc=2, argv=<value optimized out>) at bootstrap.c:412 #8 0x00000000005c66c9in StartChildProcess (type=StartupProcess) at postmaster.c:4427 #9 0x00000000005c8ab7 in PostmasterMain(argc=1, argv=0x17858bb0) at postmaster.c:1088 #10 0x00000000005725fe in main (argc=1, argv=<value optimizedout>) at main.c:188 The platform is CentOS 5.5 x86-64, kernel version 2.6.18-194.11.4.el5 I'm not quite sure where to start digging. Has anyone else seen something similar? Our consultant reports having seen a similar problem elsewhere, at a client who was running hot standby on 9.0.1, but the problem did not recur, as it does fairly regularly with this client. cheers andrew
Andrew Dunstan <andrew@dunslane.net> writes: > The symptom is that the recovery process blocks forever on a semaphore. > We've crashed it and got the following backtrace: > #0 0x0000003493ed5337 in semop () from /lib64/libc.so.6 > #1 0x00000000005bd103 in PGSemaphoreLock (sema=0x2b14986aec38, interruptOK=1 > '\001') at pg_sema.c:420 > #2 0x00000000005de645 in LockBufferForCleanup () at bufmgr.c:2432 > #3 0x0000000000463733 in heap_xlog_clean (lsn=<value optimized out>, > record=0x1787e1c0) at heapam.c:4168 > #4 heap2_redo (lsn=<value optimized out>, record=0x1787e1c0) at heapam.c:4858 > #5 0x0000000000488780 in StartupXLOG () at xlog.c:6250 So who's holding the buffer lock that it wants? Are you sure this is an actual hang, and not just recovery waiting for a standby query to complete? regards, tom lane
On Fri, Feb 18, 2011 at 2:16 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > I'm not quite sure where to start digging. Has anyone else seen > something similar? Our consultant reports having seen a similar problem > elsewhere, at a client who was running hot standby on 9.0.1, but the > problem did not recur, as it does fairly regularly with this client. I've seen a very similar backtrace that only involved one system (no Hot Standby). The problem in that case appears to have been an open cursor holding a buffer pin. LockBufferForCleanup() has logic that's supposed to prevent that from going on too long during HS - it should nuke the guy with the buffer in when the timeout expires - but maybe there's a bug in that mechanism. As a side matter, it would be good to improve this in the non-Hot Standby case also. An open cursor can tie down an autovacuum worker forever, which is not a good thing, as it's easily possible for the number of open cursors to be larger than the number of available autovacuum workers... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 02/18/2011 02:23 PM, Tom Lane wrote: > Andrew Dunstan<andrew@dunslane.net> writes: >> The symptom is that the recovery process blocks forever on a semaphore. >> We've crashed it and got the following backtrace: >> #0 0x0000003493ed5337 in semop () from /lib64/libc.so.6 >> #1 0x00000000005bd103 in PGSemaphoreLock (sema=0x2b14986aec38, interruptOK=1 >> '\001') at pg_sema.c:420 >> #2 0x00000000005de645 in LockBufferForCleanup () at bufmgr.c:2432 >> #3 0x0000000000463733 in heap_xlog_clean (lsn=<value optimized out>, >> record=0x1787e1c0) at heapam.c:4168 >> #4 heap2_redo (lsn=<value optimized out>, record=0x1787e1c0) at heapam.c:4858 >> #5 0x0000000000488780 in StartupXLOG () at xlog.c:6250 > So who's holding the buffer lock that it wants? Are you sure this is an > actual hang, and not just recovery waiting for a standby query to complete? > > It's not running HS, so there's no query to wait on. cheers andrew
On Fri, Feb 18, 2011 at 2:35 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > It's not running HS, so there's no query to wait on. That seems to imply that recovery has leaked a buffer pin. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Feb 18, 2011 at 2:35 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >> It's not running HS, so there's no query to wait on. > That seems to imply that recovery has leaked a buffer pin. No, because then the sanity check in LockBufferForCleanup would have fired: /* There should be exactly one local pin */if (PrivateRefCount[buffer - 1] != 1) elog(ERROR, "incorrect local pin count:%d", PrivateRefCount[buffer - 1]); Some sort of deadly embrace with the bgwriter, maybe? regards, tom lane
On Fri, Feb 18, 2011 at 2:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Feb 18, 2011 at 2:35 PM, Andrew Dunstan <andrew@dunslane.net> wrote: >>> It's not running HS, so there's no query to wait on. > >> That seems to imply that recovery has leaked a buffer pin. > > No, because then the sanity check in LockBufferForCleanup would have > fired: > > /* There should be exactly one local pin */ > if (PrivateRefCount[buffer - 1] != 1) > elog(ERROR, "incorrect local pin count: %d", > PrivateRefCount[buffer - 1]); Hmm, yeah. > Some sort of deadly embrace with the bgwriter, maybe? Maybe. I think it'd be useful to know what the buffer header thinks the refcount on that buffer is, and what the startup process and the bgwriter each have for PrivateRefCount[buffer]. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 02/18/2011 03:42 PM, Robert Haas wrote: > On Fri, Feb 18, 2011 at 2:50 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> Robert Haas<robertmhaas@gmail.com> writes: >>> On Fri, Feb 18, 2011 at 2:35 PM, Andrew Dunstan<andrew@dunslane.net> wrote: >>>> It's not running HS, so there's no query to wait on. >>> That seems to imply that recovery has leaked a buffer pin. >> No, because then the sanity check in LockBufferForCleanup would have >> fired: >> >> /* There should be exactly one local pin */ >> if (PrivateRefCount[buffer - 1] != 1) >> elog(ERROR, "incorrect local pin count: %d", >> PrivateRefCount[buffer - 1]); > Hmm, yeah. > >> Some sort of deadly embrace with the bgwriter, maybe? > Maybe. > > I think it'd be useful to know what the buffer header thinks the > refcount on that buffer is, and what the startup process and the > bgwriter each have for PrivateRefCount[buffer]. I'll see what I can find out (damn I hate driving debuggers). cheers andrew