Обсуждение: Hang on NOTIFY

Поиск
Список
Период
Сортировка

Hang on NOTIFY

От
Mark Simonetti
Дата:
The system I am developing makes extensive use of the async
NOTIFY/LISTEN system.

I am currently experiencing a problem on 2 production servers:

Server 1:
Virtual Windows Server 2008 R2 (VMWare)
PostgreSQL 9.3.5

Server 2:
Virtual Windows Server 2008 R2 (VMWare)
PostgreSQL 9.4.2

After the system has been running for a period of time, sometimes a few
days sometimes a few weeks, any calls to NOTIFY
will hang.

After in depth investigation it appears to happen when a listening
backend has been connected for some time (days).

Any other backend trying to inform that backend will hang on
"CallNamedPipe" in pgkill (kill.c).

Here is a stack trace from the hung SENDING backend, main thread : -

      ntdll.dll!_NtFsControlFile@40()  + 0x15 bytes
      ntdll.dll!_NtFsControlFile@40()  + 0x15 bytes
      kernel32.dll!_CallNamedPipeW@28()  + 0xf4 bytes
      postgres.exe!pgkill(int pid, int sig)  Line 43 + 0x2b bytes  C
      postgres.exe!SendProcSignal(int pid, ProcSignalReason reason, int
backendId)  Line 198 + 0x10 bytes    C
      postgres.exe!SignalBackends()  Line 1497 + 0xe bytes    C
 >    postgres.exe!ProcessCompletedNotifies()  Line 1092    C
      postgres.exe!PostgresMain(int argc, char * * argv, const char *
dbname, const char * username)  Line 3947    C
      postgres.exe!BackendRun(Port * port)  Line 4011 + 0x21 bytes  C
      postgres.exe!SubPostmasterMain(int argc, char * * argv)  Line 4515
+ 0x8 bytes    C
      postgres.exe!main(int argc, char * * argv)  Line 203 + 0x7 bytes    C
      postgres.exe!__tmainCRTStartup()  Line 555 + 0x17 bytes    C
      kernel32.dll!@BaseThreadInitThunk@12()  + 0x12 bytes
      ntdll.dll!___RtlUserThreadStart@8()  + 0x27 bytes
      ntdll.dll!__RtlUserThreadStart@8()  + 0x1b bytes

Here is a stack trace from the signalling thread (I know its irrelevent
as this is for incomming signals) : -

      ntdll.dll!_NtFsControlFile@40()  + 0x15 bytes
      ntdll.dll!_NtFsControlFile@40()  + 0x15 bytes
 >    postgres.exe!pg_signal_thread(void * param)  Line 279 + 0x9 bytes    C


Now for the RECIPIENT backend : -

      ntdll.dll!_ZwWaitForMultipleObjects@20()  + 0x15 bytes
      ntdll.dll!_ZwWaitForMultipleObjects@20()  + 0x15 bytes
      KERNELBASE.dll!_WaitForMultipleObjectsEx@20()  + 0x36 bytes
      kernel32.dll!_WaitForMultipleObjectsExImplementation@20()  + 0x8e
bytes
 >     postgres.exe!pgwin32_waitforsinglesocket(unsigned int s, int
what, int timeout)  Line 216 + 0x14 bytes    C
      postgres.exe!pgwin32_recv(unsigned int s, char * buf, int len, int
f)  Line 352 + 0xa bytes    C
      postgres.exe!secure_read(Port * port, void * ptr, unsigned int
len)  Line 304 + 0x12 bytes    C
      postgres.exe!pq_getbyte()  Line 895 + 0x67 bytes    C
      postgres.exe!SocketBackend(StringInfoData * inBuf)  Line 344 + 0x5
bytes    C
     postgres.exe!PostgresMain(int argc, char * * argv, const char *
dbname, const char * username)  Line 3968 + 0x1c bytes    C
      postgres.exe!BackendRun(Port * port)  Line 4011 + 0x21 bytes  C
      postgres.exe!SubPostmasterMain(int argc, char * * argv)  Line 4515
+ 0x8 bytes    C
      postgres.exe!main(int argc, char * * argv)  Line 203 + 0x7 bytes    C
      postgres.exe!__tmainCRTStartup()  Line 555 + 0x17 bytes    C
      kernel32.dll!@BaseThreadInitThunk@12()  + 0x12 bytes
      ntdll.dll!___RtlUserThreadStart@8()  + 0x27 bytes
      ntdll.dll!__RtlUserThreadStart@8()  + 0x1b bytes

This is the usual place for it to wait, so this seems okay.

      ntdll.dll!_NtFsControlFile@40()  + 0x15 bytes
      ntdll.dll!_NtFsControlFile@40()  + 0x15 bytes
 >    postgres.exe!pg_signal_thread(void * param)  Line 279 + 0x9 bytes    C

Also looks fine.

This seems like a possible Windows bug, as the call to CallNamedPipe has
a timeout of 1000 milliseconds, but it is clearly not timing out.  It
only seems to exit if I exit the backend it is trying to signal.

NOTE: it is trying to send to many backends, but on all the stuck
backends I checked, they all were stuck sending to the same recipient.
Closing that particular recipient DOES free everything up and signals
start flowing again.

I've searched around and cannot find a similar bug report.  Is it
possibly something I'm doing wrong?

Thanks,
Mark.
--

Re: Hang on NOTIFY

От
Tom Lane
Дата:
Mark Simonetti <marks@opalsoftware.co.uk> writes:
> [ Our Windows kill() emulation sometimes hangs ]

For the record, I was helping Mark track this down off-list, up till the
point where it was unmistakably a Windows-specific issue.  I can't do much
more with it now.

I see no good reason to think that the problem is especially
NOTIFY-specific; it seems likely that any cross-backend signal attempt,
such as pg_cancel_backend(), would have the same hazard.  So if anyone
is motivated to try to reproduce it, you might be able to do so in less
time than it takes Mark's system to have a problem by doing a boatload
of pg_cancel_backend()'s.

I had originally guessed that it might be associated with the target
backend exiting during or just before the signal attempt; however, in
Mark's tests the target process is clearly still there.  Some other
theories that would be worth pursuing are (1) it's triggered by multiple
processes trying to signal the same target process concurrently, or
(2) it's got something to do with the target being very long-lived
and therefore having received a lot of signals in its lifetime.
(32-bit overflow anybody?  Mark, can you estimate how many signals
the target might have received before things go south?)

> This seems like a possible Windows bug, as the call to CallNamedPipe has
> a timeout of 1000 milliseconds, but it is clearly not timing out.  It
> only seems to exit if I exit the backend it is trying to signal.

I did some googling and found this old report:
http://www.postgresql.org/message-id/1262016302.3302.37.camel@arc-dev2.wsicorp.com

We did apply a variant of the patch recommended there, cf commits
04a4413c2 and f27a4696f, but it's worth reading in this connection anyway.
In particular I noted the suggestion that CallNamedPipe's timeout only
applies to the initial wait to obtain an instance of the pipe.  If that's
accurate, it'd suggest that we're hanging after that step within the
multiple steps of CallNamedPipe.

            regards, tom lane