Обсуждение: CREATE REPLICATION SLOT fails on a timeout
I am finding that my logical walsender connections are being terminated due to a timeout on the CREATE REPLICATION SLOT command. with "terminating walsender process due to replication timeout" Below is the stack trace when this happens #3 0x000000000067df28 in WalSndCheckTimeOut (now=now@entry=453585463823871) at walsender.c:1748 #4 0x000000000067eedc in WalSndWaitForWal (loc=691727888) at walsender.c:1216 #5 logical_read_xlog_page (state=<optimized out>, targetPagePtr=691724288, reqLen=<optimized out>, targetRecPtr=<optimized out>, cur_page=0x18476e0 "}\320\005", pageTLI=<optimized out>) at walsender.c:741 #6 0x00000000004f41bf in ReadPageInternal (state=state@entry=0x17aa140, pageptr=pageptr@entry=691724288, reqLen=reqLen@entry=3600) at xlogreader.c:525 #7 0x00000000004f454a in XLogReadRecord (state=0x17aa140, RecPtr=691727856, RecPtr@entry=0, errormsg=errormsg@entry=0x7fff7f668c28) at xlogreader.c:228 #8 0x0000000000675e5c in DecodingContextFindStartpoint (ctx=ctx@entry=0x17a0358) at logical.c:460 #9 0x0000000000680f16 in CreateReplicationSlot (cmd=0x1798b50) at walsender.c:800 #10 exec_replication_command ( cmd_string=cmd_string@entry=0x17f1478 "CREATE_REPLICATION_SLOT \"slon_4_1\" LOGICAL \"slony1_funcs.2.2.0\"") at walsender.c:1291 #11 0x00000000006bf4a1 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x177db50, dbname=0x177db30 "test1", (gdb) p last_reply_timestamp $1 = 0 I propose the attached patch sets last_reply_timestamp to now() it starts processing a command. Since receiving a command is hearing something from the client. Steve
Вложения
Hi, On 2014-05-16 16:37:16 -0400, Steve Singer wrote: > I am finding that my logical walsender connections are being terminated due > to a timeout on the CREATE REPLICATION SLOT command. with "terminating > walsender process due to replication timeout" > > Below is the stack trace when this happens > > #3 0x000000000067df28 in WalSndCheckTimeOut (now=now@entry=453585463823871) > at walsender.c:1748 > #4 0x000000000067eedc in WalSndWaitForWal (loc=691727888) at > walsender.c:1216 > ... > #9 0x0000000000680f16 in CreateReplicationSlot (cmd=0x1798b50) at > walsender.c:800 > #10 exec_replication_command () at walsender.c:1291 > #11 0x00000000006bf4a1 in PostgresMain (argc=<optimized out>, > argv=argv@entry=0x177db50, dbname=0x177db30 "test1", > > (gdb) p last_reply_timestamp > $1 = 0 > > > I propose the attached patch sets last_reply_timestamp to now() it starts > processing a command. Since receiving a command is hearing something from > the client. Hm. Yes, that's a problem. > diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c > new file mode 100644 > index 5c11d68..56a2f10 > *** a/src/backend/replication/walsender.c > --- b/src/backend/replication/walsender.c > *************** exec_replication_command(const char *cmd > *** 1276,1281 **** > --- 1276,1282 ---- > parse_rc)))); > > cmd_node = replication_parse_result; > + last_reply_timestamp = GetCurrentTimestamp(); > > switch (cmd_node->type) > { I don't think that's going to cut it though. The creation can take longer than whatever wal_sender_timeout is set to (when there's lots of longrunning transactions). I think checking whether last_reply_timestamp = 0 during timeout checking is more robust. Greetings, Andres Freund
On 05/16/2014 04:43 PM, Andres Freund wrote: > Hi, > > I don't think that's going to cut it though. The creation can take > longer than whatever wal_sender_timeout is set to (when there's lots of > longrunning transactions). I think checking whether last_reply_timestamp > = 0 during timeout checking is more robust. > > Greetings, > > Andres Freund > > That makes sense. A patch that does that is attached. Steve
Вложения
On 2014-05-16 17:02:33 -0400, Steve Singer wrote: > >I don't think that's going to cut it though. The creation can take > >longer than whatever wal_sender_timeout is set to (when there's lots of > >longrunning transactions). I think checking whether last_reply_timestamp > >= 0 during timeout checking is more robust. > That makes sense. > A patch that does that is attached. I've attached a heavily revised version of that patch. Didn't touch all the necessary places... Survives creation of logical replication slots under 'intensive pressure', with a wal_sender_timeout=10ms. Thanks for the bugreport! Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
Hi, On 2014-05-17 01:34:25 +0200, Andres Freund wrote: > On 2014-05-16 17:02:33 -0400, Steve Singer wrote: > > >I don't think that's going to cut it though. The creation can take > > >longer than whatever wal_sender_timeout is set to (when there's lots of > > >longrunning transactions). I think checking whether last_reply_timestamp > > >= 0 during timeout checking is more robust. > > > That makes sense. > > A patch that does that is attached. > > I've attached a heavily revised version of that patch. Didn't touch all > the necessary places... > > Survives creation of logical replication slots under 'intensive > pressure', with a wal_sender_timeout=10ms. > > Thanks for the bugreport! Pushed a fix for it. I am pretty sure it will, but could you still test that it fixes your problem? Thanks! Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 05/28/2014 06:41 PM, Andres Freund wrote: > Hi, > > > Pushed a fix for it. I am pretty sure it will, but could you still test > that it fixes your problem? > > Thanks! The fix seems to work (I am no longer getting the timeout on slot creation) Thanks > Andres Freund >