Обсуждение: walsender doesn't send keepalives when writes are pending
Hi, In WalSndLoop() we do: wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT | WL_SOCKET_READABLE; if (pq_is_send_pending()) wakeEvents |= WL_SOCKET_WRITEABLE; else if (wal_sender_timeout > 0 && !ping_sent) { ... if (GetCurrentTimestamp() >= timeout) WalSndKeepalive(true); ... I think those two if's should simply be separate. There's no reason not to ask for a ping when we're writing. On a busy server that might be the case most of the time. The if (pq_is_send_pending()) should also be *after* sending the keepalive, after all, it might not immediately have been flushed as unlikely as that is. The attached patch is unsurprisingly simple. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Вложения
On Fri, Feb 14, 2014 at 12:05 PM, Andres Freund <andres@2ndquadrant.com> wrote: > There's no reason not > to ask for a ping when we're writing. Is there a reason to ask for a ping? The point of keepalives is to ensure there's some traffic on idle connections so that if the connection is dead it doesn't linger forever and so that any on-demand links (or more recently NAT routers or stateful firewalls) don't time out and disconnect and have to reconnect (or more recently just fail outright). By analogy TCP doesn't send any keepalives if there is any traffic on the link. On the other hand I happen to know (the hard way) that typical PPPoE routers *do* send LCP pings even when the link is busy so there's precedent for either behaviour. I'm guess it comes down to why you want the keepalives. -- greg
On 2014-02-14 12:55:06 +0000, Greg Stark wrote: > On Fri, Feb 14, 2014 at 12:05 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > There's no reason not > > to ask for a ping when we're writing. > Is there a reason to ask for a ping? The point of keepalives is to > ensure there's some traffic on idle connections so that if the > connection is dead it doesn't linger forever and so that any on-demand > links (or more recently NAT routers or stateful firewalls) don't time > out and disconnect and have to reconnect (or more recently just fail > outright). This ain't TCP keepalives. The reason is that we want to kill walsenders if they haven't responded to a ping inside wal_sender_timeout. That's rather important e.g. for sychronous replication, so we can quickly fall over to the next standby. In such scenarios you'll usually want a timeout *far* below anything TCP provides. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2014-02-14 13:58:59 +0100, Andres Freund wrote: > On 2014-02-14 12:55:06 +0000, Greg Stark wrote: > > On Fri, Feb 14, 2014 at 12:05 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > > There's no reason not > > > to ask for a ping when we're writing. > > > Is there a reason to ask for a ping? The point of keepalives is to > > ensure there's some traffic on idle connections so that if the > > connection is dead it doesn't linger forever and so that any on-demand > > links (or more recently NAT routers or stateful firewalls) don't time > > out and disconnect and have to reconnect (or more recently just fail > > outright). > > This ain't TCP keepalives. The reason is that we want to kill walsenders > if they haven't responded to a ping inside wal_sender_timeout. That's > rather important e.g. for sychronous replication, so we can quickly fall > over to the next standby. In such scenarios you'll usually want a > timeout *far* below anything TCP provides. walreceiver sends pings everytime it receives a 'w' message, so it's probably not an issue there, but pg_receivexlog/basebackup don't; they use their own configured intervarl. So this might be an explanation of the latter two being disconnected too early. I've seen reports of that... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 14, 2014 at 5:35 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > In WalSndLoop() we do: > > wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT | > WL_SOCKET_READABLE; > > if (pq_is_send_pending()) > wakeEvents |= WL_SOCKET_WRITEABLE; > else if (wal_sender_timeout > 0 && !ping_sent) > { > ... > if (GetCurrentTimestamp() >= timeout) > WalSndKeepalive(true); > ... > > I think those two if's should simply be separate. There's no reason not > to ask for a ping when we're writing. On a busy server that might be the > case most of the time. I think the main reason of ping is to detect n/w break sooner. On a busy server, wouldn't WALSender can detect it when next time it will try to send the remaining data? Each time in below loop, it sleeps for some time and then will again try to send data and at that time it can detect n/w failure. if ((caughtup && !streamingDoneSending) || pq_is_send_pending()) { .. if (wal_sender_timeout > 0) { .. sleeptime = 1 + (wal_sender_timeout / 10); } .. WaitLatchOrSocket(&MyWalSnd->latch, wakeEvents, MyProcPort->sock, sleeptime); } With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2014-02-21 10:08:44 +0530, Amit Kapila wrote: > On Fri, Feb 14, 2014 at 5:35 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > Hi, > > > > In WalSndLoop() we do: > > > > wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT | > > WL_SOCKET_READABLE; > > > > if (pq_is_send_pending()) > > wakeEvents |= WL_SOCKET_WRITEABLE; > > else if (wal_sender_timeout > 0 && !ping_sent) > > { > > ... > > if (GetCurrentTimestamp() >= timeout) > > WalSndKeepalive(true); > > ... > > > > I think those two if's should simply be separate. There's no reason not > > to ask for a ping when we're writing. On a busy server that might be the > > case most of the time. > > I think the main reason of ping is to detect n/w break sooner. > On a busy server, wouldn't WALSender can detect it when next time it > will try to send the remaining data? Well, especially on a pipelined connection, that can take a fair bit. TCP timeouts aren't fun. There's a reason we have the keepalives, and that they measure application to application performance. And detecting systems as down is important for e.g. synchronous rep. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 21, 2014 at 2:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-21 10:08:44 +0530, Amit Kapila wrote: >> >> I think the main reason of ping is to detect n/w break sooner. >> On a busy server, wouldn't WALSender can detect it when next time it >> will try to send the remaining data? > > Well, especially on a pipelined connection, that can take a fair > bit. TCP timeouts aren't fun. Okay, but the behaviour should be same for both keepalive message and wal data it needs to send. What I mean to say here is that if n/w is down, wal sender will detect it early by sending some data (either keepalive or wal data). Also the ping is sent only after wal_sender_timeout/2 w.r.t last reply time and wal data will be sent after each sleeptime (1 + wal_sender_timeout/10) which I think is should be lesser than the time to send ping. > There's a reason we have the keepalives, > and that they measure application to application performance. Do you mean to say the info about receiver (uphill what it has flushed..)? If yes, then won't we get this information in other reply message or sending keepalive will achieve any other purpose (may be it can get info more quickly)? > And > detecting systems as down is important for e.g. synchronous rep. If you are right, then I am missing some point which is how on a busy server sending keepalive can detect n/w break more quickly then current way. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2014-02-21 19:03:29 +0530, Amit Kapila wrote: > On Fri, Feb 21, 2014 at 2:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2014-02-21 10:08:44 +0530, Amit Kapila wrote: > >> > >> I think the main reason of ping is to detect n/w break sooner. > >> On a busy server, wouldn't WALSender can detect it when next time it > >> will try to send the remaining data? > > > > Well, especially on a pipelined connection, that can take a fair > > bit. TCP timeouts aren't fun. > > Okay, but the behaviour should be same for both keepalive message > and wal data it needs to send. What I mean to say here is that if n/w > is down, wal sender will detect it early by sending some data (either > keepalive or wal data). Also the ping is sent only after > wal_sender_timeout/2 w.r.t last reply time and wal data will be > sent after each sleeptime (1 + wal_sender_timeout/10) which > I think is should be lesser than the time to send ping. The danger is rather that *no* keepalive is sent (with requestReply = true triggering a reply by the client) by the walsender. Try to run pg_receivexlog against a busy server with a low walsender timeout. Since we'll never get to sending a keepalive we'll not trigger a reply by the receiver. Now > > There's a reason we have the keepalives, > > and that they measure application to application performance. > > Do you mean to say the info about receiver (uphill what it has > flushed..)? The important bit is updating walsender.c's last_reply_timestamp so we know that the standby has updated and we won't kill it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 21, 2014 at 7:10 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-21 19:03:29 +0530, Amit Kapila wrote: >> On Fri, Feb 21, 2014 at 2:36 PM, Andres Freund <andres@2ndquadrant.com> wrote: >> > Well, especially on a pipelined connection, that can take a fair >> > bit. TCP timeouts aren't fun. >> >> Okay, but the behaviour should be same for both keepalive message >> and wal data it needs to send. What I mean to say here is that if n/w >> is down, wal sender will detect it early by sending some data (either >> keepalive or wal data). Also the ping is sent only after >> wal_sender_timeout/2 w.r.t last reply time and wal data will be >> sent after each sleeptime (1 + wal_sender_timeout/10) which >> I think is should be lesser than the time to send ping. > > The danger is rather that *no* keepalive is sent (with requestReply = > true triggering a reply by the client) by the walsender. Try to run > pg_receivexlog against a busy server with a low walsender timeout. Since > we'll never get to sending a keepalive we'll not trigger a reply by the > receiver. Now Looking at code of pg_receivexlog in function HandleCopyStream(), it seems that it can only happen if user has not configured --status-interval in pg_receivexlog. Code referred is as below: HandleCopyStream() { .. /* * Potentially send a status message to the master */ now = localGetCurrentTimestamp(); if (still_sending && standby_message_timeout > 0 && localTimestampDifferenceExceeds(last_status, now, standby_message_timeout)) { /* Time to send feedback! */ if (!sendFeedback(conn, blockpos, now, false)) goto error; last_status = now; } Even if this is not happening due to some reason, shouldn't this be anyway the responsibility of pg_receivexlog to send the status from time to time rather than sending when server asked for it? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On 2014-02-22 09:08:39 +0530, Amit Kapila wrote: > > The danger is rather that *no* keepalive is sent (with requestReply = > > true triggering a reply by the client) by the walsender. Try to run > > pg_receivexlog against a busy server with a low walsender timeout. Since > > we'll never get to sending a keepalive we'll not trigger a reply by the > > receiver. Now > > Looking at code of pg_receivexlog in function HandleCopyStream(), > it seems that it can only happen if user has not configured > --status-interval in pg_receivexlog. Code referred is as below: The interval interval is configured independently from the primary and pg_receivexlog doesn't tune it automatically to the one configured for the walsender. > Even if this is not happening due to some reason, shouldn't this be > anyway the responsibility of pg_receivexlog to send the status from time > to time rather than sending when server asked for it? It does. At it's own interval. I don't see what's to discuss here, sorry. There's really barely any cost to doing the keepalive correctly, otherwise it'd be problematic in the half dozen cases where *we* do send it. The keepalive mechanism doesn't work in one edgecase. So, let's fix it, and not discuss why we think the entire mechanism might be useless. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Feb 14, 2014 at 7:05 AM, Andres Freund <andres@2ndquadrant.com> wrote: > In WalSndLoop() we do: > > wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT | > WL_SOCKET_READABLE; > > if (pq_is_send_pending()) > wakeEvents |= WL_SOCKET_WRITEABLE; > else if (wal_sender_timeout > 0 && !ping_sent) > { > ... > if (GetCurrentTimestamp() >= timeout) > WalSndKeepalive(true); > ... > > I think those two if's should simply be separate. There's no reason not > to ask for a ping when we're writing. On a busy server that might be the > case most of the time. > The if (pq_is_send_pending()) should also be *after* sending the > keepalive, after all, it might not immediately have been flushed as > unlikely as that is.ackers I spent an inordinate amount of time looking at this patch. I'm not sure your proposed change will accomplish the desired effect. The thing is that the code you quote is within an if-block gated by (caughtup && !streamingDoneSending) || pq_is_send_pending(). Currently, therefore, the behavior is that we wait on the latch *either when* we're caught up (and thus need to wait for more WAL) or when the outbound queue is full (and thus we need to wait for it to drain), but we send a keep-alive only in the former case (namely, we're caught up). Your proposed change would cause us to send keep-alives when we're caught up, or when we're not caught up but the write queue happens to be full. That doesn't make much sense. There might be a reason to start sending keep-alives when we're not caught up, but even if we want that behavior change there's no reason to do it only when the write queue is full. At least, that's how it looks to me. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-02-25 10:45:55 -0500, Robert Haas wrote: > On Fri, Feb 14, 2014 at 7:05 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > In WalSndLoop() we do: > > > > wakeEvents = WL_LATCH_SET | WL_POSTMASTER_DEATH | WL_TIMEOUT | > > WL_SOCKET_READABLE; > > > > if (pq_is_send_pending()) > > wakeEvents |= WL_SOCKET_WRITEABLE; > > else if (wal_sender_timeout > 0 && !ping_sent) > > { > > ... > > if (GetCurrentTimestamp() >= timeout) > > WalSndKeepalive(true); > > ... > > > > I think those two if's should simply be separate. There's no reason not > > to ask for a ping when we're writing. On a busy server that might be the > > case most of the time. > > The if (pq_is_send_pending()) should also be *after* sending the > > keepalive, after all, it might not immediately have been flushed as > > unlikely as that is.ackers > > I spent an inordinate amount of time looking at this patch. I'm not > sure your proposed change will accomplish the desired effect. The > thing is that the code you quote is within an if-block gated by > (caughtup && !streamingDoneSending) || pq_is_send_pending(). > Currently, therefore, the behavior is that we wait on the latch > *either when* we're caught up (and thus need to wait for more WAL) or > when the outbound queue is full (and thus we need to wait for it to > drain), but we send a keep-alive only in the former case (namely, > we're caught up). > > Your proposed change would cause us to send keep-alives when we're > caught up, or when we're not caught up but the write queue happens to > be full. That doesn't make much sense. There might be a reason to > start sending keep-alives when we're not caught up, but even if we > want that behavior change there's no reason to do it only when the > write queue is full. I am not sure I can follow. Why doesn't it make sense to send out the keepalive (with replyRequested = true) when we're busy sending stuff (which will be the case most of the time on a busy server)? The point is that clients like pg_receivexlog and pg_basebackup don't send an keepalive response all the time they receive something (which is perfectly fine), but just when their internal timer says it's time to do so, or if the walsender asks them to do so. So, if the server has a *lower* timeout than configured for pg_receivexlog and the server is a busy one, you'll get timeouts relatively often. This is a pretty frequent situation in synchronous replication setups because the default timeout is *way* too high for that. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 25, 2014 at 10:54 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I am not sure I can follow. Why doesn't it make sense to send out the > keepalive (with replyRequested = true) when we're busy sending stuff > (which will be the case most of the time on a busy server)? It may very well make sense, but your patch won't generally have that effect, because with the patch you proposed, the keep-alive can only be sent when the server is busy if the write queue is also full. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2014-02-25 11:15:46 -0500, Robert Haas wrote: > On Tue, Feb 25, 2014 at 10:54 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > I am not sure I can follow. Why doesn't it make sense to send out the > > keepalive (with replyRequested = true) when we're busy sending stuff > > (which will be the case most of the time on a busy server)? > > It may very well make sense, but your patch won't generally have that > effect, because with the patch you proposed, the keep-alive can only > be sent when the server is busy if the write queue is also full. Well, it either needs to be caughtup *or*/and have a busy write queue, right? Usually that state will be reached very quickly because before that we're writing data to the network as fast as it can be read from disk. Also, there's no timeout checks outside that if (caughtup || send_pending()) block, so there's not much of a window to hit problems when that loop isn't entered. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Feb 25, 2014 at 11:23 AM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2014-02-25 11:15:46 -0500, Robert Haas wrote: >> On Tue, Feb 25, 2014 at 10:54 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > I am not sure I can follow. Why doesn't it make sense to send out the >> > keepalive (with replyRequested = true) when we're busy sending stuff >> > (which will be the case most of the time on a busy server)? >> >> It may very well make sense, but your patch won't generally have that >> effect, because with the patch you proposed, the keep-alive can only >> be sent when the server is busy if the write queue is also full. > > Well, it either needs to be caughtup *or*/and have a busy write queue, > right? Right. > Usually that state will be reached very quickly because before > that we're writing data to the network as fast as it can be read from > disk. I'm unimpressed. Even if that is in practice true, making the code self-consistent is a goal of non-trivial value. The timing of sending keep-alives has no business depending on the state of the write queue, and right now it doesn't. Your patch would make it depend on that, mostly by accident AFAICS. > Also, there's no timeout checks outside that if (caughtup || > send_pending()) block, so there's not much of a window to hit problems when > that loop isn't entered. That might be true, but waiting until the write queue is full to send a ping and then checking whether we've timed out before the queue has drained isn't a sane design pattern. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 02/25/2014 06:41 PM, Robert Haas wrote: > On Tue, Feb 25, 2014 at 11:23 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> Usually that state will be reached very quickly because before >> that we're writing data to the network as fast as it can be read from >> disk. > > I'm unimpressed. Even if that is in practice true, making the code > self-consistent is a goal of non-trivial value. The timing of sending > keep-alives has no business depending on the state of the write queue, > and right now it doesn't. Your patch would make it depend on that, > mostly by accident AFAICS. Yeah, the timeout should should be checked regardless of the status of the write queue. Per the attached patch. While looking at this, I noticed that the time we sleep between each iteration is calculated in a funny way. It's not this patch's fault, but moving the code made it more glaring. After the patch, it looks like this: > TimestampTz timeout; > long sleeptime = 10000; /* 10 s */ > ... > /* > * If wal_sender_timeout is active, sleep in smaller increments > * to not go over the timeout too much. XXX: Why not just sleep > * until the timeout has elapsed? > */ > if (wal_sender_timeout > 0) > sleeptime = 1 + (wal_sender_timeout / 10); > > /* Sleep until something happens or we time out */ > ... > WaitLatchOrSocket(&MyWalSnd->latch, wakeEvents, > MyProcPort->sock, sleeptime); > ... > /* > * Check for replication timeout. Note we ignore the corner case > * possibility that the client replied just as we reached the > * timeout ... he's supposed to reply *before* that. > */ > timeout = TimestampTzPlusMilliseconds(last_reply_timestamp, > wal_sender_timeout); > if (wal_sender_timeout > 0 && GetCurrentTimestamp() >= timeout) > { > ... > } The logic was the same before the patch, but I added the XXX comment above. Why do we sleep in increments of 1/10 of wal_sender_timeout? Originally, the code calculated when the next wakeup should happen, by adding wal_sender_timeout (or replication_timeout, as it was called back then) to the time of the last reply. Why don't we do that? The code seems to have just slowly devolved into that. It was changed from sleep-until-timeout to sleep 1/10th of wal_sender_timeout by a patch that made walsender send a keep-alive message to the client every time it wakes up, and I guess the idea was to send a message at an interval that's 1/10th of the timeout. But that idea is long gone by now; first, commit da4efa13d801ccc179f1d2c24d8a60c4a2f8ede9 turned off sending the keep-alive messages by default, and then 6f60fdd7015b032bf49273c99f80913d57eac284 turned them back on, but so that it's only sent once when 1/2 of wal_sender_timeout has elapsed. I think that should be changed back to sleep until the next deadline of when something needs to happen, instead of polling. But that can be a separate patch. - Heikki
Вложения
On 2014-03-05 18:26:13 +0200, Heikki Linnakangas wrote: > On 02/25/2014 06:41 PM, Robert Haas wrote: > >On Tue, Feb 25, 2014 at 11:23 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >>Usually that state will be reached very quickly because before > >>that we're writing data to the network as fast as it can be read from > >>disk. > > > >I'm unimpressed. Even if that is in practice true, making the code > >self-consistent is a goal of non-trivial value. The timing of sending > >keep-alives has no business depending on the state of the write queue, > >and right now it doesn't. Your patch would make it depend on that, > >mostly by accident AFAICS. I still don't see how my proposed patch increases the dependency, rather the contrary, it's less dependant on the flushing behaviour. But I agree that a more sweeping change is a good idea. > The logic was the same before the patch, but I added the XXX comment above. > Why do we sleep in increments of 1/10 of wal_sender_timeout? Originally, the > code calculated when the next wakeup should happen, by adding > wal_sender_timeout (or replication_timeout, as it was called back then) to > the time of the last reply. Why don't we do that? > [ archeology ] It imo makes sense to wakeup after last_reply + wal_sender_timeout/2, so a requested reply actually has time to arrive, but otherwise I agree. I think your patch makes sense. Additionally imo the timeout checking should be moved outside the if (caughtup || pq_is_send_pending()), but that's probably a separate patch. Any chance you could apply your patch soon? I've a patch pending that'll surely conflict with this and it seems better to fix it first. Greetings, Andres Freund
On 03/05/2014 10:57 PM, Andres Freund wrote: > On 2014-03-05 18:26:13 +0200, Heikki Linnakangas wrote: >> The logic was the same before the patch, but I added the XXX comment above. >> Why do we sleep in increments of 1/10 of wal_sender_timeout? Originally, the >> code calculated when the next wakeup should happen, by adding >> wal_sender_timeout (or replication_timeout, as it was called back then) to >> the time of the last reply. Why don't we do that? >> [ archeology ] > > It imo makes sense to wakeup after last_reply + wal_sender_timeout/2, so > a requested reply actually has time to arrive, but otherwise I agree. > > I think your patch makes sense. Additionally imo the timeout checking > should be moved outside the if (caughtup || pq_is_send_pending()), but > that's probably a separate patch. > > Any chance you could apply your patch soon? I've a patch pending that'll > surely conflict with this and it seems better to fix it first. Ok, pushed. I left the polling-style sleep in place for now. Thanks! - Heikki