Обсуждение: BUG #17690: Nonresponsive client on replica can halt replication indefinitely

Поиск
Список
Период
Сортировка

BUG #17690: Nonresponsive client on replica can halt replication indefinitely

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17690
Logged by:          Jacob Baskin
Email address:      jacob.baskin@gmail.com
PostgreSQL version: 13.0
Operating system:   Linux (CentOS 7)
Description:

We have discovered that a badly-behaved client connected to a database hot
replica can indefinitely block replication from progressing. The client's
back-end gets into a state where it does not stop when the recovery process
tries to cancel conflicting queries, as long as there is still pending data
to be written.

To trigger this, the client needs to be:
- Actively running a query which conflicts with recovery
- Not reading data from its socket about query results (e.g., run "select *
from large_table" in psql and then Ctrl-Z as results are being streamed)

We believe this failure mode is deterministic. This bug definitely affects
postgres 13, and we believe it is still present in HEAD. We are running
Linux (Centos 7).

The sequence of events is as follows:

1. Postmaster tries to kill a connection that conflicts with recovery
(standby.c:393)
2. The connection process gets SIGUSR1.
3. This invokes RecoveryConflictInterrupt, which sets QueryCancelPending,
but NOT (generally) ProcDiePending (postgres.c:3039)
4. The connection process repeatedly processes ProcessClientWriteInterrupt,
which will handle interrupts if ProcDiePending is set but not otherwise
(postgres.c:526)

We believe the appropriate fix is to check for RecoveryConflictPending in
addition to ProcDiePending on postgres.c:526.

This fix would be a one-line patch which we are happy to submit but first
want to make sure that this is the correct approach.

Thanks!


Re: BUG #17690: Nonresponsive client on replica can halt replication indefinitely

От
Heikki Linnakangas
Дата:
Sorry for the very late response, I happened to notice this just now 
while browsing for bugs that got no answers.

On 18/11/2022 21:21, PG Bug reporting form wrote:
> The following bug has been logged on the website:
> 
> Bug reference:      17690
> Logged by:          Jacob Baskin
> Email address:      jacob.baskin@gmail.com
> PostgreSQL version: 13.0
> Operating system:   Linux (CentOS 7)
> Description:
> 
> We have discovered that a badly-behaved client connected to a database hot
> replica can indefinitely block replication from progressing. The client's
> back-end gets into a state where it does not stop when the recovery process
> tries to cancel conflicting queries, as long as there is still pending data
> to be written.
> 
> To trigger this, the client needs to be:
> - Actively running a query which conflicts with recovery
> - Not reading data from its socket about query results (e.g., run "select *
> from large_table" in psql and then Ctrl-Z as results are being streamed)
> 
> We believe this failure mode is deterministic. This bug definitely affects
> postgres 13, and we believe it is still present in HEAD. We are running
> Linux (Centos 7).
> 
> The sequence of events is as follows:
> 
> 1. Postmaster tries to kill a connection that conflicts with recovery
> (standby.c:393)
> 2. The connection process gets SIGUSR1.
> 3. This invokes RecoveryConflictInterrupt, which sets QueryCancelPending,
> but NOT (generally) ProcDiePending (postgres.c:3039)
> 4. The connection process repeatedly processes ProcessClientWriteInterrupt,
> which will handle interrupts if ProcDiePending is set but not otherwise
> (postgres.c:526)

Yes, I concur that's a problem.

> We believe the appropriate fix is to check for RecoveryConflictPending in
> addition to ProcDiePending on postgres.c:526.
> 
> This fix would be a one-line patch which we are happy to submit but first
> want to make sure that this is the correct approach.

That makes sense at quick glance, although you need to be careful to not 
mess with the protocol state by erroring out in the middle of writing a 
message to the socket. There is some protection against that, see 
PqCommBusy in pqcomm.c, but I'm not sure if it can get the connection to 
a sane state if you error out from ProcessClientWriteInterrupt. That's 
not a problem for ProcDiePending because the connection is being killed 
anyway, but with a recovery conflict you only want to cancel the running 
query.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: BUG #17690: Nonresponsive client on replica can halt replication indefinitely

От
Jacob Baskin
Дата:
Ah, this makes sense, I can see how we got that way. In the repro mode above, the socket buffer may be full so there's likely no way to return an error, we would have to kill the connection. It looks like there are other cases where a recovery conflict is fatal due to bad client state (e.g. if the client is idle in transaction) so our preference would probably be to do this, but I don't know if we're typical in this regard.

On Thu, Apr 25, 2024 at 6:18 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
Sorry for the very late response, I happened to notice this just now
while browsing for bugs that got no answers.

On 18/11/2022 21:21, PG Bug reporting form wrote:
> The following bug has been logged on the website:
>
> Bug reference:      17690
> Logged by:          Jacob Baskin
> Email address:      jacob.baskin@gmail.com
> PostgreSQL version: 13.0
> Operating system:   Linux (CentOS 7)
> Description:
>
> We have discovered that a badly-behaved client connected to a database hot
> replica can indefinitely block replication from progressing. The client's
> back-end gets into a state where it does not stop when the recovery process
> tries to cancel conflicting queries, as long as there is still pending data
> to be written.
>
> To trigger this, the client needs to be:
> - Actively running a query which conflicts with recovery
> - Not reading data from its socket about query results (e.g., run "select *
> from large_table" in psql and then Ctrl-Z as results are being streamed)
>
> We believe this failure mode is deterministic. This bug definitely affects
> postgres 13, and we believe it is still present in HEAD. We are running
> Linux (Centos 7).
>
> The sequence of events is as follows:
>
> 1. Postmaster tries to kill a connection that conflicts with recovery
> (standby.c:393)
> 2. The connection process gets SIGUSR1.
> 3. This invokes RecoveryConflictInterrupt, which sets QueryCancelPending,
> but NOT (generally) ProcDiePending (postgres.c:3039)
> 4. The connection process repeatedly processes ProcessClientWriteInterrupt,
> which will handle interrupts if ProcDiePending is set but not otherwise
> (postgres.c:526)

Yes, I concur that's a problem.

> We believe the appropriate fix is to check for RecoveryConflictPending in
> addition to ProcDiePending on postgres.c:526.
>
> This fix would be a one-line patch which we are happy to submit but first
> want to make sure that this is the correct approach.

That makes sense at quick glance, although you need to be careful to not
mess with the protocol state by erroring out in the middle of writing a
message to the socket. There is some protection against that, see
PqCommBusy in pqcomm.c, but I'm not sure if it can get the connection to
a sane state if you error out from ProcessClientWriteInterrupt. That's
not a problem for ProcDiePending because the connection is being killed
anyway, but with a recovery conflict you only want to cancel the running
query.

--
Heikki Linnakangas
Neon (https://neon.tech)