Re: autovacuum on primary blocking queries on replica?

Поиск
Список
Период
Сортировка
От Laurenz Albe
Тема Re: autovacuum on primary blocking queries on replica?
Дата
Msg-id 0ffa96ab1ec3b22bb738b15a93612e69d1e5c11c.camel@cybertec.at
обсуждение исходный текст
Ответ на autovacuum on primary blocking queries on replica?  (Don Seiler <don@seiler.us>)
Список pgsql-general
On Fri, 2022-05-27 at 14:00 -0500, Don Seiler wrote:
>  * PostgreSQL 12.9 - PGDG Ubuntu 18.04 image
>  * Streaming physical replication
>  * hot_standby_feedback = on
> We use a read replica to offload a lot of (what should be) quick queries. This morning we had an incident
> where these queries were all blocking on AccessShareLock waits, written to the log as:
> 
> 2022-05-27 15:23:53.476 UTC [8185] foo@foo_all - myapp LOG:  process 8185 still waiting for AccessShareLock on
relation16834 of database 16401 after 1000.228 ms at character 204
 
> 2022-05-27 15:23:53.476 UTC [8185] foo@foo_all - myapp DETAIL:  Process holding the lock: 10822. Wait queue: 32373,
8185,13782, [...]
 
> This went on for 30 seconds (the value of max_standby_streaming_delay) until PG killed the blocking process:
> 2022-05-27 15:24:22.474 UTC [10822] foo@foo_all - anotherapp FATAL:  terminating connection due to conflict with
recovery
> 2022-05-27 15:24:22.474 UTC [10822] foo@foo_all - anotherapp DETAIL:  User was holding a relation lock for too long.
> 
> I'm trying to find the root cause of why this started. We did see an UPDATE [...]

An UPDATE cannot be the problem.

> 15 seconds later we then see an aggressive autovacuum on this table:
> 
> 2022-05-27 15:23:52.507 UTC [30513] LOG:  automatic aggressive vacuum of table "foo_all.public.industry": index
scans:1
 
>         pages: 252 removed, 323 remain, 0 skipped due to pins, 0 skipped frozen
>         tuples: 8252 removed, 8252 remain, 0 are dead but not yet removable, oldest xmin: 1670999292
>         buffer usage: 12219 hits, 137 misses, 54 dirtied
>         avg read rate: 2.372 MB/s, avg write rate: 0.935 MB/s
>         system usage: CPU: user: 0.14 s, system: 0.00 s, elapsed: 0.45 s
> 
> and less than a second after that is when we see the first AccessShareLock message on the replica.
> 
> I've been reading tales of autovacuum taking an AccessExclusiveLock when truncating empty pages at the end of a
table.
> I'm imagining that updating every row of a table and then rolling back would leave all of those rows empty at the
end
> and qualify for truncation and lead to the scenario I saw this morning.
> 
> I'm still not entirely satisfied since that table in question was so small (only 8252 rows) so I wouldn't imagine it
would
> hold things up as long as it did. Although the blocking session on the replica was an application session,
> not any background/recovery process.

I think you are on the right trail.

VACUUM will truncate trailing pages if it can get a (very short) ACCESS EXCLUSIVE lock on the table.
Now that lock and the truncation is replicated, and they can create a replication conflict just like
you describe.  Even if the lock is held for a very short time, replaying it will conflict with any
query on that table on the standby.

You can disable autovacuum truncation on the table with

   ALTER TABLE foo_all.public.industry SET (vacuum_truncate = off);

if you know that you can do without autovacuum truncation for that query.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com



В списке pgsql-general по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: JSONB index not in use, but is TOAST the real cause of slow query?
Следующее
От: Shaheed Haque
Дата:
Сообщение: Re: JSONB index not in use, but is TOAST the real cause of slow query?