autovacuum on primary blocking queries on replica?

Поиск
Список
Период
Сортировка
От Don Seiler
Тема autovacuum on primary blocking queries on replica?
Дата
Msg-id CAHJZqBCc0mf0vBoiiG0Q7QR2K-WnaMciQyVKz16AAZX3jr7e+Q@mail.gmail.com
обсуждение исходный текст
Ответы Re: autovacuum on primary blocking queries on replica?  (Laurenz Albe <laurenz.albe@cybertec.at>)
Re: autovacuum on primary blocking queries on replica?  (Jeff Janes <jeff.janes@gmail.com>)
Список pgsql-general
  • PostgreSQL 12.9 - PGDG Ubuntu 18.04 image
  • Streaming physical replication
  • hot_standby_feedback = on
We use a read replica to offload a lot of (what should be) quick queries. This morning we had an incident where these queries were all blocking on AccessShareLock waits, written to the log as:

2022-05-27 15:23:53.476 UTC [8185] foo@foo_all - myapp LOG:  process 8185 still waiting for AccessShareLock on relation 16834 of database 16401 after 1000.228 ms at character 204
2022-05-27 15:23:53.476 UTC [8185] foo@foo_all - myapp DETAIL:  Process holding the lock: 10822. Wait queue: 32373, 8185, 13782, 14290, 13215, 4427, 16056, 13446, 13448, 13445, 9268, 11784, 14469
, 14502, 14340, 6510, 8730, 10876, 13012, 15760, 6645, 14712, 13006, 15923, 14485, 15924, 13447, 9926, 13005, 11803, 13013, 13298, 16059, 9266, 4943, 14470, 13148, 12118, 14491, 9927, 11791, 15097,
 11804, 16060, 14711, 5469, 8113, 16061, 14698, 14475, 10620, 13150, 14501, 14465, 14341, 16062, 16065, 14484, 7060, 11785, 16068, 16063, 16064, 16066.


This went on for 30 seconds (the value of max_standby_streaming_delay) until PG killed the blocking process:

2022-05-27 15:24:22.474 UTC [10822] foo@foo_all - anotherapp FATAL:  terminating connection due to conflict with recovery
2022-05-27 15:24:22.474 UTC [10822] foo@foo_all - anotherapp DETAIL:  User was holding a relation lock for too long.

I'm trying to find the root cause of why this started. We did see an UPDATE that was executed at 2022-05-27 15:23:37.000 UTC on the table in question (relation 16384) that ran against all rows of the table (only ~8200 rows, not huge) but the transaction was rolled back (due to a syntax error later in the transaction). 

15 seconds later we then see an aggressive autovacuum on this table:

2022-05-27 15:23:52.507 UTC [30513] LOG:  automatic aggressive vacuum of table "foo_all.public.industry": index scans: 1
        pages: 252 removed, 323 remain, 0 skipped due to pins, 0 skipped frozen
        tuples: 8252 removed, 8252 remain, 0 are dead but not yet removable, oldest xmin: 1670999292
        buffer usage: 12219 hits, 137 misses, 54 dirtied
        avg read rate: 2.372 MB/s, avg write rate: 0.935 MB/s
        system usage: CPU: user: 0.14 s, system: 0.00 s, elapsed: 0.45 s


and less than a second after that is when we see the first AccessShareLock message on the replica.

I've been reading tales of autovacuum taking an AccessExclusiveLock when truncating empty pages at the end of a table. I'm imagining that updating every row of a table and then rolling back would leave all of those rows empty at the end and qualify for truncation and lead to the scenario I saw this morning.

I'm still not entirely satisfied since that table in question was so small (only 8252 rows) so I wouldn't imagine it would hold things up as long as it did. Although the blocking session on the replica was an application session, not any background/recovery process.

I'm wondering if I'm on the right trail or if there is a much firmer explanation for what happened.

Thanks,
Don.

--
Don Seiler
www.seiler.us

В списке pgsql-general по дате отправления:

Предыдущее
От: Sándor Daku
Дата:
Сообщение: Re: Determine if range list contains specified integer
Следующее
От: 浩辰 何
Дата:
Сообщение: Improve configurability for IO related behavoir