Re: Backend handling replication slot stuck using 100% cpu, unkillable

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Backend handling replication slot stuck using 100% cpu, unkillable
Дата
Msg-id d25141d4-c390-b74b-53c1-97064d07b7c9@enterprisedb.com
обсуждение исходный текст
Ответ на Re: Backend handling replication slot stuck using 100% cpu, unkillable  (hubert depesz lubaczewski <depesz@depesz.com>)
Ответы Re: Backend handling replication slot stuck using 100% cpu, unkillable  (hubert depesz lubaczewski <depesz@depesz.com>)
Список pgsql-bugs

On 7/4/23 13:53, hubert depesz lubaczewski wrote:
> On Tue, Jul 04, 2023 at 01:30:21PM +0200, Tomas Vondra wrote:
>> On 7/3/23 14:58, hubert depesz lubaczewski wrote:
>> So is it an infinite loop in ReorderBufferExecuteInvalidations, or is it
>> just the case that there are many invalidations? I can't really deduce
>> that from the backtraces.
>>
>> How many invalidations does the transaction have? Should be enough to
>>
>>   print txn->ninvalidations
>>
>> Also, is there anything interesting about the transaction? You know the
>> XID (2741814901) so maybe use pg_waldump to see what it did.
> 
> I don't have the process anymore. Pg was restarted, and the app was
> forced to not take data from this place.
> 

OK. So the slot is still there, and we could try reading the data again
(hopefully getting stuck in the same place).

> I can run pg_waldump, no problem, but I'm not an expert on this. Which
> wal file(s) should I try to dump? and what to look for (or should I just
> provide full output of dump?
> 

The backtrace has this:

and 187650155969544 should be LSN AAAA/B4E37C08, which maps to

select pg_walfile_name('AAAA/B4E37C08');
     pg_walfile_name
--------------------------
 000000010000AAAA000000B4

So maybe try dumping that WAL segment and see if the XID 2741814901 is
there. If yes, maybe share the whole dump, it's difficult to say what to
share without knowing what the transaction did.

There's also the LSNs in pg_replication_slots - it might be interesting
to see dump for all the WAL between restart_lsn and confirmed_flush_lsn.
So like this segments between:

  SELECT pg_walfile_name(restart_lsn),
         pg_walfile_name(confirmed_flush_lsn)
    FROM pg_replication_slots WHERE slot_name = 'slot_name';


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-bugs по дате отправления:

Предыдущее
От: hubert depesz lubaczewski
Дата:
Сообщение: Re: Backend handling replication slot stuck using 100% cpu, unkillable
Следующее
От: Vamshikrishna T
Дата:
Сообщение: Re: BUG #18009: Postgres Recovery not happening