Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae

Поиск
Список
Период
Сортировка
От Melanie Plageman
Тема Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Дата
Msg-id CAAKRu_bioPMfwpA2zgK=eNq402YsvTCrxfQ_o0PvCBejFsTu=A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae  (Bowen Shi <zxwsbg12138@gmail.com>)
Ответы Re: relfrozenxid may disagree with row XIDs after 1ccc1e05ae
Список pgsql-bugs
On Mon, May 13, 2024 at 11:42 PM Bowen Shi <zxwsbg12138@gmail.com> wrote:
>
> On Mon, May 13, 2024 at 10:42 PM Melanie Plageman <melanieplageman@gmail.com> wrote:
>>
>> On Sun, May 12, 2024 at 11:19 PM Bowen Shi <zxwsbg12138@gmail.com> wrote:
>> >
>> > Hi,
>> >>
>> >> Obviously we should actually fix this on back branches, but could we
>> >> at least make the retry loop interruptible in some way so people could
>> >> use pg_cancel/terminate_backend() on a stuck autovacuum worker or
>> >> vacuum process?
>> >
>> >
>> > If the problem happens in versions <= PG 16, we don't have a good solution (vacuum process holds the exclusive
lockcause checkpoint hangs). 
>> >
>> > Maybe we can make the retry loop interruptible first. However, since we are using START_CRIT_SECTION, we cannot
simplyuse CHECK_FOR_INTERRUPTS to handle it. 
>>
>> As far as I can tell, in 14 and 15, the versions where the issue
>> reported here is present, there is not a critical section in the
>> section of code looped through in the retry loop in lazy_scan_prune().
>
>
> You are correct, I tried again to add CHECK_FOR_INTERRUPTS in the retry loop, and when attempting to interrupt the
currentloop using the pg_terminate_backend function, the value of InterruptHoldoffCount is 1, which causes the vacuum
tonot end. 

Yes, great point. Actually, Andres and I discussed this today
off-list, and he reminded me that because vacuum is holding a content
lock on the page here, InterruptHoldoffCount will be at least 1. We
could RESUME_INTERRUPTS() but we probably don't want to process
interrupts while holding the page lock here if we don't do it in other
places. And it's hard to say under what conditions we would want to
drop the page lock here.

Are you reproducing the hang locally with my repro? Or do you have
your own repro? How are you testing pg_terminate_backend() and seeing
that the InterruptHoldoffCount is 1?

>> We can actually fix the particular issue I reproduced with the
>> attached patch. However, I think it is still worth making the retry
>> loop interruptible in case there are other ways to end up infinitely
>> looping in the retry loop in lazy_scan_prune().
>
>
> I attempted to apply the patch on the REL_15_STABLE branch, but encountered some conflicts. May I ask which branch
youare using? 

Sorry, I should have mentioned that patch was against REL_14_STABLE.
Attached patch has the same functionality but should apply cleanly
against REL_15_STABLE.

- Melanie

Вложения

В списке pgsql-bugs по дате отправления:

Предыдущее
От: PG Bug reporting form
Дата:
Сообщение: BUG #18465: Wrong results from SELECT DISTINCT MIN in scalar subquery using HashAggregate
Следующее
От: Tom Lane
Дата:
Сообщение: Re: BUG #18463: Possible bug in stored procedures with polymorphic OUT parameters