Re: Fix race condition in InvalidatePossiblyObsoleteSlot()

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Дата
Msg-id ZekQQHCrIqLVpGz5@paquier.xyz
обсуждение исходный текст
Ответ на Re: Fix race condition in InvalidatePossiblyObsoleteSlot()  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
Список pgsql-hackers
On Wed, Mar 06, 2024 at 05:45:56PM +0000, Bertrand Drouvot wrote:
> Thank you both for the report! I did a few test manually and can see the issue
> from times to times. When the issue occurs, the logical decoding was able to
> go through the place where LogicalConfirmReceivedLocation() updates the
> slot's catalog_xmin before being killed. In that case I can see that the
> catalog_xmin is updated to the xid horizon.

Two buildfarm machines have complained here, and one of them twice in
a row.  That's quite amazing, because a couple of dozen runs done in a
row on the same host as these animals all pass.  The CI did not
complain either (did 2~3 runs there yesterday).

> Means in a failed test we have something like:
>
> slot's catalog_xmin: 839 and "The slot conflicted with xid horizon 839."
>
> While when the test is ok you'll see something like:
>
> slot's catalog_xmin: 841 and "The slot conflicted with xid horizon 842."

Perhaps we should also make the test report the catalog_xmin of the
slot.  That may make debugging a bit easier.

> In the failing test the call to SELECT pg_logical_slot_get_changes() does
> not advance the slot's catalog xmin anymore.

Is that something that we could enforce in the test in a stronger way,
cross-checking the xmin horizon before and after the call?

> To fix this, I think we need a new transacion to decode from the primary before
> executing pg_logical_slot_get_changes(). But this transaction has to be replayed
> on the standby first by the startup process. Which means we need to wakeup
> "terminate-process-holding-slot" and that we probably need another injection
> point somewehere in this test.
>
> I'll look at it unless you've another idea?

I am wondering if there is something else lurking here, actually, so
for now I am going to revert the change as it is annoying to get
sporadic failures in the CF bot at this time of the year and there are
a lot of patches under discussion.  Let's give it more time and more
thoughts, without pressure.
--
Michael

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: Combine headerscheck and cpluspluscheck scripts
Следующее
От: Peter Smith
Дата:
Сообщение: Re: Synchronizing slots from primary to standby