[BUG] FailedAssertion in SnapBuildPurgeOlderTxn

Поиск
Список
Период
Сортировка
От Maxim Orlov
Тема [BUG] FailedAssertion in SnapBuildPurgeOlderTxn
Дата
Msg-id CACG=ezZoz_KG+Ryh9MrU_g5e0HiVoHocEvqFF=NRrhrwKmEQJQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn  (Andres Freund <andres@anarazel.de>)
Re: [BUG] FailedAssertion in SnapBuildPurgeOlderTxn  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
Hi!


PREAMBLE

For a last couple of months, I stumbled into a problem while running tests on ARM (Debain, aarch64) and some more wired platforms
for my 64–bit XIDs patch set. Test contrib/test_decoding (catalog_change_snapshot) rarely failed with the next message:

TRAP: FailedAssertion("TransactionIdIsNormal(InitialRunningXacts[0]) && TransactionIdIsNormal(builder->xmin)", File: "snapbuild.c"

I have plenty of failing on ARM, couple on x86 and none (if memory serves) on x86–64.

At first, my thought was to blame my 64–bit XID patch for what, but this is not the case. This error persist from PG15 to PG10
without any patch applied. Though hard to reproduce.


PROBLEM

After some investigation, I think, the problem is in the snapbuild.c (commit 272248a0c1b1, see [0]). We do allocate InitialRunningXacts
array in the context of builder->context, but for the time when we call SnapBuildPurgeOlderTxn this context may be already free'd. Thus,
InitialRunningXacts array become array of 2139062143 (i.e. 0x7F7F7F7F) values. This is not caused buildfarm to fail due to that code:

if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
                                 builder->xmin))
    return;

Since the cluster is initialised with XID way less than 0x7F7F7F7F, we get to return here, but the problem is still existing.
I've attached the patch based on branch REL_15_STABLE to reproduce the problem on x86-64.

On my patch set of 64–bit XID's this problem manifested since I do init cluster with XID far beyond 32–bit bound.

Alternatively, I did try to use my patch [1] to init cluster with first transaction 2139062143 (i.e. 0x7F7F7F7F).
Then put pg_sleep call just like in the attached patch:
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -968,6 +968,8 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
        if (NInitialRunningXacts == 0)
                return;

+       pg_usleep(1000000L * 2L);
+
        /* bound check if there is at least one transaction to remove */
        if (!NormalTransactionIdPrecedes(InitialRunningXacts[0],
                                                                         builder->xmin))

Run installcheck-force for many times for a test_decoding/catalog_change_snapshot's and got a segfault.


CONCLUSION

In snapbuild.c, context allocated array InitialRunningXacts may be free'd, this caused assertion failed (at best) or
may lead to the more serious problems.


P.S.

Simple fix like:
@@ -1377,7 +1379,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
                 * changes. See SnapBuildXidSetCatalogChanges.
                 */
                NInitialRunningXacts = nxacts;
-               InitialRunningXacts = MemoryContextAlloc(builder->context, sz);
+               InitialRunningXacts = MemoryContextAlloc(TopMemoryContext, sz);
                memcpy(InitialRunningXacts, running->xids, sz);
                qsort(InitialRunningXacts, nxacts, sizeof(TransactionId), xidComparator);

seems to solve the described problem, but I'm not in the context of [0] and why array is allocated in builder->context.



--
Best regards,
Maxim Orlov.
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Langote
Дата:
Сообщение: Re: ExecRTCheckPerms() and many prunable partitions
Следующее
От: Maxim Orlov
Дата:
Сообщение: Re: [BUG] Logical replication failure "ERROR: could not map filenode "base/13237/442428" to relation OID" with catalog modifying txns