pgsql: Fix failure at promotion with 2PC transactions and archiving ena

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема pgsql: Fix failure at promotion with 2PC transactions and archiving ena
Дата
Msg-id E1qBPNF-002ddW-0w@gemulon.postgresql.org
обсуждение исходный текст
Список pgsql-committers
Fix failure at promotion with 2PC transactions and archiving enabled

When archiving is enabled, a promotion request would fail with the
following error when some 2PC transaction needs to be recovered from
WAL, preventing the promotion to complete:
FATAL:  requested WAL segment pg_wal/000000010000000000000001 has already been removed

The origin of the problem is that the last partial segment of the old
timeline is renamed before recovering the 2PC data via
RecoverPreparedTransactions() at the end of recovery, causing the FATAL
because the segment wanted is now renamed with a .partial suffix.  This
commit reorders a bit the end-of-recovery actions so as the execution of
recovery_end_command, the cleanup of the old segments of the old
timeline (RemoveNonParentXlogFiles) and the last partial segment rename
are done after the 2PC transaction data is recovered with
RecoverPreparedTransactions().  This makes the order of these
end-of-recovery actions more consistent with ~15, at the exception of
the end-of-recovery checkpoint that still needs to happen before all the
actions reordered here in v13 and v14, contrary to what 15~ does.

v15 and newer versions have "fixed" this problem somewhat accidentally
with 811051c, where the end-of-recovery actions got reordered.  In this
case, the recovery of 2PC transactions happens before the renaming of
the last partial segment of the old timeline.

v13 and v14 are the versions that can easily see this problem as per the
refactoring of 38a95731 where XLogReaderState is reset in
XLogBeginRead() before reading the 2PC transaction data.  v11 and v12
could also see this problem, but may finish by reading the 2PC data from
some of the WAL buffers instead.  Perhaps something could be done for
these two branches, but I am not really excited about doing something on
these per the lack of complaints and per the fact that v11 is soon going
to be EOL'd soon (there is always a risk of breaking something).

Note that the TAP test 009_twophase.pl is able to exhibit the issue if
it enables archiving on the primary node, which does not impact the test
coverage as restore_command would remain unused.  This is something that
should be changed on v15 and HEAD as well, so this will be changed in a
separate commit for clarity.

Author: Julian Markwort
Reviewed-by: Kyotaro Horiguchi, Michael Paquier
Discussion: https://postgr.es/m/743b9b45a2d4013bd90b6a5cba8d6faeb717ee34.camel@cybertec.at
Backpatch-through: 13

Branch
------
REL_13_STABLE

Details
-------
https://git.postgresql.org/pg/commitdiff/896012b88396f45ce67bbc3fd15f72245f5239d1

Modified Files
--------------
src/backend/access/transam/xlog.c | 108 +++++++++++++++++++-------------------
1 file changed, 54 insertions(+), 54 deletions(-)


В списке pgsql-committers по дате отправления:

Предыдущее
От: Jeff Davis
Дата:
Сообщение: Re: pgsql: Fix search_path to a safe value during maintenance operations.
Следующее
От: Michael Paquier
Дата:
Сообщение: pgsql: Enable archiving in recovery TAP test 009_twophase.pl