Re: Race condition in recovery?

Поиск
Список
Период
Сортировка
От Dilip Kumar
Тема Re: Race condition in recovery?
Дата
Msg-id CAFiTN-spAMc6WsobbphZDDz+QuwNOmWfTeR6d2BX3W=_NMmP9g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Race condition in recovery?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Ответы Re: Race condition in recovery?  (Kyotaro Horiguchi <horikyota.ntt@gmail.com>)
Список pgsql-hackers
On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
>
> At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut@gmail.com> wrote in
> Could you please fix the test script so that it causes your issue
> correctly? And/or elaborate a bit more?
>
> The attached first file is the debugging aid logging. The second is
> the test script, to be placed in src/test/recovery/t.

I will look into your test case and try to see whether we can
reproduce the issue.  But let me summarise what is the exact issue.
Basically, the issue is that first in validateRecoveryParameters if
the recovery target is the latest then we fetch the latest history
file and set the recoveryTargetTLI timeline to the latest available
timeline assume it's 2 but we delay updating the expectedTLEs (as per
commit ee994272ca50f70b53074f0febaec97e28f83c4e).  Now, while reading
the checkpoint record if we don't get the required WAL from the
archive then we try to get from primary, and while getting checkpoint
from primary we use "ControlFile->checkPointCopy.ThisTimeLineID"
suppose that is older timeline 1.  Now after reading the checkpoint we
will set the expectedTLEs based on the timeline from which we got the
checkpoint record.

See below Logic in WaitForWalToBecomeAvailable
                        if (readFile < 0)
                        {
                            if (!expectedTLEs)
                                expectedTLEs = readTimeLineHistory(receiveTLI);

Now, the first problem is we are breaking the sanity of expectedTLEs
because as per the definition it should already start with
recoveryTargetTLI but it is starting with the older TLI.  Now, in
rescanLatestTimeLine we are trying to fetch the latest TLI which is
still 2, so this logic returns without reinitializing the expectedTLEs
because it assumes that if recoveryTargetTLI is pointing to 2 then
expectedTLEs must be correct and need not be changed.

See below logic:
rescanLatestTimeLine(void)
{
....
newtarget = findNewestTimeLine(recoveryTargetTLI);
if (newtarget == recoveryTargetTLI)
{
/* No new timelines found */
return false;
}
...
newExpectedTLEs = readTimeLineHistory(newtarget);
...
expectedTLEs = newExpectedTLEs;


Solution:
1. Find better way to fix the problem of commit
(ee994272ca50f70b53074f0febaec97e28f83c4e) which is breaking the
sanity of expectedTLEs.
2. Assume, we have to live with fix 1 and we have to initialize
expectedTLEs with an older timeline for validating the checkpoint in
absence of tl.hostory file (as this commit claims).  Then as soon as
we read and validate the checkpoint, fix the expectedTLEs and set it
based on the history file of recoveryTargetTLI.

Does this explanation make sense?  If not please let me know what part
is not clear in the explanation so I can point to that code.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Pavel Stehule
Дата:
Сообщение: doc issue missing type name "multirange" in chapter title
Следующее
От: Japin Li
Дата:
Сообщение: Re: Identify missing publications from publisher while create/alter subscription.