Обсуждение: exitArchiveRecovery woes

Поиск
Список
Период
Сортировка

exitArchiveRecovery woes

От
Heikki Linnakangas
Дата:
At the end of archive recovery, we copy the last segment from the old
timeline, to initialize the first segment on the new timeline. For
example, if the timeline switch happens in the middle of WAL segment
000000010000000000000005, the whole 000000010000000000000005 segment is
copied to become 000000020000000000000005. The copying is necessary, so
that the new segment contains valid data up to the switch point.

However, we wouldn't really need to copy the whole segment, copying up
to the switch point would be enough. In fact, copying the whole segment
is a bad idea, because the copied WAL looks valid on the new timeline
too. When we read the WAL at crash recovery, we rely on a number of
things to determine if the next WAL record is valid. Most importantly,
the checksum, and the prev-pointer. The checksum protects any random
data from appearing valid, and the prev-pointer makes sure that a WAL
record copied from another location in the WAL is not mistaken as valid.
The prev-pointer is particularly important when we recycle old WAL
segments as new, because the old segment contains valid WAL records with
checksums and all. When we copy a WAL segment with the same segment
number, the prev pointer doesn't protect us, as there can be WAL records
at the exact same locations in both segments. There is a timeline ID on
the page header, but we could still be mistaken within the page. Also,
we are lenient with the TLI at start of WAL recovery, when we read the
first WAL record after the checkpoint. There are further safeguards,
like the fact that when writing WAL, we always write full blocks. But
the write could still be torn at the OS or disk level, if you crash
after writing the WAL, but before fsyncing it.

This is largely academic, but I was able to craft a test case where WAL
recovery mistakenly starts to replay the WAL copied from the old
timeline, as if it was on the new timeline. Attached is a shell script I
used. It's very sensitive to the lengths of the WAL records, so probably
only works on a similar platform as mine (x86_64 Linux). Running
pitr-test.sh ends with this:

S LOG:  database system was interrupted; last known up at 2014-12-17
15:15:42 EET
S LOG:  database system was not properly shut down; automatic recovery
in progress
S LOG:  redo starts at 0/50A2018
S PANIC:  heap_insert_redo: invalid max offset number
S CONTEXT:  xlog redo Heap/INSERT: off 28
S LOG:  startup process (PID 10640) was terminated by signal 6: Aborted
S LOG:  aborting startup due to startup process failure

That PANIC happens because it tries to apply WAL from different
timeline, and it doesn't work because it missed an earlier change to the
same page it modifies. (If you were unlucky, you could get silent
corruption instead, if the WAL record happens to apply without an error)

A simple way to avoid this is to copy the old WAL segment only up to the
point of the timeline switch, and zero the rest.


Another thing I noticed is that we copy the last old WAL segment on the
new timeline, even if the timeline switch happens at a segment boundary.
In that case, the copied WAL segment is 100% identical to the old
segment; it contains no records belonging to the new timeline. I guess
that's not wrong per se, but it seems pointless and confusing.

Attached is a patch that addresses both of those issues. This doesn't
seem worth the risk to back-patch, but let's fix these in master.


PS. The "if (endTLI != ThisTimeLineID)" test in exitArchiveRecovery was
always true, because we always switch to a new timeline after archive
recovery. I turned that into an Assert.

- Heikki

Вложения

Re: exitArchiveRecovery woes

От
Robert Haas
Дата:
On Wed, Dec 17, 2014 at 8:40 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> At the end of archive recovery, we copy the last segment from the old
> timeline, to initialize the first segment on the new timeline. For example,
> if the timeline switch happens in the middle of WAL segment
> 000000010000000000000005, the whole 000000010000000000000005 segment is
> copied to become 000000020000000000000005. The copying is necessary, so that
> the new segment contains valid data up to the switch point.
>
> However, we wouldn't really need to copy the whole segment, copying up to
> the switch point would be enough. In fact, copying the whole segment is a
> bad idea, because the copied WAL looks valid on the new timeline too.

Your proposed change makes sense to me, but why do we need the segment
to contain valid data up to the switch point?  It seems like the
switch between timelines should be "crisper": replay WAL on the old
timeline only from the old segment, and from the new timeline only on
the new segment.  Anything else seems like an invitation to unending
corner-case bugs.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: exitArchiveRecovery woes

От
Heikki Linnakangas
Дата:
On 12/18/2014 03:53 PM, Robert Haas wrote:
> On Wed, Dec 17, 2014 at 8:40 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> At the end of archive recovery, we copy the last segment from the old
>> timeline, to initialize the first segment on the new timeline. For example,
>> if the timeline switch happens in the middle of WAL segment
>> 000000010000000000000005, the whole 000000010000000000000005 segment is
>> copied to become 000000020000000000000005. The copying is necessary, so that
>> the new segment contains valid data up to the switch point.
>>
>> However, we wouldn't really need to copy the whole segment, copying up to
>> the switch point would be enough. In fact, copying the whole segment is a
>> bad idea, because the copied WAL looks valid on the new timeline too.
>
> Your proposed change makes sense to me, but why do we need the segment
> to contain valid data up to the switch point?  It seems like the
> switch between timelines should be "crisper": replay WAL on the old
> timeline only from the old segment, and from the new timeline only on
> the new segment.  Anything else seems like an invitation to unending
> corner-case bugs.

True. That would require some changes to the way archive recovery works, 
though. Currently, when our recovery target timeline is, for example, 5, 
whose parents are 4 and 3, and we're currently on timeline 3, we will 
try to restore each segment first with timeline ID 5, then 4, then 3. 
It's a bit silly, because we know the timeline history and the exact 
points where the timelines changed, so we could just fetch the correct 
one. That would be a good idea, but I'm going to go ahead with just this 
smaller change now.

- Heikki



Re: exitArchiveRecovery woes

От
Robert Haas
Дата:
On Thu, Dec 18, 2014 at 10:49 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 12/18/2014 03:53 PM, Robert Haas wrote:
>> On Wed, Dec 17, 2014 at 8:40 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>>
>>> At the end of archive recovery, we copy the last segment from the old
>>> timeline, to initialize the first segment on the new timeline. For
>>> example,
>>> if the timeline switch happens in the middle of WAL segment
>>> 000000010000000000000005, the whole 000000010000000000000005 segment is
>>> copied to become 000000020000000000000005. The copying is necessary, so
>>> that
>>> the new segment contains valid data up to the switch point.
>>>
>>> However, we wouldn't really need to copy the whole segment, copying up to
>>> the switch point would be enough. In fact, copying the whole segment is a
>>> bad idea, because the copied WAL looks valid on the new timeline too.
>>
>>
>> Your proposed change makes sense to me, but why do we need the segment
>> to contain valid data up to the switch point?  It seems like the
>> switch between timelines should be "crisper": replay WAL on the old
>> timeline only from the old segment, and from the new timeline only on
>> the new segment.  Anything else seems like an invitation to unending
>> corner-case bugs.
>
> True. That would require some changes to the way archive recovery works,
> though. Currently, when our recovery target timeline is, for example, 5,
> whose parents are 4 and 3, and we're currently on timeline 3, we will try to
> restore each segment first with timeline ID 5, then 4, then 3. It's a bit
> silly, because we know the timeline history and the exact points where the
> timelines changed, so we could just fetch the correct one. That would be a
> good idea, but I'm going to go ahead with just this smaller change now.

Yeah, it's a separate issue.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company