Обсуждение: exitArchiveRecovery woes
At the end of archive recovery, we copy the last segment from the old timeline, to initialize the first segment on the new timeline. For example, if the timeline switch happens in the middle of WAL segment 000000010000000000000005, the whole 000000010000000000000005 segment is copied to become 000000020000000000000005. The copying is necessary, so that the new segment contains valid data up to the switch point. However, we wouldn't really need to copy the whole segment, copying up to the switch point would be enough. In fact, copying the whole segment is a bad idea, because the copied WAL looks valid on the new timeline too. When we read the WAL at crash recovery, we rely on a number of things to determine if the next WAL record is valid. Most importantly, the checksum, and the prev-pointer. The checksum protects any random data from appearing valid, and the prev-pointer makes sure that a WAL record copied from another location in the WAL is not mistaken as valid. The prev-pointer is particularly important when we recycle old WAL segments as new, because the old segment contains valid WAL records with checksums and all. When we copy a WAL segment with the same segment number, the prev pointer doesn't protect us, as there can be WAL records at the exact same locations in both segments. There is a timeline ID on the page header, but we could still be mistaken within the page. Also, we are lenient with the TLI at start of WAL recovery, when we read the first WAL record after the checkpoint. There are further safeguards, like the fact that when writing WAL, we always write full blocks. But the write could still be torn at the OS or disk level, if you crash after writing the WAL, but before fsyncing it. This is largely academic, but I was able to craft a test case where WAL recovery mistakenly starts to replay the WAL copied from the old timeline, as if it was on the new timeline. Attached is a shell script I used. It's very sensitive to the lengths of the WAL records, so probably only works on a similar platform as mine (x86_64 Linux). Running pitr-test.sh ends with this: S LOG: database system was interrupted; last known up at 2014-12-17 15:15:42 EET S LOG: database system was not properly shut down; automatic recovery in progress S LOG: redo starts at 0/50A2018 S PANIC: heap_insert_redo: invalid max offset number S CONTEXT: xlog redo Heap/INSERT: off 28 S LOG: startup process (PID 10640) was terminated by signal 6: Aborted S LOG: aborting startup due to startup process failure That PANIC happens because it tries to apply WAL from different timeline, and it doesn't work because it missed an earlier change to the same page it modifies. (If you were unlucky, you could get silent corruption instead, if the WAL record happens to apply without an error) A simple way to avoid this is to copy the old WAL segment only up to the point of the timeline switch, and zero the rest. Another thing I noticed is that we copy the last old WAL segment on the new timeline, even if the timeline switch happens at a segment boundary. In that case, the copied WAL segment is 100% identical to the old segment; it contains no records belonging to the new timeline. I guess that's not wrong per se, but it seems pointless and confusing. Attached is a patch that addresses both of those issues. This doesn't seem worth the risk to back-patch, but let's fix these in master. PS. The "if (endTLI != ThisTimeLineID)" test in exitArchiveRecovery was always true, because we always switch to a new timeline after archive recovery. I turned that into an Assert. - Heikki
Вложения
On Wed, Dec 17, 2014 at 8:40 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > At the end of archive recovery, we copy the last segment from the old > timeline, to initialize the first segment on the new timeline. For example, > if the timeline switch happens in the middle of WAL segment > 000000010000000000000005, the whole 000000010000000000000005 segment is > copied to become 000000020000000000000005. The copying is necessary, so that > the new segment contains valid data up to the switch point. > > However, we wouldn't really need to copy the whole segment, copying up to > the switch point would be enough. In fact, copying the whole segment is a > bad idea, because the copied WAL looks valid on the new timeline too. Your proposed change makes sense to me, but why do we need the segment to contain valid data up to the switch point? It seems like the switch between timelines should be "crisper": replay WAL on the old timeline only from the old segment, and from the new timeline only on the new segment. Anything else seems like an invitation to unending corner-case bugs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 12/18/2014 03:53 PM, Robert Haas wrote: > On Wed, Dec 17, 2014 at 8:40 AM, Heikki Linnakangas > <hlinnakangas@vmware.com> wrote: >> At the end of archive recovery, we copy the last segment from the old >> timeline, to initialize the first segment on the new timeline. For example, >> if the timeline switch happens in the middle of WAL segment >> 000000010000000000000005, the whole 000000010000000000000005 segment is >> copied to become 000000020000000000000005. The copying is necessary, so that >> the new segment contains valid data up to the switch point. >> >> However, we wouldn't really need to copy the whole segment, copying up to >> the switch point would be enough. In fact, copying the whole segment is a >> bad idea, because the copied WAL looks valid on the new timeline too. > > Your proposed change makes sense to me, but why do we need the segment > to contain valid data up to the switch point? It seems like the > switch between timelines should be "crisper": replay WAL on the old > timeline only from the old segment, and from the new timeline only on > the new segment. Anything else seems like an invitation to unending > corner-case bugs. True. That would require some changes to the way archive recovery works, though. Currently, when our recovery target timeline is, for example, 5, whose parents are 4 and 3, and we're currently on timeline 3, we will try to restore each segment first with timeline ID 5, then 4, then 3. It's a bit silly, because we know the timeline history and the exact points where the timelines changed, so we could just fetch the correct one. That would be a good idea, but I'm going to go ahead with just this smaller change now. - Heikki
On Thu, Dec 18, 2014 at 10:49 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 12/18/2014 03:53 PM, Robert Haas wrote: >> On Wed, Dec 17, 2014 at 8:40 AM, Heikki Linnakangas >> <hlinnakangas@vmware.com> wrote: >>> >>> At the end of archive recovery, we copy the last segment from the old >>> timeline, to initialize the first segment on the new timeline. For >>> example, >>> if the timeline switch happens in the middle of WAL segment >>> 000000010000000000000005, the whole 000000010000000000000005 segment is >>> copied to become 000000020000000000000005. The copying is necessary, so >>> that >>> the new segment contains valid data up to the switch point. >>> >>> However, we wouldn't really need to copy the whole segment, copying up to >>> the switch point would be enough. In fact, copying the whole segment is a >>> bad idea, because the copied WAL looks valid on the new timeline too. >> >> >> Your proposed change makes sense to me, but why do we need the segment >> to contain valid data up to the switch point? It seems like the >> switch between timelines should be "crisper": replay WAL on the old >> timeline only from the old segment, and from the new timeline only on >> the new segment. Anything else seems like an invitation to unending >> corner-case bugs. > > True. That would require some changes to the way archive recovery works, > though. Currently, when our recovery target timeline is, for example, 5, > whose parents are 4 and 3, and we're currently on timeline 3, we will try to > restore each segment first with timeline ID 5, then 4, then 3. It's a bit > silly, because we know the timeline history and the exact points where the > timelines changed, so we could just fetch the correct one. That would be a > good idea, but I'm going to go ahead with just this smaller change now. Yeah, it's a separate issue. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company