Обсуждение: Failing start-up archive recovery at Standby mode in PG9.2.4
Hi, I find problem about failing start-up achive recovery at Standby mode in PG9.2.4 streaming replication. I test same problem in PG9.2.3. But it is not occerd... We search this problem patch in PostgreSQL git repository. It may be occerd in this commit. http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=ee8b95e8c807f31220f31bcff445d1dcb20c28b4 Attached sciript(replication_test.sh) is reproduced this problem. If you use this script, you sholud set up about PGPATH and install pgbench. Script execution summary is under following: 1. Create Master and Standby instance. 2. Insert Master to lot of record using pgbench and stop master by immediate shutdown. 3. Start Master and Standby. 4. Standby fails start-up achive recovery in ValidXLOGHeader, xlog.c:4395. 5. Standby thinks achive recovery is finished. It is mistake! 6. Standby connects to Master for stream replication, but it fail. > cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: restored log file "000000020000000000000013" from archive > [Standby] 2013-04-22 01:27:25 EDTLOCATION: RestoreArchivedFile, xlog.c:3273 > [Standby] 2013-04-22 01:27:25 EDTDEBUG: 00000: got WAL segment from archive > [Standby] 2013-04-22 01:27:25 EDTLOCATION: XLogFileReadAnyTLI, xlog.c:2930 > [Standby] 2013-04-22 01:27:25 EDTDEBUG: 00000: compacted fsync request queue from 4096 entries to 2 entries > [Standby] 2013-04-22 01:27:25 EDTCONTEXT: xlog redo multi-insert (init): rel 1663/12870/24588; blk 41575; 61 tuples > [Standby] 2013-04-22 01:27:25 EDTLOCATION: CompactCheckpointerRequestQueue, checkpointer.c:1288 > [Standby] 2013-04-22 01:27:25 EDTDEBUG: 00000: checkpoint sync: number=1 file=base/12870/24588 time=822.092 msec > [Standby] 2013-04-22 01:27:25 EDTLOCATION: mdsync, md.c:1161 > [Standby] 2013-04-22 01:27:25 EDTDEBUG: 00000: checkpoint sync: number=2 file=base/12870/24588_fsm time=0.908 msec > [Standby] 2013-04-22 01:27:25 EDTLOCATION: mdsync, md.c:1161 > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: restartpoint complete: wrote 6 buffers (0.1%); 0 transaction log file(s)added, 3 removed, 4 recycled; write=0.437 s, sync=0.823 s, total=1.274 s; sync files=2, longest=0.822 s, average=0.411s > [Standby] 2013-04-22 01:27:25 EDTLOCATION: LogCheckpointEnd, xlog.c:7893 > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: recovery restart point at 0/F0FF8D0 > [Standby] 2013-04-22 01:27:25 EDTDETAIL: last completed transaction was at log time 2013-04-22 01:26:32.439575-04 > [Standby] 2013-04-22 01:27:25 EDTLOCATION: CreateRestartPoint, xlog.c:8601 > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: restartpoint starting: xlog > [Standby] 2013-04-22 01:27:25 EDTLOCATION: LogCheckpointStart, xlog.c:7821 > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません [* This point is illegal work]> [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: unexpected pageaddr 0/6000000 in log file 0,segment 20, offset 0 > [Standby] 2013-04-22 01:27:25 EDTLOCATION: ValidXLOGHeader, xlog.c:4395 > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません > [Master] 2013-04-22 01:27:25 EDTDEBUG: 00000: received replication command: IDENTIFY_SYSTEM > [Master] 2013-04-22 01:27:25 EDTLOCATION: HandleReplicationCommand, walsender.c:449 > [Master] 2013-04-22 01:27:25 EDTDEBUG: 00000: received replication command: START_REPLICATION 0/14000000 > [Master] 2013-04-22 01:27:25 EDTLOCATION: HandleReplicationCommand, walsender.c:449 > [Master] 2013-04-22 01:27:25 EDTFATAL: 58P01: requested WAL segment 000000030000000000000014 has already been removed > [Master] 2013-04-22 01:27:25 EDTLOCATION: XLogRead, walsender.c:1023 > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: streaming replication successfully connected to primary > [Standby] 2013-04-22 01:27:25 EDTLOCATION: libpqrcv_connect, libpqwalreceiver.c:171 > [Standby] 2013-04-22 01:27:25 EDTFATAL: XX000: could not receive data from WAL stream: FATAL: requested WAL segment 000000030000000000000014has already been removed > > [Standby] 2013-04-22 01:27:25 EDTLOCATION: libpqrcv_receive, libpqwalreceiver.c:389 > > [mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません > [Standby] 2013-04-22 01:27:30 EDTDEBUG: 00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0 > [Standby] 2013-04-22 01:27:30 EDTLOCATION: ValidXLOGHeader, xlog.c:4395 > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません > [Master] 2013-04-22 01:27:30 EDTDEBUG: 00000: received replication command: IDENTIFY_SYSTEM > [Master] 2013-04-22 01:27:30 EDTLOCATION: HandleReplicationCommand, walsender.c:449 > [Master] 2013-04-22 01:27:30 EDTDEBUG: 00000: received replication command: START_REPLICATION 0/14000000 > [Master] 2013-04-22 01:27:30 EDTLOCATION: HandleReplicationCommand, walsender.c:449 > [Master] 2013-04-22 01:27:30 EDTFATAL: 58P01: requested WAL segment 000000030000000000000014 has already been removed > [Master] 2013-04-22 01:27:30 EDTLOCATION: XLogRead, walsender.c:1023 > [Standby] 2013-04-22 01:27:30 EDTLOG: 00000: streaming replication successfully connected to primary > [Standby] 2013-04-22 01:27:30 EDTLOCATION: libpqrcv_connect, libpqwalreceiver.c:171 > [Standby] 2013-04-22 01:27:30 EDTFATAL: XX000: could not receive data from WAL stream: FATAL: requested WAL segment 000000030000000000000014has already been removed > > [Standby] 2013-04-22 01:27:30 EDTLOCATION: libpqrcv_receive, libpqwalreceiver.c:389 > /pg_ctl -D Standby cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません > [Standby] 2013-04-22 01:27:35 EDTDEBUG: 00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0 > [Standby] 2013-04-22 01:27:35 EDTLOCATION: ValidXLOGHeader, xlog.c:4395 > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません > [Master] 2013-04-22 01:27:35 EDTDEBUG: 00000: received replication command: IDENTIFY_SYSTEM > [Master] 2013-04-22 01:27:35 EDTLOCATION: HandleReplicationCommand, walsender.c:449 > [Master] 2013-04-22 01:27:35 EDTDEBUG: 00000: received replication command: START_REPLICATION 0/14000000 > [Master] 2013-04-22 01:27:35 EDTLOCATION: HandleReplicationCommand, walsender.c:449 > [Master] 2013-04-22 01:27:35 EDTFATAL: 58P01: requested WAL segment 000000030000000000000014 has already been removed We try to fix this problem with my company's colleaguea, but it is too difficult and complicated to WAL contorol and validation method... This problem is occerd in xlog.c:10692. > 10692 if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false)) > 10693 goto next_record_is_invalid; And problem is occerd this checking in ValidXLOGHeader(). > 4390 if (!XLByteEQ(hdr->xlp_pageaddr, recaddr)) > 4391 { > 4392 ereport(emode_for_corrupt_record(emode, recaddr), > 4393 (errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u", > 4394 hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff, > 4395 readId, readSeg, readOff))); > 4396 return false; > 4397 } We think hdr->xlp_pageaddr.xrecoff has wrong value. This is very rare cace! We cannot analyze why this value is uncorrect! Please fix this problem with us! Best regards -- NTT OSS Center Mitsumasa KONDO
Вложения
ello, On 2013-04-24 17:43:39 +0900, KONDO Mitsumasa wrote: > Hi, > > I find problem about failing start-up achive recovery at Standby mode in PG9.2.4 streaming replication. > I test same problem in PG9.2.3. But it is not occerd... > > cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません > > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: restored log file "000000020000000000000013" from archive I can't read the error message here, but this looks suspicious. The recovery command seems to be returning success although it prints such an error message? > > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません > [* This point is illegal work]> [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: unexpected pageaddr 0/6000000 in log file0, segment 20, offset 0 > > [Standby] 2013-04-22 01:27:25 EDTLOCATION: ValidXLOGHeader, xlog.c:4395 > > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません > > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません > > [Master] 2013-04-22 01:27:25 EDTDEBUG: 00000: received replication command: IDENTIFY_SYSTEM > > [Master] 2013-04-22 01:27:25 EDTLOCATION: HandleReplicationCommand, walsender.c:449 > > [Master] 2013-04-22 01:27:25 EDTDEBUG: 00000: received replication command: START_REPLICATION 0/14000000 > > [Master] 2013-04-22 01:27:25 EDTLOCATION: HandleReplicationCommand, walsender.c:449 > > [Master] 2013-04-22 01:27:25 EDTFATAL: 58P01: requested WAL segment 000000030000000000000014 has already been removed > > [Master] 2013-04-22 01:27:25 EDTLOCATION: XLogRead, walsender.c:1023 > > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: streaming replication successfully connected to primary > > [Standby] 2013-04-22 01:27:25 EDTLOCATION: libpqrcv_connect, libpqwalreceiver.c:171 > > [Standby] 2013-04-22 01:27:25 EDTFATAL: XX000: could not receive data from WAL stream: FATAL: requested WAL segment000000030000000000000014 has already been removed > > > > [Standby] 2013-04-22 01:27:25 EDTLOCATION: libpqrcv_receive, libpqwalreceiver.c:389 > > > > [mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません > > [Standby] 2013-04-22 01:27:30 EDTDEBUG: 00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0 > > [Standby] 2013-04-22 01:27:30 EDTLOCATION: ValidXLOGHeader, xlog.c:4395 > > cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません > > cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません > > [Master] 2013-04-22 01:27:30 EDTDEBUG: 00000: received replication command: IDENTIFY_SYSTEM > > [Master] 2013-04-22 01:27:30 EDTLOCATION: HandleReplicationCommand, walsender.c:449 > > [Master] 2013-04-22 01:27:30 EDTDEBUG: 00000: received replication command: START_REPLICATION 0/14000000 > > [Master] 2013-04-22 01:27:30 EDTLOCATION: HandleReplicationCommand, walsender.c:449 > > [Master] 2013-04-22 01:27:30 EDTFATAL: 58P01: requested WAL segment 000000030000000000000014 has already been removed > > [Master] 2013-04-22 01:27:30 EDTLOCATION: XLogRead, walsender.c:1023 > > [Standby] 2013-04-22 01:27:30 EDTLOG: 00000: streaming replication successfully connected to primary > > [Standby] 2013-04-22 01:27:30 EDTLOCATION: libpqrcv_connect, libpqwalreceiver.c:171 > > [Standby] 2013-04-22 01:27:30 EDTFATAL: XX000: could not receive data from WAL stream: FATAL: requested WAL segment000000030000000000000014 has already been removed And this seems to be the second problem. You probably need to configure a higher wal_keep_segments on the primary or you need to fix your recovery_command. > We try to fix this problem with my company's colleaguea, > but it is too difficult and complicated to WAL contorol and validation method... > > This problem is occerd in xlog.c:10692. > > 10692 if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false)) > > 10693 goto next_record_is_invalid; > > And problem is occerd this checking in ValidXLOGHeader(). > > 4390 if (!XLByteEQ(hdr->xlp_pageaddr, recaddr)) > > 4391 { > > 4392 ereport(emode_for_corrupt_record(emode, recaddr), > > 4393 (errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u", > > 4394 hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff, > > 4395 readId, readSeg, readOff))); > > 4396 return false; > > 4397 } > > We think hdr->xlp_pageaddr.xrecoff has wrong value. This is very rare cace! > We cannot analyze why this value is uncorrect! I think the tests is just noticing that we don't have valid WAL - we are probably reading old preallocated wal files from before starting the node as a standby. Normally it will get the correct data from the primary after that (you can see the START_REPLICATION commands in the log), but that failed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hello, > > cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません > > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: restored log file "000000020000000000000013" from archive > I can't read the error message here, but this looks suspicious. The error message is "No such file or directory" in Japanese :-). Perhaps restore_command is 'cp ../arc/%f %p'. > The recovery command seems to be returning success although it prints such an error message? It seems that, on 9.2.4, RestoreArchivedFile returns expected xlog filename even if the restore command ends up in failure. And the caller XLogFileOpen successfully ets and returns fd for the filename if the file exists since before the restoration attempt. It does not matter when the existing file and the file expected to be restored are identical in content. If the segment includes the timeline increment point, something might happen, I suppose.. regards, -- Kyotaro Horiguchi
Sorry, > caller XLogFileOpen successfully ets and returns fd for the filename The caller is XLogFileRead in this case. # and 'ets' is gets, of course. regards, -- Kyotaro Horiguchi
On 2013-04-24 19:16:12 +0900, Kyotaro HORIGUCHI wrote: > Hello, > > > > cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません > > > [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: restored log file "000000020000000000000013" from archive > > > I can't read the error message here, but this looks suspicious. > > The error message is "No such file or directory" in Japanese :-). > Perhaps restore_command is 'cp ../arc/%f %p'. > > > The recovery command seems to be returning success although it prints such > an error message? > > It seems that, on 9.2.4, RestoreArchivedFile returns expected xlog > filename even if the restore command ends up in failure. And the > caller XLogFileOpen successfully ets and returns fd for the filename > if the file exists since before the restoration attempt. It does not > matter when the existing file and the file expected to be restored are > identical in content. But thats not what happening here, afaics the "restore log file ..." message is only printed if the returncode is 0. And the return value from RestoreArchivedFile() would be different in the case you describe which would make XLogFileRead() fail, so it should result in different errors. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Oops, > But thats not what happening here, afaics the "restore log file ..." > message is only printed if the returncode is 0. You're right. 'cp <nonexistent> <somewhere>' exits with the status code 1 (or 256?). The quoted log lines simply show that segment for tli=3 did not exist and that for tli=2 had been gotten finally. It's quite normal and irrelevant to the trouble mentioned. Sorry for the confusion. Unfortunately, the script attached didn't reproduce the situation for me. But in the log attached, > [Standby] 2013-04-22 01:27:25 EDTFATAL: XX000: could not receive data from WAL stream: FATAL: requested WAL segment 000000030000000000000014has already been removed > cp: cannot stat `../arc/000000030000000000000014': No such file or directory. > [Standby] 2013-04-22 01:27:30 EDTDEBUG: 00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0 seems showing that the standby received negative ack for the request for 20th segment, and 5 seconds later, it tried to get that from archive and somehow thought that it'd gotten but the header is of 6th segment. regards, -- Kyotaro Horiguchi
I had a bit look on it and came up with an hypothesis.. umm or a scenario. == Just after restartpoint, too old xlog files are recycled but its page header has old content, specifically, xlogid and xrecoff. Plus, if the master's LSN is at the head of new segment file, the file for the segment may have not been created and the WAL send request for that LSN from the standby might fail as "requested WAL segment %s has already been removed", in spite of the segment is "not yet created"... So the standby received nack for the request and looks for archive but the file is properly not existent. But the file of that segment is certainly found in pg_xlog directory. So XLogFileRead grabs the old-in-content file and... > seems showing that the standby received negative ack for the request > for 20th segment, and 5 seconds later, it tried to get that from > archive and somehow thought that it'd gotten but the header is of 6th > segment. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, >> I find problem about failing start-up achive recovery at Standby mode in PG9.2.4 streaming replication. >> I test same problem in PG9.2.3. But it is not occerd... > >>> cp: cannot stat `../arc/000000030000000000000013': そのようなファイルやディレクトリはありません >>> [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: restored log file "000000020000000000000013" from archive > > I can't read the error message here, but this looks suspicious. The > recovery command seems to be returning success although it prints such > an error message? Sorry.. This suspicious massage is written in japanese. It means "missing file or directory". Beacause starting point of timeline 3 WAL file is 00000003000000000000002A. >>> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません >> [* This point is illegal work]> [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: unexpected pageaddr 0/6000000 in log file0, segment 20, offset 0 >>> [Standby] 2013-04-22 01:27:25 EDTLOCATION: ValidXLOGHeader, xlog.c:4395 >>> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません >>> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません >>> [Master] 2013-04-22 01:27:25 EDTDEBUG: 00000: received replication command: IDENTIFY_SYSTEM >>> [Master] 2013-04-22 01:27:25 EDTLOCATION: HandleReplicationCommand, walsender.c:449 >>> [Master] 2013-04-22 01:27:25 EDTDEBUG: 00000: received replication command: START_REPLICATION 0/14000000 >>> [Master] 2013-04-22 01:27:25 EDTLOCATION: HandleReplicationCommand, walsender.c:449 >>> [Master] 2013-04-22 01:27:25 EDTFATAL: 58P01: requested WAL segment 000000030000000000000014 has already been removed >>> [Master] 2013-04-22 01:27:25 EDTLOCATION: XLogRead, walsender.c:1023 >>> [Standby] 2013-04-22 01:27:25 EDTLOG: 00000: streaming replication successfully connected to primary >>> [Standby] 2013-04-22 01:27:25 EDTLOCATION: libpqrcv_connect, libpqwalreceiver.c:171 >>> [Standby] 2013-04-22 01:27:25 EDTFATAL: XX000: could not receive data from WAL stream: FATAL: requested WAL segment000000030000000000000014 has already been removed >>> >>> [Standby] 2013-04-22 01:27:25 EDTLOCATION: libpqrcv_receive, libpqwalreceiver.c:389 >>> >>> [mitsu-ko@localhost postgresql-9.2.4]$ bincp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません >>> [Standby] 2013-04-22 01:27:30 EDTDEBUG: 00000: unexpected pageaddr 0/6000000 in log file 0, segment 20, offset 0 >>> [Standby] 2013-04-22 01:27:30 EDTLOCATION: ValidXLOGHeader, xlog.c:4395 >>> cp: cannot stat `../arc/000000030000000000000014': そのようなファイルやディレクトリはありません >>> cp: cannot stat `../arc/00000004.history': そのようなファイルやディレクトリはありません >>> [Master] 2013-04-22 01:27:30 EDTDEBUG: 00000: received replication command: IDENTIFY_SYSTEM >>> [Master] 2013-04-22 01:27:30 EDTLOCATION: HandleReplicationCommand, walsender.c:449 >>> [Master] 2013-04-22 01:27:30 EDTDEBUG: 00000: received replication command: START_REPLICATION 0/14000000 >>> [Master] 2013-04-22 01:27:30 EDTLOCATION: HandleReplicationCommand, walsender.c:449 >>> [Master] 2013-04-22 01:27:30 EDTFATAL: 58P01: requested WAL segment 000000030000000000000014 has already been removed >>> [Master] 2013-04-22 01:27:30 EDTLOCATION: XLogRead, walsender.c:1023 >>> [Standby] 2013-04-22 01:27:30 EDTLOG: 00000: streaming replication successfully connected to primary >>> [Standby] 2013-04-22 01:27:30 EDTLOCATION: libpqrcv_connect, libpqwalreceiver.c:171 >>> [Standby] 2013-04-22 01:27:30 EDTFATAL: XX000: could not receive data from WAL stream: FATAL: requested WAL segment000000030000000000000014 has already been removed > > And this seems to be the second problem. You probably need to configure > a higher wal_keep_segments on the primary or you need to fix your > recovery_command. In this situation, I try to up timelineID by archive recovery, not also recover WAL files. If achive recovery works correctly in Standby, recovery done point is 000000020000000000000029 in timelineID 2. Please see detail in pg9.2.4.log. Situation of archive file is here. >> We try to fix this problem with my company's colleaguea, >> but it is too difficult and complicated to WAL contorol and validation method... >> >> This problem is occerd in xlog.c:10692. >>> 10692 if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false)) >>> 10693 goto next_record_is_invalid; >> >> And problem is occerd this checking in ValidXLOGHeader(). >>> 4390 if (!XLByteEQ(hdr->xlp_pageaddr, recaddr)) >>> 4391 { >>> 4392 ereport(emode_for_corrupt_record(emode, recaddr), >>> 4393 (errmsg("unexpected pageaddr %X/%X in log file %u, segment %u, offset %u", >>> 4394 hdr->xlp_pageaddr.xlogid, hdr->xlp_pageaddr.xrecoff, >>> 4395 readId, readSeg, readOff))); >>> 4396 return false; >>> 4397 } >> >> We think hdr->xlp_pageaddr.xrecoff has wrong value. This is very rare cace! >> We cannot analyze why this value is uncorrect! > > I think the tests is just noticing that we don't have valid WAL - we are > probably reading old preallocated wal files from before starting the > node as a standby. Normally it will get the correct data from the > primary after that (you can see the START_REPLICATION commands in the > log), but that failed. I hope so! I contenue to try to analayze this problem.. Best regards -- NTT Open Sorce Software Center Mitsumasa KONDO
I also had a similar observation when I could reproduce this. I tried to find why restartpoint causes the recycled segment to be named after timeline 3, but have not been able to determine that. When I looked at the source, I found that, the function XLogFileReadAnyTLI which returns a segment file for reading a XLog page iterates over a list expectedTLIs which starts with 3 in such a case (that is, in case where this error happens). XLogFileReadAnyTLI checks the segment in both archive and pg_xlog. So, even if such a segment (that is with timeline 3) may not be in the archive , it may be in pg_xlog, due to recycling as we have observed. So, such a recycled segment may be returned by XLogFileReadAnyTLI as though it were the next segment to recover from, resulting in the "unexpected pageaddr ..." error. I could not understand (in case this error happens) how expectedTLIs list had 3 at its head (for example, when XLogFileReadAnyTLI used it as we observed) whereas at other times (when this error does not happen), it has 2 at its head until all the segments of timeline 2 are recovered from? Also, how does recycled segment gets timeline 3 whereas 2 is expected in this case? Is this the right way to look at the problem and its possible fix? I have tried going through the source regarding this but have not been able to determine where this could accidentally happen, partly because I do not understand recovery process (and its code) very well. Will post if find anything useful. regards, Amit Langote -- View this message in context: http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753221.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Hmm. I think that I caught the tail of the problem.. Script with small change reproduced the situation for me. The latest standby uses 3 as its TLI after the history file 0..3.history which could get from the archive. So the WAL files recycled on this standby will have the TLI=3. Nevertheless the LSN of the segment recycled on standby is on the TLI=2 in the master, the standby makes the first request for each segment with that LSN but TLI = 3 to the master because the standby runs on recoveryTargetTLI=3. The master reasonably doesn't have it and finally the standby finds that wrong WAL file in its pg_xlog directory before the second request with TLI=2 would be made. In conclusion, the standby should name the recycled WAL segment using the same TLI for the LSN on the master. Or should never recycle WAL files.. regards, -- Kyotaro Horiguchi
On 25.04.2013 17:55, Kyotaro HORIGUCHI wrote: > Hmm. I think that I caught the tail of the problem.. > > Script with small change reproduced the situation for me. Can you share the modified script, please? > The latest standby uses 3 as its TLI after the history file > 0..3.history which could get from the archive. So the WAL files > recycled on this standby will have the TLI=3. > Nevertheless the LSN of the segment recycled on standby is on the > TLI=2 in the master, the standby makes the first request for each > segment with that LSN but TLI = 3 to the master because the standby > runs on recoveryTargetTLI=3. The master reasonably doesn't have it and > finally the standby finds that wrong WAL file in its pg_xlog directory > before the second request with TLI=2 would be made. I'm not sure I understand what the problem is, though. When the standby opens the bogus, recycled, file in pg_xlog, it will notice that the header is incorrect, and retry reading the file from the archive. > In conclusion, the standby should name the recycled WAL segment using > the same TLI for the LSN on the master. Or should never recycle WAL > files.. Perhaps, but it should nevertheless not get confused by recycled segments. - Heikki
I forgot it. > In conclusion, the standby should name the recycled WAL segment using the same TLI for the LSN on the master. Or should never recycle WAL files. Or the standby should make the request with correct TLI at first consulting the timeline history. Or the standby should make retry with the more small TLIs after it gets broken segments. regards, -- Kyotaro Horiguchi
> Can you share the modified script, please? Please find the attached files: test.sh : test script. most significant change is the load. I used simple insert instead of pgbench. It might need some more adjustment for other environment as my usual. xlog.c.diff : Additional log output I thought to be useful to diagnose. > I'm not sure I understand what the problem is, though. When the standby > opens the bogus, recycled, file in pg_xlog, it will notice that the header > is incorrect, and retry reading the file from the archive. It looks the corrupted header only once. It continues to make retry without looking there after that. > Perhaps, but it should nevertheless not get confused by recycled segments. regards, and good night. -- Kyotaro Horiguchi
Вложения
On 25.04.2013 18:56, Kyotaro HORIGUCHI wrote: >> Can you share the modified script, please? > > Please find the attached files: > test.sh : test script. most significant change is the load. > I used simple insert instead of pgbench. > It might need some more adjustment for other environment > as my usual. > xlog.c.diff : Additional log output I thought to be useful to diagnose. Ok, thanks, I see what's going on now. The problem is that once XLogFileRead() finds a file with tli X, it immediately sets curFileTLI = X. XLogFileReadAnyTLI() never tries to read files with tli < curFileTLI. So, if recovery finds a file with the right filename, e.g 000000030000000000000008, it never tries to open 000000020000000000000008 anymore, even if the contents of 000000030000000000000008 later turn out to be bogus. One idea to fix this is to not set curFileTLI, until the page header on the just-opened file has been verified. Another idea is to change the check in XLogFileReadAnyTLI() that currently forbids curFileTLI from moving backwards. We could allow curFileTLI to move backwards, as long as the tli is >= ThisTimeLineID (ThisTimeLineID is the current timeline we're recovering records from). Attached is a patch for the 2nd approach. With the patch, the test script works for me. Thoughts? PS. This wasn't caused by the 9.2.4 change to do crash recovery before entering archive recovery. The test script fails in the same way with 9.2.3 as well. - Heikki
Вложения
<p>Thank you for the patch.<p>The test script finishes in success with that. And looks reasonable on a short glance.<p>OnFri, Apr 26, 2013 at 4:34 AM, Heikki Linnakangas <<a href="mailto:hlinnakangas@vmware.com">hlinnakangas@vmware.com</a>>wrote:<br /> > One idea to fix this is to not setcurFileTLI, until the page header on the<br /> > just-opened file has been verified. Another idea is to change thecheck in<br /> > XLogFileReadAnyTLI() that currently forbids curFileTLI from moving<br /> > backwards. We couldallow curFileTLI to move backwards, as long as the<br /> > tli is >= ThisTimeLineID (ThisTimeLineID is the currenttimeline we're<br /> > recovering records from).<br /> ><br /> > Attached is a patch for the 2nd approach.With the patch, the test script<br /> > works for me. Thoughts?<p>I am uncertain a bit weather it is necessaryto move curFileTLI to anywhere randomly read . On a short glance, the random access occurs also for reading checkpoint-relatedrecords.<br /> Also I don't have clear distinction between lastSegmentTLI and curFileTLI after the patchapplied. Although , I need look closer around them to understand.<p>> PS. This wasn't caused by the 9.2.4 changeto do crash recovery before<br /> > entering archive recovery. The test script fails in the same way with 9.2.3<br/> > as well.<p>--<br /> Kyotaro Horiguchi
How would code after applying this patch behave if a recycled segment gets renamed using the newest timeline (say 3) while we are still recovering from a lower timeline (say 2)? In that case, since XLogFileReadAnyTLI returns that recycled segment as the next segment to recover from, we get the error. And since XLogFileReadAnyTLI iterates over expectedTLIs (whose head seems to be recoveryTargetTLI at all times, is that right?), it will return that wrong (recycled segment) in the first iteration itself. The code for renaming the recycling segment remains unaffected by this patch, right? That is, the code which assigns timelineID to newly created / recycled segments is not changed. (a snippet from CreateRestartPoint() ) /** Update ThisTimeLineID to the recovery target timeline, so that* we install any recycled segments on the correct timeline.*/ >>>> ThisTimeLineID = GetRecoveryTargetTLI(); <<<<< RemoveOldXlogFiles(_logId, _logSeg, endptr); /** Make more log segments if needed. (Do this after recycling old log* segments, since that may supply some of the neededfiles.)*/ PreallocXlogFiles(endptr); So, would we still get a similar behavior (wherein the wrongly named recycled segments are produced) and cause the error to happen again? Do we need to look here? What do you think? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753352.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
How would code after applying this patch behave if a recycled segment gets renamed using the newest timeline (say 3) while we are still recovering from a lower timeline (say 2)? In that case, since XLogFileReadAnyTLI returns that recycled segment as the next segment to recover from, we get the error. And since XLogFileReadAnyTLI iterates over expectedTLIs (whose head seems to be recoveryTargetTLI at all times, is that right?), it will return that wrong (recycled segment) in the first iteration itself. The code for renaming the recycling segment remains unaffected by this patch, right? That is, the code which assigns timelineID to newly created / recycled segments is not changed. (a snippet from CreateRestartPoint() ) /* * Update ThisTimeLineID to the recovery target timeline, so that * we install any recycled segments on the correct timeline.*/ >>>> ThisTimeLineID = GetRecoveryTargetTLI(); <<<<< RemoveOldXlogFiles(_logId, _logSeg, endptr); /* * Make more log segments if needed. (Do this after recycling old log * segments, since that may supply some of the neededfiles.) */ PreallocXlogFiles(endptr); So, would we still get a similar behavior (wherein the wrongly named recycled segments are produced) and cause the error to happen again? Do we need to look here? What do you think? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753353.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On 26.04.2013 07:02, Kyotaro HORIGUCHI wrote: > I am uncertain a bit weather it is necessary to move curFileTLI to > anywhere randomly read . On a short glance, the random access occurs also > for reading checkpoint-related records. I didn't understand that. > Also I don't have clear distinction between lastSegmentTLI and curFileTLI > after the patch applied. In short, lastSegmentTLI is the TLI in first page of the last opened segment, while curFileTLI is the TLI in the filename of the last opened segment. Usually they're the same, but in a segment that contains a timeline switch, they differ. For example, if you perform PITR to a point in the middle of segment 000000010000000000000062, you end up with two segments: 000000010000000000000062 000000020000000000000062 The first half of those files are identical, but the latter contains a time-line changing checkpoint record in the middle, and from that point on the contents are different. When we open file 000000020000000000000062, lastSegmentTLI is 1, because the first half of that segment contains WAL from timeline 1, but curFileTLI is 2, because that's the TLI in the filename. - Heikki
What would happen if a recycled segment gets assigned a newer timeline than the one we are currently recovering from? In the reported erroneous behavior, that is what happens causing XLogFileReadAnyTLI() to return such bogus segment causing the error. Since, expectedTLIs contains a newer timeline at its head, and if such a wrongly recycled (that is one bearing a newer timeline than curFileTLI) segment exists in pg_xlog, XLogFileReadAnyTLI() would return fd of that segment itself causing the error in question. In next retry, the same thing would happen and prevent standby startup from proceeding any further. So, should we also look at the segment recycling code to see how it names the recycled segments (what timelineID does it use) ? For example, in CreateRestartPoint() where segments are recycled? -- View this message in context: http://postgresql.1045698.n5.nabble.com/Failing-start-up-archive-recovery-at-Standby-mode-in-PG9-2-4-tp5753110p5753358.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On 26.04.2013 07:47, Amit Langote wrote: > How would code after applying this patch behave if a recycled segment gets > renamed using the newest timeline (say 3) while we are still recovering from > a lower timeline (say 2)? In that case, since XLogFileReadAnyTLI returns > that recycled segment as the next segment to recover from, we get the error. > And since XLogFileReadAnyTLI iterates over expectedTLIs (whose head seems to > be recoveryTargetTLI at all times, is that right?), it will return that > wrong (recycled segment) in the first iteration itself. As long as the right segment is present in the archive, that's OK. Even if a recycled segment with higher TLI is in pg_xlog, with the patch we'll still read the segment with lower TLI from the archive. But there is a corner-case where a recycled segment with a higher TLI masks a segment with lower TLI in pg_xlog. For example, if you try to recover by copying all the required WAL files directly into pg_xlog, without using restore_command, you could run into problems. So yeah, I think you're right and we need to rethink the recycling. The first question is, do we have to recycle WAL segments during recovery at all? It's pointless when we're restoring from archive with restore_command; the recycled files will just get replaced with files from the archive. It does help when walreceiver is active, but I wonder how significant it is in practice. I guess the safest, smallest change is to use a lower TLI when installing the recycled files. So, instead of using the current recovery target timeline, use the ID of the timeline we're currently recovering. That way the reycycled segments will never have a higher TLI than other files that recovery will try to replay. See attached patch. - Heikki
Вложения
Hi, I discavered the problem cause. I think taht horiguchi's discovery is another problem... Problem has CreateRestartPoint. In recovery mode, PG should not WAL record. Because PG does not know latest WAL file location. But in this problem case, PG(standby) write WAL file at RestartPoint in archive recovery. In recovery mode, I think that RestartPoint can write only MinRecoveryPoint. Here is Standby's pg_xlog directory in problem caused. > [mitsu-ko@localhost postgresql-9.2.4-c]$ ls Standby/pg_xlog/ > 000000020000000000000003 000000020000000000000007 00000002000000000000000B 00000003.history > 000000020000000000000004 000000020000000000000008 00000002000000000000000C 00000003000000000000000E > 000000020000000000000005 000000020000000000000009 00000002000000000000000D 00000003000000000000000F > 000000020000000000000006 00000002000000000000000A 00000002000000000000000E archive_status This problem case is here. > [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: attempting to remove WAL segments older than log file 000000030000000000000002 > [Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles, xlog.c:3568 > [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: recycled transaction log file "000000010000000000000002" > [Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles, xlog.c:3607 > [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: recycled transaction log file "000000020000000000000002" > [Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles, xlog.c:3607 > [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: restartpoint complete: wrote 9 buffers (0.2%); 0 transaction log file(s)added, 0 removed, 2 recycled; write=0.601 s, sync=1.178 s, total=1.781 s; sync files=3, longest=1.176 s, average=0.392s > [Standby] 2013-04-26 04:26:44 EDT LOCATION: LogCheckpointEnd, xlog.c:7893 > [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: recovery restart point at 0/90FE448 > [Standby] 2013-04-26 04:26:44 EDT DETAIL: last completed transaction was at log time 2013-04-26 04:25:53.203725-04 > [Standby] 2013-04-26 04:26:44 EDT LOCATION: CreateRestartPoint, xlog.c:8601 > [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: restartpoint starting: xlog > [Standby] 2013-04-26 04:26:44 EDT LOCATION: LogCheckpointStart, xlog.c:7821 > cp: cannot stat `../arc/00000003000000000000000F': そのようなファイルやディレクトリはありません > [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: could not restore file "00000003000000000000000F" from archive: returncode 256 > [Standby] 2013-04-26 04:26:44 EDT LOCATION: RestoreArchivedFile, xlog.c:3323 > [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: unexpected pageaddr 0/2000000 in log file 0, segment 15, offset 0 > [Standby] 2013-04-26 04:26:44 EDT LOCATION: ValidXLOGHeader, xlog.c:4395 > cp: cannot stat `../arc/00000003000000000000000F': そのようなファイルやディレクトリはありません > [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: could not restore file "00000003000000000000000F" from archive: returncode 256 In recovery, pg normary search WAL file at archive recovery. If propery WAL file is nothing(archive command is failed), next search pg_xlog directory. Normary, propety WAL file is nothing in pg_xlog. But this case has propety name's WAL file(But it's mistaken and illegal) in pg_xlog. So recovery is failed and broken Standby. So I fix CreateRestartPoint at branching point of executing MinRecoveryPoint. It seems to fix this problem, but attached patch is plain. Best Regard, -- NTT Open Source Software Center Mitsumasa KONDO
Вложения
I explain more detail about this problem.
This problem was occurred by RestartPoint create illegal WAL file in during archive recovery. But I cannot recognize why illegal WAL file was created in CreateRestartPoint(). My attached patch is really plain…
In problem case at XLogFileReadAnyTLI(), first check WAL file does not get fd. Because it does not exists property WAL File in archive directory.
XLogFileReadAnyTLI()
> if (sources & XLOG_FROM_ARCHIVE)
> {
> fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_ARCHIVE, true);
> if (fd != -1)
> {
> elog(DEBUG1, "got WAL segment from archive");
> return fd;
> }
> }
Next search WAL file in pg_xlog. There are illegal WAL File in pg_xlog. And return illegal WAL File’s fd.
XLogFileReadAnyTLI()
> if (sources & XLOG_FROM_PG_XLOG)
> {
> fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_PG_XLOG, true);
> if (fd != -1)
> return fd;
> }
Returned fd is be readFile value. Of cource readFile value is over 0. So out of for-loop.
XLogPageRead
> readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2,
> sources);
> switched_segment = true;
> if (readFile >= 0)
> break;
Next, problem function point. Illegal WAL file was read, and error.
XLogPageRead
> if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
> {
> ereport(emode_for_corrupt_record(emode, *RecPtr),
> (errcode_for_file_access(),
> errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> readId, readSeg, readOff)));
> goto next_record_is_invalid;
> }
> if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
> {
> ereport(emode_for_corrupt_record(emode, *RecPtr),
> (errcode_for_file_access(),
> errmsg("could not read from log file %u, segment %u, offset %u: %m",
> readId, readSeg, readOff)));
> goto next_record_is_invalid;
> }
> if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
> goto next_record_is_invalid;
I think that horiguchi's discovery point is after this point.
We must fix that CreateRestartPoint() does not create illegal WAL File.
Best regards,
--
Mitsumasa KONDO
This problem was occurred by RestartPoint create illegal WAL file in during archive recovery. But I cannot recognize why illegal WAL file was created in CreateRestartPoint(). My attached patch is really plain…
In problem case at XLogFileReadAnyTLI(), first check WAL file does not get fd. Because it does not exists property WAL File in archive directory.
XLogFileReadAnyTLI()
> if (sources & XLOG_FROM_ARCHIVE)
> {
> fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_ARCHIVE, true);
> if (fd != -1)
> {
> elog(DEBUG1, "got WAL segment from archive");
> return fd;
> }
> }
Next search WAL file in pg_xlog. There are illegal WAL File in pg_xlog. And return illegal WAL File’s fd.
XLogFileReadAnyTLI()
> if (sources & XLOG_FROM_PG_XLOG)
> {
> fd = XLogFileRead(log, seg, emode, tli, XLOG_FROM_PG_XLOG, true);
> if (fd != -1)
> return fd;
> }
Returned fd is be readFile value. Of cource readFile value is over 0. So out of for-loop.
XLogPageRead
> readFile = XLogFileReadAnyTLI(readId, readSeg, DEBUG2,
> sources);
> switched_segment = true;
> if (readFile >= 0)
> break;
Next, problem function point. Illegal WAL file was read, and error.
XLogPageRead
> if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
> {
> ereport(emode_for_corrupt_record(emode, *RecPtr),
> (errcode_for_file_access(),
> errmsg("could not seek in log file %u, segment %u to offset %u: %m",
> readId, readSeg, readOff)));
> goto next_record_is_invalid;
> }
> if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
> {
> ereport(emode_for_corrupt_record(emode, *RecPtr),
> (errcode_for_file_access(),
> errmsg("could not read from log file %u, segment %u, offset %u: %m",
> readId, readSeg, readOff)));
> goto next_record_is_invalid;
> }
> if (!ValidXLOGHeader((XLogPageHeader) readBuf, emode, false))
> goto next_record_is_invalid;
I think that horiguchi's discovery point is after this point.
We must fix that CreateRestartPoint() does not create illegal WAL File.
Best regards,
--
Mitsumasa KONDO
On 26.04.2013 11:51, KONDO Mitsumasa wrote: > Hi, > > I discavered the problem cause. I think taht horiguchi's discovery is > another problem... > Problem has CreateRestartPoint. In recovery mode, PG should not WAL record. > Because PG does not know latest WAL file location. > But in this problem case, PG(standby) write WAL file at RestartPoint in > archive recovery. > In recovery mode, I think that RestartPoint can write only > MinRecoveryPoint. > > Here is Standby's pg_xlog directory in problem caused. >> [mitsu-ko@localhost postgresql-9.2.4-c]$ ls Standby/pg_xlog/ >> 000000020000000000000003 000000020000000000000007 >> 00000002000000000000000B 00000003.history >> 000000020000000000000004 000000020000000000000008 >> 00000002000000000000000C 00000003000000000000000E >> 000000020000000000000005 000000020000000000000009 >> 00000002000000000000000D 00000003000000000000000F >> 000000020000000000000006 00000002000000000000000A >> 00000002000000000000000E archive_status > > This problem case is here. >> [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: attempting to remove >> WAL segments older than log file 000000030000000000000002 >> [Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles, >> xlog.c:3568 >> [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: recycled transaction >> log file "000000010000000000000002" >> [Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles, >> xlog.c:3607 >> [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: recycled transaction >> log file "000000020000000000000002" >> [Standby] 2013-04-26 04:26:44 EDT LOCATION: RemoveOldXlogFiles, >> xlog.c:3607 >> [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: restartpoint complete: >> wrote 9 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 2 >> recycled; write=0.601 s, sync=1.178 s, total=1.781 s; sync files=3, >> longest=1.176 s, average=0.392 s >> [Standby] 2013-04-26 04:26:44 EDT LOCATION: LogCheckpointEnd, xlog.c:7893 >> [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: recovery restart point >> at 0/90FE448 >> [Standby] 2013-04-26 04:26:44 EDT DETAIL: last completed transaction >> was at log time 2013-04-26 04:25:53.203725-04 >> [Standby] 2013-04-26 04:26:44 EDT LOCATION: CreateRestartPoint, >> xlog.c:8601 >> [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: restartpoint starting: xlog >> [Standby] 2013-04-26 04:26:44 EDT LOCATION: LogCheckpointStart, >> xlog.c:7821 >> cp: cannot stat `../arc/00000003000000000000000F': そのようなファイル >> やディレクトリはありません >> [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: could not restore file >> "00000003000000000000000F" from archive: return code 256 >> [Standby] 2013-04-26 04:26:44 EDT LOCATION: RestoreArchivedFile, >> xlog.c:3323 >> [Standby] 2013-04-26 04:26:44 EDT LOG: 00000: unexpected pageaddr >> 0/2000000 in log file 0, segment 15, offset 0 >> [Standby] 2013-04-26 04:26:44 EDT LOCATION: ValidXLOGHeader, xlog.c:4395 >> cp: cannot stat `../arc/00000003000000000000000F': そのようなファイル >> やディレクトリはありません >> [Standby] 2013-04-26 04:26:44 EDT DEBUG: 00000: could not restore file >> "00000003000000000000000F" from archive: return code 256 > > In recovery, pg normary search WAL file at archive recovery. > If propery WAL file is nothing(archive command is failed), next search > pg_xlog directory. > Normary, propety WAL file is nothing in pg_xlog. > But this case has propety name's WAL file(But it's mistaken and illegal) > in pg_xlog. > So recovery is failed and broken Standby. > > So I fix CreateRestartPoint at branching point of executing > MinRecoveryPoint. > It seems to fix this problem, but attached patch is plain. I didn't understand this. I committed a fix for the issue where recycled WAL files get in the way of recovery, but I'm not sure if you're describing the same issue or something else. But before we dig any deeper into this, can you verify if you're still seeing any issues on 9.3beta1? - Heikki
(2013/05/07 22:40), Heikki Linnakangas wrote: > On 26.04.2013 11:51, KONDO Mitsumasa wrote: >> So I fix CreateRestartPoint at branching point of executing >> MinRecoveryPoint. >> It seems to fix this problem, but attached patch is plain. > > I didn't understand this. I committed a fix for the issue where recycled WAL > files get in the way of recovery, but I'm not sure if you're describing the same > issue or something else. But before we dig any deeper into this, can you verify > if you're still seeing any issues on 9.3beta1? Yes, my fix was very plain and mistake fix point. I see your patch, your fix is right. I posted another problem that is cannot promote and PITR. It might be different problem, uut I could not reproduce it... Please wait for more! I try to analyze and test in 9.2.4 with your patch and 9.3beta1. Best regards, -- NTT Open Source Software Center Mitsumasa KONDO