Обсуждение: Re: [GENERAL] 8.1.4 - problem with PITR - .backup.done / backup.ready version of the same file at the same time.

Поиск
Список
Период
Сортировка
"Rafael Martinez, Guerrero" <r.m.guerrero@usit.uio.no> writes:
> The problem was that 000000010000000800000010.0006D5E8.backup was
> already archived, but under pg_xlog/archive_status/ there were two
> files:
> -------------------------------------------------
> 000000010000000800000010.0006D5E8.backup.done
> 000000010000000800000010.0006D5E8.backup.ready
> -------------------------------------------------

> This situation should not happen, anyone has seen this problem before?

No, it shouldn't.  What I suspect is that XLogArchiveIsDone() got
confused and created a duplicate .ready file.  It basically assumes
that the only way its stat() calls can fail is ENOENT, ie, file not
there ... but I wonder if they failed for some other reason instead.
What sort of platform and filesystem is this on?

Did you happen to make note of the mod times of the two files before
deleting them?
        regards, tom lane


Re: [GENERAL] 8.1.4 - problem with PITR - .backup.done /

От
Rafael Martinez
Дата:
On Tue, 2006-05-30 at 09:45 -0400, Tom Lane wrote:
> "Rafael Martinez, Guerrero" <r.m.guerrero@usit.uio.no> writes:
> > The problem was that 000000010000000800000010.0006D5E8.backup was
> > already archived, but under pg_xlog/archive_status/ there were two
> > files:
> > -------------------------------------------------
> > 000000010000000800000010.0006D5E8.backup.done
> > 000000010000000800000010.0006D5E8.backup.ready
> > -------------------------------------------------
> 
> > This situation should not happen, anyone has seen this problem before?
> 
> No, it shouldn't.  What I suspect is that XLogArchiveIsDone() got
> confused and created a duplicate .ready file.  It basically assumes
> that the only way its stat() calls can fail is ENOENT, ie, file not
> there ... but I wonder if they failed for some other reason instead.
> What sort of platform and filesystem is this on?
> 

This is on a AMD64 Opteron server with RHELAS4 / 2.6.9-34.ELsmp and ext3
filesystem. This is the first time this happens.

I cannot the postgres internals but after a quick look to the source
code ......

XLogArchiveIsDone() has this code in the final of the function:
-------------------------------------------------/* Race condition --- maybe archiver just finished, so recheck */
StatusFilePath(archiveStatusPath, xlog, ".done");       if (stat(archiveStatusPath, &stat_buf) == 0)
returntrue;
 
       /* Retry creation of the .ready file */       XLogArchiveNotify(xlog);       return false;
}
-------------------------------------------------

What happens if we have a race condition and the archiver creates
a .done file between the last check for the .done file and the creation
of the .ready file by XLogArchiveNotify?

> Did you happen to make note of the mod times of the two files before
> deleting them?
> 

No, I did not :( If it happens again, I will do.

regards,
-- 
Rafael Martinez, <r.m.guerrero@usit.uio.no>
Center for Information Technology Services
University of Oslo, Norway

PGP Public Key: http://folk.uio.no/rafael/



Rafael Martinez <r.m.guerrero@usit.uio.no> writes:
> What happens if we have a race condition and the archiver creates
> a .done file between the last check for the .done file and the creation
> of the .ready file by XLogArchiveNotify?

That can't happen; the archiver creates the .done file by rename()ing
the previous .ready file, which is (supposed to be) an atomic action.
If the .ready file isn't there, and then after that we see that the
.done file isn't there, then either neither of them are there or the
filesystem is seriously broken.

My thought is that the stat()s on the .done file failed for some obscure
reason, perhaps insufficient kernel resources, even though the file was 
actually there.

If you have postmaster log output for the interval in which this
happened, it would be interesting to look for occurrences of this
warning message from pgarch_archiveDone:
   if (rename(rlogready, rlogdone) < 0)       ereport(WARNING,               (errcode_for_file_access(),
errmsg("couldnot rename file \"%s\" to \"%s\": %m",                       rlogready, rlogdone)));
 

If you find any then we might need a different theory ...
        regards, tom lane


Re: [GENERAL] 8.1.4 - problem with PITR - .backup.done /

От
Rafael Martinez
Дата:
On Tue, 2006-05-30 at 15:38 -0400, Tom Lane wrote:
[.......]
> 
> My thought is that the stat()s on the .done file failed for some obscure
> reason, perhaps insufficient kernel resources, even though the file was 
> actually there.
> 
> If you have postmaster log output for the interval in which this
> happened, it would be interesting to look for occurrences of this
> warning message from pgarch_archiveDone:
> 
>     if (rename(rlogready, rlogdone) < 0)
>         ereport(WARNING,
>                 (errcode_for_file_access(),
>                  errmsg("could not rename file \"%s\" to \"%s\": %m",
>                         rlogready, rlogdone)));
> 
> If you find any then we might need a different theory ...
> 

I do not find any warning message "could not rename file ...". These are
the relevant entries in the log file:

--------------------------------------------------------
[2006-05-29 17:31:55.212 CEST]   12022 LOG:  archived transaction log
file "00000001000000080000000F"

**** PITR_basebackup script started around 17:32 ****

[2006-05-29 17:40:27.735 CEST]   12022 LOG:  archived transaction log
file "000000010000000800000010"
[2006-05-29 17:49:32.075 CEST]   12022 LOG:  archived transaction log
file "000000010000000800000011"
[2006-05-29 17:59:40.575 CEST]   12022 LOG:  archived transaction log
file "000000010000000800000012"
[2006-05-29 18:08:27.229 CEST]   12022 LOG:  archived transaction log
file "000000010000000800000013"
[2006-05-29 18:11:36.434 CEST]   12022 LOG:  archived transaction log
file "000000010000000800000010.0006D5E8.backup"

[2006-05-29 18:11:36.467 CEST]   12022 LOG:  archive command
"archive_wal.sh -P pg_xlog/000000010000000800000010.0006D5E8.backup -F
000000010000000800000010.0006D5E8.backup" failed: return code 256

[2006-05-29 18:11:37.479 CEST]   12022 LOG:  archive command
"archive_wal.sh -P pg_xlog/000000010000000800000010.0006D5E8.backup -F
000000010000000800000010.0006D5E8.backup" failed: return code 256

[2006-05-29 18:11:38.492 CEST]   12022 LOG:  archive command
"archive_wal.sh -P pg_xlog/000000010000000800000010.0006D5E8.backup -F
000000010000000800000010.0006D5E8.backup" failed: return code 256

[2006-05-29 18:11:38.492 CEST]   12022 WARNING:  transaction log file
"000000010000000800000010.0006D5E8.backup" could not be archived: too
many failures

**** PITR_basebackup script finnished 18:12:16 ****
...............................
**** Same error several times until we deleted the .backup.ready file at
18:15 ****

[2006-05-29 18:19:14.546 CEST]   12022 LOG:  archived transaction log
file "000000010000000800000014"
[2006-05-29 18:30:10.939 CEST]   12022 LOG:  archived transaction log
file "000000010000000800000015"
...............................
--------------------------------------------------------

Our PITR_basebackup script does this:

* Checks if Backup label file exists
* Starts Backup process with pg_start_backup()
* Creates a LVM2 Snapshot of data partition
* Mounts the Snapshot partition
* Creates a tar.bz2 file of data
* Umounts Snapshot partition
* Removes Snapshot LV
* Backup last WAL file not yet archived
* Stops Backup process with pg_stop_backup()
* Waits for *.backup file to appear under the archivedir 
* Removes old WAL archived files
* Removes old tar.bz2 data file

To create the tar.bz file and to delete old WAL files can take some
time. The total running time of the PITR_basebackup script was 2412 sec.

If we get the same problem again, I will try to get more information
from the system. As I said in my last e-mail, this has been a one time
problem.

regards,
-- 
Rafael Martinez, <r.m.guerrero@usit.uio.no>
Center for Information Technology Services
University of Oslo, Norway

PGP Public Key: http://folk.uio.no/rafael/