Обсуждение: BUG #13888: pg_dump write error
The following bug has been logged on the website: Bug reference: 13888 Logged by: Vladimir Kunschikov Email address: kunschikov@gmail.com PostgreSQL version: 9.4.5 Operating system: Linux Description: Database backup in heavy loaded environment sometimes fails with the following error: pg_dump: [parallel archiver] could not write to output file: Success Subsequent rerun of the pg_dump solves that problem. More elegant and reliable solution: WRITE_ERROR_EXIT macro replacement in ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file. There can be some error checking instead of this macro.
kunschikov@gmail.com wrote: > Database backup in heavy loaded environment sometimes fails with the > following error: > > pg_dump: [parallel archiver] could not write to output file: Success > > Subsequent rerun of the pg_dump solves that problem. > More elegant and reliable solution: WRITE_ERROR_EXIT macro replacement in > ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file. > There can be some error checking instead of this macro. Yeah, I noticed this and similar lacks of error checks in pg_dump in code review, which I didn't get around to patching. Care to submit a patch? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > kunschikov@gmail.com wrote: > >> Database backup in heavy loaded environment sometimes fails with the >> following error: >> >> pg_dump: [parallel archiver] could not write to output file: Success >> >> Subsequent rerun of the pg_dump solves that problem. >> More elegant and reliable solution: WRITE_ERROR_EXIT macro replacement in >> ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file. >> There can be some error checking instead of this macro. > > Yeah, I noticed this and similar lacks of error checks in pg_dump in > code review, which I didn't get around to patching. Care to submit a > patch? Indeed, with a closer look there are things like tarWrite that can return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we simply check for errno = 0 and generate a more generic error message instead? Or are you willing at replacing all those things with just exit_horribly()? -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Yeah, I noticed this and similar lacks of error checks in pg_dump in >> code review, which I didn't get around to patching. Care to submit a >> patch? > Indeed, with a closer look there are things like tarWrite that can > return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we > simply check for errno = 0 and generate a more generic error message > instead? Or are you willing at replacing all those things with just > exit_horribly()? I do not understand these claims that there isn't an error check there. There surely is. But fwrite() didn't set errno. The real question is why did he get a short write in the first place. We don't make any attempt to support filesystems that require retries, which seems to be what is going on here. Should we? regards, tom lane
Attached small refactoring of the ahwrite() which includes checking of the errno.
It was made with postgres-9.4.5 sources but applicable to 9.4.1 and even to 9.5.0 with small fuss at ExecuteSqlCommandBuf() callOn Mon, Jan 25, 2016 at 7:08 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
kunschikov@gmail.com wrote:
> Database backup in heavy loaded environment sometimes fails with the
> following error:
>
> pg_dump: [parallel archiver] could not write to output file: Success
>
> Subsequent rerun of the pg_dump solves that problem.
> More elegant and reliable solution: WRITE_ERROR_EXIT macro replacement in
> ahwrite() function at src/bin/pg_dump/pg_backup_archiver.c source file.
> There can be some error checking instead of this macro.
Yeah, I noticed this and similar lacks of error checks in pg_dump in
code review, which I didn't get around to patching. Care to submit a
patch?
--
Álvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Вложения
On Tue, Jan 26, 2016 at 11:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >>> Yeah, I noticed this and similar lacks of error checks in pg_dump in >>> code review, which I didn't get around to patching. Care to submit a >>> patch? > >> Indeed, with a closer look there are things like tarWrite that can >> return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we >> simply check for errno = 0 and generate a more generic error message >> instead? Or are you willing at replacing all those things with just >> exit_horribly()? > > I do not understand these claims that there isn't an error check there. > There surely is. But fwrite() didn't set errno. Yep, and the error message provided let's the user think that there has been a failure, with a success. I am wondering about something like that: -#define WRITE_ERROR_EXIT \ +#define WRITE_ERROR_EXIT(errno) \ do { \ - exit_horribly(modulename, "could not write to output file: %s\n", \ - strerror(errno)); \ + if (errno != 0) \ + exit_horribly(modulename, "could not write to output file: %s\n", \ + strerror(errno)); \ + else \ + exit_horribly(modulename, "could not write to output file\n"); \ } while (0) A more invasive approach would be to actually replace most of the READ/WRITE_ERROR_EXIT stuff by more appropriate error checks with customized error messages. And I would prefer that, this helps having a deeper understanding of exactly where the error occurred when running pg_dump. > The real question is why did he get a short write in the first place. > We don't make any attempt to support filesystems that require retries, > which seems to be what is going on here. Should we? I would keep the code simple, we don't do that in the backend I recall. -- Michael
Tom Lane wrote: > Michael Paquier <michael.paquier@gmail.com> writes: > > On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera > > <alvherre@2ndquadrant.com> wrote: > >> Yeah, I noticed this and similar lacks of error checks in pg_dump in > >> code review, which I didn't get around to patching. Care to submit a > >> patch? > > > Indeed, with a closer look there are things like tarWrite that can > > return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we > > simply check for errno = 0 and generate a more generic error message > > instead? Or are you willing at replacing all those things with just > > exit_horribly()? > > I do not understand these claims that there isn't an error check there. > There surely is. But fwrite() didn't set errno. Yeah, that's what I was remembering actually: http://www.postgresql.org/message-id/20150608174336.GM133018@postgresql.org > The real question is why did he get a short write in the first place. > We don't make any attempt to support filesystems that require retries, > which seems to be what is going on here. Should we? Sounds likely. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pg_dump: could not write to output file: Success >We don't make any attempt to support filesystems that require retries Filesystem: VFAT on usb media.Seems like short write was caused not by filesystem but by gzwrite(), since pg_dump was called with `-F d` switch: pg_dump -U *** -d **** -v -j 2 -t "alert_counters" -t "aggr_alert_counters" -t "alerts*" -t "payloads*" -F d -f "/media/usb0/backup_963871436_2016-01-18_09-16-12" I think there was EAGAIN error. We are solving this problem now by suggesting to users rerun of the backup script. I've attached small patch to the ahwrite() function in my previous posting (it is waiting for the moderation approval ). On Tue, Jan 26, 2016 at 5:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: > > On Tue, Jan 26, 2016 at 1:08 AM, Alvaro Herrera > > <alvherre@2ndquadrant.com> wrote: > >> Yeah, I noticed this and similar lacks of error checks in pg_dump in > >> code review, which I didn't get around to patching. Care to submit a > >> patch? > > > Indeed, with a closer look there are things like tarWrite that can > > return 0 and trigger WRITE_ERROR_EXIT with the same thing. Couldn't we > > simply check for errno = 0 and generate a more generic error message > > instead? Or are you willing at replacing all those things with just > > exit_horribly()? > > I do not understand these claims that there isn't an error check there. > There surely is. But fwrite() didn't set errno. > > The real question is why did he get a short write in the first place. > We don't make any attempt to support filesystems that require retries, > which seems to be what is going on here. Should we? > > regards, tom lane >
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Tom Lane wrote: >> I do not understand these claims that there isn't an error check there. >> There surely is. But fwrite() didn't set errno. > Yeah, that's what I was remembering actually: > http://www.postgresql.org/message-id/20150608174336.GM133018@postgresql.org >> The real question is why did he get a short write in the first place. >> We don't make any attempt to support filesystems that require retries, >> which seems to be what is going on here. Should we? > Sounds likely. Actually, now that I think about it while not in an airport, what I think we really need is logic along the lines of errno = 0; nbytes = fwrite(...); if (nbytes != expected && errno == 0) errno = ENOSPC; ie, assume a short write implies out-of-disk-space. I believe that is what we do in most (hopefully all) cases in the backend; see for example UpdateControlFile() in xlog.c. IMO it is not really WRITE_ERROR_EXIT's place to assume that errno == 0 should be replaced by ENOSPC; in particular, that would be incorrect, or at least insufficient, unless there's an explicit initialization of "errno = 0;" before the fwrite call. So I'm inclined to put the responsibility to do all of the above logic on the fwrite call sites, rather than splitting it between callers and WRITE_ERROR_EXIT. regards, tom lane
On February 2, 2016 10:12:54 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Actually, now that I think about it while not in an airport, what I >think we really need is logic along the lines of > > errno = 0; > nbytes = fwrite(...); > if (nbytes != expected && errno == 0) > errno = ENOSPC; > >ie, assume a short write implies out-of-disk-space. I believe that >is what we do in most (hopefully all) cases in the backend; see for >example UpdateControlFile() in xlog.c. There's an exception: when writing WAL we intentionally retry on short writes. IIRC Heikki added that after we found a casewhere large writes returned short, but non zero, and trying again to finish the rest works. I'm nor sure of there aren'tother cases where that should be done, die to large writes. Andres --- Please excuse brevity and formatting - I am writing this on my mobile phone.
Andres Freund <andres@anarazel.de> writes: > On February 2, 2016 10:12:54 PM GMT+01:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ie, assume a short write implies out-of-disk-space. I believe that >> is what we do in most (hopefully all) cases in the backend; see for >> example UpdateControlFile() in xlog.c. > There's an exception: when writing WAL we intentionally retry on short writes. IIRC Heikki added that after we found acase where large writes returned short, but non zero, and trying again to finish the rest works. I'm nor sure of there aren'tother cases where that should be done, die to large writes. I could support making pg_dump do that if there's a way to do it when writing through zlib (which is probably the normal case these days). I'm not at all sure that gzwrite() is retryable after a partial write; that would likely have consequences for the internal state of the compressor. In any case, if you get nbytes == 0 and errno == 0, substituting ENOSPC seems like the right thing. regards, tom lane