Обсуждение: Error handling (or lack of it) in RemovePgTempFilesInDir

Поиск
Список
Период
Сортировка

Error handling (or lack of it) in RemovePgTempFilesInDir

От
Tom Lane
Дата:
The recent changes in commit dc6c4c9dc caused Coverity to start
complaining that RemovePgTempFilesInDir calls rmdir() without checking
its return value, as would be good practice.  Now, this wasn't really a
fault of that commit, because the code was already ignoring the result of
unlink(), but it did cause me to wonder why we're ignoring either one.
Even granted that we don't want to throw ERROR there (because this code
runs in the postmaster, and an ERROR would force postmaster exit), our
usual coding practice would be to print a LOG message about the failure.

Tracing back, the decision to ignore errors altogether seems to originate
with commit 2a6f7ac45, which added this code:

                    if (strncmp(temp_de->d_name,
                                PG_TEMP_FILE_PREFIX,
                                strlen(PG_TEMP_FILE_PREFIX)) == 0)
                    {
                        unlink(rm_path);
                    }
                    else
                    {
                        /*
                         * would prefer to use elog here, but it's not
                         * up and running during postmaster startup...
                         */
                        fprintf(stderr,
                                "Unexpected file found in temporary-files directory: %s\n",
                                rm_path);
                    }

I'm not real sure that the bit about "elog not being up" was true even
at the time, and it's definitely wrong now; but perhaps thinking that
it wasn't up explains why we chose not to whine about unlink failure.
Or maybe there was no thought that it could fail at all --- it's surely
odd to complain about "this file shouldn't be here" but not about
"I should be able to remove this file but failed to".

Anyway, I'm inclined to reverse that choice and emit LOG messages
reporting failure of any of the lstat, rmdir, or unlink calls in
RemovePgTempFilesInDir.  In the worst case, say that there are a
bunch of leftover temp files in a directory that someone has made
unwritable, this might cause a fair amount of whining in the postmaster
log --- but that's a situation that needs to be fixed anyway, so
I cannot see how not printing anything is a good idea.

I'm also inclined to convert the ReadDir calls in this function to be
ReadDirExtended(..., LOG), as per Michael's proposal in
https://www.postgresql.org/message-id/CAB7nPqRpOCxjiirHmebEFhXVTK7V5Jvw4bz82p7Oimtsm3TyZA@mail.gmail.com
If we don't want to throw a hard error for failure to remove files,
it seems like throwing an error for failure to read a temp dir isn't
a great choice either.

            regards, tom lane


Re: Error handling (or lack of it) in RemovePgTempFilesInDir

От
Thomas Munro
Дата:
On Tue, Dec 5, 2017 at 4:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Anyway, I'm inclined to reverse that choice and emit LOG messages
> reporting failure of any of the lstat, rmdir, or unlink calls in
> RemovePgTempFilesInDir.  In the worst case, say that there are a
> bunch of leftover temp files in a directory that someone has made
> unwritable, this might cause a fair amount of whining in the postmaster
> log --- but that's a situation that needs to be fixed anyway, so
> I cannot see how not printing anything is a good idea.

Belatedly, +1.  The error hiding seemed a bit odd considering that we
were prepared to log "unexpected file found ...".  I probably should
have questioned that instead of extending it monkey-see-monkey-do.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Error handling (or lack of it) in RemovePgTempFilesInDir

От
Michael Paquier
Дата:
On Tue, Dec 5, 2017 at 8:40 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Tue, Dec 5, 2017 at 4:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Anyway, I'm inclined to reverse that choice and emit LOG messages
>> reporting failure of any of the lstat, rmdir, or unlink calls in
>> RemovePgTempFilesInDir.  In the worst case, say that there are a
>> bunch of leftover temp files in a directory that someone has made
>> unwritable, this might cause a fair amount of whining in the postmaster
>> log --- but that's a situation that needs to be fixed anyway, so
>> I cannot see how not printing anything is a good idea.
>
> Belatedly, +1.  The error hiding seemed a bit odd considering that we
> were prepared to log "unexpected file found ...".  I probably should
> have questioned that instead of extending it monkey-see-monkey-do.

Well, I am -1 on this change. The coding before commit 561885d that
you have now pushed (timezone makes me answer later) was way more
conservative and I honestly preferred it as *only* the next postmaster
restart would remove remnant temp files which can cause potentially GB
of data to stay around. Also, if the failure happens at startup, isn't
it going to fail as well for backends afterwards? This would cause
backends to fail abruptly and it is actually easier to debug a
completely stopped instance...
-- 
Michael


Re: Error handling (or lack of it) in RemovePgTempFilesInDir

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Tue, Dec 5, 2017 at 8:40 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Belatedly, +1.  The error hiding seemed a bit odd considering that we
>> were prepared to log "unexpected file found ...".  I probably should
>> have questioned that instead of extending it monkey-see-monkey-do.

> Well, I am -1 on this change. The coding before commit 561885d that
> you have now pushed (timezone makes me answer later) was way more
> conservative and I honestly preferred it as *only* the next postmaster
> restart would remove remnant temp files which can cause potentially GB
> of data to stay around.

Uh ... I'm confused?  That particular change only concerns whether we emit
a log message, not whether the action is attempted or succeeds.

            regards, tom lane


Re: Error handling (or lack of it) in RemovePgTempFilesInDir

От
Michael Paquier
Дата:
On Tue, Dec 5, 2017 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Tue, Dec 5, 2017 at 8:40 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> Belatedly, +1.  The error hiding seemed a bit odd considering that we
>>> were prepared to log "unexpected file found ...".  I probably should
>>> have questioned that instead of extending it monkey-see-monkey-do.
>
>> Well, I am -1 on this change. The coding before commit 561885d that
>> you have now pushed (timezone makes me answer later) was way more
>> conservative and I honestly preferred it as *only* the next postmaster
>> restart would remove remnant temp files which can cause potentially GB
>> of data to stay around.
>
> Uh ... I'm confused?  That particular change only concerns whether we emit
> a log message, not whether the action is attempted or succeeds.

From the commit mentioned upthread, this switches one hard failure
when opening pg_tblspc to a LOG report:
@@ -3014,7 +3018,7 @@ RemovePgTempFiles(void)
     */
    spc_dir = AllocateDir("pg_tblspc");

-   while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
+   while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
    {
ReadDir() issues an ERROR when spc_dir is NULL..
-- 
Michael


Re: Error handling (or lack of it) in RemovePgTempFilesInDir

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Tue, Dec 5, 2017 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Uh ... I'm confused?  That particular change only concerns whether we emit
>> a log message, not whether the action is attempted or succeeds.

> From the commit mentioned upthread, this switches one hard failure
> when opening pg_tblspc to a LOG report:
> @@ -3014,7 +3018,7 @@ RemovePgTempFiles(void)
>      */
>     spc_dir = AllocateDir("pg_tblspc");

> -   while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
> +   while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
>     {

That's not the same commit you just mentioned.  The point with this one is
that RemovePgTempFiles is a noncritical operation: if we fail to remove
temp files, it's still safe to start up, because those temp files won't
cause failures later.  (This is the exact opposite of the situation for
ResetUnloggedRelations's directory scans, which is why I changed that one
in the opposite direction.)

The general theory I'm operating on is that we should endeavor to
let the database start in any situation where that doesn't involve
a data-corruption hazard.  Yeah, it might not be nice if we leave
GB worth of temp files around, but is a postmaster start failure
better?  I don't think so.

            regards, tom lane


Re: Error handling (or lack of it) in RemovePgTempFilesInDir

От
Michael Paquier
Дата:
On Tue, Dec 5, 2017 at 11:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> On Tue, Dec 5, 2017 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Uh ... I'm confused?  That particular change only concerns whether we emit
>>> a log message, not whether the action is attempted or succeeds.
>
>> From the commit mentioned upthread, this switches one hard failure
>> when opening pg_tblspc to a LOG report:
>> @@ -3014,7 +3018,7 @@ RemovePgTempFiles(void)
>>      */
>>     spc_dir = AllocateDir("pg_tblspc");
>
>> -   while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
>> +   while ((spc_de = ReadDirExtended(spc_dir, "pg_tblspc", LOG)) != NULL)
>>     {
>
> That's not the same commit you just mentioned.

561885db05d3296082ce8750805b8ec322cf9aa1 refers to this thread, and I
am reading this diff as part of it :)

> The general theory I'm operating on is that we should endeavor to
> let the database start in any situation where that doesn't involve
> a data-corruption hazard.  Yeah, it might not be nice if we leave
> GB worth of temp files around, but is a postmaster start failure
> better?  I don't think so.

I am getting the feeling that we are going to see people complain
about those files lying around as well... That's as far as my opinion
goes.
-- 
Michael


Re: Error handling (or lack of it) in RemovePgTempFilesInDir

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Tue, Dec 5, 2017 at 11:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The general theory I'm operating on is that we should endeavor to
>> let the database start in any situation where that doesn't involve
>> a data-corruption hazard.  Yeah, it might not be nice if we leave
>> GB worth of temp files around, but is a postmaster start failure
>> better?  I don't think so.

> I am getting the feeling that we are going to see people complain
> about those files lying around as well... That's as far as my opinion
> goes.

Perhaps, but surely it's inconsistent to consider that opendir()
failure is fatal while failing to unlink individual temp files
found by the directory scan is not.  As an example, a pretty likely
scenario is that somehow a temp directory has been made unwritable
by the postmaster.  The old coding would have let that pass without
even a bleat in the postmaster log; but it results in just the same
amount of disk space leakage as ignoring the temp directory because
we couldn't opendir() it.

            regards, tom lane