Обсуждение: Errands around AllocateDir()
Hi all, On the recent following thread problems around the use of AllocateDir() have been diagnosed with its use in the backend code: https://www.postgresql.org/message-id/20171127093107.1473.70477@wrigleys.postgresql.org I had a close look at all the callers of AllocateDir() and noticed a couple of unwelcome things (Tom noticed some of those in the thread mentioned above, I found others): - Some elog() instead of ereport(), which is bad for the error code and translation. - Some use ereport(LOG), and could take advantage of ReadDirExtended, which could get exposed instead of being contained in fd.c. Those elog(LOG) can be removed when there is no further operation in the routine where the folder read is happening. - perform_base_backup() makes the mistake of not saving errno before CheckXLogRemoved() when AllocateDir returns NULL, which can lead to an incorrect error message. - restoreTwoPhaseData() calls ReadDir() with LWLockAcquire() in-between, which can lead to an incorrect errno used. - Some code paths can simply rely on the error generated by ReadDir(), so some code is useless. - Some code paths use non-generic error messages, like RemoveOldXlogFiles(). Here more consistent error string would I think make sense. Some issues are real bugs, like what's in perform_base_backup() and restoreTwoPhaseData(), and some incorrect error codes. Missing translation of some messages is also wrong IMO. Making error messages more consistent is nice and cosmetic. My monitoring of all those issues is leading me to the attached patch for HEAD. Thoughts? -- Michael
Вложения
Michael Paquier <michael.paquier@gmail.com> writes: > I had a close look at all the callers of AllocateDir() and noticed a > couple of unwelcome things (Tom noticed some of those in the thread > mentioned above, I found others): The only part of this that seems likely to be controversial is the decision to get rid of special-case error messages, eg replace "could not open tablespace directory \"%s\": %m" with the more generic "could not open directory \"%s\": %m" As I said in the previous thread, I don't see anything much wrong with that; but if anyone doesn't like it, speak now. Also, I'm inclined to back-patch the exporting of ReadDirExtended, so that we can rely on it being available if we use it in any back-patched fixes. I wouldn't back-patch anything else here that's not a clear bug fix, but we may need public ReadDirExtended anyway just to do that much (I didn't really check that yet). regards, tom lane
On Mon, Dec 4, 2017 at 12:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> I had a close look at all the callers of AllocateDir() and noticed a >> couple of unwelcome things (Tom noticed some of those in the thread >> mentioned above, I found others): > > The only part of this that seems likely to be controversial is the > decision to get rid of special-case error messages, eg replace > "could not open tablespace directory \"%s\": %m" > with the more generic > "could not open directory \"%s\": %m" > > As I said in the previous thread, I don't see anything much wrong > with that; but if anyone doesn't like it, speak now. I find that change an improvement, but not something to back-patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Dec 4, 2017 at 12:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> The only part of this that seems likely to be controversial is the >> decision to get rid of special-case error messages, eg replace >> "could not open tablespace directory \"%s\": %m" >> with the more generic >> "could not open directory \"%s\": %m" >> As I said in the previous thread, I don't see anything much wrong >> with that; but if anyone doesn't like it, speak now. > I find that change an improvement, but not something to back-patch. Yeah, agreed. The only thing I'm concerned about back-patching is the places where a wrong errno might be reported. regards, tom lane
Tom Lane wrote: > Yeah, agreed. The only thing I'm concerned about back-patching is > the places where a wrong errno might be reported. If we're currently reporting "could not open dir: Success" then backpatching such a fix is definitely an improvement. OTOH if currently we have opendir() trying to report a failure, then LWLockRelease replace the errno because something completely unrelated also failed, having the message report exactly the opendir() failure rather than the lwlock failure is surely also an improvement. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Michael Paquier <michael.paquier@gmail.com> writes: > I had a close look at all the callers of AllocateDir() and noticed a > couple of unwelcome things (Tom noticed some of those in the thread > mentioned above, I found others): Pushed with some minor additional fiddling. The most notable thing I changed was that instead of this: > - perform_base_backup() makes the mistake of not saving errno before > CheckXLogRemoved() when AllocateDir returns NULL, which can lead to an > incorrect error message. I modified CheckXLogRemoved() to internally guarantee that it does not change errno. This is because there seemed to be other call sites that were depending on that, not just this one. Anyway, that seemed like a more future-proof fix than relying on callers to deal with it. regards, tom lane
On Tue, Dec 5, 2017 at 7:17 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Michael Paquier <michael.paquier@gmail.com> writes: >> I had a close look at all the callers of AllocateDir() and noticed a >> couple of unwelcome things (Tom noticed some of those in the thread >> mentioned above, I found others): > > Pushed with some minor additional fiddling. Thanks. > The most notable thing I changed was that instead of this: > >> - perform_base_backup() makes the mistake of not saving errno before >> CheckXLogRemoved() when AllocateDir returns NULL, which can lead to an >> incorrect error message. > > I modified CheckXLogRemoved() to internally guarantee that it does not > change errno. This is because there seemed to be other call sites that > were depending on that, not just this one. Anyway, that seemed like a > more future-proof fix than relying on callers to deal with it. Hm, OK. Yes I can see the point behind this way of doing instead, CheckXLogRemoved() is much used in error code paths. -- Michael
On Tue, Dec 5, 2017 at 4:11 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Tom Lane wrote: >> Yeah, agreed. The only thing I'm concerned about back-patching is >> the places where a wrong errno might be reported. > > If we're currently reporting "could not open dir: Success" then > backpatching such a fix is definitely an improvement. OTOH if currently > we have opendir() trying to report a failure, then LWLockRelease replace > the errno because something completely unrelated also failed, having the > message report exactly the opendir() failure rather than the lwlock > failure is surely also an improvement. Note I am +/-0 with exposing ReadDirExtended in back-branches, because there is no use for it yet there. Only fixing the actual bugs with errno is of course fine for me if my initial message was not clear for back-branches. -- Michael
Michael Paquier <michael.paquier@gmail.com> writes: > Note I am +/-0 with exposing ReadDirExtended in back-branches, because > there is no use for it yet there. My thought is basically that we might end up back-patching some change that needs that. Nothing I've done today requires it, but it seems like a very harmless guard against future problems. I am BTW wondering whether any of the other directory scan loops in the backend ought to be converted from ReadDir to ReadDirExtended(LOG). I'm just finishing up making ResetUnloggedRelations() go the other way, because of the argument that failing to reset unlogged relations would be a data corruption hazard. But there may well be other cases where the best thing on balance is to log the problem and press on. With the changes we've made today, it's a very easy fix to flip from one behavior to the other. regards, tom lane