Обсуждение: Autovacuum breakage from a734fd5d1

Поиск
Список
Период
Сортировка

Autovacuum breakage from a734fd5d1

От
Tom Lane
Дата:
Buildfarm member skink failed a couple days ago:
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2016-11-25%2017%3A50%3A01

I believe the interesting parts of the log are

2016-11-25 18:29:03.285 UTC [583882e7.2a45:1] LOG:  autovacuum: dropping orphan temp table "(null)"."(null)" in
database"regression" 
2016-11-25 18:29:03.919 UTC [583882e7.2a45:2] ERROR:  could not open relation with OID 33895
2016-11-25 18:29:03.919 UTC [583882e7.2a45:3] CONTEXT:  dropping of orphan temp table "(null)"."(null)" in database
"regression"

TRAP: FailedAssertion("!(pointer != ((void *)0))", File:
"/home/andres/build/buildfarm/HEAD/pgsql.build/../pgsql/src/backend/utils/mmgr/mcxt.c",Line: 1001) 

2016-11-25 18:29:07.524 UTC [58387ba7.609:3] LOG:  server process (PID 10821) was terminated by signal 6: Aborted
2016-11-25 18:29:07.526 UTC [58387ba7.609:4] LOG:  terminating any other active server processes

10821 = 2a45 hex so the first three log entries are from the process that
later asserted.  The assertion is evidently from a "pfree(NULL)" call.

Looking into autovacuum.c, it's pretty clear that it was trying to drop
what it thought was an orphan temp table, only the temp table was already
gone, causing get_namespace_name() and get_rel_name() to return NULL.
It's only a platform-specific artifact that the ereport calls printed
"(null)" rather than crashing, and of course the pfree(relname) down at
the bottom of the loop is what caused the assert failure.

Presumably, what happened here is that somebody else already deleted
the "orphan" table before autovacuum could get to it.  Since there's
no previous crash in this run, it's difficult to come to any conclusion
other than that the "orphan" got deleted by its owning backend, meaning
that the test for orphan status is jumping the gun and deciding that
a table is orphaned before its owning session is really gone.  That's
rather disastrous if true.

The other problem here is that the code is supposing that it can access
a temp table's catalog entries without having obtained any lock on it.
Correct coding would involve acquiring the relation lock and then
checking to see if the table is still there.

I am also not feeling terribly comfortable with the potentially long
interval between where we decide that relation OID so-and-so is an
orphaned table and where we summarily remove that relation OID.  This
is probably safe, if you assume that temp tables aren't likely to
have lifespans comparable to the OID counter's wraparound interval,
but I think we would be well advised to recheck that decision after
acquiring lock.

At least some of these behaviors were presumably hazards before, but were
effectively masked by the long delay before we'd attempt to remove an
orphan temp table.  But I am wondering why we've not seen reports of
crashes due to the old "found orphan temp table" log message trying to
dereference null pointers, since that could have occurred immediately
after the table became (ostensibly) orphan.

I don't have proposed fixes for any of this yet, but it's clearly
an area that needs review.
        regards, tom lane



Re: Autovacuum breakage from a734fd5d1

От
Tom Lane
Дата:
I wrote:
> Buildfarm member skink failed a couple days ago:
> http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink&dt=2016-11-25%2017%3A50%3A01

Ah ... I can reproduce this with moderate reliability (one failure every
10 or so iterations of the regression tests) by inserting a delay just
before autovacuum's check of orphan status:

*** a/src/backend/postmaster/autovacuum.c
--- b/src/backend/postmaster/autovacuum.c
*************** do_autovacuum(void)
*** 2046,2051 ****
--- 2046,2053 ----         {             int            backendID;
+             pg_usleep(100000);
+              backendID = GetTempNamespaceBackendId(classForm->relnamespace);              /* We just ignore it if the
owningbackend is still active */ 


I think the sequence of events must be:

1. autovacuum starts its seqscan of pg_class, locking down the snapshot
it's going to use for that.

2. Some backend completes its session and drops some temp table(s).

3. autovacuum's scan arrives at the pg_class entry for one of these
tables.  By now it's committed dead, but it's still visible according
to the seqscan's snapshot, so we make the above test.  Assuming the
owning backend has vacated its sinval slot and no new session has
occupied it, we'll decide the table is orphan and record its OID
for later deletion.

4. The later code that tries to drop the table is able to see that
it's gone by now.  Kaboom.

In existing releases, it would be about impossible for this race condition
to persist long enough that we'd actually try to drop the table.  It's
definitely possible that we'd try to print "found orphan temp table",
but guess what: the back-branch coding here is
                   ereport(LOG,                           (errmsg("autovacuum: found orphan temp table \"%s\".\"%s\" in
database\"%s\"",                                get_namespace_name(classForm->relnamespace),
      NameStr(classForm->relname),                                   get_database_name(MyDatabaseId)))); 

The only part of that that would be at risk of failure is the
get_namespace_name call, and since we don't ordinarily remove the
pg_namespace entries for temp schemas, it's not likely to fail either.
So the race condition does exist in released branches, but it would not
cause an autovacuum crash even if libc is unforgiving about printf'ing a
NULL string.  At most it would cause bogus log entries claiming that
temp tables have been orphaned when they haven't.

I went digging in the buildfarm logs and was able to find one single
instance of a "found orphan temp table" log entry that couldn't be
blamed on a prior backend crash; it's in this report:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=tick&dt=2015-07-29%2003%3A37%3A52

So the problem seems to be confirmed to exist, but be of low probability
and low consequences, in back branches.  I think we only need to fix it in
HEAD.  The lock acquisition and status recheck that I proposed before
should be sufficient.
        regards, tom lane



Re: Autovacuum breakage from a734fd5d1

От
Robert Haas
Дата:
On Sun, Nov 27, 2016 at 5:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So the problem seems to be confirmed to exist, but be of low probability
> and low consequences, in back branches.  I think we only need to fix it in
> HEAD.  The lock acquisition and status recheck that I proposed before
> should be sufficient.

Thanks for digging into this.  I failed to notice while reviewing that
the way we were printing the message had changed a bit in the new
code, and I just totally overlooked the existing locking hazards.
Oops.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Autovacuum breakage from a734fd5d1

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> Thanks for digging into this.  I failed to notice while reviewing that
> the way we were printing the message had changed a bit in the new
> code, and I just totally overlooked the existing locking hazards.
> Oops.

I pushed a patch to deal with this.  I ended up simplifying the previous
commit considerably by getting rid of the commit-multiple-deletions-per-
transaction business.  I do not think that this code will get exercised
enough, either in the field or in testing, to justify taking a risk of
additional bugs from that.

Also, another reason for not liking that is that it seems to increase the
risk of deadlock against an incoming session that's taking over the temp
schema belonging to a crashed session.  The incoming backend would be
deleting the same tables in (probably) a different order, which is an
obvious deadlock hazard.  I think it'd be mostly all right because the
autovac worker should lose deadlock fights, but the incoming session would
see a deadlock_timeout stall if not an error.  So that's another reason to
commit immediately after each deletion.

I tested this by dint of turning RemoveTempRelationsCallback into a no-op,
so that things always acted as though sessions had crashed without
cleaning up their temp tables.  The code survived quite a few regression
test cycles without visible problems, although I had to tweak rules.sql
to ignore temp views.
        regards, tom lane



Re: Autovacuum breakage from a734fd5d1

От
Michael Paquier
Дата:
On Mon, Nov 28, 2016 at 10:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 27, 2016 at 5:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So the problem seems to be confirmed to exist, but be of low probability
>> and low consequences, in back branches.  I think we only need to fix it in
>> HEAD.  The lock acquisition and status recheck that I proposed before
>> should be sufficient.
>
> Thanks for digging into this.  I failed to notice while reviewing that
> the way we were printing the message had changed a bit in the new
> code, and I just totally overlooked the existing locking hazards.
> Oops.

Sorry for the late reply.

In order to reproduce the failure I have just inserted a manual
pg_usleep before looping through the list of orphan_oids, and after
dropping manually from another session a couple of orphaned temporary
tables I was able to see the failure. Attached is a proposal of patch.
--
Michael

Вложения

Re: Autovacuum breakage from a734fd5d1

От
Michael Paquier
Дата:
On Mon, Nov 28, 2016 at 11:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I pushed a patch to deal with this.  I ended up simplifying the previous
> commit considerably by getting rid of the commit-multiple-deletions-per-
> transaction business.  I do not think that this code will get exercised
> enough, either in the field or in testing, to justify taking a risk of
> additional bugs from that.

Oh, well, that was too late then... Thanks for addressing the issue.
-- 
Michael



Re: Autovacuum breakage from a734fd5d1

От
Tom Lane
Дата:
Michael Paquier <michael.paquier@gmail.com> writes:
> In order to reproduce the failure I have just inserted a manual
> pg_usleep before looping through the list of orphan_oids, and after
> dropping manually from another session a couple of orphaned temporary
> tables I was able to see the failure. Attached is a proposal of patch.

That's not really adequate, because acquiring the lock doesn't prove that
the table still exists; you might just be holding a useless lock against
a now-unused OID.  You really need to do something to verify that the
table's catalog entries are still visible after you have the lock.
        regards, tom lane



Re: Autovacuum breakage from a734fd5d1

От
Michael Paquier
Дата:
On Mon, Nov 28, 2016 at 11:46 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Michael Paquier <michael.paquier@gmail.com> writes:
>> In order to reproduce the failure I have just inserted a manual
>> pg_usleep before looping through the list of orphan_oids, and after
>> dropping manually from another session a couple of orphaned temporary
>> tables I was able to see the failure. Attached is a proposal of patch.
>
> That's not really adequate, because acquiring the lock doesn't prove that
> the table still exists; you might just be holding a useless lock against
> a now-unused OID.  You really need to do something to verify that the
> table's catalog entries are still visible after you have the lock.

Self-meh. That's where try_relation_open() would have been better...
Now it does not matter much, thanks for the input.
-- 
Michael



Re: Autovacuum breakage from a734fd5d1

От
Robert Haas
Дата:
On Sun, Nov 27, 2016 at 9:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I pushed a patch to deal with this.  I ended up simplifying the previous
> commit considerably by getting rid of the commit-multiple-deletions-per-
> transaction business.  I do not think that this code will get exercised
> enough, either in the field or in testing, to justify taking a risk of
> additional bugs from that.

I think you made this considerably more fragile with those changes.
Now, if we fail to drop a temporary table, we won't do any actual
vacuuming, either.  I'd be willing to bet someone will get hosed
because of that who would have been much less hosed with the previous
coding.

Whether or not burning an XID per dropped table is going to hurt
anyone is more arguable.  One would like to think that only an
extraordinarily unlucky person would have many temporary tables to
drop at the very same time that they were also critically close to a
wraparound event.  I wouldn't wager on this one actually biting
anyone.  But I also do not think that the old code was so complex that
we couldn't have found and removed any bugs it might have had fairly
easily, so I don't agree with this change, either.

Also, I think that ripping out obviously-intentional safeguards
without discussion over a holiday weekend is not the most
collaborative approach that you could have devised.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Autovacuum breakage from a734fd5d1

От
Michael Paquier
Дата:
On Mon, Nov 28, 2016 at 12:14 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think you made this considerably more fragile with those changes.
> Now, if we fail to drop a temporary table, we won't do any actual
> vacuuming, either.  I'd be willing to bet someone will get hosed
> because of that who would have been much less hosed with the previous
> coding.

[ ... Reading the actual change ...]
Right. This is missing a PG_TRY/CATCH block, the previous patch has
been designed to be non-disruptive with the next operations of
autovacuum. So HEAD is now far more invasive in the way of doing
things.

> Whether or not burning an XID per dropped table is going to hurt
> anyone is more arguable.  One would like to think that only an
> extraordinarily unlucky person would have many temporary tables to
> drop at the very same time that they were also critically close to a
> wraparound event.  I wouldn't wager on this one actually biting
> anyone.  But I also do not think that the old code was so complex that
> we couldn't have found and removed any bugs it might have had fairly
> easily, so I don't agree with this change, either.

Don't we need to worry about burning too many transaction XIDs for a
wraparound autovacuum? I am aware of the fact that this is really a
corner-case but burning potentially thousands of them to drop that
much orphaned object does not sound really appealling to me.
-- 
Michael



Re: Autovacuum breakage from a734fd5d1

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I think you made this considerably more fragile with those changes.

This code will only ever run at all in corner cases --- cases that
almost by definition will go untested in the standard regression tests.
The problems you suggest it has are corner-squared or corner-cubed cases.
We were rather fortunate to find the bugs in the committed version before
it got into the field, and would be even less likely to find bugs in code
that is meant to run in corner-squared cases.  So, with all due respect,
I think your priorities here are entirely backwards.  I'm much more
worried about whether the code will still work as intended after the next
big backend code refactoring than about whether it's optimal for cases
that will almost never occur in the field.
        regards, tom lane



Re: Autovacuum breakage from a734fd5d1

От
Robert Haas
Дата:
On Sun, Nov 27, 2016 at 10:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I think you made this considerably more fragile with those changes.
>
> This code will only ever run at all in corner cases --- cases that
> almost by definition will go untested in the standard regression tests.
> The problems you suggest it has are corner-squared or corner-cubed cases.
> We were rather fortunate to find the bugs in the committed version before
> it got into the field, and would be even less likely to find bugs in code
> that is meant to run in corner-squared cases.  So, with all due respect,
> I think your priorities here are entirely backwards.

I am of the contrary opinion - I think *your* priorities are
backwards.   Describing those as corner-squared or corner-cubed cases
is misleading; it is true that they will arise for very few users, but
they may be catastrophic when they do arise. On the other hand, the
logic you've removed might have been bug-free (in which case removing
it is strictly a loss) or it might have had only bugs without any
particularly serious consequences (in which case you've traded little
problems for big ones).  Even if it does have a serious bug, there's
no reason to suppose it will be impossible to find; why shouldn't the
BF or beta testing or dev testing reveal it, just as it revealed this
bug?

I don't believe we should be so scared of the possibility of a serious
bug that can't be found through any of the ways we normally test that
we aren't willing to fix problems we can readily foresee.  I grant
that there are some situations where fixing a problem might involve
enough risk that we shouldn't attempt it, but this is (or was) pretty
straightforward code patterned after existing logic, and I really see
no reason to believe that anything that was wrong with it couldn't
have been debugged easily enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Autovacuum breakage from a734fd5d1

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> I don't believe we should be so scared of the possibility of a serious
> bug that can't be found through any of the ways we normally test that
> we aren't willing to fix problems we can readily foresee.  I grant
> that there are some situations where fixing a problem might involve
> enough risk that we shouldn't attempt it, but this is (or was) pretty
> straightforward code patterned after existing logic, and I really see
> no reason to believe that anything that was wrong with it couldn't
> have been debugged easily enough.

I'm astonished that you think that.  A problem here would be almost
impossible to diagnose/reproduce, I should think, given that it would
require at least two different failures (first a backend not cleaning up
after itself, and then something going wrong in autovac's drop attempt).
If you had reason to believe there was something broken there, you could
certainly hack the system enough to force it through that code sequence,
but that's basically not something that would ever happen in routine
testing.  So my judgment is that the odds of bugs being introduced here
and then making it to the field outweighs the potential benefit over the
long run.  We have enough hard-to-test code already, we do not need to add
more for hypothetical corner cases.

I did think of another argument on your side of this, which is that
if you imagine that there's a persistent failure to drop some temp table,
that would effectively disable autovacuum in that database.  Which
would be bad.  But we could address that at very minimal risk just by
moving the drop loop to after the vacuuming loop.  I note that the case
I'm afraid of, a bug in the error-catching logic, could also lead to
autovacuum becoming entirely disabled if we keep the drops first.
        regards, tom lane



Re: Autovacuum breakage from a734fd5d1

От
Robert Haas
Дата:
On Mon, Nov 28, 2016 at 12:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I don't believe we should be so scared of the possibility of a serious
>> bug that can't be found through any of the ways we normally test that
>> we aren't willing to fix problems we can readily foresee.  I grant
>> that there are some situations where fixing a problem might involve
>> enough risk that we shouldn't attempt it, but this is (or was) pretty
>> straightforward code patterned after existing logic, and I really see
>> no reason to believe that anything that was wrong with it couldn't
>> have been debugged easily enough.
>
> I'm astonished that you think that.  A problem here would be almost
> impossible to diagnose/reproduce, I should think, given that it would
> require at least two different failures (first a backend not cleaning up
> after itself, and then something going wrong in autovac's drop attempt).
> If you had reason to believe there was something broken there, you could
> certainly hack the system enough to force it through that code sequence,
> but that's basically not something that would ever happen in routine
> testing.  So my judgment is that the odds of bugs being introduced here
> and then making it to the field outweighs the potential benefit over the
> long run.  We have enough hard-to-test code already, we do not need to add
> more for hypothetical corner cases.

The easiest way to test this would be to just hack the system catalogs
to be invalid in some way.  I think that frying relnatts or
relpersistence would cause a relcache build failure, which is
basically the kind of thing I'm worried about here.  I've seen plenty
of cases where a system was basically working and the user was
basically happy despite some minor catalog corruption ... until that
"minor" catalog corruption broke an entire subsystem.

A good example is relnamespace.  We've still not eliminated all of the
cases where a DROP SCHEMA concurrent with a CREATE <something> can
result in an object that lives in a nonexistent schema.  So you end up
with this orphaned object that nobody really cares about (after all,
they dropped the schema on purpose) but it doesn't really matter
because everything still runs just fine.  And then, as recently
happened with an actual EnterpriseDB customer, somebody tries to run
pg_dump.  As it turns out, pg_dump fails outright in this situation.
And now suddenly the customer is calling support.  The bad catalog
entries themselves aren't really an issue, but when some other system
like pg_dump or autovacuum chokes on them and *fails completely*
instead of *failing only on the problematic objects* it amplifies the
problem from something that affects only an object that nobody really
cares about into a problem that has a major impact on the whole
system.

> I did think of another argument on your side of this, which is that
> if you imagine that there's a persistent failure to drop some temp table,
> that would effectively disable autovacuum in that database.  Which
> would be bad.  But we could address that at very minimal risk just by
> moving the drop loop to after the vacuuming loop.  I note that the case
> I'm afraid of, a bug in the error-catching logic, could also lead to
> autovacuum becoming entirely disabled if we keep the drops first.

I agree that moving the DROP loop after the other loop has some
appeal, but I see a couple of problems.  One is that, by itself, it
doesn't prevent the cascading-failure problem I mentioned above.  If
the system is close to wrapround and the pg_class scan finds the temp
table that is holding back xmin after the table that can't be dropped
because the catalog is corrupt, then you're back in the situation
where a busted table keeps the system from doing the right thing on an
un-busted table.  The second is that dropping a table doesn't call
vac_update_datfrozenxid(); if we drop a table before running vacuum
operations, the results of the drop will be reflected in any
subsequent datfrozenxid update that may occur.  If we drop it
afterwards, it won't.

Perhaps neither of those things are totally horrible but they're not
especially good, either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company