Обсуждение: autovacuum crash due to null pointer

Поиск
Список
Период
Сортировка

autovacuum crash due to null pointer

От
Tom Lane
Дата:
There's a fairly interesting crash here:
http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=jaguar&dt=2008-07-16%2003:00:02
The buildfarm was nice enough to provide a stack trace at the bottom of
the page, which shows clearly that autovac tried to pfree a null
pointer.

What I think happened was that the table that was selected to be
autovacuumed got dropped during the setup steps, leading get_rel_name()
to return NULL at line 2167.  vacuum() itself would have fallen out
silently ...  however, had it errored out, the attempts at error
reporting in the PG_CATCH block would have crashed.

I see that we already noticed and fixed this type of problem in
autovac_report_activity(), but do_autovacuum() didn't get the word.
Is there anyplace else in there with the same issue?  For that matter,
why is autovac_report_activity repeating the lookups already done
at the outer level?

One other point is that the postmaster log just says

TRAP: FailedAssertion("!(pointer != ((void *)0))", File: "mcxt.c", Line: 580)
[487d6715.3a87:2] LOG:  server process (PID 16885) was terminated by signal 6: Aborted

Could we get that to say "autovacuum worker" instead of "server"?
        regards, tom lane


Re: autovacuum crash due to null pointer

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> There's a fairly interesting crash here:
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=jaguar&dt=2008-07-16%2003:00:02
> The buildfarm was nice enough to provide a stack trace at the bottom of
> the page, which shows clearly that autovac tried to pfree a null
> pointer.
> 
> What I think happened was that the table that was selected to be
> autovacuumed got dropped during the setup steps, leading get_rel_name()
> to return NULL at line 2167.  vacuum() itself would have fallen out
> silently ...  however, had it errored out, the attempts at error
> reporting in the PG_CATCH block would have crashed.

Ouch.  This was introduced when we added the PG_TRY block to add the
errcontext and resume vacuuming with the next table; the original code
was clean about this problem.

> I see that we already noticed and fixed this type of problem in
> autovac_report_activity(), but do_autovacuum() didn't get the word.
> Is there anyplace else in there with the same issue?

I'll have a more detailed look.

> For that matter, why is autovac_report_activity repeating the lookups
> already done at the outer level?

Ah, good catch -- it's because the outer code was added later.  It
should be easy to pass the names down, I think.

> One other point is that the postmaster log just says
> 
> TRAP: FailedAssertion("!(pointer != ((void *)0))", File: "mcxt.c", Line: 580)
> [487d6715.3a87:2] LOG:  server process (PID 16885) was terminated by signal 6: Aborted
> 
> Could we get that to say "autovacuum worker" instead of "server"?

Hmm, that would require moving code from HandleChildCrash to
CleanupBackend, I think.  (We're creating the process name in
CleanupBackend without scanning the BackendList, which we would need to
figure out if it's a worker.)

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: autovacuum crash due to null pointer

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> There's a fairly interesting crash here:
> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=jaguar&dt=2008-07-16%2003:00:02
> The buildfarm was nice enough to provide a stack trace at the bottom of
> the page, which shows clearly that autovac tried to pfree a null
> pointer.
> 
> What I think happened was that the table that was selected to be
> autovacuumed got dropped during the setup steps, leading get_rel_name()
> to return NULL at line 2167.  vacuum() itself would have fallen out
> silently ...  however, had it errored out, the attempts at error
> reporting in the PG_CATCH block would have crashed.

I'm having a hard time reproducing this problem; I added a pg_usleep()
just before get_rel_name to have the chance to drop the table, but
strangely enough it doesn't return NULL.  It seems that the cache entry
is not getting invalidated.  Maybe there's something else that's needed
to reproduce the crash.

Anyway, I propose this patch in HEAD to close this hole.  For 8.3 I'm
thinking in just rearranging the get_*_name calls and adding the goto.
Thoughts?

I think there's another hole here: what happens if the table exists when
get_rel_name is called, but the schema is dropped between that point and
calling get_namespace_name?  It is pretty unlikely but can it be
discarded completely?  What should we do in that case?

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.


Re: autovacuum crash due to null pointer

От
Alvaro Herrera
Дата:
Alvaro Herrera wrote:

> Anyway, I propose this patch in HEAD to close this hole.  For 8.3 I'm
> thinking in just rearranging the get_*_name calls and adding the goto.
> Thoughts?

Sorry, really attached.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Вложения

Re: autovacuum crash due to null pointer

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Tom Lane wrote:
>> What I think happened was that the table that was selected to be
>> autovacuumed got dropped during the setup steps, leading get_rel_name()
>> to return NULL at line 2167.  vacuum() itself would have fallen out
>> silently ...  however, had it errored out, the attempts at error
>> reporting in the PG_CATCH block would have crashed.

> I'm having a hard time reproducing this problem; I added a pg_usleep()
> just before get_rel_name to have the chance to drop the table, but
> strangely enough it doesn't return NULL.  It seems that the cache entry
> is not getting invalidated.  Maybe there's something else that's needed
> to reproduce the crash.

I think cache invals would get noticed at points where we had to open
some relation, so you probably need the sleep somewhere earlier than
that.  Or just throw in an AcceptInvalidationMessages() after the sleep.

> I think there's another hole here: what happens if the table exists when
> get_rel_name is called, but the schema is dropped between that point and
> calling get_namespace_name?  It is pretty unlikely but can it be
> discarded completely?  What should we do in that case?

I think you can just skip the table if any of those three calls return a
null.  That probably means the table/schema/whatever got dropped, and
if for some reason there's a transient failure, it'll get vacuumed next
time anyway.
        regards, tom lane


Re: autovacuum crash due to null pointer

От
Alvaro Herrera
Дата:
Tom Lane wrote:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > I'm having a hard time reproducing this problem; I added a pg_usleep()
> > just before get_rel_name to have the chance to drop the table, but
> > strangely enough it doesn't return NULL.  It seems that the cache entry
> > is not getting invalidated.  Maybe there's something else that's needed
> > to reproduce the crash.
>
> I think cache invals would get noticed at points where we had to open
> some relation, so you probably need the sleep somewhere earlier than
> that.  Or just throw in an AcceptInvalidationMessages() after the sleep.

AcceptInvalidationMessages did the trick.

Patches for 8.3 and HEAD attached.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Вложения

Re: autovacuum crash due to null pointer

От
Alvaro Herrera
Дата:
Alvaro Herrera wrote:

> Patches for 8.3 and HEAD attached.

And applied.

-- 
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.