Обсуждение: Re: [COMMITTERS] pgsql: Fix inadequate locking during get_rel_oids().

Поиск
Список
Период
Сортировка

Re: [COMMITTERS] pgsql: Fix inadequate locking during get_rel_oids().

От
Andres Freund
Дата:
Hi Tom,

On 2017-09-29 20:26:42 +0000, Tom Lane wrote:
> get_rel_oids used to not take any relation locks at all, but that stopped
> being a good idea with commit 3c3bb9933, which inserted a syscache lookup
> into the function.  A concurrent DROP TABLE could now produce "cache lookup
> failed", which we don't want to have happen in normal operation.  The best
> solution seems to be to transiently take a lock on the relation named by
> the RangeVar (which also makes the result of RangeVarGetRelid a lot less
> spongy).  But we shouldn't hold the lock beyond this function, because we
> don't want VACUUM to lock more than one table at a time.  (That would not
> be a big problem right now, but it will become one after the pending
> feature patch to allow multiple tables to be named in VACUUM.)

I'm not sure how big a problem this is, but I suspect it is
one. Autovacuum used to skip relations when they're locked, and the
vacuum isn't an anti-wraparound one.  But after this change it appears
we'll get stuck behind this new lock, even if VACOPT_NOWAIT is
specified.  That's bad because now relations that are AEL locked
(e.g. because of some longrunning DDL) can block global autovacuum
progress.

I noticed this when reviewing a patch that adds NOWAIT (now renamed to
SKIP LOCKED) as a user visible knob and it getting stuck behind an
AEL. The updated version of the patch
http://archives.postgresql.org/message-id/7327B413-1A57-477F-A6A0-6FD80CB5482D%40amazon.com
has an attempt at fixing the issue, although I've not yet looked at it
in any sort of detail.

I suspect we should break out that part once reviewed, and backpatch to
10?

Greetings,

Andres Freund


Re: [COMMITTERS] pgsql: Fix inadequate locking during get_rel_oids().

От
Andres Freund
Дата:
On 2018-03-05 16:11:52 -0800, Andres Freund wrote:
> Hi Tom,
> 
> On 2017-09-29 20:26:42 +0000, Tom Lane wrote:
> > get_rel_oids used to not take any relation locks at all, but that stopped
> > being a good idea with commit 3c3bb9933, which inserted a syscache lookup
> > into the function.  A concurrent DROP TABLE could now produce "cache lookup
> > failed", which we don't want to have happen in normal operation.  The best
> > solution seems to be to transiently take a lock on the relation named by
> > the RangeVar (which also makes the result of RangeVarGetRelid a lot less
> > spongy).  But we shouldn't hold the lock beyond this function, because we
> > don't want VACUUM to lock more than one table at a time.  (That would not
> > be a big problem right now, but it will become one after the pending
> > feature patch to allow multiple tables to be named in VACUUM.)
> 
> I'm not sure how big a problem this is, but I suspect it is
> one. Autovacuum used to skip relations when they're locked, and the
> vacuum isn't an anti-wraparound one.  But after this change it appears
> we'll get stuck behind this new lock, even if VACOPT_NOWAIT is
> specified.  That's bad because now relations that are AEL locked
> (e.g. because of some longrunning DDL) can block global autovacuum
> progress.

Scratch that, we should be going down the
    /* If caller supplied OID, there's nothing we need do here. */
    if (OidIsValid(vrel->oid))
    {
        oldcontext = MemoryContextSwitchTo(vac_context);
        vacrels = lappend(vacrels, vrel);
        MemoryContextSwitchTo(oldcontext);
    }
branch in expand_vacuum_rel() for autovacuum, so this shouldn't
matter. Sorry for the noise

Greetings,

Andres Freund


Re: [COMMITTERS] pgsql: Fix inadequate locking during get_rel_oids().

От
Andres Freund
Дата:
On 2018-03-05 16:11:52 -0800, Andres Freund wrote:
> Hi Tom,
> 
> On 2017-09-29 20:26:42 +0000, Tom Lane wrote:
> > get_rel_oids used to not take any relation locks at all, but that stopped
> > being a good idea with commit 3c3bb9933, which inserted a syscache lookup
> > into the function.  A concurrent DROP TABLE could now produce "cache lookup
> > failed", which we don't want to have happen in normal operation.  The best
> > solution seems to be to transiently take a lock on the relation named by
> > the RangeVar (which also makes the result of RangeVarGetRelid a lot less
> > spongy).  But we shouldn't hold the lock beyond this function, because we
> > don't want VACUUM to lock more than one table at a time.  (That would not
> > be a big problem right now, but it will become one after the pending
> > feature patch to allow multiple tables to be named in VACUUM.)
> 
> I'm not sure how big a problem this is, but I suspect it is
> one. Autovacuum used to skip relations when they're locked, and the
> vacuum isn't an anti-wraparound one.  But after this change it appears
> we'll get stuck behind this new lock, even if VACOPT_NOWAIT is
> specified.  That's bad because now relations that are AEL locked
> (e.g. because of some longrunning DDL) can block global autovacuum
> progress.

Scratch that, we should be going down the
    /* If caller supplied OID, there's nothing we need do here. */
    if (OidIsValid(vrel->oid))
    {
        oldcontext = MemoryContextSwitchTo(vac_context);
        vacrels = lappend(vacrels, vrel);
        MemoryContextSwitchTo(oldcontext);
    }
branch in expand_vacuum_rel() for autovacuum, so this shouldn't
matter. Sorry for the noise

Greetings,

Andres Freund


Re: [COMMITTERS] pgsql: Fix inadequate locking during get_rel_oids().

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2017-09-29 20:26:42 +0000, Tom Lane wrote:
>> get_rel_oids used to not take any relation locks at all, but that stopped
>> being a good idea with commit 3c3bb9933, which inserted a syscache lookup
>> into the function.  A concurrent DROP TABLE could now produce "cache lookup
>> failed", which we don't want to have happen in normal operation.  The best
>> solution seems to be to transiently take a lock on the relation named by
>> the RangeVar (which also makes the result of RangeVarGetRelid a lot less
>> spongy).  But we shouldn't hold the lock beyond this function, because we
>> don't want VACUUM to lock more than one table at a time.  (That would not
>> be a big problem right now, but it will become one after the pending
>> feature patch to allow multiple tables to be named in VACUUM.)

> I'm not sure how big a problem this is, but I suspect it is
> one. Autovacuum used to skip relations when they're locked, and the
> vacuum isn't an anti-wraparound one.  But after this change it appears
> we'll get stuck behind this new lock, even if VACOPT_NOWAIT is
> specified.  That's bad because now relations that are AEL locked
> (e.g. because of some longrunning DDL) can block global autovacuum
> progress.

Hm.  Maybe we should take this lock with nowait if the vacuum option
is specified?

Another idea is to revert both this patch and 3c3bb9933, and instead
handle partitioning more like we handle recursion for toast tables, ie
make no decisions until after we do have lock on a relation.  The
really fundamental problem here is that 3c3bb9933 thought it is OK
to do a syscache lookup on a table we have no lock on, which is flat
wrong.  But we don't necessarily have to do it exactly like that.

            regards, tom lane


Re: [COMMITTERS] pgsql: Fix inadequate locking during get_rel_oids().

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> Scratch that, we should be going down the
>     /* If caller supplied OID, there's nothing we need do here. */
> branch in expand_vacuum_rel() for autovacuum, so this shouldn't
> matter. Sorry for the noise

But you said you'd seen blocking behind AEL with NOWAIT, so there's
still a problem for manual vacuums no?

            regards, tom lane


Re: [COMMITTERS] pgsql: Fix inadequate locking during get_rel_oids().

От
Michael Paquier
Дата:
On Mon, Mar 05, 2018 at 04:34:09PM -0800, Andres Freund wrote:
> Scratch that, we should be going down the
>     /* If caller supplied OID, there's nothing we need do here. */
>     if (OidIsValid(vrel->oid))
>     {
>         oldcontext = MemoryContextSwitchTo(vac_context);
>         vacrels = lappend(vacrels, vrel);
>         MemoryContextSwitchTo(oldcontext);
>     }
> branch in expand_vacuum_rel() for autovacuum, so this shouldn't
> matter. Sorry for the noise

Yes, I don't see a problem.  However I can understand that it is easy to
be confused on those code paths if you are not used to them and this
area has changed quite a bit the last years.  Perhaps we could add an
Assert(IsAutoVacuumWorkerProcess()) to reduce the confusion?
--
Michael

Вложения

Re: [COMMITTERS] pgsql: Fix inadequate locking during get_rel_oids().

От
Michael Paquier
Дата:
On Mon, Mar 05, 2018 at 04:34:09PM -0800, Andres Freund wrote:
> Scratch that, we should be going down the
>     /* If caller supplied OID, there's nothing we need do here. */
>     if (OidIsValid(vrel->oid))
>     {
>         oldcontext = MemoryContextSwitchTo(vac_context);
>         vacrels = lappend(vacrels, vrel);
>         MemoryContextSwitchTo(oldcontext);
>     }
> branch in expand_vacuum_rel() for autovacuum, so this shouldn't
> matter. Sorry for the noise

Yes, I don't see a problem.  However I can understand that it is easy to
be confused on those code paths if you are not used to them and this
area has changed quite a bit the last years.  Perhaps we could add an
Assert(IsAutoVacuumWorkerProcess()) to reduce the confusion?
--
Michael

Вложения

Re: [COMMITTERS] pgsql: Fix inadequate locking during get_rel_oids().

От
Andres Freund
Дата:
On 2018-03-05 19:53:23 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Scratch that, we should be going down the
> >     /* If caller supplied OID, there's nothing we need do here. */
> > branch in expand_vacuum_rel() for autovacuum, so this shouldn't
> > matter. Sorry for the noise
> 
> But you said you'd seen blocking behind AEL with NOWAIT, so there's
> still a problem for manual vacuums no?

Right. But that facility isn't yet exposed to SQL, just in the patch I'm
reviewing. So there's no issue with the code from the commit I'm
replying to.

In [1] I wrote about one idea how to resolve this for the proposed
patch.

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20180306005349.b65whmvj7z6hbe2y@alap3.anarazel.de