Обсуждение: Assert while autovacuum was executing

Поиск
Список
Период
Сортировка

Assert while autovacuum was executing

От
Jaime Casanova
Дата:
Hi,

I have been testing 16beta1, last commit
a14e75eb0b6a73821e0d66c0d407372ec8376105
I just let sqlsmith do its magic before trying something else, and
today I found a core with the attached backtrace.

Only information on the log was this:

DETAIL:  Failed process was running: autovacuum: VACUUM
public.array_index_op_test

--
Jaime Casanova
Director de Servicios Profesionales
SYSTEMGUARDS - Consultores de PostgreSQL

Вложения

Re: Assert while autovacuum was executing

От
Peter Geoghegan
Дата:
On Sat, Jun 17, 2023 at 11:29 AM Jaime Casanova
<jcasanov@systemguards.com.ec> wrote:
> I have been testing 16beta1, last commit
> a14e75eb0b6a73821e0d66c0d407372ec8376105
> I just let sqlsmith do its magic before trying something else, and
> today I found a core with the attached backtrace.

The assertion that fails is the IsPageLockHeld assertion from commit 72e78d831a.

I think that this is kind of an odd assertion. It's also not justified
by any comments. Why invent this rule at all?

To be fair the use of page heavyweight locks in ginInsertCleanup() is
also odd. The only reason why ginInsertCleanup() uses page-level locks
here is to get the benefit of deadlock detection, and to be able to
hold the lock for a relatively long time if that proves necessary
(i.e., interruptibility). There are reasons to doubt that that's a
good design, but either way it seems fundamentally incompatible with
the rule enforced by the assertion.

--
Peter Geoghegan



Re: Assert while autovacuum was executing

От
Amit Kapila
Дата:
On Sun, Jun 18, 2023 at 12:18 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sat, Jun 17, 2023 at 11:29 AM Jaime Casanova
> <jcasanov@systemguards.com.ec> wrote:
> > I have been testing 16beta1, last commit
> > a14e75eb0b6a73821e0d66c0d407372ec8376105
> > I just let sqlsmith do its magic before trying something else, and
> > today I found a core with the attached backtrace.
>
> The assertion that fails is the IsPageLockHeld assertion from commit 72e78d831a.
>

I'll look into this and share my analysis.

--
With Regards,
Amit Kapila.



Re: Assert while autovacuum was executing

От
Amit Kapila
Дата:
On Mon, Jun 19, 2023 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Jun 18, 2023 at 12:18 AM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Sat, Jun 17, 2023 at 11:29 AM Jaime Casanova
> > <jcasanov@systemguards.com.ec> wrote:
> > > I have been testing 16beta1, last commit
> > > a14e75eb0b6a73821e0d66c0d407372ec8376105
> > > I just let sqlsmith do its magic before trying something else, and
> > > today I found a core with the attached backtrace.
> >
> > The assertion that fails is the IsPageLockHeld assertion from commit 72e78d831a.
> >
>
> I'll look into this and share my analysis.
>

This failure mode appears to be introduced in commit 7d71d3dd08 (in
PG16) where we started to process the config file after acquiring page
lock during autovacuum. The problem here is that after acquiring page
lock (a heavy-weight lock), while processing the config file, we tried
to access the catalog cache which in turn attempts to acquire a lock
on the catalog relation, and that leads to the assertion failure. This
is because of an existing rule that we don't acquire any other
heavyweight lock while holding the page lock except for relation
extension. I think normally we should be careful about the lock
ordering for heavy-weight locks to avoid deadlocks but here there may
not be any existing hazard in acquiring a lock on the catalog table
after acquiring page lock on the gin index's metapage as I am not
aware of a scenario where we can acquire them in reverse order. One
naive idea is to have a parameter like vacuum_config_reload_safe to
allow config reload during autovacuum and make it false for the gin
index cleanup code path.

The reason for the existing rule for page lock and relation extension
locks was to not allow them to participate in group locking which will
allow other parallel operations like a parallel vacuum where multiple
workers can work on the same index, or parallel inserts, parallel
copy, etc. The related commits are 15ef6ff4b9, 72e78d831ab,
85f6b49c2c, and 3ba59ccc89. See 3ba59ccc89 for more details (To allow
parallel inserts and parallel copy, we have ensured that relation
extension and page locks don't participate in group locking which
means such locks can conflict among the same group members. This is
required as it is no safer for two related processes to extend the
same relation or perform clean up in gin indexes at a time than for
unrelated processes to do the same....).

--
With Regards,
Amit Kapila.



RE: Assert while autovacuum was executing

От
"Zhijie Hou (Fujitsu)"
Дата:
On Tuesday, June 20, 2023 5:44 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Mon, Jun 19, 2023 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> >
> > On Sun, Jun 18, 2023 at 12:18 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > >
> > > On Sat, Jun 17, 2023 at 11:29 AM Jaime Casanova
> > > <jcasanov@systemguards.com.ec> wrote:
> > > > I have been testing 16beta1, last commit
> > > > a14e75eb0b6a73821e0d66c0d407372ec8376105
> > > > I just let sqlsmith do its magic before trying something else, and
> > > > today I found a core with the attached backtrace.
> > >
> > > The assertion that fails is the IsPageLockHeld assertion from commit
> 72e78d831a.
> > >
> >
> > I'll look into this and share my analysis.
> >
> 
> This failure mode appears to be introduced in commit 7d71d3dd08 (in
> PG16) where we started to process the config file after acquiring page lock
> during autovacuum. The problem here is that after acquiring page lock (a
> heavy-weight lock), while processing the config file, we tried to access the
> catalog cache which in turn attempts to acquire a lock on the catalog relation,
> and that leads to the assertion failure. This is because of an existing rule that we
> don't acquire any other heavyweight lock while holding the page lock except
> for relation extension. I think normally we should be careful about the lock
> ordering for heavy-weight locks to avoid deadlocks but here there may not be
> any existing hazard in acquiring a lock on the catalog table after acquiring page
> lock on the gin index's metapage as I am not aware of a scenario where we can
> acquire them in reverse order. One naive idea is to have a parameter like
> vacuum_config_reload_safe to allow config reload during autovacuum and
> make it false for the gin index cleanup code path.

I also think it would be better to skip reloading config in page lock cases to be consistent
with the rule. And here is the patch which does the same.

I tried to reproduce the assert failure(before applying the patch) using the following steps:

1. I added a sleep before vacuum_delay_point in ginInsertCleanup and LOG("attach this process") before sleeping.
2. And changed few GUC to make autovacuum happen more frequently and then start the server.
-
autovacuum_naptime = 5s
autovacuum_vacuum_threshold = 1
autovacuum_vacuum_insert_threshold = 100
-

3. Then I execute the following sqls:
-
create table gin_test_tbl(i int4[]);
create index gin_test_idx on gin_test_tbl using gin (i)
  with (fastupdate = on, gin_pending_list_limit = 4096);
insert into gin_test_tbl select array[1, 2, g] from generate_series(1, 20000) g;
insert into gin_test_tbl select array[1, 3, g] from generate_series(1, 1000) g;
-
4. After a while, I can see the LOG from autovacuum worker and then use gdb to attach to the autovacuum worker.
5. When the autovacuum worker is blocked, I changed the "default_text_search_config = 'pg_catalog.public'" in configure
fileand reload it.
 
6. Release the autovacuum worker and then I can see the assert failure.

And I can see the assert failure doesn't happen after applying the patch.

> 
> The reason for the existing rule for page lock and relation extension locks was
> to not allow them to participate in group locking which will allow other parallel
> operations like a parallel vacuum where multiple workers can work on the same
> index, or parallel inserts, parallel copy, etc. The related commits are 15ef6ff4b9,
> 72e78d831ab, 85f6b49c2c, and 3ba59ccc89. See 3ba59ccc89 for more details
> (To allow parallel inserts and parallel copy, we have ensured that relation
> extension and page locks don't participate in group locking which means such
> locks can conflict among the same group members. This is required as it is no
> safer for two related processes to extend the same relation or perform clean
> up in gin indexes at a time than for unrelated processes to do the same....).

Best Regards,
Hou zj

Вложения

Re: Assert while autovacuum was executing

От
Andres Freund
Дата:
Hi,

On 2023-06-20 15:14:26 +0530, Amit Kapila wrote:
> On Mon, Jun 19, 2023 at 5:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, Jun 18, 2023 at 12:18 AM Peter Geoghegan <pg@bowt.ie> wrote:
> > >
> > > On Sat, Jun 17, 2023 at 11:29 AM Jaime Casanova
> > > <jcasanov@systemguards.com.ec> wrote:
> > > > I have been testing 16beta1, last commit
> > > > a14e75eb0b6a73821e0d66c0d407372ec8376105
> > > > I just let sqlsmith do its magic before trying something else, and
> > > > today I found a core with the attached backtrace.
> > >
> > > The assertion that fails is the IsPageLockHeld assertion from commit 72e78d831a.
> > >
> >
> > I'll look into this and share my analysis.
> >
> 
> This failure mode appears to be introduced in commit 7d71d3dd08 (in
> PG16) where we started to process the config file after acquiring page
> lock during autovacuum.

I find it somewhat hard to believe that this is the only way to reach this
issue. You're basically asserting that there's not a single cache lookup
reachable from inside ginInsertCleanup() - which seems unlikely, given the
range of comparators that can exist.

<plays around>

Yep. Doesn't even require enabling debug_discard_caches or reconnecting.


DROP TABLE IF EXISTS tbl_foo;
DROP TYPE IF EXISTS Foo;

CREATE TYPE foo AS ENUM ('a', 'b', 'c');
ALTER TYPE foo ADD VALUE 'ab' BEFORE 'b';
CREATE TABLE tbl_foo (foo foo);
CREATE INDEX tbl_foo_idx ON tbl_foo USING gin (foo) WITH (fastupdate = on);

INSERT INTO tbl_foo(foo) VALUES ('ab'), ('a'), ('b'), ('c');

SELECT gin_clean_pending_list('tbl_foo_idx');


As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that likely
also means 3ba59ccc89 is not right.

Greetings,

Andres Freund



Re: Assert while autovacuum was executing

От
Peter Geoghegan
Дата:
On Tue, Jun 20, 2023 at 10:27 PM Andres Freund <andres@anarazel.de> wrote:
> As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that likely
> also means 3ba59ccc89 is not right.

Quite possibly. But I maintain that ginInsertCleanup() is probably
also bogus in a way that's directly relevant.

Did you know that ginInsertCleanup() is the only code that uses
heavyweight page locks these days? Though only on the index metapage!

Isn't this the kind of thing that VACUUM's relation level lock is
supposed to take care of?


--
Peter Geoghegan



Re: Assert while autovacuum was executing

От
Amit Kapila
Дата:
On Wed, Jun 21, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote:
>
> As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that likely
> also means 3ba59ccc89 is not right.
>

Indeed. I was thinking of a fix but couldn't find one yet. One idea I
am considering is to allow catalog table locks after page lock but I
think apart from hacky that also won't work because we still need to
remove the check added for page locks in the deadlock code path in
commit 3ba59ccc89 and may need to do something for group locking. Feel
free to share any ideas if you have, I can try to evaluate those in
detail. I think in the worst case we need to remove the changes added
by 72e78d831a and 3ba59ccc89 which won't impact any existing feature
but will add a hurdle in parallelizing other write operations or even
improving the parallelism in vacuum (like allowing multiple workers
for an index).

--
With Regards,
Amit Kapila.



Re: Assert while autovacuum was executing

От
Amit Kapila
Дата:
On Wed, Jun 21, 2023 at 11:53 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Jun 20, 2023 at 10:27 PM Andres Freund <andres@anarazel.de> wrote:
> > As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that likely
> > also means 3ba59ccc89 is not right.
>
> Quite possibly. But I maintain that ginInsertCleanup() is probably
> also bogus in a way that's directly relevant.
>
> Did you know that ginInsertCleanup() is the only code that uses
> heavyweight page locks these days? Though only on the index metapage!
>
> Isn't this the kind of thing that VACUUM's relation level lock is
> supposed to take care of?
>

Yeah, I also can't see why that shouldn't be sufficient for VACUUM.
Assuming your observation is correct, what do you suggest doing in
this regard?

--
With Regards,
Amit Kapila.



Re: Assert while autovacuum was executing

От
Andres Freund
Дата:
Hi,

On 2023-06-22 10:00:01 +0530, Amit Kapila wrote:
> On Wed, Jun 21, 2023 at 11:53 AM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Tue, Jun 20, 2023 at 10:27 PM Andres Freund <andres@anarazel.de> wrote:
> > > As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that likely
> > > also means 3ba59ccc89 is not right.
> >
> > Quite possibly. But I maintain that ginInsertCleanup() is probably
> > also bogus in a way that's directly relevant.
> >
> > Did you know that ginInsertCleanup() is the only code that uses
> > heavyweight page locks these days? Though only on the index metapage!
> >
> > Isn't this the kind of thing that VACUUM's relation level lock is
> > supposed to take care of?
> >
> 
> Yeah, I also can't see why that shouldn't be sufficient for VACUUM.

I'd replied on that point to Peter earlier, accidentlly loosing the CC
list. The issue is that ginInsertCleanup() isn't just called from VACUUM, but
also from normal inserts (to reduce the size of the fastupdate list).

You can possibly come up with another scheme, but I think just doing this via
the relation lock might be problematic. Suddenly an insert would, temporarily,
also block operations that don't normally conflict with inserts etc.

Greetings,

Andres Freund



Re: Assert while autovacuum was executing

От
Amit Kapila
Дата:
On Thu, Jun 22, 2023 at 9:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 21, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that likely
> > also means 3ba59ccc89 is not right.
> >
>
> Indeed. I was thinking of a fix but couldn't find one yet. One idea I
> am considering is to allow catalog table locks after page lock but I
> think apart from hacky that also won't work because we still need to
> remove the check added for page locks in the deadlock code path in
> commit 3ba59ccc89 and may need to do something for group locking.
>

I have further thought about this part and I think even if we remove
the changes in commit 72e78d831a (remove the assertion for page locks
in LockAcquireExtended()) and remove the check added for page locks in
FindLockCycleRecurseMember() via commit 3ba59ccc89, it is still okay
to keep the change related to "allow page lock to conflict among
parallel group members" in LockCheckConflicts(). This is because locks
on catalog tables don't conflict among group members. So, we shouldn't
see a deadlock among parallel group members. Let me try to explain
this thought via an example:

Begin;
Lock pg_enum in Access Exclusive mode;
gin_clean_pending_list() -- assume this function is executed by both
leader and parallel worker; also this requires a lock on pg_enum as
shown by Andres in email [1]

Say the parallel worker acquires page lock first and it will also get
lock on pg_enum because of group locking, so, the leader backend will
wait for page lock for the parallel worker. Eventually, the parallel
worker will release the page lock and the leader backend can get the
lock. So, we should be still okay with parallelism.

OTOH, if the above theory is wrong or people are not convinced, I am
okay with removing all the changes in commits 72e78d831a and
3ba59ccc89.

[1] - https://www.postgresql.org/message-id/20230621052713.wc5377dyslxpckfj%40awork3.anarazel.de

--
With Regards,
Amit Kapila.



Re: Assert while autovacuum was executing

От
Dilip Kumar
Дата:
On Fri, Jun 23, 2023 at 2:04 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Jun 22, 2023 at 9:16 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Jun 21, 2023 at 10:57 AM Andres Freund <andres@anarazel.de> wrote:
> > >
> > > As far as I can tell 72e78d831a as-is is just bogus. Unfortunately that likely
> > > also means 3ba59ccc89 is not right.
> > >
> >
> > Indeed. I was thinking of a fix but couldn't find one yet. One idea I
> > am considering is to allow catalog table locks after page lock but I
> > think apart from hacky that also won't work because we still need to
> > remove the check added for page locks in the deadlock code path in
> > commit 3ba59ccc89 and may need to do something for group locking.
> >
>
> I have further thought about this part and I think even if we remove
> the changes in commit 72e78d831a (remove the assertion for page locks
> in LockAcquireExtended()) and remove the check added for page locks in
> FindLockCycleRecurseMember() via commit 3ba59ccc89, it is still okay
> to keep the change related to "allow page lock to conflict among
> parallel group members" in LockCheckConflicts(). This is because locks
> on catalog tables don't conflict among group members. So, we shouldn't
> see a deadlock among parallel group members. Let me try to explain
> this thought via an example:
>

IMHO, whatsoever the case this check[1], is not wrong at all.  I agree
that we do not have parallel write present in the code so having this
check is not necessary as of now.  But in theory, this check is
correct because this is saying that parallel leader and worker should
conflict on the 'relation extension lock' and the 'page lock' and
that's the fact.  It holds true irrespective of whether it is being
used currently or not.


[1]
/*
* The relation extension or page lock conflict even between the group
* members.
*/
if (LOCK_LOCKTAG(*lock) == LOCKTAG_RELATION_EXTEND ||
(LOCK_LOCKTAG(*lock) == LOCKTAG_PAGE))
{
PROCLOCK_PRINT("LockCheckConflicts: conflicting (group)",
   proclock);
return true;
}


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Assert while autovacuum was executing

От
Andres Freund
Дата:
Hi,

On 2023-06-23 14:04:15 +0530, Amit Kapila wrote:
> OTOH, if the above theory is wrong or people are not convinced, I am
> okay with removing all the changes in commits 72e78d831a and
> 3ba59ccc89.

I am not convinced. And even if I were, coming up with new justifications in a
released version, when the existing testing clearly wasn't enough to find the
current bug, doesn't strike me as wise.

Greetings,

Andres Freund



Re: Assert while autovacuum was executing

От
Amit Kapila
Дата:
On Fri, Jun 23, 2023 at 10:07 PM Andres Freund <andres@anarazel.de> wrote:
>
> On 2023-06-23 14:04:15 +0530, Amit Kapila wrote:
> > OTOH, if the above theory is wrong or people are not convinced, I am
> > okay with removing all the changes in commits 72e78d831a and
> > 3ba59ccc89.
>
> I am not convinced. And even if I were, coming up with new justifications in a
> released version, when the existing testing clearly wasn't enough to find the
> current bug, doesn't strike me as wise.
>

Fair enough. If we could have been convinced of this then we can keep
the required change only for HEAD. But anyway let's remove the work
related to both commits (72e78d831a and 3ba59ccc89) for now and then
we can come back to it when we parallelize writes. The attached patch
removes the changes added by both commits with slight tweaking in
comments/readme based on the recent state.

--
With Regards,
Amit Kapila.

Вложения

RE: Assert while autovacuum was executing

От
"Zhijie Hou (Fujitsu)"
Дата:
On Monday, June 26, 2023 12:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Fri, Jun 23, 2023 at 10:07 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > On 2023-06-23 14:04:15 +0530, Amit Kapila wrote:
> > > OTOH, if the above theory is wrong or people are not convinced, I am
> > > okay with removing all the changes in commits 72e78d831a and
> > > 3ba59ccc89.
> >
> > I am not convinced. And even if I were, coming up with new
> > justifications in a released version, when the existing testing
> > clearly wasn't enough to find the current bug, doesn't strike me as wise.
> >
> 
> Fair enough. If we could have been convinced of this then we can keep the
> required change only for HEAD. But anyway let's remove the work related to
> both commits (72e78d831a and 3ba59ccc89) for now and then we can come
> back to it when we parallelize writes. The attached patch removes the changes
> added by both commits with slight tweaking in comments/readme based on
> the recent state.

Thanks for the patch. I have confirmed that the patch to revert page lock
handling applies cleanly on all branches(13~HEAD) and the assert failure and
undetectable deadlock problem are fixed after applying the patch.

Best Regards,
Hou zj

Re: Assert while autovacuum was executing

От
Amit Kapila
Дата:
On Wed, Jun 28, 2023 at 7:26 AM Zhijie Hou (Fujitsu)
<houzj.fnst@fujitsu.com> wrote:
>
> On Monday, June 26, 2023 12:18 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Fair enough. If we could have been convinced of this then we can keep the
> > required change only for HEAD. But anyway let's remove the work related to
> > both commits (72e78d831a and 3ba59ccc89) for now and then we can come
> > back to it when we parallelize writes. The attached patch removes the changes
> > added by both commits with slight tweaking in comments/readme based on
> > the recent state.
>
> Thanks for the patch. I have confirmed that the patch to revert page lock
> handling applies cleanly on all branches(13~HEAD) and the assert failure and
> undetectable deadlock problem are fixed after applying the patch.
>

Thanks for the verification. Unless someone has any further comments
or suggestions, I'll push this next week sometime.

--
With Regards,
Amit Kapila.



Re: Assert while autovacuum was executing

От
Amit Kapila
Дата:
On Fri, Jun 30, 2023 at 8:26 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Jun 28, 2023 at 7:26 AM Zhijie Hou (Fujitsu)
> <houzj.fnst@fujitsu.com> wrote:
> >
>
> Thanks for the verification. Unless someone has any further comments
> or suggestions, I'll push this next week sometime.
>

Pushed but forgot to do indent which leads to BF failure[1]. I'll take
care of it.

[1] - https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=koel&dt=2023-07-06%2005%3A19%3A03

--
With Regards,
Amit Kapila.