Обсуждение: cache lookup failed dropping public schema with trgm index

Поиск
Список
Период
Сортировка

cache lookup failed dropping public schema with trgm index

От
Wyatt Alt
Дата:
Hi,

This reproduces on 15.4 and 13.12:

create table foo(t text);
create extension pg_trgm;
create index on foo using gist(t gist_trgm_ops);
drop schema public cascade;

NOTICE:  drop cascades to 2 other objects
DETAIL:  drop cascades to table foo
drop cascades to extension pg_trgm
ERROR:  cache lookup failed for function 1195999
Time: 20.968 ms

Re: cache lookup failed dropping public schema with trgm index

От
Michael Paquier
Дата:
On Mon, Aug 21, 2023 at 11:40:15AM -0700, Wyatt Alt wrote:
> This reproduces on 15.4 and 13.12:
>
> create table foo(t text);
> create extension pg_trgm;
> create index on foo using gist(t gist_trgm_ops);
> drop schema public cascade;

Indeed, reproduced here down to 13.  12 and older versions don't react
like that.  I'll go bisect that..
--
Michael

Вложения

Re: cache lookup failed dropping public schema with trgm index

От
Andres Freund
Дата:
Hi,

On 2023-08-21 11:40:15 -0700, Wyatt Alt wrote:
> This reproduces on 15.4 and 13.12:

Also reproduces on HEAD.


> create table foo(t text);
> create extension pg_trgm;
> create index on foo using gist(t gist_trgm_ops);
> drop schema public cascade;
>
> NOTICE:  drop cascades to 2 other objects
> DETAIL:  drop cascades to table foo
> drop cascades to extension pg_trgm
> ERROR:  cache lookup failed for function 1195999
> Time: 20.968 ms

It also seems to work without even involving a drop schema. Just dropping
pg_trgm with cascade is sufficient.

<several wrong theories>

I think we might primarily dealing with missing invalidations. Dropping
objects is scheduled in an order where we first drop
  schedule deletion of function 10 (text, text) of operator family t.gist_trgm_ops for access method gist:
t.gtrgm_options(internal)at 0
 
  schedule deletion of function t.gtrgm_options(internal) at 1
and then later
  schedule deletion of index t.foo_t at 32


During the index deletion we try to initialize the access method. But haven't
performed sufficient invalidation and still think "function 10"
exists. Calling it then causes the error.

One can "verify" this theory by adding an InvalidateSystemCaches() at the end
of deleteOneObject(). That "fixes" the issue.  This also explains why dropping
pg_trgm in a new session works - we don't have old cache entries that could be
out of date.

Not quite sure where we are dropping the ball with invalidations yet.


However, I suspect there's more wrong than this, albeit perhaps not
problematic in a huge way. While debugging I added a getObjectDescription()
description call to deleteObjectsInList() *before* calling
deleteOneObject(). That fails when dropping pg_trgmp, because we end up
dropping the type gtrgm before trgm_out(). The reason this happens is because
we reach gtrgm_out() via the extension dependency, rather than via the
type. When recursing to gtrgm_out(), we recurse to type gtrgm, recurse to
gtrgm_in(), schedule it for deletion, then recurse to gtrgm_out(), but find
it's in the stack and do *not* schedule it for deletion, before scheduling
gtrgm for deletion. Only then we delete gtrgm_out().

Now, this isn't a real issue in practice (without such a debugging statement,
which likely can't work in some cases), but I strongly suspect that it
indicates a scheduling order issue that's more widespread. Despite, I think,
correct dependencies, we end up with a topologically inconsistent drop
order. There aren't any cycles in the directed dependency graph from what I
can see.

Greetings,

Andres Freund



Re: cache lookup failed dropping public schema with trgm index

От
Michael Paquier
Дата:
On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote:
> It also seems to work without even involving a drop schema. Just dropping
> pg_trgm with cascade is sufficient.

FWIW, after a bisect I can see that 911e7020 is the origin of the
failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases).

> Now, this isn't a real issue in practice (without such a debugging statement,
> which likely can't work in some cases), but I strongly suspect that it
> indicates a scheduling order issue that's more widespread. Despite, I think,
> correct dependencies, we end up with a topologically inconsistent drop
> order. There aren't any cycles in the directed dependency graph from what I
> can see.

Yeah, guess so.  I was first betting on a missing shared inval here.
Now note that for example, this command works:
psql -v ON_ERROR_STOP=1 -c 'create table foo(t text); create extension pg_trgm; create index on foo using gist(t
gist_trgm_ops);createindex on foo using gist(t gist_trgm_ops); drop schema public cascade;'
 
--
Michael

Вложения

Re: cache lookup failed dropping public schema with trgm index

От
Wyatt Alt
Дата:
Thanks for having a look. Note also that switching the first two lines of my example prevents the failure.

On Mon, Aug 21, 2023 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote:
On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote:
> It also seems to work without even involving a drop schema. Just dropping
> pg_trgm with cascade is sufficient.

FWIW, after a bisect I can see that 911e7020 is the origin of the
failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases).

> Now, this isn't a real issue in practice (without such a debugging statement,
> which likely can't work in some cases), but I strongly suspect that it
> indicates a scheduling order issue that's more widespread. Despite, I think,
> correct dependencies, we end up with a topologically inconsistent drop
> order. There aren't any cycles in the directed dependency graph from what I
> can see.

Yeah, guess so.  I was first betting on a missing shared inval here.
Now note that for example, this command works:
psql -v ON_ERROR_STOP=1 -c 'create table foo(t text); create extension pg_trgm; create index on foo using gist(t gist_trgm_ops);create index on foo using gist(t gist_trgm_ops); drop schema public cascade;'
--
Michael

Re: cache lookup failed dropping public schema with trgm index

От
Tom Lane
Дата:
Michael Paquier <michael@paquier.xyz> writes:
> On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote:
>> It also seems to work without even involving a drop schema. Just dropping
>> pg_trgm with cascade is sufficient.

> FWIW, after a bisect I can see that 911e7020 is the origin of the
> failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases).

Hmm.  I see that 911e7020 modified pg_trgm's install script with

+ALTER OPERATOR FAMILY gist_trgm_ops USING gist
+ADD FUNCTION 10 (text) gtrgm_options (internal);

I wonder whether that correctly adds a dependency to ensure the
opfamily is dropped before the function.

            regards, tom lane



Re: cache lookup failed dropping public schema with trgm index

От
Michael Paquier
Дата:
On Mon, Aug 21, 2023 at 07:27:19PM -0400, Tom Lane wrote:
> Hmm.  I see that 911e7020 modified pg_trgm's install script with
>
> +ALTER OPERATOR FAMILY gist_trgm_ops USING gist
> +ADD FUNCTION 10 (text) gtrgm_options (internal);
>
> I wonder whether that correctly adds a dependency to ensure the
> opfamily is dropped before the function.

The dependencies look OK seen from here when it comes to the
extension, the indexes or the option function.  See, for instance:
=# SELECT pg_describe_object(classid, objid, objsubid) as obj,
     pg_describe_object(refclassid,refobjid,refobjsubid) as objref,
     deptype FROM pg_depend
   WHERE (classid = 'pg_proc'::regclass AND objid = 'gtrgm_options'::regproc)
      OR (refclassid = 'pg_proc'::regclass AND refobjid = 'gtrgm_options'::regproc);
-[ RECORD 1]------------------------------------------------------------------------------------------------------
obj     | function gtrgm_options(internal)
objref  | schema public
deptype | n
-[ RECORD 2
]------------------------------------------------------------------------------------------------------
obj     | function gtrgm_options(internal)
objref  | extension pg_trgm
deptype | e
-[ RECORD 3]------------------------------------------------------------------------------------------------------
obj     | function 10 (text, text) of operator family gist_trgm_ops for access method gist: gtrgm_options(internal)
objref  | function gtrgm_options(internal)
deptype | a
--
Michael

Вложения

Re: cache lookup failed dropping public schema with trgm index

От
Andres Freund
Дата:
Hi,

On 2023-08-21 19:27:19 -0400, Tom Lane wrote:
> Michael Paquier <michael@paquier.xyz> writes:
> > On Mon, Aug 21, 2023 at 03:36:10PM -0700, Andres Freund wrote:
> >> It also seems to work without even involving a drop schema. Just dropping
> >> pg_trgm with cascade is sufficient.
>
> > FWIW, after a bisect I can see that 911e7020 is the origin of the
> > failure (`git bisect start b5d69b7 9e1c9f9` based on two merge-bases).
>
> Hmm.  I see that 911e7020 modified pg_trgm's install script with
>
> +ALTER OPERATOR FAMILY gist_trgm_ops USING gist
> +ADD FUNCTION 10 (text) gtrgm_options (internal);
>
> I wonder whether that correctly adds a dependency to ensure the
> opfamily is dropped before the function.

It's not quite that, I think. Whether it fails or succeeds depends on the
state of the system caches. The dependencies lead to the equivalent of
  ALTER OPERATOR FAMILY ... USING ... DROP
being performed, before dropping the index.

I think there are at least three levels of problems here:

- I don't think we currently properly force index relcache entries to be
  invalidated when the relevant opfamily changes

- LookupOpclassInfo() doesn't have *any* invalidation support, so even if we
  were to perform invalidation, we'd still return a potentially stale
  entry. The function says:

 * Note there is no provision for flushing the cache.  This is OK at the
 * moment because there is no way to ALTER any interesting properties of an
 * existing opclass --- all you can do is drop it, which will result in
 * a useless but harmless dead entry in the cache.

  But that's not true (amymore?), because it does do pg_amproc lookups, and
  they *can* change.

- Minor: Even if you force LookupOpclassInfo() to perform lookups again, it
  doesn't work, because when an amproc entry doesn't exist anymore, the old
  value in opcentry->supportProcs[i] survives, because nothing resets it.

Greetings,

Andres Freund



Re: cache lookup failed dropping public schema with trgm index

От
Michael Paquier
Дата:
On Tue, Aug 22, 2023 at 11:45:58AM -0700, Andres Freund wrote:
> It's not quite that, I think. Whether it fails or succeeds depends on the
> state of the system caches. The dependencies lead to the equivalent of
>   ALTER OPERATOR FAMILY ... USING ... DROP
> being performed, before dropping the index.
>
> I think there are at least three levels of problems here:
>
> - I don't think we currently properly force index relcache entries to be
>   invalidated when the relevant opfamily changes
>
> - LookupOpclassInfo() doesn't have *any* invalidation support, so even if we
>   were to perform invalidation, we'd still return a potentially stale
>   entry. The function says:
>
>  * Note there is no provision for flushing the cache.  This is OK at the
>  * moment because there is no way to ALTER any interesting properties of an
>  * existing opclass --- all you can do is drop it, which will result in
>  * a useless but harmless dead entry in the cache.
>
>   But that's not true (amymore?), because it does do pg_amproc lookups, and
>   they *can* change.
>
> - Minor: Even if you force LookupOpclassInfo() to perform lookups again, it
>   doesn't work, because when an amproc entry doesn't exist anymore, the old
>   value in opcentry->supportProcs[i] survives, because nothing resets it.

Another thing: we also need to think harder about
RelationReloadIndexInfo() which is the cheap path for cache reload for
the index information.  This is quite different from rd_options,
though, because we are going to need an extra facility to look at the
pg_attribute tuples for the indexes to get access to the new option
options?  We touch that a bit in RelationBuildTupleDesc(), but not in
this cheap reload path.  This is going to require more infrastructure,
at quick glance..
--
Michael

Вложения