At Tue, 25 Jul 2023 13:00:00 +0300, Alexander Lakhin <exclusion@gmail.com> wrote in
> Hi Tom,
>
> 21.07.2023 22:21, Tom Lane wrote:
> > Yes, we certainly want to do that during LockRelationOid. But what
> > seems to be happening here is an inval while we are closing/unlocking
> > the catalog we got the syscache entry from. That is, the expected
> > behavior here is:
> >
> > SearchSysCacheExists:
> >
> > * is entry present-and-valid?
> > No, so...
> >
> > * open and lock relevant catalog (with possible inval)
> >
> > * scan catalog, find desired row, create valid syscache entry
> >
> > * close and unlock catalog
> >
> > * return success
> >
> > SearchSysCache1 (from pg_class_aclmask_ext):
> >
> > * is entry present-and-valid?
> > Yes, so increment its refcount and return it
> >
> > There is no inval in the entry-already-present code path in syscache
> > lookup. So if we are seeing this failure, ISTM it must mean that an
> > inval is happening during "close and unlock catalog", which seems like
> > something that we don't want. But I've not traced exactly how that
> > happens.
>
> Yes, but here we deal with -DCATCACHE_FORCE_RELEASE (added to
> config_env
> on prion), so the cache entry, that was just found in
> SearchSysCacheExists(), is removed immediately because of
> SearchSysCacheExists() -> ReleaseSysCache(tuple) ->
> ReleaseCatCache(tuple).
>
> So, while the construction "if (SearchSysCacheExists())
> ... SearchSysCache1()"
> seems robust for normal conditions, it might be broken when catcache
I agree about the safety of the construct.
> entries
> released forcefully. Thus, if the worst consequence of the issue is
> sporadic
> test failures on prion, then may be fix it in a least invasive way (on
> level 1).
> 1) test xmlmap fails sporadically due to the catalog changes caused by
> parallel tests activity
> 2) schema_to_xmlschemaX() can fail when parallel workers are used
> 3) has_table_privilegeX() can fail sporadically when executed within a
> parallel worker
Doesn't this imply that the function isn't parallel-safe? The issue is
gone by marking it and all variants as parallel-restricted. It seems
to be a reasolable way to address this issue.
> 4) SearchSysCacheX(RELOID, ...) can switch to a newer catalog snapshot,
> when repeated in a parallel worker
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center