Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()

Поиск
Список
Период
Сортировка
От Peter Geoghegan
Тема Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Дата
Msg-id CAH2-Wzn57T=d7eB90m0wr+AiAXetk-NWA=ntS89R2mOcDimNsQ@mail.gmail.com
обсуждение исходный текст
Ответ на Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Noah Misch <noah@leadboat.com>)
Ответы Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()  (Peter Geoghegan <pg@bowt.ie>)
Список pgsql-bugs
On Tue, Jan 9, 2024 at 4:44 PM Noah Misch <noah@leadboat.com> wrote:
> On Tue, Jan 09, 2024 at 03:59:19PM -0500, Peter Geoghegan wrote:
> > Did the affected system that you investigated happen to have an
> > atypically high number of databases? The system 15.4 system that I saw
> > the problem on had almost 3,000 databases.
>
> No, single-digit database count here.

My suspicion was that this factor might increase the propensity of
calls to GetOldestNonRemovableTransactionId (used to establish
VACUUM's OldestXmin) to not agree with the GlobalVis* based state used
by pruneheap.c, in the way that we need to worry about here (i.e.
inconsistencies that lead to VACUUM getting stuck inside
lazy_scan_prune's loop).

Using gdb I was able to determine that
ComputeXidHorizonsResultLastXmin == RecentXmin at some point long
after the system gets stuck (when I actually looked). So
GlobalVisTestShouldUpdate() doesn't return true at that point. And, I
see that VACUUM's OldestXmin value is between
GlobalVisDataRels.maybe_needed and
GlobalVisDataRels.definitely_needed. The deleted tuple's xmax is
committed according to OldestXmin (i.e. it's a value < OldestXmin),
and is < GlobalVisDataRels.definitely_needed, too. The same tuple xmax
is > GlobalVisDataRels.maybe_needed. As for this tuple's xmin, it was
already frozen by a previous VACUUM operation. The tuple infomask
flags indicate that it's a pretty standard deleted tuple.

Overall, there aren't a lot of details here that seem like they might
be out of the ordinary, hinting at a specific underlying cause.

It looks more like the assumptions that we make about OldestXmin
agreeing with GlobalVis* state just aren't quite robust, in general.
Ideally I'd be able to point to some specific assumption that has been
violated -- and we might yet tie the problem to some specific detail
that I've yet to identify. As I said upthread, I'm concerned that code
in places like procarray.c is rather loose about how the horizons are
recomputed, in a way that doesn't sit well with me.
GlobalVisTestShouldUpdate() thinks that it's okay to use
ComputeXidHorizonsResultLastXmin-based heuristics to decide when to
recompute horizons. It is more or less treated as a matter of weighing
costs against benefits -- not as a potential correctness issue.

--
Peter Geoghegan



В списке pgsql-bugs по дате отправления:

Предыдущее
От: Noah Misch
Дата:
Сообщение: Re: BUG #17257: (auto)vacuum hangs within lazy_scan_prune()
Следующее
От: Richard Guo
Дата:
Сообщение: Re: BUG #18252: Assert in CheckOpSlotCompatibility() fails when recursive union filters tuples in non-recursive term