Обсуждение: snapshot too old issues, first around wraparound and then more.

Поиск

Список

Период

Сортировка

snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

01 апреля 2020 г., 09:40:08

Hi,

Sorry, this mail is somewhat long. But I think it's important that at
least a few committers read it, since I think we're going to have to
make some sort of call about what to do.


I am trying to change the snapshot too old infrastructure so it
cooperates with my snapshot scalability patch. While trying to
understand the code sufficiently, I think I found a fairly serious
issue:

To map the time-based old_snapshot_threshold to an xid that can be used
as a cutoff for heap_page_prune(), we maintain a ringbuffer of
old_snapshot_threshold + 10 entries in
oldSnapshotControlData->xid_by_minute[]. TransactionIdLimitedForOldSnapshots()
uses that to (at least that's the goal) increase the horizon used for
pruning.

The problem is that there's no protection again the xids in the
ringbuffer getting old enough to wrap around. Given that practical uses
of old_snapshot_threshold are likely to be several hours to several
days, that's not particularly hard to hit.

That then has the consequence that we can use an xid that's either from
"from the future" (if bigger than the current xid), or more recent than
appropriate (if it wrapped far enough to be below nextxid, but not yet
older than OldestXmin) as the OldestXmin argument to heap_page_prune().

Which in turn means that we can end up pruning much more recently
removed rows than intended.

While that'd be bad on its own, the big problem is that we won't detect
that case on access, in contrast to the way old_snapshot_threshold is
intended to work. The reason for that is detecting these errors happens
on the basis of timestamps - which obviously do not wrap around.


This made me want to try to reproduce the problem to at least some
degree. But I hit another wall: I can't make head or tails out of the
values in the xid_by_minute[] mapping.


I added some debug output to print the mapping before/after changes by
MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
to the server start in minutes/seconds to make it easier to interpret).

And the output turns out to be something like:

WARNING:  old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
  entry 0 (ring 0): min 7: xid 582921233
  entry 1 (ring 1): min 8: xid 654154155
  entry 2 (ring 2): min 9: xid 661972949
  entry 3 (ring 3): min 10: xid 666899382
  entry 4 (ring 4): min 11: xid 644169619
  entry 5 (ring 5): min 12: xid 644169619
  entry 6 (ring 6): min 13: xid 644169619
  entry 7 (ring 7): min 14: xid 644169619

WARNING:  head 420 s: updating existing bucket 4 for sec 660 with xmin 666899382

WARNING:  old snapshot mapping at "after update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
  entry 0 (ring 0): min 7: xid 582921233
  entry 1 (ring 1): min 8: xid 654154155
  entry 2 (ring 2): min 9: xid 661972949
  entry 3 (ring 3): min 10: xid 666899382
  entry 4 (ring 4): min 11: xid 666899382
  entry 5 (ring 5): min 12: xid 644169619
  entry 6 (ring 6): min 13: xid 644169619
  entry 7 (ring 7): min 14: xid 644169619

It's pretty obvious that the xids don't make a ton of sense, I think:
They're not monotonically ordered. The same values exist multiple times,
despite xids being constantly used. Also, despite the ringbuffer
supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
always add), and the workload having run for 14min, we only have 8
entries.

Then a bit later we get:

WARNING:  old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
  entry 0 (ring 0): min 7: xid 582921233
  entry 1 (ring 1): min 8: xid 654154155
  entry 2 (ring 2): min 9: xid 661972949
  entry 3 (ring 3): min 10: xid 666899382
  entry 4 (ring 4): min 11: xid 666899382
  entry 5 (ring 5): min 12: xid 666899382
  entry 6 (ring 6): min 13: xid 666899382
  entry 7 (ring 7): min 14: xid 666899382

WARNING:  head 420 s: filling 8 buckets starting at 0 for sec 900 with xmin 666899382
WARNING:  old snapshot mapping at "after update" with head ts: 15, current entries: 15 max entries: 15, offset: 1
  entry 0 (ring 1): min 15: xid 654154155
  entry 1 (ring 2): min 16: xid 661972949
  entry 2 (ring 3): min 17: xid 666899382
  entry 3 (ring 4): min 18: xid 666899382
  entry 4 (ring 5): min 19: xid 666899382
  entry 5 (ring 6): min 20: xid 666899382
  entry 6 (ring 7): min 21: xid 666899382
  entry 7 (ring 8): min 22: xid 666899382
  entry 8 (ring 9): min 23: xid 666899382
  entry 9 (ring 10): min 24: xid 666899382
  entry 10 (ring 11): min 25: xid 666899382
  entry 11 (ring 12): min 26: xid 666899382
  entry 12 (ring 13): min 27: xid 666899382
  entry 13 (ring 14): min 28: xid 666899382
  entry 14 (ring 0): min 29: xid 666899382


At a later point we then enter the "Advance is so far that all old data
is junk; start over." branch, and just reset the whole mapping:
  entry 0 (ring 0): min 30: xid 866711525


The problem, as far as I can tell, is that
oldSnapshotControl->head_timestamp appears to be intended to be the
oldest value in the ring. But we update it unconditionally in the "need
a new bucket, but it might not be the very next one" branch of
MaintainOldSnapshotTimeMapping().

While there's not really a clear-cut comment explaining whether
head_timestamp() is intended to be the oldest or the newest timestamp,
it seems to me that the rest of the code treats it as the "oldest"
timestamp.

TransactionId
TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
                                    Relation relation)
...
        ts = AlignTimestampToMinuteBoundary(ts)
            - (old_snapshot_threshold * USECS_PER_MINUTE);
...
                LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);

                if (oldSnapshotControl->count_used > 0
                    && ts >= oldSnapshotControl->head_timestamp)
                {
                    int         offset;

                    offset = ((ts - oldSnapshotControl->head_timestamp)
                              / USECS_PER_MINUTE);
                    if (offset > oldSnapshotControl->count_used - 1)
                        offset = oldSnapshotControl->count_used - 1;
                    offset = (oldSnapshotControl->head_offset + offset)
                        % OLD_SNAPSHOT_TIME_MAP_ENTRIES;
                    xlimit = oldSnapshotControl->xid_by_minute[offset];

                    if (NormalTransactionIdFollows(xlimit, recentXmin))
                        SetOldSnapshotThresholdTimestamp(ts, xlimit);
                }

                LWLockRelease(OldSnapshotTimeMapLock);

So we wind ts back by old_snapshot_threshold minutes. Then check that
that's still newer than oldSnapshotControl->head_timestamp - which
clearly can't be the case if it were the newest ts. And as far as I can
tell the indexing code also only makes sense if head_timestamp is the
oldest timestamp.


This would mean that most cases the old_snapshot_threshold feature is
active it would cause corruption: We'd not trigger errors on access,
because the timestamp set with SetOldSnapshotThresholdTimestamp() would
not actually match the xids used to limit.  But:

It turns out to be somewhat hard to get
TransactionIdLimitedForOldSnapshots() to actually do something. Because
oldSnapshotControl->head_timestamp is updated much more often than it
should, the ts >= oldSnapshotControl->head_timestamp condition will
often prevent the limiting code from being hit.

But it's not very unlikely either. Due to the update of head_timestamp
to the current timestamp, we'll enter the "existing mapping; advance xid
if possible" branch for up to OLD_SNAPSHOT_TIME_MAP_ENTRIES times. Which
means we can hit it for
/*
 * The structure used to map times to TransactionId values for the "snapshot
 * too old" feature must have a few entries at the tail to hold old values;
 * otherwise the lookup will often fail and the expected early pruning or
 * vacuum will not usually occur.  It is best if this padding is for a number
 * of minutes greater than a thread would normally be stalled, but it's OK if
 * early vacuum opportunities are occasionally missed, so there's no need to
 * use an extreme value or get too fancy.  10 minutes seems plenty.
 */
#define OLD_SNAPSHOT_PADDING_ENTRIES 10
#define OLD_SNAPSHOT_TIME_MAP_ENTRIES (old_snapshot_threshold + OLD_SNAPSHOT_PADDING_ENTRIES)

10 minutes, I think. There's some other ways too, but they're much less
likely.

Note that once the issue has been hit once, future
SetOldSnapshotThresholdTimestamp() calls that don't hit those 10 minutes
will also return a corrupted horizon, because
oldSnapshotControl->threshold_xid will have the wrong value, which then
will be used:
        /*
         * Failsafe protection against vacuuming work of active transaction.
         *
         * This is not an assertion because we avoid the spinlock for
         * performance, leaving open the possibility that xlimit could advance
         * and be more current; but it seems prudent to apply this limit.  It
         * might make pruning a tiny bit less aggressive than it could be, but
         * protects against data loss bugs.
         */
        if (TransactionIdIsNormal(latest_xmin)
            && TransactionIdPrecedes(latest_xmin, xlimit))
            xlimit = latest_xmin;

        if (NormalTransactionIdFollows(xlimit, recentXmin))
            return xlimit;



As far as I can tell, this code has been wrong since the feature has
been committed.  The tests don't show a problem, because none of this
code is reached when old_snapshot_threshold = 0 (which has no real world
use, it's purely for testing).


I really don't know what to do here. The feature never worked and will
silently cause wrong query results. Fixing it seems like a pretty large
task - there's a lot more bugs. But ripping out a feature in stable
branches is pretty bad too.


Before figuring out the above, I spent the last several days trying to
make this feature work with my snapshot scalability patch. Trying to
avoid regressing old_snapshot_threshold behaviour / performance. But not
it seems to me that there's no actual working feature that can be
preserved.


I am really tired.


Andres.

Re: snapshot too old issues, first around wraparound and then more.

От

Robert Haas

Дата:

01 апреля 2020 г., 17:01:07

On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:
> The problem is that there's no protection again the xids in the
> ringbuffer getting old enough to wrap around. Given that practical uses
> of old_snapshot_threshold are likely to be several hours to several
> days, that's not particularly hard to hit.

Presumably this could be fixed by changing it to use FullTransactionId.

> The problem, as far as I can tell, is that
> oldSnapshotControl->head_timestamp appears to be intended to be the
> oldest value in the ring. But we update it unconditionally in the "need
> a new bucket, but it might not be the very next one" branch of
> MaintainOldSnapshotTimeMapping().

I agree, that doesn't look right. It's correct, I think, for the "if
(advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)" case, but not in the
"else" case. In the "else" case, it should advance by 1 (wrapping if
needed) each time we take the "if (oldSnapshotControl->count_used ==
OLD_SNAPSHOT_TIME_MAP_ENTRIES)" branch, and should remain unchanged in
the "else" branch for that if statement.

> As far as I can tell, this code has been wrong since the feature has
> been committed.  The tests don't show a problem, because none of this
> code is reached when old_snapshot_threshold = 0 (which has no real world
> use, it's purely for testing).

I'm pretty sure I complained about the fact that only the
old_snapshot_threshold = 0 case was tested at the time this went in,
but I don't think Kevin was too convinced that we needed to do
anything else, and realistically, if he'd tried for a regression test
that ran for 15 minutes, Tom would've gone ballistic.

> I really don't know what to do here. The feature never worked and will
> silently cause wrong query results. Fixing it seems like a pretty large
> task - there's a lot more bugs. But ripping out a feature in stable
> branches is pretty bad too.

I don't know what other bugs there are, but the two you mention above
look fixable. Even if we decide that the feature can't be salvaged, I
would vote against ripping it out in back branches. I would instead
argue for telling people not to use it and ripping it out in master.
However, much as I'm not in love with all of the complexity this
feature adds, I don't see the problems you've reported here as serious
enough to justify ripping it out.

What exactly is the interaction of this patch with your snapshot
scalability work?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

01 апреля 2020 г., 17:01:28

Hi,

On 2020-03-31 23:40:08 -0700, Andres Freund wrote:
> I added some debug output to print the mapping before/after changes by
> MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
> to the server start in minutes/seconds to make it easier to interpret).

Now attached.

Greetings,

Andres Freund

Вложения

debug.diff

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

01 апреля 2020 г., 18:09:14

Hi,

On 2020-04-01 10:01:07 -0400, Robert Haas wrote:
> On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:
> > The problem is that there's no protection again the xids in the
> > ringbuffer getting old enough to wrap around. Given that practical uses
> > of old_snapshot_threshold are likely to be several hours to several
> > days, that's not particularly hard to hit.
> 
> Presumably this could be fixed by changing it to use FullTransactionId.

That doesn't exist in all the back branches. Think it'd be easier to add
code to explicitly prune it during MaintainOldSnapshotTimeMapping().

> > The problem, as far as I can tell, is that
> > oldSnapshotControl->head_timestamp appears to be intended to be the
> > oldest value in the ring. But we update it unconditionally in the "need
> > a new bucket, but it might not be the very next one" branch of
> > MaintainOldSnapshotTimeMapping().
> 
> I agree, that doesn't look right. It's correct, I think, for the "if
> (advance >= OLD_SNAPSHOT_TIME_MAP_ENTRIES)" case, but not in the
> "else" case. In the "else" case, it should advance by 1 (wrapping if
> needed) each time we take the "if (oldSnapshotControl->count_used ==
> OLD_SNAPSHOT_TIME_MAP_ENTRIES)" branch, and should remain unchanged in
> the "else" branch for that if statement.

Yea.

> > As far as I can tell, this code has been wrong since the feature has
> > been committed.  The tests don't show a problem, because none of this
> > code is reached when old_snapshot_threshold = 0 (which has no real world
> > use, it's purely for testing).
> 
> I'm pretty sure I complained about the fact that only the
> old_snapshot_threshold = 0 case was tested at the time this went in,
> but I don't think Kevin was too convinced that we needed to do
> anything else, and realistically, if he'd tried for a regression test
> that ran for 15 minutes, Tom would've gone ballistic.

I think it's not just Tom that'd have gone ballistic. I think it's the
reason why, as I think is pretty clear, the feature was *never* actually
tested. The results of whats being removed are not quite random, but
it's not far from it. And there's long stretches of time where it never
removes things.

It's also a completely self-made problem.

There's really no reason at all to have bins of one minute. As it's a
PGC_POSTMASTER GUC, it should just have didided time into bins of
(old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
of a week there's no need to keep 10k bins, and the minimum threshold of
1 minute obviously is problematic.

> > I really don't know what to do here. The feature never worked and will
> > silently cause wrong query results. Fixing it seems like a pretty large
> > task - there's a lot more bugs. But ripping out a feature in stable
> > branches is pretty bad too.
> 
> I don't know what other bugs there are, but the two you mention above
> look fixable.

They probably are fixable. But there's a lot more, I think:

Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
update_ts threshold actually needs to be ts >= update_ts, right now we
don't handle being newer than the newest bin correctly afaict (mitigated
by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
to say, because there's no comments.

The whole lock nesting is very hazardous. Most (all?)
TestForOldSnapshot() calls happen with locks on on buffers held, and can
acquire lwlocks itself. In some older branches we do entire *catalog
searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
basis to detect conflicts seems dangerous to me. Isn't that ignoring
inserts that are already in progress?

> Even if we decide that the feature can't be salvaged, I would vote
> against ripping it out in back branches. I would instead argue for
> telling people not to use it and ripping it out in master.

It currently silently causes wrong query results. There's no
infrastructure to detect / protect against that (*).

I'm sure we can fix individual instances of problems. But I don't know
how one is supposed to verify that the fixes actually work. There's
currently no tests for the actual feature. And manual tests are painful
due to the multi-minute thresholds needed, and it's really hard to
manually verify that only the right rows are removed due to the feature,
and that all necessary errors are thrown. Given e.g. the bugs in my
email upthread, there's periods of several minutes where we'd not see
any row removed and then periods where the wrong ones would be removed,
so the manual tests would have to be repeated numerous times to actually
ensure anything.

If somebody wants to step up to the plate and fix these, it'd perhaps be
more realistic to say that we'll keep the feature. But even if somebody
does, I think it'd require a lot of development in the back branches. On
a feature whose purpose is to eat data that is still required.

I think even if we decide that we do not want to rip the feature out, we
should seriously consider hard disabling it in the backbranches. At
least I don't see how the fixed code is tested enough to be entrusted
with users data.

Do we actually have any evidence of this feature ever beeing used? I
didn't find much evidence for that in the archives (except Thomas
finding a problem). Given that it currently will switch between not
preventing bloat and causing wrong query results, without that being
noticed...

(*) perhaps I just am not understanding the protection however. To me
it's not at all clear what:

        /*
         * Failsafe protection against vacuuming work of active transaction.
         *
         * This is not an assertion because we avoid the spinlock for
         * performance, leaving open the possibility that xlimit could advance
         * and be more current; but it seems prudent to apply this limit.  It
         * might make pruning a tiny bit less aggressive than it could be, but
         * protects against data loss bugs.
         */
        if (TransactionIdIsNormal(latest_xmin)
            && TransactionIdPrecedes(latest_xmin, xlimit))
            xlimit = latest_xmin;

        if (NormalTransactionIdFollows(xlimit, recentXmin))
            return xlimit;

actually provides in the way of a protection.

> However,
> much as I'm not in love with all of the complexity this feature adds,
> I don't see the problems you've reported here as serious enough to
> justify ripping it out.
> 
> What exactly is the interaction of this patch with your snapshot
> scalability work?

Post my work there's no precise RecentOldestXmin anymore (since
accessing the frequently changing xmin of other backends is what causes
a good chunk of the scalability issues). But heap_page_prune_opt() has
to determine what to use as the threshold for being able to prune dead
rows. Without snapshot_too_old we can initially rely on the known
boundaries to determine whether we can prune, and only determine an
"accurate" boundary when encountering a prune xid (or a tuple, but
that's an optimization) that falls in the range where we don't know for
certain we can prune.  But that's not easy to do with the way the
old_snapshot_threshold stuff currently works.

It's not too hard to implement a crude version that just determines an
accurate xmin horizon whenever pruning with old_snapshot_threshold
set. But that seems like gimping performance for old_snapshot_threshold,
which didn't seem nice.

Additionally, the current implementation of snapshot_too_old is pretty
terrible about causing unnecessary conflicts when hot pruning. Even if
there was no need at all for the horizon to be limited to be able to
prune the page, or if there was nothing to prune on the page (note that
the limiting happens before checking if the space on the page even makes
pruning useful), we still cause a conflict for future accesses, because
TransactionIdLimitedForOldSnapshots() will
SetOldSnapshotThresholdTimestamp() to a recent timestamp.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Robert Haas

Дата:

01 апреля 2020 г., 18:15:14

On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:
> I added some debug output to print the mapping before/after changes by
> MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
> to the server start in minutes/seconds to make it easier to interpret).
>
> And the output turns out to be something like:
>
> WARNING:  old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
>   entry 0 (ring 0): min 7: xid 582921233
>   entry 1 (ring 1): min 8: xid 654154155
>   entry 2 (ring 2): min 9: xid 661972949
>   entry 3 (ring 3): min 10: xid 666899382
>   entry 4 (ring 4): min 11: xid 644169619
>   entry 5 (ring 5): min 12: xid 644169619
>   entry 6 (ring 6): min 13: xid 644169619
>   entry 7 (ring 7): min 14: xid 644169619
>
> WARNING:  head 420 s: updating existing bucket 4 for sec 660 with xmin 666899382
>
> WARNING:  old snapshot mapping at "after update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
>   entry 0 (ring 0): min 7: xid 582921233
>   entry 1 (ring 1): min 8: xid 654154155
>   entry 2 (ring 2): min 9: xid 661972949
>   entry 3 (ring 3): min 10: xid 666899382
>   entry 4 (ring 4): min 11: xid 666899382
>   entry 5 (ring 5): min 12: xid 644169619
>   entry 6 (ring 6): min 13: xid 644169619
>   entry 7 (ring 7): min 14: xid 644169619
>
> It's pretty obvious that the xids don't make a ton of sense, I think:
> They're not monotonically ordered. The same values exist multiple times,
> despite xids being constantly used. Also, despite the ringbuffer
> supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
> always add), and the workload having run for 14min, we only have 8
> entries.

The function header comment for MaintainOldSnapshotTimeMapping could
hardly be more vague, as it's little more than a restatement of the
function name. However, it looks to me like the idea is that this
function might get called multiple times for the same or similar
values of whenTaken. I suppose that's the point of this code:

    else if (ts <= (oldSnapshotControl->head_timestamp +
                    ((oldSnapshotControl->count_used - 1)
                     * USECS_PER_MINUTE)))
    {
        /* existing mapping; advance xid if possible */
        int         bucket = (oldSnapshotControl->head_offset
                              + ((ts - oldSnapshotControl->head_timestamp)
                                 / USECS_PER_MINUTE))
        % OLD_SNAPSHOT_TIME_MAP_ENTRIES;

        if (TransactionIdPrecedes(oldSnapshotControl->xid_by_minute[bucket],
xmin))
            oldSnapshotControl->xid_by_minute[bucket] = xmin;
    }

What I interpret this to be doing is saying - if we got a new call to
this function with a rounded-to-the-minute timestamp that we've seen
previously and for which we still have an entry, and if the XID passed
to this function is newer than the one passed by the previous call,
then advance the xid_by_minute[] bucket to the newer value. Now that
begs the question - what does this XID actually represent? The
comments don't seem to answer that question, not even the comments for
OldSnapshotControlData, which say that we should "Keep one xid per
minute for old snapshot error handling." but don't say which XIDs we
should keep or how they'll be used. However, the only call to
MaintainOldSnapshotTimeMapping() is in GetSnapshotData(). It appears
that we call this function each time a new snapshot is taken and pass
the current time (modulo some fiddling) and snapshot xmin. Given that,
one would expect that any updates to the map would be tight races,
i.e. a bunch of processes that all took their snapshots right around
the same time would all update the same map entry in quick succession,
with the newest value winning.

And that make the debugging output which I quoted from your message
above really confusing. At this point, the "head timestamp" is 7
minutes after this facility started up. The first we entry we have is
for minute 7, and the last is for minute 14. But the one we're
updating is for minute 11. How the heck can that happen? I might
suspect that you'd stopped a process inside GetSnapshotData() with a
debugger, but that can't explain it either, because GetSnapshotData()
gets the xmin first and only afterwards gets the timestamp - so if
you'd stopped for ~3 minutes it just before the call to
MaintainOldSnapshotTimeMapping(), it would've been updating the map
with an *old* XID. In reality, though, it changed the XID from
644169619 to 666899382, advancing over 22 million XIDs. I don't
understand what's going on there. How is this function getting called
with a 4-minute old value of whenTaken?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

01 апреля 2020 г., 18:39:47

Hi,

On 2020-04-01 11:15:14 -0400, Robert Haas wrote:
> On Wed, Apr 1, 2020 at 2:40 AM Andres Freund <andres@anarazel.de> wrote:
> > I added some debug output to print the mapping before/after changes by
> > MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
> > to the server start in minutes/seconds to make it easier to interpret).
> >
> > And the output turns out to be something like:
> >
> > WARNING:  old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
> >   entry 0 (ring 0): min 7: xid 582921233
> >   entry 1 (ring 1): min 8: xid 654154155
> >   entry 2 (ring 2): min 9: xid 661972949
> >   entry 3 (ring 3): min 10: xid 666899382
> >   entry 4 (ring 4): min 11: xid 644169619
> >   entry 5 (ring 5): min 12: xid 644169619
> >   entry 6 (ring 6): min 13: xid 644169619
> >   entry 7 (ring 7): min 14: xid 644169619
> >
> > WARNING:  head 420 s: updating existing bucket 4 for sec 660 with xmin 666899382
> >
> > WARNING:  old snapshot mapping at "after update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
> >   entry 0 (ring 0): min 7: xid 582921233
> >   entry 1 (ring 1): min 8: xid 654154155
> >   entry 2 (ring 2): min 9: xid 661972949
> >   entry 3 (ring 3): min 10: xid 666899382
> >   entry 4 (ring 4): min 11: xid 666899382
> >   entry 5 (ring 5): min 12: xid 644169619
> >   entry 6 (ring 6): min 13: xid 644169619
> >   entry 7 (ring 7): min 14: xid 644169619
> >
> > It's pretty obvious that the xids don't make a ton of sense, I think:
> > They're not monotonically ordered. The same values exist multiple times,
> > despite xids being constantly used. Also, despite the ringbuffer
> > supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
> > always add), and the workload having run for 14min, we only have 8
> > entries.
> 
> The function header comment for MaintainOldSnapshotTimeMapping could
> hardly be more vague, as it's little more than a restatement of the
> function name. However, it looks to me like the idea is that this
> function might get called multiple times for the same or similar
> values of whenTaken. I suppose that's the point of this code:

Right. We enforce whenTaken to be monotonic
(cf. GetSnapshotCurrentTimestamp()), but since
GetSnapshotCurrentTimestamp() reduces the granularity of the timestamp
to one-minute (the AlignTimestampToMinuteBoundary() call), it's
obviously possible to end up in the same bin as a previous


> What I interpret this to be doing is saying - if we got a new call to
> this function with a rounded-to-the-minute timestamp that we've seen
> previously and for which we still have an entry, and if the XID passed
> to this function is newer than the one passed by the previous call,
> then advance the xid_by_minute[] bucket to the newer value. Now that
> begs the question - what does this XID actually represent? The
> comments don't seem to answer that question, not even the comments for
> OldSnapshotControlData, which say that we should "Keep one xid per
> minute for old snapshot error handling." but don't say which XIDs we
> should keep or how they'll be used. However, the only call to
> MaintainOldSnapshotTimeMapping() is in GetSnapshotData(). It appears
> that we call this function each time a new snapshot is taken and pass
> the current time (modulo some fiddling) and snapshot xmin. Given that,
> one would expect that any updates to the map would be tight races,
> i.e. a bunch of processes that all took their snapshots right around
> the same time would all update the same map entry in quick succession,
> with the newest value winning.

Right.


> And that make the debugging output which I quoted from your message
> above really confusing. At this point, the "head timestamp" is 7
> minutes after this facility started up. The first we entry we have is
> for minute 7, and the last is for minute 14. But the one we're
> updating is for minute 11. How the heck can that happen?

If I undestand what your reference correctly, I think that is because,
due to the bug, the "need a new bucket" branch doesn't just extend by
one bucket, it extends it by many in common cases. Basically filling
buckets "into the future".

the advance = ... variable in the branch will not always be 1, even when
we continually call Maintain*.  Here's some debug output showing that
(slightly modified from the patch I previously sent):

WARNING:  old snapshot mapping at "before update" with head ts: 1, current entries: 2 max entries: 15, offset: 0
  entry 0 (ring 0): min 1: xid 1089371384
  entry 1 (ring 1): min 2: xid 1099553206

WARNING:  head 1 min: filling 2 buckets starting at 0 for whenTaken 3 min, with xmin 1109840204

WARNING:  old snapshot mapping at "after update" with head ts: 3, current entries: 4 max entries: 15, offset: 0
  entry 0 (ring 0): min 3: xid 1089371384
  entry 1 (ring 1): min 4: xid 1099553206
  entry 2 (ring 2): min 5: xid 1109840204
  entry 3 (ring 3): min 6: xid 1109840204

Note how the two new buckets have the same xid, and how we're inserting
for "whenTaken 3 min", but we've filled the mapping up to minute 6.


I don't think the calculation of the 'advance' variable is correct as
is, even if we ignore the wrong setting of the head_timestamp variable.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Robert Haas

Дата:

01 апреля 2020 г., 19:02:18

On Wed, Apr 1, 2020 at 11:09 AM Andres Freund <andres@anarazel.de> wrote:
> That doesn't exist in all the back branches. Think it'd be easier to add
> code to explicitly prune it during MaintainOldSnapshotTimeMapping().

That's reasonable.

> There's really no reason at all to have bins of one minute. As it's a
> PGC_POSTMASTER GUC, it should just have didided time into bins of
> (old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
> of a week there's no need to keep 10k bins, and the minimum threshold of
> 1 minute obviously is problematic.

I am very doubtful that this approach would have been adequate. It
would mean that, with old_snapshot_threshold set to a week, the
threshold for declaring a snapshot "old" would jump forward 16.8 hours
at a time. It's hard for me to make a coherent argument right now as
to exactly what problems that would create, but it's not very
granular, and a lot of bloat-related things really benefit from more
granularity. I also don't really see what the problem with keeping a
bucket per minute in memory is, even for a week. It's only 60 * 24 * 7
= ~10k buckets, isn't it? That's not really insane for an in-memory
data structure. I agree that the code that does that maintenance being
buggy is a problem to whatever extent that is the case, but writing
the code to have fewer buckets wouldn't necessarily have made it any
less buggy.

> They probably are fixable. But there's a lot more, I think:
>
> Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
> update_ts threshold actually needs to be ts >= update_ts, right now we
> don't handle being newer than the newest bin correctly afaict (mitigated
> by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
> to say, because there's no comments.

That test and the following one for "if (ts == update_ts)" both make
me nervous too.  If only two of <, >, and = are expected, there should
be an Assert() to that effect, at least. If all three values are
expected then we need an explanation of why we're only checking for
equality.

> The whole lock nesting is very hazardous. Most (all?)
> TestForOldSnapshot() calls happen with locks on on buffers held, and can
> acquire lwlocks itself. In some older branches we do entire *catalog
> searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

The catalog searches are clearly super-bad, but I'm not sure that the
other ones have a deadlock risk or anything. They might, but I think
we'd need some evidence of that.

> GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
> basis to detect conflicts seems dangerous to me. Isn't that ignoring
> inserts that are already in progress?

How so?

> It currently silently causes wrong query results. There's no
> infrastructure to detect / protect against that (*).

Sure, and what if you break more stuff ripping it out? Ripping this
volume of code out in a supposedly-stable branch is totally insane
almost no matter how broken the feature is. I also think, and we've
had this disagreement before, that you're far too willing to say
"well, that's wrong so we need to hit it with a nuke." I complained
when you added those error checks to vacuum in back-branches, and
since that release went out people are regularly tripping those checks
and taking prolonged outages for a problem that wasn't making them
unhappy before. I know that in theory those people are better off
because their database was always corrupted and now they know. But for
some of them, those prolonged outages are worse than the problem they
had before. I believe it was irresponsible to decide on behalf of our
entire user base that they were better off with such a behavior change
in a supposedly-stable branch, and I believe the same thing here.

I have no objection to the idea that *if* the feature is hopelessly
broken, it should be removed. But I don't have confidence at this
point that you've established that, and I think ripping out thousands
of lines of codes in the back-branches is terrible. Even
hard-disabling the feature in the back-branches without actually
removing the code is an awfully strong reaction, but it could be
justified if we find out that things are actually super-bad and not
really fixable. Actually removing the code is unnecessary, protects
nobody, and has risk.

> Do we actually have any evidence of this feature ever beeing used? I
> didn't find much evidence for that in the archives (except Thomas
> finding a problem). Given that it currently will switch between not
> preventing bloat and causing wrong query results, without that being
> noticed...

I believe that at least one EnterpriseDB customer used it, and
possibly more than one. I am not sure how extensively, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

01 апреля 2020 г., 20:03:37

On Wed, Apr 1, 2020 at 9:02 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I complained
> when you added those error checks to vacuum in back-branches, and
> since that release went out people are regularly tripping those checks
> and taking prolonged outages for a problem that wasn't making them
> unhappy before. I know that in theory those people are better off
> because their database was always corrupted and now they know. But for
> some of them, those prolonged outages are worse than the problem they
> had before. I believe it was irresponsible to decide on behalf of our
> entire user base that they were better off with such a behavior change
> in a supposedly-stable branch, and I believe the same thing here.

I agreed with that decision, FWIW. Though I don't deny that there is
some merit in what you say. This is the kind of high level
philosophical question where large differences of opinion are quite
normal.

I don't think that it's fair to characterize Andres' actions in that
situation as in any way irresponsible. We had an extremely complicated
data corruption bug that he went to great lengths to fix, following
two other incorrect fixes. He was jet lagged from travelling to India
at the time. He went to huge lengths to make sure that the bug was
correctly squashed.

> Actually removing the code is unnecessary, protects
> nobody, and has risk.

Every possible approach has risk. We are deciding among several
unpleasant and risky alternatives here, no?

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Robert Haas

Дата:

01 апреля 2020 г., 20:27:56

On Wed, Apr 1, 2020 at 1:03 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I don't think that it's fair to characterize Andres' actions in that
> situation as in any way irresponsible. We had an extremely complicated
> data corruption bug that he went to great lengths to fix, following
> two other incorrect fixes. He was jet lagged from travelling to India
> at the time. He went to huge lengths to make sure that the bug was
> correctly squashed.

I don't mean it as a personal attack on Andres, and I know and am glad
that he worked hard on the problem, but I don't agree that it was the
right decision. Perhaps "irresponsible" is the wrong word, but it's
certainly caused problems for multiple EnterpriseDB customers, and in
my view, those problems weren't necessary. Either a WARNING or an
ERROR would have shown up in the log, but an ERROR terminates VACUUM
for that table and thus basically causes autovacuum to be completely
broken. That is a really big problem. Perhaps you will want to argue,
as Andres did, that the value of having ERROR rather than WARNING in
the log justifies that outcome, but I sure don't agree.

> > Actually removing the code is unnecessary, protects
> > nobody, and has risk.
>
> Every possible approach has risk. We are deciding among several
> unpleasant and risky alternatives here, no?

Sure, but not all levels of risk are equal. Jumping out of a plane
carries some risk of death whether or not you have a parachute, but
that does not mean that we shouldn't worry about whether you have one
or not before you jump.

In this case, I think it is pretty clear that hard-disabling the
feature by always setting old_snapshot_threshold to -1 carries less
risk of breaking unrelated things than removing code that caters to
the feature all over the code base. Perhaps it is not quite as
dramatic as my parachute example, but I think it is pretty clear all
the same that one is a lot more likely to introduce new bugs than the
other. A carefully targeted modification of a few lines of code in 1
file just about has to carry less risk than ~1k lines of code spread
across 40 or so files.

However, I still think that without some more analysis, it's not clear
whether we should go this direction at all. Andres's results suggest
that there are some bugs here, but I think we need more senior hackers
to study the situation before we make a decision about what to do
about them. I certainly haven't had enough time to even fully
understand the problems yet, and nobody else has posted on that topic
at all. I have the highest respect for Andres and his technical
ability, and if he says this stuff has problems, I'm sure it does. Yet
I'm not willing to conclude that because he's tired and frustrated
with this stuff right now, it's unsalvageable. For the benefit of the
whole community, such a claim deserves scrutiny from multiple people.

Is there any chance that you're planning to look into the details?
That would certainly be welcome from my perspective.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

01 апреля 2020 г., 21:01:07

Hi,

On 2020-04-01 12:02:18 -0400, Robert Haas wrote:
> On Wed, Apr 1, 2020 at 11:09 AM Andres Freund <andres@anarazel.de> wrote:
> > There's really no reason at all to have bins of one minute. As it's a
> > PGC_POSTMASTER GUC, it should just have didided time into bins of
> > (old_snapshot_threshold * USEC_PER_SEC) / 100 or such. For a threshold
> > of a week there's no need to keep 10k bins, and the minimum threshold of
> > 1 minute obviously is problematic.
>
> I am very doubtful that this approach would have been adequate. It
> would mean that, with old_snapshot_threshold set to a week, the
> threshold for declaring a snapshot "old" would jump forward 16.8 hours
> at a time. It's hard for me to make a coherent argument right now as
> to exactly what problems that would create, but it's not very
> granular, and a lot of bloat-related things really benefit from more
> granularity. I also don't really see what the problem with keeping a
> bucket per minute in memory is, even for a week. It's only 60 * 24 * 7
> = ~10k buckets, isn't it? That's not really insane for an in-memory
> data structure. I agree that the code that does that maintenance being
> buggy is a problem to whatever extent that is the case, but writing
> the code to have fewer buckets wouldn't necessarily have made it any
> less buggy.

My issue isn't really that it's too many buckets right now, but that it
doesn't scale down to smaller thresholds. I think to be able to develop
this reasonably, it'd need to be able to support thresholds in the
sub-second range. And I don't see how you can have the same binning for
such small thresholds, and for multi-day thresholds - we'd quickly go to
millions of buckets for longer thresholds.

I really think we'd need to support millisecond resolution to make this
properly testable.

> > GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
> > basis to detect conflicts seems dangerous to me. Isn't that ignoring
> > inserts that are already in progress?

> How so?

Because it returns the end of the reserved WAL, not how far we've
actually inserted. I.e. there can be in-progress, but not finished,
modifications that will have an LSN < GetXLogInsertRecPtr(). But the
whenTaken timestamp could reflect one that should throw an error for
these in-progress modifications (since the transaction limiting happens
before the WAL logging).

I am not 100%, but I suspect that that could lead to errors not being
thrown that should, because TestForOldSnapshot() will not see these
in-progress modifications as conflicting.

Hm, also, shouldn't
        && PageGetLSN(page) > (snapshot)->lsn)
in TestForOldSnapshot() be an >=?

> > It currently silently causes wrong query results. There's no
> > infrastructure to detect / protect against that (*).
>
> Sure, and what if you break more stuff ripping it out? Ripping this
> volume of code out in a supposedly-stable branch is totally insane
> almost no matter how broken the feature is.

For the backbranches I was just thinking of forcing the GUC to be off
(either by disallowing it to be set to on, or just warning when its set
to true, but not propagating the value).

> I have no objection to the idea that *if* the feature is hopelessly
> broken, it should be removed.

I would be a lot less inclined to go that way if old_snapshot_threshold

a) weren't explicitly about removing still-needed data - in contrast to
  a lot of other features, where the effects of bugs is temporary, here
  it can be much larger.
b) were a previously working feature, but as far as I can tell, it never really did
c) had tests that verify that my fixes actually do the right thing. As
  it stands, I'd not just have to fix the bugs, I'd also have to develop
  a test framework that can test this

While I wish I had been more forceful, and reviewed more of the code to
point out more of the quality issues, I did argue hard against the
feature going in. On account of it being architecturally bad and
impactful. Which I think it has proven to be several times over by
now. And now I'm kind of on the hook to fix it, it seems?

> I also think, and we've had this disagreement before, that you're far
> too willing to say "well, that's wrong so we need to hit it with a
> nuke." I complained when you added those error checks to vacuum in
> back-branches, and since that release went out people are regularly
> tripping those checks and taking prolonged outages for a problem that
> wasn't making them unhappy before. I know that in theory those people
> are better off because their database was always corrupted and now
> they know. But for some of them, those prolonged outages are worse
> than the problem they had before.

I think this is a somewhat revisionist. Sure, the errors were added
after like the 10th data corruption bug around freezing that we didn't
find for a long time, because of the lack of errors being thrown. But
the error checks weren't primarily added to find further bugs, but to
prevent data loss due to the fixed bug. Of which we had field reports.

I'd asked over *weeks* for reviews of the bug fixes. Not a single person
expressed concerns about throwing new errors at that time. First version
of the patches with the errors:
https://postgr.es/m/20171114030341.movhteyakqeqx5pm%40alap3.anarazel.de
I pushed them over a month later
https://postgr.es/m/20171215023059.oeyindn57oeis5um%40alap3.anarazel.de

There also wasn't (and isn't) a way to just report back that we can't
currently freeze the individual page, without doing major surgery. And
even if there were, what are supposed to do other than throw an error?
We need to remove tuples below relfrozenxid, or we corrupt the table.

As I've first asked before when you complained about those errors: What
was the alternative? Just have invisible tuples reappear? Delete them? I
don't think you've ever answered that.

You brought this up as an example for me being over-eager with errors
checks before. But I don't see how that meshes with the history visible
in the thread referenced above.

The more general issue, about throwing errors, is not just about the
people that don't give a hoot about whether their data evolves on its
own (perhaps a good tradeoff for them). Not throwing errors affects
*everyone*. Some people do care about their data. Without errors we
never figure out that we screwed up.  And long-term, even the people
that care much more about availability than data loss, benefit from the
whole system getting more robust.

We've since found numerous further data corrupting bugs because of the
relfrozenxid checks. Some of very long standing vintage. Some in newly
added code.

Yes, hypothetically, I'd argue for introducing the checks solely for the
sake of finding bugs. Even if I were prescient to forsee the number of
issues caused (although I'd add block numbers to the error message from
the get go, knowing that). But I'd definitely not do so in the back
branches.

> I believe it was irresponsible to decide on behalf of our entire user
> base that they were better off with such a behavior change in a
> supposedly-stable branch, and I believe the same thing here.

As I explained above, I don't think that's fair with regard to the
relfrozenxid errors. Setting that aside:

In these discussions you always seem to only argue for the people that
don't care about their data. But, uh, a lot of people do - should we
just silently eat their data? And the long-term quality of the project
gets a lot better by throwing errors, because it actually allows us to
fix them.

As far as I can tell we couldn't even have added the checks to master,
back then, if we follow your logic: A lot of the reports about hitting
the errors were with 11+ (partially due to pg_upgrade, partially because
they detected other bugs).

The likelihood of hurting people by adding checks at a later point would
be a lot lower, if we stopped adding code that ignores errors silently
and hoping for the best. But we keep adding such "lenient" code.

We just found another long-standing cause of data corrupting, that
should have been found earlier if we had errors, or at least warnings,
btw. The locking around vac_udpate_datfrozenxid() has been broken for a
long long time, but the silent 'if (bogus) return' made it very hard to
find.
https://www.postgresql.org/message-id/20200323235036.6pje6usrjjx22zv3%40alap3.anarazel.de

Also, I've recently seen a number of databases beeing eaten because we
just ignore our own WAL logging rules to avoid throwing hard enough
errors (RelationTruncate() WAL logging the truncation outside of a
critical section - oops if you hit it, your primary and replicas/backups
diverge, among many other bad consequences).

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

01 апреля 2020 г., 21:04:43

On Wed, Apr 1, 2020 at 10:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Sure, but not all levels of risk are equal. Jumping out of a plane
> carries some risk of death whether or not you have a parachute, but
> that does not mean that we shouldn't worry about whether you have one
> or not before you jump.
>
> In this case, I think it is pretty clear that hard-disabling the
> feature by always setting old_snapshot_threshold to -1 carries less
> risk of breaking unrelated things than removing code that caters to
> the feature all over the code base. Perhaps it is not quite as
> dramatic as my parachute example, but I think it is pretty clear all
> the same that one is a lot more likely to introduce new bugs than the
> other. A carefully targeted modification of a few lines of code in 1
> file just about has to carry less risk than ~1k lines of code spread
> across 40 or so files.

Yeah, that's certainly true. But is that fine point really what
anybody disagrees about? I didn't think that Andres was focussed on
literally ripping it out over just disabling it.

> Is there any chance that you're planning to look into the details?
> That would certainly be welcome from my perspective.

I had a few other things that I was going to work on this week, but
those seems less urgent. I'll take a look into it, and report back
what I find.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

01 апреля 2020 г., 21:18:55

Hi,

On 2020-04-01 11:04:43 -0700, Peter Geoghegan wrote:
> On Wed, Apr 1, 2020 at 10:28 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > Is there any chance that you're planning to look into the details?
> > That would certainly be welcome from my perspective.

+1

This definitely needs more eyes. I am not even close to understanding
the code fully.


> I had a few other things that I was going to work on this week, but
> those seems less urgent. I'll take a look into it, and report back
> what I find.

Thanks you!

I attached a slightly evolved version of my debugging patch.


Greetings,

Andres Freund

Вложения

v2-debug.diff

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

01 апреля 2020 г., 21:37:06

Hi,

On 2020-04-01 13:27:56 -0400, Robert Haas wrote:
> Perhaps "irresponsible" is the wrong word, but it's certainly caused
> problems for multiple EnterpriseDB customers, and in my view, those
> problems weren't necessary. Either a WARNING or an ERROR would have
> shown up in the log, but an ERROR terminates VACUUM for that table and
> thus basically causes autovacuum to be completely broken. That is a
> really big problem. Perhaps you will want to argue, as Andres did,
> that the value of having ERROR rather than WARNING in the log
> justifies that outcome, but I sure don't agree.

If that had been a really viable option, I would have done so. At the
very least in the back branches, but quite possibly also in master. Or
if somebody had brought them up as an issue at the time.

What is heap_prepare_freeze_tuple/FreezeMultiXactId supposed to do after
issuing a WARNING in these cases. Without the ERROR, e.g.,
            if (!TransactionIdDidCommit(xid))
                ereport(ERROR,
                        (errcode(ERRCODE_DATA_CORRUPTED),
                         errmsg_internal("uncommitted xmin %u from before xid cutoff %u needs to be frozen",
                                         xid, cutoff_xid)));
would make a deleted tuple visible.


        if (TransactionIdPrecedes(xid, relfrozenxid))
            ereport(ERROR,
                    (errcode(ERRCODE_DATA_CORRUPTED),
                     errmsg_internal("found xmin %u from before relfrozenxid %u",
                                     xid, relfrozenxid)));
would go on replace xmin of a potentially uncommitted tuple with
relfrozenxid, making it appear visible.


        if (TransactionIdPrecedes(xid, relfrozenxid))
            ereport(ERROR,
                    (errcode(ERRCODE_DATA_CORRUPTED),
                     errmsg_internal("found xmax %u from before relfrozenxid %u",
                                     xid, relfrozenxid)));
would replace the xmax indicating a potentially deleted tuple with ?, either
making the tuple become, potentially wrongly, visible/invisible

or
    else if (MultiXactIdPrecedes(multi, relminmxid))
        ereport(ERROR,
                (errcode(ERRCODE_DATA_CORRUPTED),
                 errmsg_internal("found multixact %u from before relminmxid %u",
                                 multi, relminmxid)));
or ...


Just continuing is easier said than done. Especially with the background
of knowing that several users had hit the bug that allowed all of the
above to be hit, and that advancing relfrozenxid further would make it
worse.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Robert Haas

Дата:

01 апреля 2020 г., 22:02:28

On Wed, Apr 1, 2020 at 2:37 PM Andres Freund <andres@anarazel.de> wrote:
> Just continuing is easier said than done. Especially with the background
> of knowing that several users had hit the bug that allowed all of the
> above to be hit, and that advancing relfrozenxid further would make it
> worse.

Fair point, but it seems we're arguing over nothing here, or at least
nothing relevant to this thread, because it sounds like if we are
going to disable that you're OK with doing that by just shutting it
off the code rather than trying to remove it all. I had the opposite
impression from your first email.

Sorry to have derailed the thread, and for my poor choice of words.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: snapshot too old issues, first around wraparound and then more.

От

Kevin Grittner

Дата:

01 апреля 2020 г., 22:10:09

On Wed, Apr 1, 2020 at 10:09 AM Andres Freund <andres@anarazel.de> wrote:

First off, many thanks to Andres for investigating this, and apologies for the bugs. Also thanks to Michael for making sure I saw the thread. I must also apologize that for not being able to track the community lists consistently due to health issues that are exacerbated by stress, and the fact that these lists often push past my current limits. I'll try to help in this as best I can.

Do we actually have any evidence of this feature ever beeing used? I
didn't find much evidence for that in the archives (except Thomas
finding a problem).

This was added because a very large company trying to convert from Oracle had a test that started to show some slowdown on PostgreSQL after 8 hours, serious slowdown by 24 hours, and crashed hard before it could get to 48 hours -- due to lingering WITH HOLD cursors left by ODBC code. They had millions of lines of code that would need to be rewritten without this feature. With this feature (set to 20 minutes, if I recall correctly), their unmodified code ran successfully for at least three months solid without failure or corruption. Last I heard, they were converting a large number of instances from Oracle to PostgreSQL, and those would all fail hard within days of running with this feature removed or disabled.

Also, VMware is using PostgreSQL as an embedded part of many products, and this feature was enabled to deal with similar failures due to ODBC cursors; so the number of instances running 24/7 under high load which have shown a clear benefit from enabling this feature has a lot of zeros.

Perhaps the lack of evidence for usage in the archives indicates a low frequency of real-world failures due to the feature, rather than lack of use? I'm not doubting that Andres found real issues that should be fixed, but perhaps not very many people who are using the feature have more than two billion transactions within the time threshold, and perhaps the other problems are not as big as the problems solved by use of the feature -- at least in some cases.

To save readers who have not yet done the math some effort, at the 20 minute threshold used by the initial user, they would need to have a sustained rate of consumption of transaction IDs of over 66 million per second to experience wraparound problems, and at the longest threshold I have seen it would need to exceed an average of 461,893 TPS for three days solid to hit wraparound. Those aren't impossible rates to hit, but in practice it might not be a frequent occurrence yet on modern hardware with some real-world applications. Hopefully we can find a way to fix this before those rates become common.

I am reviewing the issue and patches now, and hope I can make some useful contribution to the discussion.

Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

01 апреля 2020 г., 22:42:59

Hi,

Nice to have you back for a bit! Even if the circumstances aren't
great...

It's very understandable that the lists are past your limits, I barely
keep up these days. Without any health issues.

On 2020-04-01 14:10:09 -0500, Kevin Grittner wrote:
> Perhaps the lack of evidence for usage in the archives indicates a low
> frequency of real-world failures due to the feature, rather than lack of
> use?  I'm not doubting that Andres found real issues that should be fixed,
> but perhaps not very many people who are using the feature have more than
> two billion transactions within the time threshold, and perhaps the other
> problems are not as big as the problems solved by use of the feature -- at
> least in some cases.

> To save readers who have not yet done the math some effort, at the 20
> minute threshold used by the initial user, they would need to have a
> sustained rate of consumption of transaction IDs of over 66 million per
> second to experience wraparound problems, and at the longest threshold I
> have seen it would need to exceed an average of 461,893 TPS for three days
> solid to hit wraparound.  Those aren't impossible rates to hit, but in
> practice it might not be a frequent occurrence yet on modern hardware with
> some real-world applications.  Hopefully we can find a way to fix this
> before those rates become common.

The wraparound issue on their own wouldn't be that bad - when I found it
I did play around with a few ideas for how to fix it. The most practical
would probably be to have MaintainOldSnapshotTimeMapping() scan all
buckets when a new oldSnapshotControl->oldest_xid is older than
RecentGlobalXmin. There's no benefit in the contents of those buckets
anyway, since we know that we can freeze those independent of
old_snapshot_threshold.

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error). And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.

I suspect one reason for users not noticing either is that

a) it's plausible that users of the feature would mostly have
  long-running queries/transactions querying immutable or insert only
  data. Those would not notice that, on other tables, rows are getting
  removed, where access would not trigger the required error.

b) I observe long-ish phases were no cleanup is happening (due to
  oldSnapshotControl->head_timestamp getting updated more often than
  correct). But if old_snapshot_threshold is small enough in relation to
  the time the generated bloat becomes problematic, there will still be
  occasions to actually perform cleanup.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Kevin Grittner

Дата:

01 апреля 2020 г., 23:11:52

On Wed, Apr 1, 2020 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:

The thing that makes me really worried is that the contents of the time
mapping seem very wrong. I've reproduced query results in a REPEATABLE
READ transaction changing (pruned without triggering an error).

That is a very big problem. On the sort-of bright side (ironic in light of the fact that I'm a big proponent of using serializable transactions), none of the uses that I have personally seen of this feature use anything other than the default READ COMMITTED isolation level. That might help explain the lack of complaints for those using the feature. But yeah, I REALLY want to see a solid fix for that!

And I've
reproduced rows not getting removed for much longer than than they
should, according to old_snapshot_threshold.

I suspect one reason for users not noticing either is that

a) it's plausible that users of the feature would mostly have
long-running queries/transactions querying immutable or insert only
data. Those would not notice that, on other tables, rows are getting
removed, where access would not trigger the required error.

b) I observe long-ish phases were no cleanup is happening (due to
oldSnapshotControl->head_timestamp getting updated more often than
correct). But if old_snapshot_threshold is small enough in relation to
the time the generated bloat becomes problematic, there will still be
occasions to actually perform cleanup.

Keep in mind that the real goal of this feature is not to eagerly _see_ "snapshot too old" errors, but to prevent accidental debilitating bloat due to one misbehaving user connection. This is particularly easy to see (and therefore unnervingly common) for those using ODBC, which in my experience tends to correspond to the largest companies which are using PostgreSQL. In some cases, the snapshot which is preventing removal of the rows will never be used again; removal of the rows will not actually affect the result of any query, but only the size and performance of the database. This is a "soft limit" -- kinda like max_wal_size. Where there was a trade-off between accuracy of the limit and performance, the less accurate way was intentionally chosen. I apologize for not making that more clear in comments.

While occasional "snapshot too old" errors are an inconvenient side effect of achieving the primary goal, it might be of interest to know that the initial (very large corporate) user of this feature had, under Oracle, intentionally used a cursor that would be held open as long as a user chose to leave a list open for scrolling around. They used cursor features for as long as the cursor allowed. This could be left open for days or weeks (or longer?). Their query ordered by a unique index, and tracked the ends of the currently displayed portion of the list so that if they happened to hit the "snapshot too old" error they could deallocate and restart the cursor and reposition before moving forward or back to the newly requested rows. They were not willing to convert to PostgreSQL unless this approach continued to work.

In Summary:

(1) It's not urgent that rows always be removed as soon as possible after the threshold is crossed as long as they don't often linger too awfully far past that limit and allow debilitating bloat.

(2) It _is_ a problem if results inconsistent with the snapshot are returned -- a "snapshot too old" error is necessary.

(3) Obviously, wraparound problems need to be solved.

I hope this is helpful.

Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

Re: snapshot too old issues, first around wraparound and then more.

От

Robert Haas

Дата:

01 апреля 2020 г., 23:24:54

On Wed, Apr 1, 2020 at 3:43 PM Andres Freund <andres@anarazel.de> wrote:
> The thing that makes me really worried is that the contents of the time
> mapping seem very wrong. I've reproduced query results in a REPEATABLE
> READ transaction changing (pruned without triggering an error). And I've
> reproduced rows not getting removed for much longer than than they
> should, according to old_snapshot_threshold.

I think it would be a good idea to add a system view that shows the
contents of the mapping. We could make it a contrib module, if you
like, so that it can even be installed on back branches. We'd need to
move the structure definition from snapmgr.c to a header file, but
that doesn't seem like such a big deal.

Maybe that contrib module could even have some functions to simulate
aging without the passage of any real time. Like, say you have a
function or procedure old_snapshot_pretend_time_has_passed(integer),
and it moves oldSnapshotControl->head_timestamp backwards by that
amount. Maybe that would require updating some other fields in
oldSnapshotControl too but it doesn't seem like we'd need to do a
whole lot.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

02 апреля 2020 г., 00:11:11

Hi,

On 2020-04-01 15:11:52 -0500, Kevin Grittner wrote:
> On Wed, Apr 1, 2020 at 2:43 PM Andres Freund <andres@anarazel.de> wrote:
>
> > The thing that makes me really worried is that the contents of the time
> > mapping seem very wrong. I've reproduced query results in a REPEATABLE
> > READ transaction changing (pruned without triggering an error).
>
>
> That is a very big problem.  On the sort-of bright side (ironic in light of
> the fact that I'm a big proponent of using serializable transactions), none
> of the uses that I have personally seen of this feature use anything other
> than the default READ COMMITTED isolation level.  That might help explain
> the lack of complaints for those using the feature.  But yeah, I REALLY
> want to see a solid fix for that!

I don't think it's dependent on RR - it's just a bit easier to verify
that the query results are wrong that way.


> > And I've
> > reproduced rows not getting removed for much longer than than they
> > should, according to old_snapshot_threshold.
> >
> > I suspect one reason for users not noticing either is that
> >
> > a) it's plausible that users of the feature would mostly have
> >   long-running queries/transactions querying immutable or insert only
> >   data. Those would not notice that, on other tables, rows are getting
> >   removed, where access would not trigger the required error.
> >
> > b) I observe long-ish phases were no cleanup is happening (due to
> >   oldSnapshotControl->head_timestamp getting updated more often than
> >   correct). But if old_snapshot_threshold is small enough in relation to
> >   the time the generated bloat becomes problematic, there will still be
> >   occasions to actually perform cleanup.
> >
>
> Keep in mind that the real goal of this feature is not to eagerly _see_
> "snapshot too old" errors, but to prevent accidental debilitating bloat due
> to one misbehaving user connection.

I don't think it's an "intentional" inaccuracy issue leading to
this. The map contents are just wrong, in particular the head_timestamp
most of the time is so new that
TransactionIdLimitedForOldSnapshots(). When filling a new bucket,
MaintainOldSnapshotThreshold() unconditionally updates
oldSnapshotControl->head_timestamp to be the current minute, which means
it'll take old_snapshot_threshold minutes till
TransactionIdLimitedForOldSnapshots() even looks at the mapping again.

As far as I can tell, with a large old_snapshot_threshold, it can take a
very long time to get to a head_timestamp that's old enough for
TransactionIdLimitedForOldSnapshots() to do anything.  Look at this
trace of a pgbench run with old_snapshot_threshold enabled, showing some of
the debugging output added in the patch upthread.

This is with a threshold of 10min, in a freshly started database:
> 2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING:  head 0 min: filling 1 buckets starting at 0 for
whenTaken1 min, with xmin 2068881994
 
> 2020-04-01 13:49:00.000 PDT [1268502][2/43571:2068881994] WARNING:  old snapshot mapping at "after update" with head
ts:1, current entries: 2 max entries: 20, offset: 0
 
>       entry 0 (ring 0): min 1: xid 2068447214
>       entry 1 (ring 1): min 2: xid 2068881994
>
> 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING:  old snapshot mapping at "before update" with head ts: 1,
currententries: 2 max entries: 20, offset: 0
 
>       entry 0 (ring 0): min 1: xid 2068447214
>       entry 1 (ring 1): min 2: xid 2068881994
>
> 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING:  head 1 min: updating existing bucket 1 for whenTaken 2
min,with xmin 2069199511
 
> 2020-04-01 13:50:00.000 PDT [1268505][5/122542:0] WARNING:  old snapshot mapping at "after update" with head ts: 1,
currententries: 2 max entries: 20, offset: 0
 
>       entry 0 (ring 0): min 1: xid 2068447214
>       entry 1 (ring 1): min 2: xid 2069199511
>
> 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING:  old snapshot mapping at "before update" with
headts: 1, current entries: 2 max entries: 20, offset: 0
 
>       entry 0 (ring 0): min 1: xid 2068447214
>       entry 1 (ring 1): min 2: xid 2069199511
>
> 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING:  head 1 min: filling 2 buckets starting at 0 for
whenTaken3 min, with xmin 2069516499
 
> 2020-04-01 13:51:00.000 PDT [1268502][2/202674:2069516501] WARNING:  old snapshot mapping at "after update" with head
ts:3, current entries: 4 max entries: 20, offset: 0
 
>       entry 0 (ring 0): min 3: xid 2068447214
>       entry 1 (ring 1): min 4: xid 2069199511
>       entry 2 (ring 2): min 5: xid 2069516499
>       entry 3 (ring 3): min 6: xid 2069516499
> ...
> 2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING:  old snapshot mapping at "before update" with
headts: 7, current entries: 8 max entries: 20, offset: 0
 
>       entry 0 (ring 0): min 7: xid 2068447214
>       entry 1 (ring 1): min 8: xid 2071112480
>       entry 2 (ring 2): min 9: xid 2071434473
>       entry 3 (ring 3): min 10: xid 2071755177
>       entry 4 (ring 4): min 11: xid 2072075827
>       entry 5 (ring 5): min 12: xid 2072395700
>       entry 6 (ring 6): min 13: xid 2072715464
>       entry 7 (ring 7): min 14: xid 2073035816

Before the mapping change the database had been running for 15
minutes. But the mapping starts only at 7 minutes from start. And then
is updated to

> 2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING:  head 7 min: filling 8 buckets starting at 0 for
whenTaken15 min, with xmin 2075918093
 
> 2020-04-01 14:03:00.000 PDT [1268504][4/1158832:2075918094] WARNING:  old snapshot mapping at "after update" with
headts: 15, current entries: 16 max entries: 20, offset: 0
 
>       entry 0 (ring 0): min 15: xid 2068447214
>       entry 1 (ring 1): min 16: xid 2071112480
>       entry 2 (ring 2): min 17: xid 2071434473
>       entry 3 (ring 3): min 18: xid 2071755177
>       entry 4 (ring 4): min 19: xid 2072075827
>       entry 5 (ring 5): min 20: xid 2072395700
>       entry 6 (ring 6): min 21: xid 2072715464
>       entry 7 (ring 7): min 22: xid 2073035816
>       entry 8 (ring 8): min 23: xid 2075918093
>       entry 9 (ring 9): min 24: xid 2075918093
>       entry 10 (ring 10): min 25: xid 2075918093
>       entry 11 (ring 11): min 26: xid 2075918093
>       entry 12 (ring 12): min 27: xid 2075918093
>       entry 13 (ring 13): min 28: xid 2075918093
>       entry 14 (ring 14): min 29: xid 2075918093
>       entry 15 (ring 15): min 30: xid 2075918093

be considered having started in that moment. And we expand the size of
the mapping by 8 at the same time, filling the new buckets with the same
xid. Despite there being a continuous workload.

After a few more minutes we get:
> 2020-04-01 14:07:00.000 PDT [1268503][3/1473617:2077202085] WARNING:  head 15 min: updating existing bucket 4 for
whenTaken19 min, with xmin 2077202085
 
> 2020-04-01 14:07:00.000 PDT [1268503][3/1473617:2077202085] WARNING:  old snapshot mapping at "after update" with
headts: 15, current entries: 16 max entries: 20, offset: 0
 
>       entry 0 (ring 0): min 15: xid 2068447214
>       entry 1 (ring 1): min 16: xid 2076238895
>       entry 2 (ring 2): min 17: xid 2076559154
>       entry 3 (ring 3): min 18: xid 2076880731
>       entry 4 (ring 4): min 19: xid 2077202085
>       entry 5 (ring 5): min 20: xid 2072395700
>       entry 6 (ring 6): min 21: xid 2072715464
>       entry 7 (ring 7): min 22: xid 2073035816
>       entry 8 (ring 8): min 23: xid 2075918093
>       entry 9 (ring 9): min 24: xid 2075918093
>       entry 10 (ring 10): min 25: xid 2075918093
>       entry 11 (ring 11): min 26: xid 2075918093
>       entry 12 (ring 12): min 27: xid 2075918093
>       entry 13 (ring 13): min 28: xid 2075918093
>       entry 14 (ring 14): min 29: xid 2075918093
>       entry 15 (ring 15): min 30: xid 2075918093
>

Note how the xids are not monotonically ordered. And how IsLimited still
won't be able to make use of the mapping, as the head timestamp is only
4 minutes old (whenTaken == 19 min, head == 15min).

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

02 апреля 2020 г., 01:00:16

On Wed, Apr 1, 2020 at 1:25 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Maybe that contrib module could even have some functions to simulate
> aging without the passage of any real time. Like, say you have a
> function or procedure old_snapshot_pretend_time_has_passed(integer),
> and it moves oldSnapshotControl->head_timestamp backwards by that
> amount. Maybe that would require updating some other fields in
> oldSnapshotControl too but it doesn't seem like we'd need to do a
> whole lot.

I like that idea. I think that I've spotted what may be an independent
bug, but I have to wait around for a minute or two to reproduce it
each time. Makes it hard to get to a minimal test case.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

02 апреля 2020 г., 01:30:39

On Wed, Apr 1, 2020 at 3:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I like that idea. I think that I've spotted what may be an independent
> bug, but I have to wait around for a minute or two to reproduce it
> each time. Makes it hard to get to a minimal test case.

I now have simple steps to reproduce a bug when I start Postgres
master with "--old_snapshot_threshold=1"  (1 minute).

This example shows wrong answers to queries in session 2:

Session 1:

pg@regression:5432 [1444078]=# create table snapshot_bug (col int4);
CREATE TABLE
pg@regression:5432 [1444078]=# create index on snapshot_bug (col );
CREATE INDEX
pg@regression:5432 [1444078]=# insert into snapshot_bug select i from
generate_series(1, 500) i;
INSERT 0 500

Session 2 starts, and views the data in a serializable transaction:

pg@regression:5432 [1444124]=# begin isolation level serializable ;
BEGIN
pg@regression:5432 [1444124]=*# select col from snapshot_bug where col
>= 0 order by col limit 14;
┌─────┐
│ col │
├─────┤
│   1 │
│   2 │
│   3 │
│   4 │
│   5 │
│   6 │
│   7 │
│   8 │
│   9 │
│  10 │
│  11 │
│  12 │
│  13 │
│  14 │
└─────┘
(14 rows)

So far so good. Now session 2 continues:

pg@regression:5432 [1444078]=# delete from snapshot_bug where col < 15;
DELETE 14

Session 1:

(repeats the same "select col from snapshot_bug where col >= 0 order
by col limit 14" query every 100 ms using psql's \watch 0.1)

Session 2:

pg@regression:5432 [1444078]=# vacuum snapshot_bug ;
VACUUM

Before too long, we see the following over in session 2 -- the answer
the query gives changes, even though this is a serializable
transaction:

Wed 01 Apr 2020 03:12:59 PM PDT (every 0.1s)

┌─────┐
│ col │
├─────┤
│   1 │
│   2 │
│   3 │
│   4 │
│   5 │
│   6 │
│   7 │
│   8 │
│   9 │
│  10 │
│  11 │
│  12 │
│  13 │
│  14 │
└─────┘
(14 rows)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

┌─────┐
│ col │
├─────┤
│  15 │
│  16 │
│  17 │
│  18 │
│  19 │
│  20 │
│  21 │
│  22 │
│  23 │
│  24 │
│  25 │
│  26 │
│  27 │
│  28 │
└─────┘
(14 rows)

Wed 01 Apr 2020 03:13:00 PM PDT (every 0.1s)

┌─────┐
│ col │
├─────┤
│  15 │
│  16 │
│  17 │
│  18 │
│  19 │
│  20 │
│  21 │
│  22 │
│  23 │
│  24 │
│  25 │
│  26 │
│  27 │
│  28 │
└─────┘
(14 rows)

We continue to get this wrong answer for almost another minute (at
least on this occasion). Eventually we get "snapshot too old". Note
that the answer changes when we cross the "minute threshold"

Andres didn't explain anything to me that contributed to finding the
bug (though it could be a known bug, I don't think that it is). It
took me a surprisingly short amount of time to stumble upon this bug
-- I didn't find it because I have good intuitions about how to break
the feature.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

02 апреля 2020 г., 02:09:19

Hi,

On 2020-04-01 14:11:11 -0700, Andres Freund wrote:
> As far as I can tell, with a large old_snapshot_threshold, it can take a
> very long time to get to a head_timestamp that's old enough for
> TransactionIdLimitedForOldSnapshots() to do anything.  Look at this
> trace of a pgbench run with old_snapshot_threshold enabled, showing some of
> the debugging output added in the patch upthread.
>
> This is with a threshold of 10min, in a freshly started database:
> [...]

I took a lot longer till the queries started to be cancelled. The last
mapping update before that was:

> 2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING:  old snapshot mapping at "before update" with
headts: 31, current entries: 20 max entries: 20, offset: 12
 
>       entry 0 (ring 12): min 31: xid 2078468128
>       entry 1 (ring 13): min 32: xid 2078642746
>       entry 2 (ring 14): min 33: xid 2078672303
>       entry 3 (ring 15): min 34: xid 2078700941
>       entry 4 (ring 16): min 35: xid 2078728729
>       entry 5 (ring 17): min 36: xid 2078755425
>       entry 6 (ring 18): min 37: xid 2078781089
>       entry 7 (ring 19): min 38: xid 2078805567
>       entry 8 (ring 0): min 39: xid 2078830065
>       entry 9 (ring 1): min 40: xid 2078611853
>       entry 10 (ring 2): min 41: xid 2078611853
>       entry 11 (ring 3): min 42: xid 2078611853
>       entry 12 (ring 4): min 43: xid 2078611853
>       entry 13 (ring 5): min 44: xid 2078611853
>       entry 14 (ring 6): min 45: xid 2078611853
>       entry 15 (ring 7): min 46: xid 2078611853
>       entry 16 (ring 8): min 47: xid 2078611853
>       entry 17 (ring 9): min 48: xid 2078611853
>       entry 18 (ring 10): min 49: xid 2078611853
>       entry 19 (ring 11): min 50: xid 2078611853
>
> 2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING:  head 31 min: updating existing bucket 1 for
whenTaken40 min, with xmin 2078853870
 
> 2020-04-01 14:28:00.000 PDT [1268503][3/1894126:2078853871] WARNING:  old snapshot mapping at "after update" with
headts: 31, current entries: 20 max entries: 20, offset: 12
 
>       entry 0  (ring 12): min 31: xid 2078468128
>       entry 1  (ring 13): min 32: xid 2078642746
>       entry 2  (ring 14): min 33: xid 2078672303
>       entry 3  (ring 15): min 34: xid 2078700941
>       entry 4  (ring 16): min 35: xid 2078728729
>       entry 5  (ring 17): min 36: xid 2078755425
>       entry 6  (ring 18): min 37: xid 2078781089
>       entry 7  (ring 19): min 38: xid 2078805567
>       entry 8  (ring 0 ): min 39: xid 2078830065
>       entry 9  (ring 1 ): min 40: xid 2078853870
>       entry 10 (ring 2 ): min 41: xid 2078611853
>       entry 11 (ring 3 ): min 42: xid 2078611853
>       entry 12 (ring 4 ): min 43: xid 2078611853
>       entry 13 (ring 5 ): min 44: xid 2078611853
>       entry 14 (ring 6 ): min 45: xid 2078611853
>       entry 15 (ring 7 ): min 46: xid 2078611853
>       entry 16 (ring 8 ): min 47: xid 2078611853
>       entry 17 (ring 9 ): min 48: xid 2078611853
>       entry 18 (ring 10): min 49: xid 2078611853
>       entry 19 (ring 11): min 50: xid 2078611853


A query ran for fourty minutes during this, without getting cancelled.



A good while later this happens:
> 2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING:  old snapshot mapping at "before update" with
headts: 82, current entries: 20 max entries: 20, offset: 12
 
>       entry 0 (ring 12): min 82: xid 2080333207
>       entry 1 (ring 13): min 83: xid 2080527298
>       entry 2 (ring 14): min 84: xid 2080566990
>       entry 3 (ring 15): min 85: xid 2080605960
>       entry 4 (ring 16): min 86: xid 2080644554
>       entry 5 (ring 17): min 87: xid 2080682957
>       entry 6 (ring 18): min 88: xid 2080721936
>       entry 7 (ring 19): min 89: xid 2080760947
>       entry 8 (ring 0): min 90: xid 2080799843
>       entry 9 (ring 1): min 91: xid 2080838696
>       entry 10 (ring 2): min 92: xid 2080877550
>       entry 11 (ring 3): min 93: xid 2080915870
>       entry 12 (ring 4): min 94: xid 2080954151
>       entry 13 (ring 5): min 95: xid 2080992556
>       entry 14 (ring 6): min 96: xid 2081030980
>       entry 15 (ring 7): min 97: xid 2081069403
>       entry 16 (ring 8): min 98: xid 2081107811
>       entry 17 (ring 9): min 99: xid 2081146322
>       entry 18 (ring 10): min 100: xid 2081185023
>       entry 19 (ring 11): min 101: xid 2081223632
>
> 2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING:  head 82 min: filling 20 buckets starting at 12
forwhenTaken 102 min, with xmin 2081262046
 
> 2020-04-01 15:30:00.000 PDT [1268503][3/2518699:2081262046] WARNING:  old snapshot mapping at "after update" with
headts: 102, current entries: 1 max entries: 20, offset: 0
 
>       entry 0 (ring 0): min 102: xid 2081262046

The entire mapping reset, i.e. it'll take another fourty minutes for
cancellations to happen.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

02 апреля 2020 г., 02:59:51

Hi,

On 2020-04-01 15:30:39 -0700, Peter Geoghegan wrote:
> On Wed, Apr 1, 2020 at 3:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I like that idea. I think that I've spotted what may be an independent
> > bug, but I have to wait around for a minute or two to reproduce it
> > each time. Makes it hard to get to a minimal test case.
>
> I now have simple steps to reproduce a bug when I start Postgres
> master with "--old_snapshot_threshold=1"  (1 minute).

Thanks, that's super helpful.


> This example shows wrong answers to queries in session 2:
>
> Session 1:
>
> pg@regression:5432 [1444078]=# create table snapshot_bug (col int4);
> CREATE TABLE
> pg@regression:5432 [1444078]=# create index on snapshot_bug (col );
> CREATE INDEX
> pg@regression:5432 [1444078]=# insert into snapshot_bug select i from
> generate_series(1, 500) i;
> INSERT 0 500
>
> Session 2 starts, and views the data in a serializable transaction:
>
> pg@regression:5432 [1444124]=# begin isolation level serializable ;
> BEGIN
> pg@regression:5432 [1444124]=*# select col from snapshot_bug where col
> >= 0 order by col limit 14;
> ┌─────┐
> │ col │
> ├─────┤
> │   1 │
> │   2 │
> │   3 │
> │   4 │
> │   5 │
> │   6 │
> │   7 │
> │   8 │
> │   9 │
> │  10 │
> │  11 │
> │  12 │
> │  13 │
> │  14 │
> └─────┘
> (14 rows)
>
> So far so good. Now session 2 continues:

> pg@regression:5432 [1444078]=# delete from snapshot_bug where col < 15;
> DELETE 14

I got a bit confused here - you seemed to have switched session 1 and 2
around? Doesn't seem to matter much though, I was able to reproduce this.

This indeed seems a separate bug.

The backtrace to the point where the xmin horizon is affected by
TransactionIdLimitedForOldSnapshots() is:

#0  TransactionIdLimitedForOldSnapshots (recentXmin=2082816071, relation=0x7f52ff3b56f8) at
/home/andres/src/postgresql/src/backend/utils/time/snapmgr.c:1870
#1  0x00005567f4cd1a55 in heap_page_prune_opt (relation=0x7f52ff3b56f8, buffer=175) at
/home/andres/src/postgresql/src/backend/access/heap/pruneheap.c:106
#2  0x00005567f4cc70e2 in heapam_index_fetch_tuple (scan=0x5567f6db3028, tid=0x5567f6db2e40, snapshot=0x5567f6d67d68,
slot=0x5567f6db1b60,
    call_again=0x5567f6db2e46, all_dead=0x7ffce13d78de) at
/home/andres/src/postgresql/src/backend/access/heap/heapam_handler.c:137
#3  0x00005567f4cdf5e6 in table_index_fetch_tuple (scan=0x5567f6db3028, tid=0x5567f6db2e40, snapshot=0x5567f6d67d68,
slot=0x5567f6db1b60,
    call_again=0x5567f6db2e46, all_dead=0x7ffce13d78de) at
/home/andres/src/postgresql/src/include/access/tableam.h:1020
#4  0x00005567f4ce0767 in index_fetch_heap (scan=0x5567f6db2de0, slot=0x5567f6db1b60) at
/home/andres/src/postgresql/src/backend/access/index/indexam.c:577
#5  0x00005567f4f19191 in IndexOnlyNext (node=0x5567f6db16a0) at
/home/andres/src/postgresql/src/backend/executor/nodeIndexonlyscan.c:169
#6  0x00005567f4ef4bc4 in ExecScanFetch (node=0x5567f6db16a0, accessMtd=0x5567f4f18f20 <IndexOnlyNext>,
recheckMtd=0x5567f4f1951c<IndexOnlyRecheck>)
 
    at /home/andres/src/postgresql/src/backend/executor/execScan.c:133
#7  0x00005567f4ef4c39 in ExecScan (node=0x5567f6db16a0, accessMtd=0x5567f4f18f20 <IndexOnlyNext>,
recheckMtd=0x5567f4f1951c<IndexOnlyRecheck>)
 
    at /home/andres/src/postgresql/src/backend/executor/execScan.c:182
#8  0x00005567f4f195d4 in ExecIndexOnlyScan (pstate=0x5567f6db16a0) at
/home/andres/src/postgresql/src/backend/executor/nodeIndexonlyscan.c:317
#9  0x00005567f4ef0f71 in ExecProcNodeFirst (node=0x5567f6db16a0) at
/home/andres/src/postgresql/src/backend/executor/execProcnode.c:444
#10 0x00005567f4f1d694 in ExecProcNode (node=0x5567f6db16a0) at
/home/andres/src/postgresql/src/include/executor/executor.h:245
#11 0x00005567f4f1d7d2 in ExecLimit (pstate=0x5567f6db14b8) at
/home/andres/src/postgresql/src/backend/executor/nodeLimit.c:95
#12 0x00005567f4ef0f71 in ExecProcNodeFirst (node=0x5567f6db14b8) at
/home/andres/src/postgresql/src/backend/executor/execProcnode.c:444
#13 0x00005567f4ee57c3 in ExecProcNode (node=0x5567f6db14b8) at
/home/andres/src/postgresql/src/include/executor/executor.h:245
#14 0x00005567f4ee83dd in ExecutePlan (estate=0x5567f6db1280, planstate=0x5567f6db14b8, use_parallel_mode=false,
operation=CMD_SELECT,sendTuples=true,
 
    numberTuples=0, direction=ForwardScanDirection, dest=0x5567f6db3c78, execute_once=true)
    at /home/andres/src/postgresql/src/backend/executor/execMain.c:1646
#15 0x00005567f4ee5e23 in standard_ExecutorRun (queryDesc=0x5567f6d0c490, direction=ForwardScanDirection, count=0,
execute_once=true)
    at /home/andres/src/postgresql/src/backend/executor/execMain.c:364
#16 0x00005567f4ee5c35 in ExecutorRun (queryDesc=0x5567f6d0c490, direction=ForwardScanDirection, count=0,
execute_once=true)
    at /home/andres/src/postgresql/src/backend/executor/execMain.c:308
#17 0x00005567f510c4de in PortalRunSelect (portal=0x5567f6d49260, forward=true, count=0, dest=0x5567f6db3c78)
    at /home/andres/src/postgresql/src/backend/tcop/pquery.c:912
#18 0x00005567f510c191 in PortalRun (portal=0x5567f6d49260, count=9223372036854775807, isTopLevel=true, run_once=true,
dest=0x5567f6db3c78,
    altdest=0x5567f6db3c78, qc=0x7ffce13d7de0) at /home/andres/src/postgresql/src/backend/tcop/pquery.c:756
#19 0x00005567f5106015 in exec_simple_query (query_string=0x5567f6cdd7a0 "select col from snapshot_bug where col >= 0
orderby col limit 14;")
 
    at /home/andres/src/postgresql/src/backend/tcop/postgres.c:1239

which in my tree is the elog() in the block below:
        if (!same_ts_as_threshold)
        {
            if (ts == update_ts)
            {
                PrintOldSnapshotMapping("non cached limit via update_ts", false);

                xlimit = latest_xmin;
                if (NormalTransactionIdFollows(xlimit, recentXmin))
                {
                    elog(LOG, "increasing threshold from %u to %u (via update_ts)",
                         recentXmin, xlimit);
                    SetOldSnapshotThresholdTimestamp(ts, xlimit);
                }
            }

the mapping at that point is:

2020-04-01 16:14:00.025 PDT [1272381][4/2:0] WARNING:  old snapshot mapping at "non cached limit via update_ts" with
headts: 1, current entries: 2 max entries: 11, offset: 0
 
      entry 0 (ring 0): min 1: xid 2082816067
      entry 1 (ring 1): min 2: xid 2082816071

and the xmin changed is:
2020-04-01 16:14:00.026 PDT [1272381][4/2:0] LOG:  increasing threshold from 2082816071 to 2082816072 (via update_ts)

in the frame of heap_prune_page_opt():
(rr) p snapshot->whenTaken
$5 = 639097973135655
(rr) p snapshot->lsn
$6 = 133951784192
(rr) p MyPgXact->xmin
$7 = 2082816071
(rr) p BufferGetBlockNumber(buffer)
$11 = 0

in the frame for TransactionIdLimitedForOldSnapshots:
(rr) p ts
$8 = 639098040000000
(rr) p latest_xmin
$9 = 2082816072
(rr) p update_ts
$10 = 639098040000000


The primary issue here is that there is no TestForOldSnapshot() in
heap_hot_search_buffer(). Therefore index fetches will never even try to
detect that tuples it needs actually have already been pruned away.


The wrong queries I saw took longer to reproduce, so I've not been able
to debug the precise reasons.


Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

02 апреля 2020 г., 03:17:21

Hi,

On 2020-04-01 16:59:51 -0700, Andres Freund wrote:
> The primary issue here is that there is no TestForOldSnapshot() in
> heap_hot_search_buffer(). Therefore index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.

FWIW, with autovacuum=off the query does not get killed until a manual
vacuum, nor if fewer rows are deleted and the table has previously been
vacuumed.

The vacuum in the second session isn't required. There just needs to be
something consuming an xid, so that oldSnapshotControl->latest_xmin is
increased. A single SELECT txid_current(); or such in a separate session
is sufficient.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

02 апреля 2020 г., 03:26:04

On Wed, Apr 1, 2020 at 4:59 PM Andres Freund <andres@anarazel.de> wrote:
> Thanks, that's super helpful.

Glad I could help.

> I got a bit confused here - you seemed to have switched session 1 and 2
> around? Doesn't seem to matter much though, I was able to reproduce this.

Yeah, I switched the session numbers because I was in a hurry. Sorry about that.

As you have already worked out, one session does all the DDL and
initial loading of data, while the other session queries the data
repeatedly in a serializable (or RR) xact. The latter session exhibits
the bug.

> This indeed seems a separate bug.

> The primary issue here is that there is no TestForOldSnapshot() in
> heap_hot_search_buffer(). Therefore index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.

I suspected that heap_hot_search_buffer() was missing something.

> The wrong queries I saw took longer to reproduce, so I've not been able
> to debug the precise reasons.

How hard would it be to write a debug patch that reduces the quantum
used in places like TransactionIdLimitedForOldSnapshots() to something
much less than the current 1 minute quantum? That made reproducing the
bug *very* tedious.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

02 апреля 2020 г., 03:54:06

Hi,

On 2020-04-01 16:59:51 -0700, Andres Freund wrote:
> The primary issue here is that there is no TestForOldSnapshot() in
> heap_hot_search_buffer(). Therefore index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.

bitmap heap scan doesn't have the necessary checks either. In the
non-lossy case it uses heap_hot_search_buffer, for the lossy case it has
an open coded access without the check (that's bitgetpage() before v12,
and heapam_scan_bitmap_next_block() after that).

Nor do sample scans, but that was "at least" introduced later.

As far as I can tell there's not sufficient in-tree explanation of when
code needs to test for an old snapshot. There's just the following
comment above TestForOldSnapshot():
 * Check whether the given snapshot is too old to have safely read the given
 * page from the given table.  If so, throw a "snapshot too old" error.
 *
 * This test generally needs to be performed after every BufferGetPage() call
 * that is executed as part of a scan.  It is not needed for calls made for
 * modifying the page (for example, to position to the right place to insert a
 * new index tuple or for vacuuming).  It may also be omitted where calls to
 * lower-level functions will have already performed the test.

But I don't find "as part of a scan" very informative. I mean, it
was explicitly not called from with the executor back then (for a while
the check was embedded in BufferGetPage()):

static void
bitgetpage(HeapScanDesc scan, TBMIterateResult *tbmres)
...
        Page        dp = BufferGetPage(buffer, NULL, NULL, BGP_NO_SNAPSHOT_TEST);

I am more than a bit dumbfounded here.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

02 апреля 2020 г., 04:04:15

On Wed, Apr 1, 2020 at 5:54 PM Andres Freund <andres@anarazel.de> wrote:
> As far as I can tell there's not sufficient in-tree explanation of when
> code needs to test for an old snapshot. There's just the following
> comment above TestForOldSnapshot():
>  * Check whether the given snapshot is too old to have safely read the given
>  * page from the given table.  If so, throw a "snapshot too old" error.
>  *
>  * This test generally needs to be performed after every BufferGetPage() call
>  * that is executed as part of a scan.  It is not needed for calls made for
>  * modifying the page (for example, to position to the right place to insert a
>  * new index tuple or for vacuuming).  It may also be omitted where calls to
>  * lower-level functions will have already performed the test.
>
> But I don't find "as part of a scan" very informative.

I also find it strange that _bt_search() calls TestForOldSnapshot() on
every level on the tree (actually, it calls _bt_moveright() which
calls it on every level of the tree). At least with reads (see the
comments at the top of _bt_moveright()).

Why do we need to do the test on internal pages? We only ever call
PredicateLockPage() on a leaf nbtree page. Why the inconsistency
between the two similar-seeming cases?

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

02 апреля 2020 г., 04:46:46

Hi,

On 2020-04-01 17:54:06 -0700, Andres Freund wrote:
>  * Check whether the given snapshot is too old to have safely read the given
>  * page from the given table.  If so, throw a "snapshot too old" error.
>  *
>  * This test generally needs to be performed after every BufferGetPage() call
>  * that is executed as part of a scan.  It is not needed for calls made for
>  * modifying the page (for example, to position to the right place to insert a
>  * new index tuple or for vacuuming).  It may also be omitted where calls to
>  * lower-level functions will have already performed the test.

To me this sounds like we'd not need to check for an old snapshot in
heap_delete/update/lock_tuple. And they were explictly not testing for
old snapshots.  But I don't understand why that'd be correct?

In a lot of UPDATE/DELETE queries there's no danger that the target
tuple will be pruned away, because the underlying scan node will hold a
pin. But I don't think that's guaranteed. E.g. if a tidscan is below the
ModifyTable node, it will not hold a pin by the time we heap_update,
because there's no scan holding a pin, and the slot will have been
materialized before updating.

There are number of other ways, I think.

So it's possible to get to heap_update/delete (and probably lock_tuple
as well) with a tid that's already been pruned away. Neither contains a
non-assert check ensuring the tid still is normal.

With assertions we'd fail with an assertion in PageGetItem(). But
without it looks like we'll interpret the page header as a tuple. Which
can't be good.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Kevin Grittner

Дата:

02 апреля 2020 г., 19:05:14

On Wed, Apr 1, 2020 at 6:59 PM Andres Freund <andres@anarazel.de> wrote:

index fetches will never even try to
detect that tuples it needs actually have already been pruned away.

I looked at this flavor of problem today and from what I saw:

(1) This has been a problem all the way back to 9.6.0.

(2) The behavior is correct if the index creation is skipped or if enable_indexscan is turned off in the transaction, confirming Andres' analysis.

(3) Pruning seems to happen as intended; the bug found by Peter seems to be entirely about failing to TestForOldSnapshot() where needed.

Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

Re: snapshot too old issues, first around wraparound and then more.

От

Kevin Grittner

Дата:

02 апреля 2020 г., 19:36:32

On Wed, Apr 1, 2020 at 7:17 PM Andres Freund <andres@anarazel.de> wrote:

FWIW, with autovacuum=off the query does not get killed until a manual
vacuum, nor if fewer rows are deleted and the table has previously been
vacuumed.

The vacuum in the second session isn't required. There just needs to be
something consuming an xid, so that oldSnapshotControl->latest_xmin is
increased. A single SELECT txid_current(); or such in a separate session
is sufficient.

Agreed. I don't see that part as a problem; if no xids are being consumed, it's hard to see how we could be heading into debilitating levels of bloat, so there is no need to perform the early pruning. It would not be worth consuming any cycles to ensure that pruning happens sooner than it does in this case. It's OK for it to happen any time past the moment that the snapshot hits the threshold, but it's also OK for it to wait until a vacuum of the table or until some activity consumes an xid.

Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

02 апреля 2020 г., 19:40:15

Hi,

On April 2, 2020 9:36:32 AM PDT, Kevin Grittner <kgrittn@gmail.com> wrote:
>On Wed, Apr 1, 2020 at 7:17 PM Andres Freund <andres@anarazel.de>
>wrote:
>
>> FWIW, with autovacuum=off the query does not get killed until a
>manual
>> vacuum, nor if fewer rows are deleted and the table has previously
>been
>> vacuumed.
>>
>> The vacuum in the second session isn't required. There just needs to
>be
>> something consuming an xid, so that oldSnapshotControl->latest_xmin
>is
>> increased. A single SELECT txid_current(); or such in a separate
>session
>> is sufficient.
>>
>
>Agreed.  I don't see that part as a problem; if no xids are being
>consumed,
>it's hard to see how we could be heading into debilitating levels of
>bloat,
>so there is no need to perform the early pruning.  It would not be
>worth
>consuming any cycles to ensure that pruning happens sooner than it does
>in
>this case.  It's OK for it to happen any time past the moment that the
>snapshot hits the threshold, but it's also OK for it to wait until a
>vacuum
>of the table or until some activity consumes an xid.

The point about txid being sufficient was just about simplifying the reproducer for wrong query results.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

02 апреля 2020 г., 21:28:06

On Tue, Mar 31, 2020 at 11:40 PM Andres Freund <andres@anarazel.de> wrote:
> The problem, as far as I can tell, is that
> oldSnapshotControl->head_timestamp appears to be intended to be the
> oldest value in the ring. But we update it unconditionally in the "need
> a new bucket, but it might not be the very next one" branch of
> MaintainOldSnapshotTimeMapping().
>
> While there's not really a clear-cut comment explaining whether
> head_timestamp() is intended to be the oldest or the newest timestamp,
> it seems to me that the rest of the code treats it as the "oldest"
> timestamp.

At first, I was almost certain that it's supposed to be the oldest
based only on the OldSnapshotControlData struct fields themselves. It
seemed pretty unambiguous:

    int         head_offset;    /* subscript of oldest tracked time */
    TimestampTz head_timestamp; /* time corresponding to head xid */

(Another thing that supports this interpretation is the fact that
there is a separate current_timestamp latest timestamp field in
OldSnapshotControlData.)

But then I took another look at the "We need a new bucket, but it
might not be the very next one" branch. It does indeed seem to
directly contradict the OldSnapshotControlData comments/documentation.
Note just the code itself, either. Even comments from this "new
bucket" branch disagree with the OldSnapshotControlData comments:

                if (oldSnapshotControl->count_used ==
OLD_SNAPSHOT_TIME_MAP_ENTRIES)
                {
                    /* Map full and new value replaces old head. */
                    int         old_head = oldSnapshotControl->head_offset;

                    if (old_head == (OLD_SNAPSHOT_TIME_MAP_ENTRIES - 1))
                        oldSnapshotControl->head_offset = 0;
                    else
                        oldSnapshotControl->head_offset = old_head + 1;
                    oldSnapshotControl->xid_by_minute[old_head] = xmin;
                }

Here, the comment says the map (circular buffer) is full, and that we
must replace the current head with a *new* value/timestamp (the one we
just got in GetSnapshotData()). It looks as if the design of the data
structure changed during the development of the original patch, but
this entire branch was totally overlooked.

In conclusion, I share Andres' concerns here. There are glaring
problems with how we manipulate the data structure that controls the
effective horizon for pruning. Maybe they can be fixed while leaving
the code that manages the OldSnapshotControl circular buffer in
something resembling its current form, but I doubt it. In my opinion,
there is no approach to fixing "snapshot too old" that won't have some
serious downside.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

02 апреля 2020 г., 23:04:12

On Thu, Apr 2, 2020 at 11:28 AM Peter Geoghegan <pg@bowt.ie> wrote:
> In conclusion, I share Andres' concerns here. There are glaring
> problems with how we manipulate the data structure that controls the
> effective horizon for pruning. Maybe they can be fixed while leaving
> the code that manages the OldSnapshotControl circular buffer in
> something resembling its current form, but I doubt it. In my opinion,
> there is no approach to fixing "snapshot too old" that won't have some
> serious downside.

I'll add something that might be constructive: It would probably be a
good idea to introduce a function like syncrep.c's
SyncRepQueueIsOrderedByLSN() function, which is designed to be called
by assertions only. That would both clearly document and actually
verify the circular buffer/OldSnapshotControl data structure's
invariants.

--
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

03 апреля 2020 г., 03:12:35

Hi,

I just spend a good bit more time improving my snapshot patch, so it
could work well with a fixed version of the old_snapshot_threshold
feature. Mostly so there's no unnecessary dependency on the resolution
of the issues in that patch.


When testing my changes, for quite a while, I could not get
src/test/modules/snapshot_too_old/ to trigger a single too-old error.

It turns out, that's because there's not a single tuple removed due to
old_snapshot_threshold in src/test/modules/snapshot_too_old/. The only
reason the current code triggers any such errors is that

a) TransactionIdLimitedForOldSnapshots() is always called in
   heap_page_prune_opt(), even if the "not limited" horizon
   (i.e. RecentGlobalDataXmin) is more than old enough to allow for
   pruning. That includes pages without a pd_prune_xid.

b) TransactionIdLimitedForOldSnapshots(), in the old_snapshot_threshold
   == 0 branch, always calls
    SetOldSnapshotThresholdTimestamp(ts, xlimit)
   even if the horizon has not changed due to snapshot_too_old (xlimit
   is initially set tot the "input" horizon, and only increased if
   between (recentXmin, MyProc->xmin)).


To benefit from the snapshot scalability improvements in my patchset, it
is important to avoid unnecessarily determining the "accurate" xmin
horizon, if it's clear from the "lower boundary" horizon that pruning
can happen. Therefore I changed heap_page_prune_opt() and
heap_page_prune() to only limit when we couldn't prune.

In the course of that I separated getting the horizon from
TransactionIdLimitedForOldSnapshots() and triggering errors when an
already removed tuple would be needed via
TransactionIdLimitedForOldSnapshots().

Because there are no occasions to actually remove tuples in the entire
test, there now were no TransactionIdLimitedForOldSnapshots() calls. And
thus no errors.  My code turns out to actually work.


Thus, if I change the code in master from:
        TransactionId xlimit = recentXmin;
...
        if (old_snapshot_threshold == 0)
        {
            if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
                && TransactionIdFollows(latest_xmin, xlimit))
                xlimit = latest_xmin;

            ts -= 5 * USECS_PER_SEC;
            SetOldSnapshotThresholdTimestamp(ts, xlimit);

            return xlimit;
        }

to
...
        if (old_snapshot_threshold == 0)
        {
            if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
                && TransactionIdFollows(latest_xmin, xlimit))
            {
                xlimit = latest_xmin;
                SetOldSnapshotThresholdTimestamp(ts, xlimit);
            }

            ts -= 5 * USECS_PER_SEC;

            return xlimit;
        }

there's not a single error raised in the existing tests. Not a *single*
tuple removal is caused by old_snapshot_threshold. We just test the
order of SetOldSnapshotThresholdTimestamp() calls. We have code in the
backend to support testing old_snapshot_threshold, but we don't test
anything meaningful in the feature. We basically test a oddly behaving
version version of "transaction_timeout = 5s".  I can't emphasize enough
how baffling I find this.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

03 апреля 2020 г., 03:17:13

Hi,

On 2020-04-01 12:02:18 -0400, Robert Haas wrote:
> I have no objection to the idea that *if* the feature is hopelessly
> broken, it should be removed.

I don't think we have a real choice here at this point, at least for the
back branches.

Just about nothing around old_snapshot_threshold works correctly:

* There are basically no tests (see [1] I jsut sent, and also
  old_snapshot_threshold bypassing a lot of the relevant code).

* We don't detect errors after hot pruning (to allow that is a major
  point of the feature) when access is via any sort of index
  scans. Wrong query results.

* The time->xid mapping is is entirely broken. We don't prevent bloat
  for many multiples of old_snapshot_threshold (if above 1min).

  It's possible, but harder, to have this cause wrong query results.

* In read-mostly workloads it can trigger errors in sessions that are
  much younger than old_snapshot_threshold, if the transactionid is not
  advancing.

  I've not tried to reproduce, but I suspect this can also cause wrong
  query results. Because a later snapshot can have the same xmin as
  older transactions, it sure looks like we can end up with data from an
  older xmin getting removed, but the newer snapshot's whenTaken will
  prevent TestForOldSnapshot_impl() from raising an error.

* I am fairly sure that it can cause crashes (or even data corruption),
  because it assumes that DML never needs to check for old snapshots
  (with no meaningful justificiation). Leading to heap_update/delete to
  assume the page header is a tuple.

* There's obviously also the wraparound issue that made me start this
  thread initially.

Since this is a feature that can result in wrong query results (and
quite possibly crashes / data corruption), I don't think we can just
leave this unfixed.  But given the amount of code / infrastructure
changes required to get this into a working feature, I don't see how we
can unleash those changes onto the stable branches.

There's quite a few issues in here that require not just local bugfixes,
but some design changes too. And it's pretty clear that the feature
didn't go through enough review before getting committed. I see quite
some merit in removing the code in master, and having a potential
reimplementation go through a normal feature integration process.

I don't really know what to do here. Causing problems by neutering a
feature in the back branch *sucks*. While not quite as bad, removing a
feature without a replacement in a major release is pretty harsh
too. But I don't really see any other realistic path forward.

FWIW, I've now worked around the interdependency of s_t_o my snapshot
scalability patch (only took like 10 days). I have manually confirmed it
works with 0/1 minute thresholds.  I can make the tests pass unmodified
if I just add SetOldSnapshotThresholdTimestamp() calls when not
necessary (which obviously makes no sense).  Lead to some decent
improvements around pruning that are independent of s_t_o (with more
possibilities "opened").  But I still think we need to do something
here.

Greetings,

Andres Freund

[1] https://postgr.es/m/20200403001235.e6jfdll3gh2ygbuc%40alap3.anarazel.de

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

03 апреля 2020 г., 04:21:31

On Thu, Apr 2, 2020 at 5:17 PM Andres Freund <andres@anarazel.de> wrote:
> Since this is a feature that can result in wrong query results (and
> quite possibly crashes / data corruption), I don't think we can just
> leave this unfixed.  But given the amount of code / infrastructure
> changes required to get this into a working feature, I don't see how we
> can unleash those changes onto the stable branches.

I don't think that the feature can be allowed to remain in anything
like its current form. The current design is fundamentally unsound.

> I don't really know what to do here. Causing problems by neutering a
> feature in the back branch *sucks*. While not quite as bad, removing a
> feature without a replacement in a major release is pretty harsh
> too. But I don't really see any other realistic path forward.

I have an idea that might allow us to insulate some users from the
problem caused by a full revert (or disabling the feature) in the
backbranches. I wouldn't usually make such a radical suggestion, but
the current situation is exceptional. Anything that avoids serious
pain for users deserves to be considered.

Kevin said this about the feature very recently:

"""
Keep in mind that the real goal of this feature is not to eagerly
_see_ "snapshot too old" errors, but to prevent accidental
debilitating bloat due to one misbehaving user connection.  This is
particularly easy to see (and therefore unnervingly common) for those
using ODBC, which in my experience tends to correspond to the largest
companies which are using PostgreSQL.  In some cases, the snapshot
which is preventing removal of the rows will never be used again;
removal of the rows will not actually affect the result of any query,
but only the size and performance of the database.  This is a "soft
limit" -- kinda like max_wal_size.  Where there was a trade-off
between accuracy of the limit and performance, the less accurate way
was intentionally chosen.  I apologize for not making that more clear
in comments.
"""

ODBC uses cursors in rather strange ways, often to implement a kind of
ODBC-level cache. See the description of "Use Declare/Fetch" from
https://odbc.postgresql.org/docs/config.html to get some idea of what
this can look like.

I think that it's worth considering whether or not there are a
significant number of "snapshot too old" users that rarely or never
rely on old snapshots used by new queries. Kevin said that this
happens "in some cases", but how many cases? Might it be that many
"snapshot too old" users could get by with a version of the feature
that makes the most conservative possible assumptions, totally giving
up on the idea of differentiating which blocks are truly safe to
access with an "old" snapshot? (In other words, one that assumes that
they're *all* unsafe for an "old" snapshot.)

I'm thinking of a version of "snapshot too old" that amounts to a
statement timeout that gets applied for xmin horizon type purposes in
the conventional way, while only showing an error to the client if and
when they access literally any buffer (though not when the relation is
a system catalog). Is it possible that something along those lines is
appreciably better than nothing to users? If it is, and if we can find
a way to manage the transition, then maybe we could tolerate
supporting this greatly simplified implementation of "snapshot too
old".

I feel slightly silly for even suggesting this. I have to ask. Maybe
nobody noticed a problem with the feature before now (at least in
part) because they didn't truly care about old snapshots anyway. They
just wanted to avoid a significant impact from buggy code that leaks
cursors and things like that. Or, they were happy as long as they
could still access ODBC's "100 rows in a cache" through the cursor.
The docs say that a old_snapshot_threshold setting in the hours is
about the lowest reasonable setting for production use, which seems
rather high to me. It almost seems as if the feature specifically
targets misbehaving applications already.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Amit Kapila

Дата:

03 апреля 2020 г., 12:02:09

On Fri, Apr 3, 2020 at 6:52 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Apr 2, 2020 at 5:17 PM Andres Freund <andres@anarazel.de> wrote:
> > Since this is a feature that can result in wrong query results (and
> > quite possibly crashes / data corruption), I don't think we can just
> > leave this unfixed.  But given the amount of code / infrastructure
> > changes required to get this into a working feature, I don't see how we
> > can unleash those changes onto the stable branches.
>

As per my initial understanding, the changes required are (a) There
seem to be multiple places where TestForOldSnapshot is missing,  (b)
TestForOldSnapshot itself need to be reviewed carefully to see if it
has problems, (c) Some of the members of OldSnapshotControlData like
head_timestamp and xid_by_minute are not maintained accurately, (d)
handling of wraparound for xids in the in-memory data-structure for
this feature is required, (e) test infrastructure is not good enough
to catch bugs or improve this feature.

Now, this sounds like a quite of work but OTOH, if we see most of the
critical changes required will be in only a few functions like
TransactionIdLimitedForOldSnapshots(),
MaintainOldSnapshotTimeMapping(), TestForOldSnapshot().  I don't deny
the possibility that we might need much more work or we need to come
up with quite a different design to address all these problems but
unless Kevin or someone else doesn't come up with a solution to
address all of these problems, we can't be sure of that.

> I don't think that the feature can be allowed to remain in anything
> like its current form. The current design is fundamentally unsound.
>

Agreed, but OTOH, not giving time to Kevin or others who might be
interested to support this work is also not fair.  I think once
somebody comes up with patches for problems we can decide whether this
feature can be salvaged in back-branches or we need to disable it in a
hard-way.  Now, if Kevin himself is not interested in fixing or nobody
shows up to help, then surely we can take the decision sooner but
giving time for a couple of weeks (or even till we are near for PG13
release) in this case doesn't seem like a bad idea.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

03 апреля 2020 г., 22:03:09

Hi,

On 2020-04-03 14:32:09 +0530, Amit Kapila wrote:
> On Fri, Apr 3, 2020 at 6:52 AM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Thu, Apr 2, 2020 at 5:17 PM Andres Freund <andres@anarazel.de> wrote:
> > > Since this is a feature that can result in wrong query results (and
> > > quite possibly crashes / data corruption), I don't think we can just
> > > leave this unfixed.  But given the amount of code / infrastructure
> > > changes required to get this into a working feature, I don't see how we
> > > can unleash those changes onto the stable branches.
> >
>
> As per my initial understanding, the changes required are (a) There
> seem to be multiple places where TestForOldSnapshot is missing, (b)
> TestForOldSnapshot itself need to be reviewed carefully to see if it
> has problems, (c) Some of the members of OldSnapshotControlData like
> head_timestamp and xid_by_minute are not maintained accurately, (d)
> handling of wraparound for xids in the in-memory data-structure for
> this feature is required, (e) test infrastructure is not good enough
> to catch bugs or improve this feature.

And a bunch more correctness issues. But basically, yes.

When you say "(c) Some of the members of OldSnapshotControlData like
head_timestamp and xid_by_minute are not maintained accurately)" - note
that that's the core state for the whole feature.

With regards to test: "not good enough" is somewhat of an
understatement. Not a *single* tuple is removed in the tests due to
old_snapshot_threshold - and removing tuples is the entire point.

> Now, this sounds like a quite of work but OTOH, if we see most of the
> critical changes required will be in only a few functions like
> TransactionIdLimitedForOldSnapshots(),
> MaintainOldSnapshotTimeMapping(), TestForOldSnapshot().

I don't think that's really the case. Every place reading a buffer needs
to be inspected, and new calls added. They aren't free, and I'm not sure
all of them have the relevant snapshot available. To fix the issue of
spurious errors, we'd likely need changes outside of those, and it'd
quite possibly have performance / bloat implications.

> I don't deny the possibility that we might need much more work or we
> need to come up with quite a different design to address all these
> problems but unless Kevin or someone else doesn't come up with a
> solution to address all of these problems, we can't be sure of that.
>
> > I don't think that the feature can be allowed to remain in anything
> > like its current form. The current design is fundamentally unsound.
> >
>
> Agreed, but OTOH, not giving time to Kevin or others who might be
> interested to support this work is also not fair.  I think once
> somebody comes up with patches for problems we can decide whether this
> feature can be salvaged in back-branches or we need to disable it in a
> hard-way.  Now, if Kevin himself is not interested in fixing or nobody
> shows up to help, then surely we can take the decision sooner but
> giving time for a couple of weeks (or even till we are near for PG13
> release) in this case doesn't seem like a bad idea.

It'd certainly be great if somebody came up with fixes, yes. Even if we
had to disable it in the back branches, that'd allow us to keep the
feature around, at least.

The likelihood of regressions even when the feature is not on does not
seem that low. But you're right, we'll be able to better judge it with
fixes to look at.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Amit Kapila

Дата:

04 апреля 2020 г., 11:57:56

On Sat, Apr 4, 2020 at 12:33 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2020-04-03 14:32:09 +0530, Amit Kapila wrote:
> >
> > Agreed, but OTOH, not giving time to Kevin or others who might be
> > interested to support this work is also not fair.  I think once
> > somebody comes up with patches for problems we can decide whether this
> > feature can be salvaged in back-branches or we need to disable it in a
> > hard-way.  Now, if Kevin himself is not interested in fixing or nobody
> > shows up to help, then surely we can take the decision sooner but
> > giving time for a couple of weeks (or even till we are near for PG13
> > release) in this case doesn't seem like a bad idea.
>
> It'd certainly be great if somebody came up with fixes, yes. Even if we
> had to disable it in the back branches, that'd allow us to keep the
> feature around, at least.
>
> The likelihood of regressions even when the feature is not on does not
> seem that low.
>

Yeah, that is the key point.  IIRC, when this feature got added Kevin
and others spent a lot of effort to ensure that point.

> But you're right, we'll be able to better judge it with
> fixes to look at.
>

I am hoping Kevin would take the lead and then others also can help.
Kevin, please do let us know if you are *not* planning to work on the
issues raised in this thread so that we can think of an alternative?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: snapshot too old issues, first around wraparound and then more.

От

Thomas Munro

Дата:

13 апреля 2020 г., 05:58:34

On Fri, Apr 3, 2020 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I think that it's worth considering whether or not there are a
> significant number of "snapshot too old" users that rarely or never
> rely on old snapshots used by new queries. Kevin said that this
> happens "in some cases", but how many cases? Might it be that many
> "snapshot too old" users could get by with a version of the feature
> that makes the most conservative possible assumptions, totally giving
> up on the idea of differentiating which blocks are truly safe to
> access with an "old" snapshot? (In other words, one that assumes that
> they're *all* unsafe for an "old" snapshot.)
>
> I'm thinking of a version of "snapshot too old" that amounts to a
> statement timeout that gets applied for xmin horizon type purposes in
> the conventional way, while only showing an error to the client if and
> when they access literally any buffer (though not when the relation is
> a system catalog). Is it possible that something along those lines is
> appreciably better than nothing to users? If it is, and if we can find
> a way to manage the transition, then maybe we could tolerate
> supporting this greatly simplified implementation of "snapshot too
> old".

Hi Peter,

Interesting idea.  I'm keen to try prototyping it to see how well it
works out it practice.  Let me know soon if you already have designs
on that and I'll get out of your way, otherwise I'll give it a try and
share what I come up with.

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

13 апреля 2020 г., 08:14:11

Hi,

On 2020-04-13 14:58:34 +1200, Thomas Munro wrote:
> On Fri, Apr 3, 2020 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I think that it's worth considering whether or not there are a
> > significant number of "snapshot too old" users that rarely or never
> > rely on old snapshots used by new queries. Kevin said that this
> > happens "in some cases", but how many cases? Might it be that many
> > "snapshot too old" users could get by with a version of the feature
> > that makes the most conservative possible assumptions, totally giving
> > up on the idea of differentiating which blocks are truly safe to
> > access with an "old" snapshot? (In other words, one that assumes that
> > they're *all* unsafe for an "old" snapshot.)
> >
> > I'm thinking of a version of "snapshot too old" that amounts to a
> > statement timeout that gets applied for xmin horizon type purposes in
> > the conventional way, while only showing an error to the client if and
> > when they access literally any buffer (though not when the relation is
> > a system catalog). Is it possible that something along those lines is
> > appreciably better than nothing to users? If it is, and if we can find
> > a way to manage the transition, then maybe we could tolerate
> > supporting this greatly simplified implementation of "snapshot too
> > old".
> 
> Hi Peter,
> 
> Interesting idea.  I'm keen to try prototyping it to see how well it
> works out it practice.  Let me know soon if you already have designs
> on that and I'll get out of your way, otherwise I'll give it a try and
> share what I come up with.

FWIW, I think the part that is currently harder to fix is the time->xmin
mapping and some related pieces. Second comes the test
infrastructure. Compared to those, adding additional checks for old
snapshots wouldn't be too hard - although I'd argue that the approach of
sprinkling these tests everywhere isn't that scalable...

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Thomas Munro

Дата:

15 апреля 2020 г., 05:21:16

On Mon, Apr 13, 2020 at 2:58 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Fri, Apr 3, 2020 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > I think that it's worth considering whether or not there are a
> > significant number of "snapshot too old" users that rarely or never
> > rely on old snapshots used by new queries. Kevin said that this
> > happens "in some cases", but how many cases? Might it be that many
> > "snapshot too old" users could get by with a version of the feature
> > that makes the most conservative possible assumptions, totally giving
> > up on the idea of differentiating which blocks are truly safe to
> > access with an "old" snapshot? (In other words, one that assumes that
> > they're *all* unsafe for an "old" snapshot.)
> >
> > I'm thinking of a version of "snapshot too old" that amounts to a
> > statement timeout that gets applied for xmin horizon type purposes in
> > the conventional way, while only showing an error to the client if and
> > when they access literally any buffer (though not when the relation is
> > a system catalog). Is it possible that something along those lines is
> > appreciably better than nothing to users? If it is, and if we can find
> > a way to manage the transition, then maybe we could tolerate
> > supporting this greatly simplified implementation of "snapshot too
> > old".
>
> Interesting idea.  I'm keen to try prototyping it to see how well it
> works out it practice.  Let me know soon if you already have designs
> on that and I'll get out of your way, otherwise I'll give it a try and
> share what I come up with.

Here's a quick and dirty test patch of that idea (or my understanding
of it), just for experiments.  It introduces snapshot->expire_time and
a new timer SNAPSHOT_TIMEOUT to cause the next CHECK_FOR_INTERRUPTS()
to set snapshot->too_old on any active or registered snapshots whose
time has come, and then try to advance MyPgXact->xmin, without
considering the ones marked too old.  That gets rid of the concept of
"early pruning".  You can use just regular pruning, because the
snapshot is no longer holding the regular xmin back.  Then
TestForOldSnapshot() becomes simply if (snapshot->too_old)
ereport(...).

There are certainly some rough edges, missed details and bugs in here,
not least the fact (pointed out to me by Andres in an off-list chat)
that we sometimes use short-lived snapshots without registering them;
we'd have to fix that.  It also does nothing to ensure that
TestForOldSnapshot() is actually called at all the right places, which
is still required for correct results.

If those problems can be fixed, you'd have a situation where
snapshot-too-old is a coarse grained, blunt instrument that
effectively aborts your transaction even if the whole cluster is
read-only.  I am not sure if that's really truly useful to anyone (ie
if these ODBC cursor users would be satisfied; I'm not sure I
understand that use case).

Hmm.  I suppose it must be possible to put the LSN check back: if
(snapshot->too_old && PageGetLSN(page) > snapshot->lsn) ereport(...).
Then the granularity would be the same as today -- block level -- but
the complexity is transferred from the pruning side (has to deal with
xid time map) to the snapshot-owning side (has to deal with timers,
CFI() and make sure all snapshots are registered).  Maybe not a great
deal, and maybe not easier than fixing the existing bugs.

One problem is all the new setitimer() syscalls.  I feel like that
could be improved, as could statement_timeout, by letting existing
timers run rather than repeatedly rescheduling eagerly, so that eg a 1
minute timeout never gets rescheduled more than once per minute.  I
haven't looked into that, but I guess it's no worse than the existing
implement's overheads anyway.

PS in the patch the GUC is interpreted as milliseconds, which is more
fun for testing but it should really be minutes like before.

Вложения

0001-Implement-snapshot_too_old-using-a-timer.patch

Re: snapshot too old issues, first around wraparound and then more.

От

Thomas Munro

Дата:

17 апреля 2020 г., 06:37:12

On Mon, Apr 13, 2020 at 5:14 PM Andres Freund <andres@anarazel.de> wrote:
> FWIW, I think the part that is currently harder to fix is the time->xmin
> mapping and some related pieces. Second comes the test
> infrastructure. Compared to those, adding additional checks for old
> snapshots wouldn't be too hard - although I'd argue that the approach of
> sprinkling these tests everywhere isn't that scalable...

Just trying out some ideas here...  I suppose the wrapping problem
just requires something along the lines of the attached, but now I'm
wondering how to write decent tests for it.  Using the
pg_clobber_current_snapshot_timestamp() function I mentioned in
Robert's time->xmin thread, it's easy to build up a time map without
resorting to sleeping etc, with something like:

select pg_clobber_current_snapshot_timestamp('3000-01-01 00:00:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:01:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:02:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:03:00Z');
select pg_current_xact_id();
select pg_clobber_current_snapshot_timestamp('3000-01-01 00:04:00Z');

Then of course  frozenXID can be advanced with eg update pg_database
set datallowconn = 't' where datname = 'template0', then vacuumdb
--freeze --all, and checked before and after with Robert's
pg_old_snapshot_time_mapping() SRF to see that it's truncated.  But
it's not really the level of stuff we'd ideally mess with in
pg_regress tests and I don't see any precent, so I guess maybe I'll
need to go and figure out how to write some perl.

Вложения

0001-Truncate-old-snapshot-XIDs-before-truncating-CLOG.patch

Re: snapshot too old issues, first around wraparound and then more.

От

Thomas Munro

Дата:

17 апреля 2020 г., 08:17:18

On Fri, Apr 17, 2020 at 3:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Apr 13, 2020 at 5:14 PM Andres Freund <andres@anarazel.de> wrote:
> > FWIW, I think the part that is currently harder to fix is the time->xmin
> > mapping and some related pieces. Second comes the test
> > infrastructure. Compared to those, adding additional checks for old
> > snapshots wouldn't be too hard - although I'd argue that the approach of
> > sprinkling these tests everywhere isn't that scalable...
>
> Just trying out some ideas here...
> ... so I guess maybe I'll
> need to go and figure out how to write some perl.

Here's a very rough sketch of what I mean.  Patches 0001-0003 are
stolen directly from Robert.  I think 0005's t/001_truncate.pl
demonstrates that the map is purged of old xids as appropriate.  I
suppose this style of testing based on manually advancing the hands of
time should also allow for testing early pruning, but it may be Monday
before I can try that so I'm sharing what I have so far in case it's
useful...  I think this really wants to be in src/test/modules, not
contrib, but I just bolted it on top of what Robert posted.

Вложения

Re: snapshot too old issues, first around wraparound and then more.

От

Robert Haas

Дата:

17 апреля 2020 г., 15:19:43

On Thu, Apr 16, 2020 at 11:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Then of course  frozenXID can be advanced with eg update pg_database
> set datallowconn = 't' where datname = 'template0', then vacuumdb
> --freeze --all, and checked before and after with Robert's
> pg_old_snapshot_time_mapping() SRF to see that it's truncated.  But
> it's not really the level of stuff we'd ideally mess with in
> pg_regress tests and I don't see any precent, so I guess maybe I'll
> need to go and figure out how to write some perl.

The reason I put it in contrib is because I thought it would possibly
be useful to anyone who is actually using this feature to be able to
look at this information. It's unclear to me that there's any less
reason to provide introspection here than there is for, say, pg_locks.

It's sorta unclear to me why you continued the discussion of this on
this thread rather than the new one I started. Seems like doing it
over there might be clearer.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: snapshot too old issues, first around wraparound and then more.

От

Thomas Munro

Дата:

17 апреля 2020 г., 23:39:43

On Sat, Apr 18, 2020 at 12:19 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Apr 16, 2020 at 11:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > Then of course  frozenXID can be advanced with eg update pg_database
> > set datallowconn = 't' where datname = 'template0', then vacuumdb
> > --freeze --all, and checked before and after with Robert's
> > pg_old_snapshot_time_mapping() SRF to see that it's truncated.  But
> > it's not really the level of stuff we'd ideally mess with in
> > pg_regress tests and I don't see any precent, so I guess maybe I'll
> > need to go and figure out how to write some perl.
>
> The reason I put it in contrib is because I thought it would possibly
> be useful to anyone who is actually using this feature to be able to
> look at this information. It's unclear to me that there's any less
> reason to provide introspection here than there is for, say, pg_locks.

Makes sense.  I was talking more about the
pg_clobber_snapshot_timestamp() function I showed, which is for use by
tests, not end users, since it does weird stuff to internal state.

> It's sorta unclear to me why you continued the discussion of this on
> this thread rather than the new one I started. Seems like doing it
> over there might be clearer.

I understood that you'd forked a new thread to discuss one particular
problem among the many that Andres nailed to the door, namely the xid
map's failure to be monotonic, and here I was responding to other
things from his list, namely the lack of defences against wrap-around
and the lack of testing.  Apparently I misunderstood.  I will move to
the new thread for the next version I post, once I figure out if I can
use pg_clobber_snapshot_timestamp() in a TAP test to check early
vacuum/pruning behaviour.

Re: snapshot too old issues, first around wraparound and then more.

От

Robert Haas

Дата:

18 апреля 2020 г., 15:34:26

On Fri, Apr 17, 2020 at 4:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I understood that you'd forked a new thread to discuss one particular
> problem among the many that Andres nailed to the door, namely the xid
> map's failure to be monotonic, and here I was responding to other
> things from his list, namely the lack of defences against wrap-around
> and the lack of testing.  Apparently I misunderstood.

Oh, maybe I'm the one who misunderstood...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: snapshot too old issues, first around wraparound and then more.

От

Tom Lane

Дата:

15 июня 2021 г., 19:51:28

Robert Haas <robertmhaas@gmail.com> writes:
> Oh, maybe I'm the one who misunderstood...

So, it's well over a year later, and so far as I can see exactly
nothing has been done about snapshot_too_old's problems.

I never liked that feature to begin with, and I would be very
glad to undertake the task of ripping it out.  If someone thinks
this should not happen, please commit to fixing it ... and not
"eventually".

            regards, tom lane

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

15 июня 2021 г., 20:10:36

Hi,

On 2021-06-15 12:51:28 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Oh, maybe I'm the one who misunderstood...
> 
> So, it's well over a year later, and so far as I can see exactly
> nothing has been done about snapshot_too_old's problems.
> 
> I never liked that feature to begin with, and I would be very
> glad to undertake the task of ripping it out.  If someone thinks
> this should not happen, please commit to fixing it ... and not
> "eventually".

I still think that's the most reasonable course. I actually like the
feature, but I don't think a better implementation of it would share
much if any of the current infrastructure.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

15 июня 2021 г., 20:10:42

On Tue, Jun 15, 2021 at 9:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So, it's well over a year later, and so far as I can see exactly
> nothing has been done about snapshot_too_old's problems.

FWIW I think that the concept itself is basically reasonable. The
implementation is very flawed, though, so it hardly enters into it.

> I never liked that feature to begin with, and I would be very
> glad to undertake the task of ripping it out.  If someone thinks
> this should not happen, please commit to fixing it ... and not
> "eventually".

ISTM that this is currently everybody's responsibility, and therefore
nobody's responsibility. That's probably why the problems haven't been
resolved yet.

I propose that the revert question be explicitly timeboxed. If the
issues haven't been fixed by some date, then "snapshot too old"
automatically gets reverted without further discussion. This gives
qualified hackers the opportunity to save the feature if they feel
strongly about it, and are actually willing to take responsibility for
its ongoing maintenance.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Tom Lane

Дата:

15 июня 2021 г., 21:01:40

Peter Geoghegan <pg@bowt.ie> writes:
> On Tue, Jun 15, 2021 at 9:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So, it's well over a year later, and so far as I can see exactly
>> nothing has been done about snapshot_too_old's problems.

> I propose that the revert question be explicitly timeboxed. If the
> issues haven't been fixed by some date, then "snapshot too old"
> automatically gets reverted without further discussion. This gives
> qualified hackers the opportunity to save the feature if they feel
> strongly about it, and are actually willing to take responsibility for
> its ongoing maintenance.

The goal I have in mind is for snapshot_too_old to be fixed or gone
in v15.  I don't feel a need to force the issue sooner than that, so
there's plenty of time for someone to step up, if anyone wishes to.

I imagine that we should just ignore the question of whether anything
can be done for it in the back branches.  Given the problems
identified upthread, fixing it in a non-back-patchable way would be
challenging enough.

            regards, tom lane

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

15 июня 2021 г., 21:20:51

On Tue, Jun 15, 2021 at 11:01 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The goal I have in mind is for snapshot_too_old to be fixed or gone
> in v15.  I don't feel a need to force the issue sooner than that, so
> there's plenty of time for someone to step up, if anyone wishes to.

Seems more than reasonable to me. A year ought to be plenty of time if
the feature truly is salvageable.

What do other people think? Ideally we could commit to that hard
deadline now. To me the important thing is to actually have a real
deadline that forces the issue one way or another. This situation must
not be allowed to drag on forever.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Robert Haas

Дата:

15 июня 2021 г., 22:17:05

On Tue, Jun 15, 2021 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So, it's well over a year later, and so far as I can see exactly
> nothing has been done about snapshot_too_old's problems.

Progress has been pretty limited, but not altogether nonexistent.
55b7e2f4d78d8aa7b4a5eae9a0a810601d03c563 fixed, or at least seemed to
fix, the time->XID mapping, which is one of the main things that
Andres said was broken originally. Also, there are patches on this
thread from Thomas Munro to add some test coverage for that case,
another problem Andres noted in his original email. I guess it
wouldn't be too hard to get something committed there, and I'm willing
to do it if Thomas doesn't want to and if there's any prospect of
salvaging the feature.

But that's not clear to me. I'm not clear how exactly how many
problems we know about and need to fix in order to keep the feature,
and I'm also not clear how deep the hole goes. Like, if we need to get
a certain number of specific bugs fixed, I might be willing to do
that. If we need to commit to a major rewrite of the current
implementation, that's more than I can do. But I guess I don't
understand exactly how bad the current problems are. Reviewing
complaints from Andres from this thread:

> Looking at TransactionIdLimitedForOldSnapshots() I think the ts ==
> update_ts threshold actually needs to be ts >= update_ts, right now we
> don't handle being newer than the newest bin correctly afaict (mitigated
> by autovacuum=on with naptime=1s doing a snapshot more often). It's hard
> to say, because there's no comments.

This seems specific enough to be analyzed and anything that is broken
can be fixed.

> The whole lock nesting is very hazardous. Most (all?)
> TestForOldSnapshot() calls happen with locks on on buffers held, and can
> acquire lwlocks itself. In some older branches we do entire *catalog
> searches* with the buffer lwlock held (for RelationHasUnloggedIndex()).

I think it's unclear whether there are live problems in master in this area.

> GetSnapshotData() using snapshot->lsn = GetXLogInsertRecPtr(); as the
> basis to detect conflicts seems dangerous to me. Isn't that ignoring
> inserts that are already in progress?

Discussion on this point trailed off. Upon rereading, I think Andres
is correct that there's an issue; the snapshot's LSN needs to be set
to a value not older than the last xlog insertion that has been
completed rather than, as now, the last one that is started. I guess
to get that value we would need to do something like
WaitXLogInsertionsToFinish(), or some approximation of it e.g.
GetXLogWriteRecPtr() at the risk of unnecessary snapshot-too-old
errors.

> * In read-mostly workloads it can trigger errors in sessions that are
>  much younger than old_snapshot_threshold, if the transactionid is not
>  advancing.
>
>  I've not tried to reproduce, but I suspect this can also cause wrong
>  query results. Because a later snapshot can have the same xmin as
>  older transactions, it sure looks like we can end up with data from an
>  older xmin getting removed, but the newer snapshot's whenTaken will
>  prevent TestForOldSnapshot_impl() from raising an error.

I haven't really wrapped my head around this one, but it seems
amenable to a localized fix. It basically amounts to a complaint that
GetOldSnapshotThresholdTimestamp() is returning a newer value than it
should. I don't know exactly what's required to make it not do that,
but it doesn't seem intractable.

> * I am fairly sure that it can cause crashes (or even data corruption),
>  because it assumes that DML never needs to check for old snapshots
>  (with no meaningful justificiation). Leading to heap_update/delete to
>  assume the page header is a tuple.

I don't understand the issue here, really. I assume there might be a
wrong word here, because assuming that the page header is a tuple
doesn't sound like a thing that would actually happen. I think one of
the key problems for this feature is figuring out whether you've got
snapshot-too-old checks in all the right places. I think what is being
alleged here is that heap_update() and heap_delete() need them, and
that it's not good enough to rely on the scan that found the tuple to
be updated or deleted having already performed those checks. It is not
clear to me whether that is true, or how it could cause crashes.
Andres may have explained this to me at some point, but if he did I
have unfortunately forgotten.

My general point here is that I would like to know whether we have a
finite number of reasonably localized bugs or a three-ring disaster
that is unrecoverable no matter what we do. Andres seems to think it
is the latter, and I *think* Peter Geoghegan agrees, but I think that
the point might be worth a little more discussion. I'm unclear whether
Tom's dislike for the feature represents hostility to the concept -
with which I would have to disagree - or a judgement on the quality of
the implementation - which might be justified. For the record, and to
Peter's point, I think it's reasonable to set v15 feature freeze as a
drop-dead date for getting this feature into acceptable shape, but I
would like to try to nail down what we think "acceptable" means in
this context.

Thanks,

--
Robert Haas
EDB: http://www.enterprisedb.com

Re: snapshot too old issues, first around wraparound and then more.

От

Tom Lane

Дата:

15 июня 2021 г., 22:49:03

Robert Haas <robertmhaas@gmail.com> writes:
> My general point here is that I would like to know whether we have a
> finite number of reasonably localized bugs or a three-ring disaster
> that is unrecoverable no matter what we do. Andres seems to think it
> is the latter, and I *think* Peter Geoghegan agrees, but I think that
> the point might be worth a little more discussion.

TBH, I am not clear on that either.

> I'm unclear whether
> Tom's dislike for the feature represents hostility to the concept -
> with which I would have to disagree - or a judgement on the quality of
> the implementation - which might be justified.

I think it's a klugy, unprincipled solution to a valid real-world
problem.  I suspect the implementation issues are not unrelated to
the kluginess of the concept.  Thus, I would really like to see us
throw this away and find something better.  I admit I have nothing
to offer about what a better solution to the problem would look like.
But I would really like it to not involve random-seeming query failures.

            regards, tom lane

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

16 июня 2021 г., 00:12:27

On Tue, Jun 15, 2021 at 12:49 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > My general point here is that I would like to know whether we have a
> > finite number of reasonably localized bugs or a three-ring disaster
> > that is unrecoverable no matter what we do. Andres seems to think it
> > is the latter, and I *think* Peter Geoghegan agrees, but I think that
> > the point might be worth a little more discussion.
>
> TBH, I am not clear on that either.

I don't know for sure which it is, but that in itself isn't actually
what matters to me. The most concerning thing is that I don't really
know how to *assess* the design now. The clear presence of at least
several very severe bugs doesn't necessarily prove anything (it just
*hints* at major design problems).

If I could make a very clear definitive statement on this then I'd
probably have to do ~1/3 of the total required work -- that'd be my
guess. If it was easy to be quite sure here then we wouldn't still be
here 12 months later. In any case I don't think that the feature
deserves to be treated all that differently to something that was
committed much more recently, given what we know. Frankly it took me
about 5 minutes to find a very serious bug in the feature, pretty much
without giving it any thought. That is not a good sign.

> I think it's a klugy, unprincipled solution to a valid real-world
> problem.  I suspect the implementation issues are not unrelated to
> the kluginess of the concept.  Thus, I would really like to see us
> throw this away and find something better.  I admit I have nothing
> to offer about what a better solution to the problem would look like.
> But I would really like it to not involve random-seeming query failures.

I would be very happy to see somebody take this up, because it is
important. The reality is that anybody that undertakes this task
should start with the assumption that they're starting from scratch,
at least until they learn otherwise. So ISTM that it might as well be
true that it needs a total rewrite, even if it turns out to not be
strictly true.

--
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

16 июня 2021 г., 00:32:11

On Tue, Jun 15, 2021 at 12:17 PM Robert Haas <robertmhaas@gmail.com> wrote:
> My general point here is that I would like to know whether we have a
> finite number of reasonably localized bugs or a three-ring disaster
> that is unrecoverable no matter what we do. Andres seems to think it
> is the latter, and I *think* Peter Geoghegan agrees, but I think that
> the point might be worth a little more discussion. I'm unclear whether
> Tom's dislike for the feature represents hostility to the concept -
> with which I would have to disagree - or a judgement on the quality of
> the implementation - which might be justified. For the record, and to
> Peter's point, I think it's reasonable to set v15 feature freeze as a
> drop-dead date for getting this feature into acceptable shape, but I
> would like to try to nail down what we think "acceptable" means in
> this context.

What I had in mind was this: a committer adopting the feature
themselves. The committer would be morally obligated to maintain the
feature on an ongoing basis, just as if they were the original
committer. This seems like the only sensible way of resolving this
issue once and for all.

If it really is incredibly important that we keep this feature, or one
like it, then I have to imagine that somebody will step forward --
there is still ample opportunity. But if nobody steps forward, I'll be
forced to conclude that perhaps it wasn't quite as important as I
first thought. Anybody can agree that it's important in an abstract
sense -- that's easy. What we need is a committer willing to sign on
the dotted line, which we're no closer to today than we were a year
ago. Actions speak louder than words.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

16 июня 2021 г., 00:45:50

On Wed, Apr 1, 2020 at 4:59 PM Andres Freund <andres@anarazel.de> wrote:
> The primary issue here is that there is no TestForOldSnapshot() in
> heap_hot_search_buffer(). Therefore index fetches will never even try to
> detect that tuples it needs actually have already been pruned away.

This is still true, right? Nobody fixed this bug after 14 months? Even
though we're talking about returning rows that are not visible to the
xact's snapshot?

--
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Tom Lane

Дата:

16 июня 2021 г., 00:50:33

Peter Geoghegan <pg@bowt.ie> writes:
> What I had in mind was this: a committer adopting the feature
> themselves. The committer would be morally obligated to maintain the
> feature on an ongoing basis, just as if they were the original
> committer. This seems like the only sensible way of resolving this
> issue once and for all.

Yeah, it seems clear that we need somebody to do that, given that
Kevin Grittner has been inactive for awhile.  Even if the known
problems can be resolved by drive-by patches, I think this area
needs an ongoing commitment from someone.

            regards, tom lane

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

16 июня 2021 г., 03:20:46

Hi,

On 2021-06-15 15:17:05 -0400, Robert Haas wrote:
> But that's not clear to me. I'm not clear how exactly how many
> problems we know about and need to fix in order to keep the feature,
> and I'm also not clear how deep the hole goes. Like, if we need to get
> a certain number of specific bugs fixed, I might be willing to do
> that. If we need to commit to a major rewrite of the current
> implementation, that's more than I can do. But I guess I don't
> understand exactly how bad the current problems are. Reviewing
> complaints from Andres from this thread:

One important complaints I think your (useful!) list missed is that there's
missing *read side* checks that demonstrably lead to wrong query results:
https://www.postgresql.org/message-id/CAH2-Wz%3DFQ9rbBKkt1nXvz27kmd4A8i1%2B7dcLTNqpCYibxX83VQ%40mail.gmail.com
and that it's currently very hard to figure out where they need to be, because
there's no real explained model of what needs to be checked and what not.

> > * I am fairly sure that it can cause crashes (or even data corruption),
> >  because it assumes that DML never needs to check for old snapshots
> >  (with no meaningful justificiation). Leading to heap_update/delete to
> >  assume the page header is a tuple.
> 
> I don't understand the issue here, really. I assume there might be a
> wrong word here, because assuming that the page header is a tuple
> doesn't sound like a thing that would actually happen.

I suspect what I was thinking of is that a tuple could get pruned away due to
s_t_o, which would leave a LP_DEAD item around. As heap_update/delete neither
checks s_t_o, nor balks at targetting LP_DEAD items, we'd use the offset from
a the LP_DEAD item. ItemIdSetDead() sets lp_off to 0 - which would mean that
the page header is interpreted as a tuple. Right?

> I think one of the key problems for this feature is figuring out
> whether you've got snapshot-too-old checks in all the right places. I
> think what is being alleged here is that heap_update() and
> heap_delete() need them, and that it's not good enough to rely on the
> scan that found the tuple to be updated or deleted having already
> performed those checks. It is not clear to me whether that is true, or
> how it could cause crashes.  Andres may have explained this to me at
> some point, but if he did I have unfortunately forgotten.

I don't think it is sufficient to rely on the scan. That works only as long as
the page with the to-be-modified tuple is pinned (since that'd prevent pruning
/ vacuuming from working on the page), but I am fairly sure that there are
plans where the target tuple is not pinned from the point it was scanned until
it is modified. In which case it is entirely possible that the u/d target can
be pruned away due to s_t_o between the scan checking s_t_o and the u/d
executing.

> My general point here is that I would like to know whether we have a
> finite number of reasonably localized bugs or a three-ring disaster
> that is unrecoverable no matter what we do. Andres seems to think it
> is the latter

Correct. I think there's numerous architectural issues the way the feature is
implemented right now, and that it'd be a substantial project to address them.

> For the record, and to Peter's point, I think it's reasonable to set
> v15 feature freeze as a drop-dead date for getting this feature into
> acceptable shape, but I would like to try to nail down what we think
> "acceptable" means in this context.

I think the absolute minimum would be to have
- actually working tests
- a halfway thorough code review of the feature
- added documentation explaining where exactly s_t_o tests need to be
- bugfixes obviously

If I were to work on the feature, I cannot imagine being sufficient confident
the feature works as long as the xid->time mapping granularity is a
minute. It's just not possible to write reasonable tests with the granularity
being that high. Or even to do manual tests of it - I'm not that patient. But
I "can accept" if somebody actually doing the work differs on this.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Thomas Munro

Дата:

16 июня 2021 г., 04:32:01

On Wed, Jun 16, 2021 at 7:17 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Progress has been pretty limited, but not altogether nonexistent.
> 55b7e2f4d78d8aa7b4a5eae9a0a810601d03c563 fixed, or at least seemed to
> fix, the time->XID mapping, which is one of the main things that
> Andres said was broken originally. Also, there are patches on this
> thread from Thomas Munro to add some test coverage for that case,
> another problem Andres noted in his original email. I guess it
> wouldn't be too hard to get something committed there, and I'm willing
> to do it if Thomas doesn't want to and if there's any prospect of
> salvaging the feature.

FTR the latest patches are on a different thread[1].  I lost steam on
that stuff because I couldn't find a systematic way to deal with the
lack of checks all over the place, or really understand how the whole
system fits together with confidence.  Those patches to fix an xid
wraparound bug and make the testing work better may be useful and I'll
be happy to rebase them, depending on how this discussion goes, but it
seems a bit like the proverbial deckchairs on the Titanic from what
I'm reading...  I think the technique I showed for exercising some
basic STO mechanisms and scenarios is probably useful, but I currently
have no idea how to prove much of anything about the whole system and
am not personally in a position to dive into that rabbit hole in a
PG15 time scale.

[1] https://www.postgresql.org/message-id/CA%2BhUKGJyw%3DuJ4eL1x%3D%2BvKm16fLaxNPvKUYtnChnRkSKi024u_A%40mail.gmail.com

Re: snapshot too old issues, first around wraparound and then more.

От

Noah Misch

Дата:

16 июня 2021 г., 07:59:45

On Tue, Jun 15, 2021 at 02:32:11PM -0700, Peter Geoghegan wrote:
> What I had in mind was this: a committer adopting the feature
> themselves. The committer would be morally obligated to maintain the
> feature on an ongoing basis, just as if they were the original
> committer. This seems like the only sensible way of resolving this
> issue once and for all.
> 
> If it really is incredibly important that we keep this feature, or one
> like it, then I have to imagine that somebody will step forward --
> there is still ample opportunity. But if nobody steps forward, I'll be
> forced to conclude that perhaps it wasn't quite as important as I
> first thought.

Hackers are rather wise, but the variety of PostgreSQL use is enormous.  We
see that, among other ways, when regression fixes spike in each vN.1.  The
$SUBJECT feature was born in response to a user experience; a lack of hacker
interest doesn't invalidate that user experience.  We face these competing
interests, at least:

1) Some users want the feature kept so their application can use a certain
   pattern of long-running, snapshot-bearing transactions.

2) (a) Some hackers want the feature gone so they can implement changes
   without making those changes cooperate with this feature.  (b) Bugs in this
   feature make such cooperation materially harder.

3) Some users want the feature gone because (2) is slowing the progress of
   features they do want.

4) Some users want the feature kept because they don't use it but will worry
   what else is vulnerable to removal.  PostgreSQL has infrequent history of
   removing released features.  Normally, PostgreSQL lets some bugs languish
   indefinitely, e.g. in
   https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items#Live_issues

5) Some users want the feature gone because they try it, find a bug, and
   regret trying it or fear trying other features.

A hacker adopting the feature would be aiming to reduce (2)(b) to zero,
essentially.  What other interests are relevant?

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

16 июня 2021 г., 08:47:45

On Tue, Jun 15, 2021 at 9:59 PM Noah Misch <noah@leadboat.com> wrote:
> Hackers are rather wise, but the variety of PostgreSQL use is enormous.  We
> see that, among other ways, when regression fixes spike in each vN.1.  The
> $SUBJECT feature was born in response to a user experience; a lack of hacker
> interest doesn't invalidate that user experience.

I agree that it would be good to hear from some users about this. If a
less painful workaround is possible at all, then users may be able to
help -- maybe it'll be possible to cut scope.

> We face these competing
> interests, at least:

> 1) Some users want the feature kept so their application can use a certain
>    pattern of long-running, snapshot-bearing transactions.

Undoubtedly true.

> 2) (a) Some hackers want the feature gone so they can implement changes
>    without making those changes cooperate with this feature.  (b) Bugs in this
>    feature make such cooperation materially harder.

Is that really true? Though it was probably true back when this thread
was started last year, things have changed. Andres found a way to work
around the problems he had with snapshot too old, AFAIK.

> A hacker adopting the feature would be aiming to reduce (2)(b) to zero,
> essentially.  What other interests are relevant?

The code simply isn't up to snuff. If the code was in a niche contrib
module then maybe it would be okay to let this slide. But the fact is
that it touches critical parts of the system. This cannot be allowed
to drag on forever. It's as simple as that.

I admit that I think that the most likely outcome is that it gets
reverted. I don't feel great about that. What else can be done about
it that will really help the situation? No qualified person is likely
to have the time to commit to fixing snapshot too old. Isn't that the
real problem here?

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Noah Misch

Дата:

16 июня 2021 г., 09:24:20

On Tue, Jun 15, 2021 at 10:47:45PM -0700, Peter Geoghegan wrote:
> On Tue, Jun 15, 2021 at 9:59 PM Noah Misch <noah@leadboat.com> wrote:
> > Hackers are rather wise, but the variety of PostgreSQL use is enormous.  We
> > see that, among other ways, when regression fixes spike in each vN.1.  The
> > $SUBJECT feature was born in response to a user experience; a lack of hacker
> > interest doesn't invalidate that user experience.
> 
> I agree that it would be good to hear from some users about this. If a
> less painful workaround is possible at all, then users may be able to
> help -- maybe it'll be possible to cut scope.

It would be good.  But if we don't hear from users in 2021 or 2022, that
doesn't invalidate what users already said in 2016.

> > We face these competing
> > interests, at least:
> 
> > 1) Some users want the feature kept so their application can use a certain
> >    pattern of long-running, snapshot-bearing transactions.
> 
> Undoubtedly true.
> 
> > 2) (a) Some hackers want the feature gone so they can implement changes
> >    without making those changes cooperate with this feature.  (b) Bugs in this
> >    feature make such cooperation materially harder.
> 
> Is that really true? Though it was probably true back when this thread
> was started last year, things have changed. Andres found a way to work
> around the problems he had with snapshot too old, AFAIK.

When I say "some hackers", I don't mean that specific people think such
thoughts right now.  I'm saying that the expected cost of future cooperation
with the feature is nonzero, and bugs in the feature raise that cost.  Perhaps
(5) has more weight than (2).  (If (2), (3) and (5) all have little weight,
then PostgreSQL should just keep the feature with its bugs.)

> > A hacker adopting the feature would be aiming to reduce (2)(b) to zero,
> > essentially.  What other interests are relevant?
> 
> The code simply isn't up to snuff. If the code was in a niche contrib
> module then maybe it would be okay to let this slide. But the fact is
> that it touches critical parts of the system. This cannot be allowed
> to drag on forever. It's as simple as that.

Even if we were to stipulate that this feature "isn't up to snuff", purging
PostgreSQL of substandard features may or may not add sufficient value to
compensate for (1) and (4).

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

16 июня 2021 г., 09:48:47

On Tue, Jun 15, 2021 at 11:24 PM Noah Misch <noah@leadboat.com> wrote:
> When I say "some hackers", I don't mean that specific people think such
> thoughts right now.  I'm saying that the expected cost of future cooperation
> with the feature is nonzero, and bugs in the feature raise that cost.

I see.

> > > A hacker adopting the feature would be aiming to reduce (2)(b) to zero,
> > > essentially.  What other interests are relevant?
> >
> > The code simply isn't up to snuff. If the code was in a niche contrib
> > module then maybe it would be okay to let this slide. But the fact is
> > that it touches critical parts of the system. This cannot be allowed
> > to drag on forever. It's as simple as that.
>
> Even if we were to stipulate that this feature "isn't up to snuff", purging
> PostgreSQL of substandard features may or may not add sufficient value to
> compensate for (1) and (4).

I'm more concerned about 1 (compatibility) than about 4 (perception
that we deprecate things when we shouldn't), FWIW.

It's not that this is a substandard feature in the same way that (say)
contrib/ISN is a substandard feature -- it's not about the quality
level per se. Nor is it the absolute number of bugs. The real issue is
that this is a substandard feature that affects crucial areas of the
system. Strategically important things that we really cannot afford to
break.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Greg Stark

Дата:

16 июня 2021 г., 18:41:42

I think Andres's point earlier is the one that stands out the most for me:

> I still think that's the most reasonable course. I actually like the
> feature, but I don't think a better implementation of it would share
> much if any of the current infrastructure.

That makes me wonder whether ripping the code out early in the v15
cycle wouldn't be a better choice. It would make it easier for someone
to start work on a new implementation.

There is the risk that the code would still be out and no new
implementation would have appeared by the release of v15 but it sounds
like that's people are leaning towards ripping it out at that point
anyways.

Fwiw I too think the basic idea of the feature is actually awesome.
There are tons of use cases where you might have one long-lived
transaction working on a dedicated table (or even a schema) that will
never look at the rapidly mutating tables in another schema and never
trigger the error even though those tables have been vacuumed many
times over during its run-time.

Re: snapshot too old issues, first around wraparound and then more.

От

Tom Lane

Дата:

16 июня 2021 г., 19:00:57

Greg Stark <stark@mit.edu> writes:
> Fwiw I too think the basic idea of the feature is actually awesome.
> There are tons of use cases where you might have one long-lived
> transaction working on a dedicated table (or even a schema) that will
> never look at the rapidly mutating tables in another schema and never
> trigger the error even though those tables have been vacuumed many
> times over during its run-time.

I agree that's a great use-case.  I don't like this implementation though.
I think if you want to set things up like that, you should draw a line
between the tables it's okay for the long transaction to touch and those
it isn't, and then any access to the latter should predictably draw an
error.  I really do not like the idea that it might work anyway, because
then if you accidentally break the rule, you have an application that just
fails randomly ... probably only on the days when the boss wants that
report *now* not later.

            regards, tom lane

Re: snapshot too old issues, first around wraparound and then more.

От

Stephen Frost

Дата:

16 июня 2021 г., 19:11:45

Greetings,

* Greg Stark (stark@mit.edu) wrote:
> I think Andres's point earlier is the one that stands out the most for me:
>
> > I still think that's the most reasonable course. I actually like the
> > feature, but I don't think a better implementation of it would share
> > much if any of the current infrastructure.
>
> That makes me wonder whether ripping the code out early in the v15
> cycle wouldn't be a better choice. It would make it easier for someone
> to start work on a new implementation.
>
> There is the risk that the code would still be out and no new
> implementation would have appeared by the release of v15 but it sounds
> like that's people are leaning towards ripping it out at that point
> anyways.
>
> Fwiw I too think the basic idea of the feature is actually awesome.
> There are tons of use cases where you might have one long-lived
> transaction working on a dedicated table (or even a schema) that will
> never look at the rapidly mutating tables in another schema and never
> trigger the error even though those tables have been vacuumed many
> times over during its run-time.

I've long felt that the appropriate approach to addressing that is to
improve on VACUUM and find a way to do better than just having the
conditional of 'xmax < global min' drive if we can mark a given tuple as
no longer visible to anyone.

Not sure that all of the snapshot-too-old use cases could be solved that
way, nor am I even sure it's actually possible to make VACUUM smarter in
that way without introducing other problems or having to track much more
information than we do today, but it'd sure be nice if we could address
the use-case you outline above while also not introducing query
failures if that transaction does happen to decide to go look at some
other table (naturally, the tuples which are in that rapidly mutating
table that *would* be visible to the long-running transaction would have
to be kept around to make things work, but if it's rapidly mutating then
there's very likely lots of tuples that the long-running transaction
can't see in it, and which nothing else can either, and therefore could
be vacuumed).

Thanks,

Stephen

Вложения

signature.asc

Re: snapshot too old issues, first around wraparound and then more.

От

Tom Lane

Дата:

16 июня 2021 г., 20:04:07

Stephen Frost <sfrost@snowman.net> writes:
> I've long felt that the appropriate approach to addressing that is to
> improve on VACUUM and find a way to do better than just having the
> conditional of 'xmax < global min' drive if we can mark a given tuple as
> no longer visible to anyone.

Yeah, I think this scenario of a few transactions with old snapshots
and the rest with very new ones could be improved greatly if we exposed
more info about backends' snapshot state than just "oldest xmin".  But
that might be expensive to do.

I remember that Heikki was fooling with a patch that reduced snapshots
to LSNs.  If we got that done, it'd be practical to expose complete
info about backends' snapshot state in a lot of cases (i.e., anytime
you had less than N live snapshots).

Of course, there's still the question of how VACUUM could cheaply
apply such info to decide what could be purged.

            regards, tom lane

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

16 июня 2021 г., 20:44:49

On Wed, Jun 16, 2021 at 10:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I remember that Heikki was fooling with a patch that reduced snapshots
> to LSNs.  If we got that done, it'd be practical to expose complete
> info about backends' snapshot state in a lot of cases (i.e., anytime
> you had less than N live snapshots).
>
> Of course, there's still the question of how VACUUM could cheaply
> apply such info to decide what could be purged.

I would think that it wouldn't really matter inside VACUUM -- it would
only really need to be either an opportunistic pruning or an
opportunistic index deletion thing -- probably both. Most of the time
VACUUM doesn't seem to end up doing most of the work of removing
garbage versions. It's mostly useful for "floating garbage", to use
the proper GC memory management term.

It's not just because opportunistic techniques are where the real work
of removing garbage is usually done these days. It's also because
opportunistic techniques are triggered in response to an immediate
problem, like an overfull heap page or an imminent page split that
we'd like to avoid -- they can actually see what's going on at the
local level in a way that doesn't really work inside VACUUM.

This also means that they can cycle through strategies for a page,
starting with the cheapest and coarsest grained cleanup, progressing
to finer grained cleanup. You only really need the finer grained
cleanup when the coarse grained cleanup (simple OldestXmin style
cutoff) fails. And even then you only need to use the slowpath when
you have a pretty good idea that it'll actually be useful -- you at
least know up front that there are a bunch of RECENTLY_DEAD tuples
that very well might be freeable once you use the slow path.

We can leave the floating garbage inside heap pages that hardly ever
get opportunistic pruning behind for VACUUM. We might even find that
an advanced strategy that does clever things in order to cleanup
intermediate versions isn't actually needed all that often (it's
executed perhaps orders of magnitude less frequently than simple
opportunistic pruning is executed) -- even when the clever technique
really helps the workload. Technically opportunistic pruning might be
99%+ effective, even when it doesn't look like it's effective to
users. The costs in this area are often very nonlinear. It can be very
counterintuitive.

--
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

16 июня 2021 г., 21:06:10

Hi,

On 2021-06-15 21:59:45 -0700, Noah Misch wrote:
> Hackers are rather wise, but the variety of PostgreSQL use is enormous.  We
> see that, among other ways, when regression fixes spike in each vN.1.  The
> $SUBJECT feature was born in response to a user experience; a lack of hacker
> interest doesn't invalidate that user experience.  We face these competing
> interests, at least:
> 
> 1) Some users want the feature kept so their application can use a certain
>    pattern of long-running, snapshot-bearing transactions.

This is obviously true. However, given that the feature practically did
not work at all before 55b7e2f4d78d8aa7b4a5eae9a0a810601d03c563 and
still cannot really be described to work (e..g index scans returning
wrong query results), and there have been two complaints about it as far
as I know leads me to believe that it does not have a great many
features.

> 2) (a) Some hackers want the feature gone so they can implement changes
>    without making those changes cooperate with this feature.  (b) Bugs in this
>    feature make such cooperation materially harder.

I think the a) part is a large problem. Primarily because it's so
unclear what one exactly has to do where (no docs/comments explaining
that) and because there's no usable test framework.

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

16 июня 2021 г., 21:27:57

Hi,

On 2021-06-16 13:04:07 -0400, Tom Lane wrote:
> Yeah, I think this scenario of a few transactions with old snapshots
> and the rest with very new ones could be improved greatly if we exposed
> more info about backends' snapshot state than just "oldest xmin".  But
> that might be expensive to do.

I think it'd be pretty doable now. The snapshot scalability changes
separated out information needed to do vacuuming / pruning (i.e. xmin)
from the information needed to build a snapshot (xid, flags, subxids
etc). Because xmin is not frequently accessed from other backends
anymore, it is not important anymore to touch it as rarely as
possible. From the cross-backend POV I think it'd be practically free to
track a backend's xmax now.

It's not quite as obvious that it'd essentially free to track a
backend's xmax across all the snapshots it uses. I think we'd basically
need a second pairingheap in snapmgr.c to track the "most advanced"
xmax? That's *probably* fine, but I'm not 100% - Heikki wrote a faster
heap implementation for snapmgr.c for a reason I assume.

I think the hard part of this would be much more on the pruning / vacuum
side of things. There's two difficulties:

1) Keeping it cheap to determine whether a tuple can be vacuumed,
   particularly while doing on-access pruning. This likely means that
   we'd only assemble the information to do visibility determination for
   rows above the "dead for everybody" horizon when encountering a
   sufficiently old tuple. And then we need a decent datastructure for
   checking whether an xid is in one of the "not needed" xid ranges.

   This seems solvable.

2) Modeling when it is safe to remove row versions. It is easy to remove
   a tuple that was inserted and deleted within one "not needed" xid
   range, but it's far less obvious when it is safe to remove row
   versions where prior/later row versions are outside of such a gap.

   Consider e.g. an update chain where the oldest snapshot can see one
   row version, then there is a chain of rows that could be vacuumed
   except for the old snapshot, and then there's a live version. If the
   old session updates the row version that is visible to it, it needs
   to be able to follow the xid chain.

   This seems hard to solve in general.

   It perhaps is sufficiently effective to remove row version chains
   entirely within one removable xid range. And it'd probably doable to
   also address the case where a chain is larger than one range, as long
   as all the relevant row versions are within one page: We can fix up
   the ctids of older still visible row versions to point to the
   successor of pruned row versions.

   But I have a hard time seeing a realistic approach to removing chains
   that span xid ranges and multiple pages. The locking and efficiency
   issues seem substantial.

Greetings,

Andres

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

16 июня 2021 г., 21:29:36

On Wed, Jun 16, 2021 at 11:06 AM Andres Freund <andres@anarazel.de> wrote:
> > 2) (a) Some hackers want the feature gone so they can implement changes
> >    without making those changes cooperate with this feature.  (b) Bugs in this
> >    feature make such cooperation materially harder.
>
> I think the a) part is a large problem. Primarily because it's so
> unclear what one exactly has to do where (no docs/comments explaining
> that) and because there's no usable test framework.

Right. This is what I meant yesterday, when talking about design
issues. It's really not about the bugs so much. We probably could go
through them one by one until things stopped being visibly broken,
without going to a huge amount of trouble -- it's not that hard to
paper over these things without anybody noticing. This is clear just
when you look at how long it took anybody to notice the problems we do
have. Whether or not that amounts to "just fixing the bugs" is perhaps
open to interpretation. Either way I would not be comfortable with
even claiming that "fixing the bugs" in this way actually makes the
situation better overall -- it might make it even worse. So in a more
fundamental sense it would actually be really hard to fix these bugs.
I would never have confidence in a fix like that.

I really don't see a way around it -- we have to declare technical
debt bankruptcy here. Whether or not that means reverting the feature
or rewriting it from scratch remains to be seen. That's another
question entirely, and has everything to do with somebody's
willingness to adopt the project and little to do with how any
individual feels about it -- just like with a new feature. It does us
no good to put off the question indefinitely.

--
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Andres Freund

Дата:

16 июня 2021 г., 22:06:37

Hi,

On 2021-06-16 10:44:49 -0700, Peter Geoghegan wrote:
> On Wed, Jun 16, 2021 at 10:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Of course, there's still the question of how VACUUM could cheaply
> > apply such info to decide what could be purged.

> I would think that it wouldn't really matter inside VACUUM -- it would
> only really need to be either an opportunistic pruning or an
> opportunistic index deletion thing -- probably both. Most of the time
> VACUUM doesn't seem to end up doing most of the work of removing
> garbage versions. It's mostly useful for "floating garbage", to use
> the proper GC memory management term.

I don't fully agree with this. For one, there are workloads where VACUUM
removes the bulk of the dead tuples. For another, slowing down VACUUM
can cause a slew of follow-on problems, so being careful to not
introduce new bottlenecks is important. And I don't think just doing
this optimization as part of on-access pruning is reasonable
solution. And it's not like making on-access pruning slower is
unproblematic either.

But as I said nearby, I think the hardest part is figuring out how to
deal with ctid chains, not the efficiency of the xid->visibility lookup
(or the collection of data necessary for that lookup).

Greetings,

Andres Freund

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

16 июня 2021 г., 22:08:18

On Wed, Jun 16, 2021 at 11:27 AM Andres Freund <andres@anarazel.de> wrote:
> 2) Modeling when it is safe to remove row versions. It is easy to remove
>    a tuple that was inserted and deleted within one "not needed" xid
>    range, but it's far less obvious when it is safe to remove row
>    versions where prior/later row versions are outside of such a gap.
>
>    Consider e.g. an update chain where the oldest snapshot can see one
>    row version, then there is a chain of rows that could be vacuumed
>    except for the old snapshot, and then there's a live version. If the
>    old session updates the row version that is visible to it, it needs
>    to be able to follow the xid chain.
>
>    This seems hard to solve in general.

As I've said to you before, I think that it would make sense to solve
the problem inside heap_index_delete_tuples() first (for index tuple
deletion) -- implement and advanced version for heap pruning later.
That gives users a significant benefit without requiring that you
solve this hard problem with xmin/xmax and update chains.

I don't think that it matters that index AMs still only have LP_DEAD
bits set when tuples are dead to all snapshots including the oldest.
Now that we can batch TIDs within each call to
heap_index_delete_tuples() to pick up "extra" deletable TIDs from the
same heap blocks, we'll often be able to delete a significant number
of extra index tuples whose TIDs are in a "not needed" range. Whereas
today, without the "not needed" range mechanism in place, we just
delete the index tuples that are LP_DEAD-set already, plus maybe a few
others ("extra index tuples" that are not even needed by the oldest
snapshot) -- but that's it.

We might miss our chance to ever delete the nearby index tuples
forever, just because we missed the opportunity once. Recall that the
LP_DEAD bit being set for an index tuple isn't just information about
the index tuple in Postgres 14+ -- it also suggests that the *heap
block* has many more index tuples that we can delete that aren't
LP_DEAD set in the index. And so nbtree will check those extra nearby
TIDs out in passing within heap_index_delete_tuples(). We currently
lose this valuable hint about the heap block forever if we delete the
LP_DEAD-set index tuples, unless we get lucky and somebody sets a few
more index tuples for the same heap blocks before the next time the
leaf page fills up (and heap_index_delete_tuples() must be called).

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Peter Geoghegan

Дата:

16 июня 2021 г., 22:18:20

On Wed, Jun 16, 2021 at 12:06 PM Andres Freund <andres@anarazel.de> wrote:
> > I would think that it wouldn't really matter inside VACUUM -- it would
> > only really need to be either an opportunistic pruning or an
> > opportunistic index deletion thing -- probably both. Most of the time
> > VACUUM doesn't seem to end up doing most of the work of removing
> > garbage versions. It's mostly useful for "floating garbage", to use
> > the proper GC memory management term.
>
> I don't fully agree with this. For one, there are workloads where VACUUM
> removes the bulk of the dead tuples.

It's definitely much more important that VACUUM run often when non-HOT
updates are the norm, and there are lots of them. But that's probably
not going to be helped all that much by this technique anyway.

Mostly I'm just saying I'd start elsewhere and do heapam later. And
probably do VACUUM itself last of all, if that usefully cut scope.

> For another, slowing down VACUUM
> can cause a slew of follow-on problems, so being careful to not
> introduce new bottlenecks is important. And I don't think just doing
> this optimization as part of on-access pruning is reasonable
> solution. And it's not like making on-access pruning slower is
> unproblematic either.

I think that on-access pruning is much more important because it's the
only hope we have of keeping the original heap page intact, in the
sense that there are no non-HOT updates over time, though there may be
many HOT updates. And no LP_DEAD items ever accumulate. It's not so
much about cleaning up bloat as it is about *preserving* the heap
pages in this sense.

If in the long run it's impossible to keep the page intact in this
sense then we will still have most of our current problems. It might
not make that much practical difference if we simply delay the problem
-- we kinda have to prevent it entirely, at least for a given
workload. So I'm mostly concerned about keeping things stable over
time, at the level of individual pages.

> But as I said nearby, I think the hardest part is figuring out how to
> deal with ctid chains, not the efficiency of the xid->visibility lookup
> (or the collection of data necessary for that lookup).

Definitely true.

-- 
Peter Geoghegan

Re: snapshot too old issues, first around wraparound and then more.

От

Stephen Frost

Дата:

17 июня 2021 г., 18:31:30

Greetings,

* Peter Geoghegan (pg@bowt.ie) wrote:
> On Wed, Jun 16, 2021 at 12:06 PM Andres Freund <andres@anarazel.de> wrote:
> > > I would think that it wouldn't really matter inside VACUUM -- it would
> > > only really need to be either an opportunistic pruning or an
> > > opportunistic index deletion thing -- probably both. Most of the time
> > > VACUUM doesn't seem to end up doing most of the work of removing
> > > garbage versions. It's mostly useful for "floating garbage", to use
> > > the proper GC memory management term.
> >
> > I don't fully agree with this. For one, there are workloads where VACUUM
> > removes the bulk of the dead tuples.
>
> It's definitely much more important that VACUUM run often when non-HOT
> updates are the norm, and there are lots of them. But that's probably
> not going to be helped all that much by this technique anyway.

I don't follow this argument.  Surely there are many, many cases out
there where there's very few HOT updates but lots of non-HOT updates
which create lots of dead rows that can't currently be cleaned up if
there's a long running transaction hanging around.

> Mostly I'm just saying I'd start elsewhere and do heapam later. And
> probably do VACUUM itself last of all, if that usefully cut scope.

Not quite following what 'elsewhere' means here or what it would entail
if it involves cleaning up dead tuples but doesn't involve heapam.  I
can sort of follow the idea of working on the routine page-level cleanup
of tuples rather than VACUUM, except that would seem to require one to
deal with the complexities of ctid chains discussed below and therefore
be a larger and more complicated effort than if one were to tackle
VACUUM and perhaps in the first round cut scope by explicitly ignoring
ctid chains.

> > For another, slowing down VACUUM
> > can cause a slew of follow-on problems, so being careful to not
> > introduce new bottlenecks is important. And I don't think just doing
> > this optimization as part of on-access pruning is reasonable
> > solution. And it's not like making on-access pruning slower is
> > unproblematic either.

I don't know that slowing down VACUUM, which already goes purposefully
slow by default when run out of autovacuum, needs to really be stressed
over, particularly when what we're talking about here are CPU cycles.  I
do think it'd make sense to have a heuristic which decides if we're
going to put in the effort to try to do this kind of pruning.  That is-
if the global Xmin and the current transaction are only a few thousand
apart or something along those lines then don't bother, but if there's
been 100s of thousands of transactions then enable it (perhaps allowing
control over this or allowing users to explicitly ask VACUUM to 'work
harder' or such).

> I think that on-access pruning is much more important because it's the
> only hope we have of keeping the original heap page intact, in the
> sense that there are no non-HOT updates over time, though there may be
> many HOT updates. And no LP_DEAD items ever accumulate. It's not so
> much about cleaning up bloat as it is about *preserving* the heap
> pages in this sense.
>
> If in the long run it's impossible to keep the page intact in this
> sense then we will still have most of our current problems. It might
> not make that much practical difference if we simply delay the problem
> -- we kinda have to prevent it entirely, at least for a given
> workload. So I'm mostly concerned about keeping things stable over
> time, at the level of individual pages.

I do think that's a worthwhile goal, but if we could get some kind of
cleanup happening, that strikes me as better than the nothing that we
have today.  Which side makes sense to tackle first is certainly a
discussion that could be had but I'd go for "do the simple thing first".

> > But as I said nearby, I think the hardest part is figuring out how to
> > deal with ctid chains, not the efficiency of the xid->visibility lookup
> > (or the collection of data necessary for that lookup).
>
> Definitely true.

It strikes me that stressing over ctid chains, while certainly something
to consider, at this point is putting the cart before the horse in this
discussion- there's not much sense in it if we haven't actually got the
data collection piece figured out and working (and hopefully in a manner
that minimizes the overhead from it) and then worked out the logic to
figure out if a given tuple is actually visible to any running
transaction.  As I say above, it seems like it'd be a great win even if
it was initially only able to deal with 'routine'/non-chained cases and
only with VACUUM.

The kind of queue tables that I'm thinking of, at least, are ones like
what PgQ uses: https://github.com/pgq/pgq

Now, that already works around our lacking here by using TRUNCATE and
table rotation, but if we improved here then it'd potentially be able to
be rewritten to use routine DELETE's instead of TRUNCATE.  Even the
UPDATEs which are done to process a batch for a subscriber look to be
non-HOT due to updating indexed fields anyway (in
pgq.next_batch_custom(), it's setting subscription.sub_batch which has a
UNIQUE btree on it).  Looks like there's a HOT UPDATE for the queue
table when a table swap happens, but that UPDATE wouldn't actually be
necessary if we'd fix the issue with just routine INSERT/DELETE leading
to tons of dead tuples that can't be VACUUM'd if a long running
transaction is running, and I doubt that UPDATE was actually
intentionally designed to take advantage of HOT, it just happened to
work that way.

The gist of what I'm trying to get at here is that the use-cases I've
seen, and where people have put in the effort to work around the long
running transaction vs. VACUUM issue by using hacks like table swapping
and TRUNCATE, aren't cases where there's a lot of HOT updating happening
on the tables that are getting bloated due to VACUUM being unable to
clean up tuples.  So, if that's actually the easier thing to tackle,
fantastic, let's do it and then figure out how to improve on it to
handle the more complicated cases later.  (This presumes that it's
actually possible to essentially 'skip' the hard cases and still have a
working implementation, of course).

Thanks,

Stephen

Вложения

signature.asc

Re: snapshot too old issues, first around wraparound and then more.

От

Noah Misch

Дата:

18 июня 2021 г., 06:49:31

On Wed, Jun 16, 2021 at 12:00:57PM -0400, Tom Lane wrote:
> Greg Stark <stark@mit.edu> writes:
> > I think Andres's point earlier is the one that stands out the most for me:
> > 
> > > I still think that's the most reasonable course. I actually like the
> > > feature, but I don't think a better implementation of it would share
> > > much if any of the current infrastructure.
> > 
> > That makes me wonder whether ripping the code out early in the v15
> > cycle wouldn't be a better choice. It would make it easier for someone
> > to start work on a new implementation.

Deleting the feature early is better than deleting the feature late,
certainly.  (That doesn't tell us about the relative utility of deleting the
feature early versus never deleting the feature.)

> > Fwiw I too think the basic idea of the feature is actually awesome.
> > There are tons of use cases where you might have one long-lived
> > transaction working on a dedicated table (or even a schema) that will
> > never look at the rapidly mutating tables in another schema and never
> > trigger the error even though those tables have been vacuumed many
> > times over during its run-time.
> 
> I agree that's a great use-case.  I don't like this implementation though.
> I think if you want to set things up like that, you should draw a line
> between the tables it's okay for the long transaction to touch and those
> it isn't, and then any access to the latter should predictably draw an
> error.

I agree that would be a useful capability, but it solves a different problem.

> I really do not like the idea that it might work anyway, because
> then if you accidentally break the rule, you have an application that just
> fails randomly ... probably only on the days when the boss wants that
> report *now* not later.

Every site adopting SERIALIZABLE learns that transactions can fail due to
mostly-unrelated concurrent activity.  ERRCODE_SNAPSHOT_TOO_OLD is another
kind of serialization failure, essentially.  Moreover, one can opt for an
old_snapshot_threshold value longer than the runtime of the boss's favorite
report.  Of course, nobody would reject a replacement that has all the
advantages of old_snapshot_threshold and fewer transaction failures.  Once
your feature rewrite starts taking away advantages to achieve fewer
transaction failures, that rewrite gets a lot more speculative.

nm

Re: snapshot too old issues, first around wraparound and then more.

От

Thomas Munro

Дата:

18 июня 2021 г., 07:52:43

On Wed, Apr 15, 2020 at 2:21 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Mon, Apr 13, 2020 at 2:58 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Fri, Apr 3, 2020 at 2:22 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > > I'm thinking of a version of "snapshot too old" that amounts to a
> > > statement timeout that gets applied for xmin horizon type purposes in
> > > the conventional way, while only showing an error to the client if and
> > > when they access literally any buffer (though not when the relation is
> > > a system catalog). Is it possible that something along those lines is
> > > appreciably better than nothing to users? If it is, and if we can find
> > > a way to manage the transition, then maybe we could tolerate
> > > supporting this greatly simplified implementation of "snapshot too
> > > old".

I rebased that patch and fleshed it out just a bit more.  Warning:
experiment grade, incomplet, inkorrect, but I think it demonstrates
the main elements of Peter's idea from last year.

For READ COMMITTED, the user experience is much like a statement
timeout, except that it can keep doing stuff that doesn't read
non-catalog data.  Trivial example: pg_sleep(10) happily completes
with old_snapshot_threshold=5s, as do queries that materialise all
their input data in time, and yet your xmin is zapped.  For REPEATABLE
READ it's obviously tied to your first query, and produces "snapshot
too old" if you repeatedly SELECT from a little table and your time
runs out.

In this version I put the check into the heapam visibility + vismap
checks, instead of in the buffer access code.  The reason is that the
lower level buffer access routines don't have a snapshot, but if you
push the check down to buffer access and just check the "oldest"
snapshot (definition in this patch, not in master), then you lose some
potential granularity with different cursors.  If you try to put it at
a higher level in places that have a snapshot and access a buffer, you
run into the problem of being uncertain that you've covered all the
bases.  But I may be underthinking that.

Quite a few unresolved questions about catalog and toast snapshots and
other special stuff, as well as the question of whether it's actually
useful or the overheads can be made cheap enough.

> Hmm.  I suppose it must be possible to put the LSN check back: if
> (snapshot->too_old && PageGetLSN(page) > snapshot->lsn) ereport(...).
> Then the granularity would be the same as today -- block level -- but
> the complexity is transferred from the pruning side (has to deal with
> xid time map) to the snapshot-owning side (has to deal with timers,
> CFI() and make sure all snapshots are registered).  Maybe not a great
> deal, and maybe not easier than fixing the existing bugs.

It is a shame to lose the current LSN-based logic; it's really rather
clever (except for being broken).

> One problem is all the new setitimer() syscalls.  I feel like that
> could be improved, as could statement_timeout, by letting existing
> timers run rather than repeatedly rescheduling eagerly, so that eg a 1
> minute timeout never gets rescheduled more than once per minute.  I
> haven't looked into that, but I guess it's no worse than the existing
> implement's overheads anyway.

At least that problem was fixed, by commit 09cf1d52.  (Not entirely
sure why I got away with not reenabling the timer between queries, but
I didn't look very hard).

Вложения

v2-0001-Implement-snapshot_too_old-using-a-timer.patch

Re: snapshot too old issues, first around wraparound and then more.

От

Greg Stark

Дата:

23 июня 2021 г., 08:21:00

On Thu, 17 Jun 2021 at 23:49, Noah Misch <noah@leadboat.com> wrote:
>
> On Wed, Jun 16, 2021 at 12:00:57PM -0400, Tom Lane wrote:
> > I agree that's a great use-case.  I don't like this implementation though.
> > I think if you want to set things up like that, you should draw a line
> > between the tables it's okay for the long transaction to touch and those
> > it isn't, and then any access to the latter should predictably draw an
> > error.

> I agree that would be a useful capability, but it solves a different problem.

Yeah, I think this discussion veered off into how to improve vacuum
snapshot tracking. That's an worthwhile endeavour but it doesn't
really address the use case this patch was there to target.

Fundamentally there's no way in SQL for users to give this information
to Postgres. There's nothing in SQL or our API that lets a client
inform Postgres what tables a session is going to access within a
transaction in the future.

What this alternative would look like would be a command that a client
would have to issue at the start of every transaction listing every
table that transaction will be allowed to touch. Any attempt to read
from any other table during the transaction would then get an error.

That sounds like it would be neat but it wouldn't work great with the
general approach in Postgres of having internal functions accessing
relations on demand (think of catalog tables, toast tables, and
pg_proc functions).

The "snapshot too old" approach is much more in line with Postgres's
general approach of giving users a general purpose platform and then
dealing gracefully with the consequences.

-- 
greg

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: snapshot too old issues, first around wraparound and then more.

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения