Обсуждение: Re: pgsql: Compute XID horizon for page level index vacuum on primary.

Поиск
Список
Период
Сортировка

Re: pgsql: Compute XID horizon for page level index vacuum on primary.

От
Simon Riggs
Дата:
On Wed, 27 Mar 2019 at 00:06, Andres Freund <andres@anarazel.de> wrote:
Compute XID horizon for page level index vacuum on primary.

Previously the xid horizon was only computed during WAL replay.

This commit message was quite confusing. It took me a while to realize this relates to btree index deletes and that what you mean is that we are calculcating the latestRemovedXid for index entries. That is related to but not same thing as the horizon itself. So now I understand the "was computed only during WAL replay" since it seemed obvious that the xmin horizon was calculcated regularly on the master, but as you say the latestRemovedXid was not.

Now I understand, I'm happy that you've moved this from redo into mainline. And you've optimized it, which is also important (since performance was the original objection and why it was placed in redo). I can see you've removed duplicate code in hash indexes as well, which is good.

The term "xid horizon" was only used once in the code in PG11. That usage appears to be a typo, since in many other places the term "xmin horizon" is used to mean the point at which we can finally mark tuples as dead. Now we have some new, undocumented APIs that use the term "xid horizon" yet still code that refers to "xmin horizon", with neither term being defined. I'm hoping you'll do some later cleanup of that to avoid confusion.

While trying to understand this, I see there is an even better way to optimize this. Since we are removing dead index tuples, we could alter the killed index tuple interface so that it returns the xmax of the tuple being marked as killed, rather than just a boolean to say it is dead. Indexes can then mark the killed tuples with the xmax that killed them rather than just a hint bit. This is possible since the index tuples are dead and cannot be used to follow the htid to the heap, so the htid is redundant and so the block number of the tid could be overwritten with the xmax, zeroing the itemid. Each killed item we mark with its xmax means one less heap fetch we need to perform when we delete the page - it's possible we optimize that away completely by doing this.

Since this point of the code is clearly going to be a performance issue it seems like something we should do now.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgsql: Compute XID horizon for page level index vacuum onprimary.

От
Andres Freund
Дата:
Hi,

On 2019-03-29 09:37:11 +0000, Simon Riggs wrote:
> This commit message was quite confusing. It took me a while to realize this
> relates to btree index deletes and that what you mean is that we are
> calculcating the latestRemovedXid for index entries. That is related to but
> not same thing as the horizon itself.

Well, it's the page level horizon...


> While trying to understand this, I see there is an even better way to
> optimize this. Since we are removing dead index tuples, we could alter the
> killed index tuple interface so that it returns the xmax of the tuple being
> marked as killed, rather than just a boolean to say it is dead.

Wouldn't that quite possibly result in additional and unnecessary
conflicts? Right now the page level horizon is computed whenever the
page is actually reused, rather than when an item is marked as
deleted. As it stands right now, the computed horizons are commonly very
"old", because of that delay, leading to lower rates of conflicts.


> Indexes can then mark the killed tuples with the xmax that killed them
> rather than just a hint bit. This is possible since the index tuples
> are dead and cannot be used to follow the htid to the heap, so the
> htid is redundant and so the block number of the tid could be
> overwritten with the xmax, zeroing the itemid. Each killed item we
> mark with its xmax means one less heap fetch we need to perform when
> we delete the page - it's possible we optimize that away completely by
> doing this.

That's far from a trivial feature imo. It seems quite possible that we'd
end up with increased overhead, because the current logic can get away
with only doing hint bit style writes - but would that be true if we
started actually replacing the item pointers? Because I don't see any
guarantee they couldn't cross a page boundary etc? So I think we'd need
to do WAL logging during index searches, which seems prohibitively
expensive.

And I'm also doubtful it's worth it because:

> Since this point of the code is clearly going to be a performance issue it
> seems like something we should do now.

I've tried quite a bit to find a workload where this matters, but after
avoiding redundant buffer accesses by sorting, and prefetching I was
unable to do so.  What workload do you see where this would be really be
bad? Without the performance optimization I'd found a very minor
regression by trying to force the heap visits to happen in a pretty
random order, but after sorting that went away.  I'm sure it's possible
to find a case on overloaded rotational disks where you'd find a small
regression, but I don't think it'd be particularly bad.

Greetings,

Andres Freund



Re: pgsql: Compute XID horizon for page level index vacuum onprimary.

От
Andres Freund
Дата:
Hi,

On 2019-03-29 09:37:11 +0000, Simon Riggs wrote:
> This commit message was quite confusing. It took me a while to realize this
> relates to btree index deletes and that what you mean is that we are
> calculcating the latestRemovedXid for index entries. That is related to but
> not same thing as the horizon itself.

Well, it's the page level horizon...


> While trying to understand this, I see there is an even better way to
> optimize this. Since we are removing dead index tuples, we could alter the
> killed index tuple interface so that it returns the xmax of the tuple being
> marked as killed, rather than just a boolean to say it is dead.

Wouldn't that quite possibly result in additional and unnecessary
conflicts? Right now the page level horizon is computed whenever the
page is actually reused, rather than when an item is marked as
deleted. As it stands right now, the computed horizons are commonly very
"old", because of that delay, leading to lower rates of conflicts.


> Indexes can then mark the killed tuples with the xmax that killed them
> rather than just a hint bit. This is possible since the index tuples
> are dead and cannot be used to follow the htid to the heap, so the
> htid is redundant and so the block number of the tid could be
> overwritten with the xmax, zeroing the itemid. Each killed item we
> mark with its xmax means one less heap fetch we need to perform when
> we delete the page - it's possible we optimize that away completely by
> doing this.

That's far from a trivial feature imo. It seems quite possible that we'd
end up with increased overhead, because the current logic can get away
with only doing hint bit style writes - but would that be true if we
started actually replacing the item pointers? Because I don't see any
guarantee they couldn't cross a page boundary etc? So I think we'd need
to do WAL logging during index searches, which seems prohibitively
expensive.

And I'm also doubtful it's worth it because:

> Since this point of the code is clearly going to be a performance issue it
> seems like something we should do now.

I've tried quite a bit to find a workload where this matters, but after
avoiding redundant buffer accesses by sorting, and prefetching I was
unable to do so.  What workload do you see where this would be really be
bad? Without the performance optimization I'd found a very minor
regression by trying to force the heap visits to happen in a pretty
random order, but after sorting that went away.  I'm sure it's possible
to find a case on overloaded rotational disks where you'd find a small
regression, but I don't think it'd be particularly bad.

Greetings,

Andres Freund



Re: pgsql: Compute XID horizon for page level index vacuum on primary.

От
Simon Riggs
Дата:
On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
 
On 2019-03-29 09:37:11 +0000, Simon Riggs wrote:
 
> While trying to understand this, I see there is an even better way to
> optimize this. Since we are removing dead index tuples, we could alter the
> killed index tuple interface so that it returns the xmax of the tuple being
> marked as killed, rather than just a boolean to say it is dead.

Wouldn't that quite possibly result in additional and unnecessary
conflicts? Right now the page level horizon is computed whenever the
page is actually reused, rather than when an item is marked as
deleted. As it stands right now, the computed horizons are commonly very
"old", because of that delay, leading to lower rates of conflicts.

I wasn't suggesting we change when the horizon is calculated, so no change there.

The idea was to cache the data for later use, replacing the hint bit with a hint xid.

That won't change the rate of conflicts, up or down - but it does avoid I/O.
 
> Indexes can then mark the killed tuples with the xmax that killed them
> rather than just a hint bit. This is possible since the index tuples
> are dead and cannot be used to follow the htid to the heap, so the
> htid is redundant and so the block number of the tid could be
> overwritten with the xmax, zeroing the itemid. Each killed item we
> mark with its xmax means one less heap fetch we need to perform when
> we delete the page - it's possible we optimize that away completely by
> doing this.

That's far from a trivial feature imo. It seems quite possible that we'd
end up with increased overhead, because the current logic can get away
with only doing hint bit style writes - but would that be true if we
started actually replacing the item pointers? Because I don't see any
guarantee they couldn't cross a page boundary etc? So I think we'd need
to do WAL logging during index searches, which seems prohibitively
expensive.

Don't see that.

I was talking about reusing the first 4 bytes of an index tuple's ItemPointerData,
which is the first field of an index tuple. Index tuples are MAXALIGNed, so I can't see how that would ever cross a page boundary.
 
And I'm also doubtful it's worth it because:

> Since this point of the code is clearly going to be a performance issue it
> seems like something we should do now.

I've tried quite a bit to find a workload where this matters, but after
avoiding redundant buffer accesses by sorting, and prefetching I was
unable to do so.  What workload do you see where this would be really be
bad? Without the performance optimization I'd found a very minor
regression by trying to force the heap visits to happen in a pretty
random order, but after sorting that went away.  I'm sure it's possible
to find a case on overloaded rotational disks where you'd find a small
regression, but I don't think it'd be particularly bad.

The code can do literally hundreds of random I/Os in an 8192 blocksize. What happens with 16 or 32kB?

"Small regression" ?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgsql: Compute XID horizon for page level index vacuum on primary.

От
Simon Riggs
Дата:
On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
 
On 2019-03-29 09:37:11 +0000, Simon Riggs wrote:
 
> While trying to understand this, I see there is an even better way to
> optimize this. Since we are removing dead index tuples, we could alter the
> killed index tuple interface so that it returns the xmax of the tuple being
> marked as killed, rather than just a boolean to say it is dead.

Wouldn't that quite possibly result in additional and unnecessary
conflicts? Right now the page level horizon is computed whenever the
page is actually reused, rather than when an item is marked as
deleted. As it stands right now, the computed horizons are commonly very
"old", because of that delay, leading to lower rates of conflicts.

I wasn't suggesting we change when the horizon is calculated, so no change there.

The idea was to cache the data for later use, replacing the hint bit with a hint xid.

That won't change the rate of conflicts, up or down - but it does avoid I/O.
 
> Indexes can then mark the killed tuples with the xmax that killed them
> rather than just a hint bit. This is possible since the index tuples
> are dead and cannot be used to follow the htid to the heap, so the
> htid is redundant and so the block number of the tid could be
> overwritten with the xmax, zeroing the itemid. Each killed item we
> mark with its xmax means one less heap fetch we need to perform when
> we delete the page - it's possible we optimize that away completely by
> doing this.

That's far from a trivial feature imo. It seems quite possible that we'd
end up with increased overhead, because the current logic can get away
with only doing hint bit style writes - but would that be true if we
started actually replacing the item pointers? Because I don't see any
guarantee they couldn't cross a page boundary etc? So I think we'd need
to do WAL logging during index searches, which seems prohibitively
expensive.

Don't see that.

I was talking about reusing the first 4 bytes of an index tuple's ItemPointerData,
which is the first field of an index tuple. Index tuples are MAXALIGNed, so I can't see how that would ever cross a page boundary.
 
And I'm also doubtful it's worth it because:

> Since this point of the code is clearly going to be a performance issue it
> seems like something we should do now.

I've tried quite a bit to find a workload where this matters, but after
avoiding redundant buffer accesses by sorting, and prefetching I was
unable to do so.  What workload do you see where this would be really be
bad? Without the performance optimization I'd found a very minor
regression by trying to force the heap visits to happen in a pretty
random order, but after sorting that went away.  I'm sure it's possible
to find a case on overloaded rotational disks where you'd find a small
regression, but I don't think it'd be particularly bad.

The code can do literally hundreds of random I/Os in an 8192 blocksize. What happens with 16 or 32kB?

"Small regression" ?

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgsql: Compute XID horizon for page level index vacuum onprimary.

От
Andres Freund
Дата:
Hi,

On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
> On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
> > That's far from a trivial feature imo. It seems quite possible that we'd
> > end up with increased overhead, because the current logic can get away
> > with only doing hint bit style writes - but would that be true if we
> > started actually replacing the item pointers? Because I don't see any
> > guarantee they couldn't cross a page boundary etc? So I think we'd need
> > to do WAL logging during index searches, which seems prohibitively
> > expensive.
> >
> 
> Don't see that.
> 
> I was talking about reusing the first 4 bytes of an index tuple's
> ItemPointerData,
> which is the first field of an index tuple. Index tuples are MAXALIGNed, so
> I can't see how that would ever cross a page boundary.

They're 8 bytes, and MAXALIGN often is 4 bytes:

struct ItemPointerData {
        BlockIdData                ip_blkid;             /*     0     4 */
        OffsetNumber               ip_posid;             /*     4     2 */

        /* size: 6, cachelines: 1, members: 2 */
        /* last cacheline: 6 bytes */
};

struct IndexTupleData {
        ItemPointerData            t_tid;                /*     0     6 */
        short unsigned int         t_info;               /*     6     2 */

        /* size: 8, cachelines: 1, members: 2 */
        /* last cacheline: 8 bytes */
};

So as a whole they definitely can cross sector boundaries. You might be
able to argue your way out of that by saying that the blkid is going to
be aligned, but that's not that trivial, as t_info isn't guaranteed
that.

But even so, you can't have unlogged changes that you then rely on. Even
if there's no torn page issue. Currently BTP_HAS_GARBAGE and
ItemIdMarkDead() are treated as hints - if we want to guarantee all
these are accurate, I don't quite see how we'd get around WAL logging
those.


> > And I'm also doubtful it's worth it because:
> >
> > > Since this point of the code is clearly going to be a performance issue
> > it
> > > seems like something we should do now.
> >
> > I've tried quite a bit to find a workload where this matters, but after
> > avoiding redundant buffer accesses by sorting, and prefetching I was
> > unable to do so.  What workload do you see where this would be really be
> > bad? Without the performance optimization I'd found a very minor
> > regression by trying to force the heap visits to happen in a pretty
> > random order, but after sorting that went away.  I'm sure it's possible
> > to find a case on overloaded rotational disks where you'd find a small
> > regression, but I don't think it'd be particularly bad.

> The code can do literally hundreds of random I/Os in an 8192 blocksize.
> What happens with 16 or 32kB?

It's really hard to construct such cases after the sorting changes, but
obviously not impossible. But to make it actually painful you need a
workload where the implied randomness of accesses isn't already a major
bottleneck - and that's hard.

This has been discussed publically for a few months...

Greetings,

Andres Freund



Re: pgsql: Compute XID horizon for page level index vacuum onprimary.

От
Andres Freund
Дата:
Hi,

On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
> On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
> > That's far from a trivial feature imo. It seems quite possible that we'd
> > end up with increased overhead, because the current logic can get away
> > with only doing hint bit style writes - but would that be true if we
> > started actually replacing the item pointers? Because I don't see any
> > guarantee they couldn't cross a page boundary etc? So I think we'd need
> > to do WAL logging during index searches, which seems prohibitively
> > expensive.
> >
> 
> Don't see that.
> 
> I was talking about reusing the first 4 bytes of an index tuple's
> ItemPointerData,
> which is the first field of an index tuple. Index tuples are MAXALIGNed, so
> I can't see how that would ever cross a page boundary.

They're 8 bytes, and MAXALIGN often is 4 bytes:

struct ItemPointerData {
        BlockIdData                ip_blkid;             /*     0     4 */
        OffsetNumber               ip_posid;             /*     4     2 */

        /* size: 6, cachelines: 1, members: 2 */
        /* last cacheline: 6 bytes */
};

struct IndexTupleData {
        ItemPointerData            t_tid;                /*     0     6 */
        short unsigned int         t_info;               /*     6     2 */

        /* size: 8, cachelines: 1, members: 2 */
        /* last cacheline: 8 bytes */
};

So as a whole they definitely can cross sector boundaries. You might be
able to argue your way out of that by saying that the blkid is going to
be aligned, but that's not that trivial, as t_info isn't guaranteed
that.

But even so, you can't have unlogged changes that you then rely on. Even
if there's no torn page issue. Currently BTP_HAS_GARBAGE and
ItemIdMarkDead() are treated as hints - if we want to guarantee all
these are accurate, I don't quite see how we'd get around WAL logging
those.


> > And I'm also doubtful it's worth it because:
> >
> > > Since this point of the code is clearly going to be a performance issue
> > it
> > > seems like something we should do now.
> >
> > I've tried quite a bit to find a workload where this matters, but after
> > avoiding redundant buffer accesses by sorting, and prefetching I was
> > unable to do so.  What workload do you see where this would be really be
> > bad? Without the performance optimization I'd found a very minor
> > regression by trying to force the heap visits to happen in a pretty
> > random order, but after sorting that went away.  I'm sure it's possible
> > to find a case on overloaded rotational disks where you'd find a small
> > regression, but I don't think it'd be particularly bad.

> The code can do literally hundreds of random I/Os in an 8192 blocksize.
> What happens with 16 or 32kB?

It's really hard to construct such cases after the sorting changes, but
obviously not impossible. But to make it actually painful you need a
workload where the implied randomness of accesses isn't already a major
bottleneck - and that's hard.

This has been discussed publically for a few months...

Greetings,

Andres Freund



Re: pgsql: Compute XID horizon for page level index vacuum on primary.

От
Simon Riggs
Дата:
On Fri, 29 Mar 2019 at 16:12, Andres Freund <andres@anarazel.de> wrote:
 
On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
> On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
> > That's far from a trivial feature imo. It seems quite possible that we'd
> > end up with increased overhead, because the current logic can get away
> > with only doing hint bit style writes - but would that be true if we
> > started actually replacing the item pointers? Because I don't see any
> > guarantee they couldn't cross a page boundary etc? So I think we'd need
> > to do WAL logging during index searches, which seems prohibitively
> > expensive.
> >
>
> Don't see that.
>
> I was talking about reusing the first 4 bytes of an index tuple's
> ItemPointerData,
> which is the first field of an index tuple. Index tuples are MAXALIGNed, so
> I can't see how that would ever cross a page boundary.

They're 8 bytes, and MAXALIGN often is 4 bytes:

xids are 4 bytes, so we're good. 

If MAXALIGN could ever be 2 bytes, we'd have a problem.

So as a whole they definitely can cross sector boundaries. You might be
able to argue your way out of that by saying that the blkid is going to
be aligned, but that's not that trivial, as t_info isn't guaranteed
that.

But even so, you can't have unlogged changes that you then rely on. Even
if there's no torn page issue. Currently BTP_HAS_GARBAGE and
ItemIdMarkDead() are treated as hints - if we want to guarantee all
these are accurate, I don't quite see how we'd get around WAL logging
those.

You can have unlogged changes that you rely on - that is exactly how hints work.

If the hint is lost, we do the I/O. Worst case it would be the same as what you have now.

I'm talking about saving many I/Os - this doesn't need to provably avoid all I/Os to work, its incremental benefit all the way.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgsql: Compute XID horizon for page level index vacuum on primary.

От
Simon Riggs
Дата:
On Fri, 29 Mar 2019 at 16:12, Andres Freund <andres@anarazel.de> wrote:
 
On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
> On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
> > That's far from a trivial feature imo. It seems quite possible that we'd
> > end up with increased overhead, because the current logic can get away
> > with only doing hint bit style writes - but would that be true if we
> > started actually replacing the item pointers? Because I don't see any
> > guarantee they couldn't cross a page boundary etc? So I think we'd need
> > to do WAL logging during index searches, which seems prohibitively
> > expensive.
> >
>
> Don't see that.
>
> I was talking about reusing the first 4 bytes of an index tuple's
> ItemPointerData,
> which is the first field of an index tuple. Index tuples are MAXALIGNed, so
> I can't see how that would ever cross a page boundary.

They're 8 bytes, and MAXALIGN often is 4 bytes:

xids are 4 bytes, so we're good. 

If MAXALIGN could ever be 2 bytes, we'd have a problem.

So as a whole they definitely can cross sector boundaries. You might be
able to argue your way out of that by saying that the blkid is going to
be aligned, but that's not that trivial, as t_info isn't guaranteed
that.

But even so, you can't have unlogged changes that you then rely on. Even
if there's no torn page issue. Currently BTP_HAS_GARBAGE and
ItemIdMarkDead() are treated as hints - if we want to guarantee all
these are accurate, I don't quite see how we'd get around WAL logging
those.

You can have unlogged changes that you rely on - that is exactly how hints work.

If the hint is lost, we do the I/O. Worst case it would be the same as what you have now.

I'm talking about saving many I/Os - this doesn't need to provably avoid all I/Os to work, its incremental benefit all the way.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgsql: Compute XID horizon for page level index vacuum on primary.

От
Peter Geoghegan
Дата:
 On Fri, Mar 29, 2019 at 9:12 AM Andres Freund <andres@anarazel.de> wrote:
> But even so, you can't have unlogged changes that you then rely on. Even
> if there's no torn page issue. Currently BTP_HAS_GARBAGE and
> ItemIdMarkDead() are treated as hints - if we want to guarantee all
> these are accurate, I don't quite see how we'd get around WAL logging
> those.

It might be possible to WAL-log the _bt_check_unique() item killing.
That seems to be much more effective than the similar and better known
kill_prior_tuple optimization in practice. I don't think that that
should be in scope for v12, though. I for one am satisfied with your
explanation.

> > The code can do literally hundreds of random I/Os in an 8192 blocksize.
> > What happens with 16 or 32kB?
>
> It's really hard to construct such cases after the sorting changes, but
> obviously not impossible. But to make it actually painful you need a
> workload where the implied randomness of accesses isn't already a major
> bottleneck - and that's hard.

There is also the fact that in many cases you'll just have accessed
the same buffers from within _bt_check_unique() anyway.

-- 
Peter Geoghegan



Re: pgsql: Compute XID horizon for page level index vacuum on primary.

От
Peter Geoghegan
Дата:
 On Fri, Mar 29, 2019 at 9:12 AM Andres Freund <andres@anarazel.de> wrote:
> But even so, you can't have unlogged changes that you then rely on. Even
> if there's no torn page issue. Currently BTP_HAS_GARBAGE and
> ItemIdMarkDead() are treated as hints - if we want to guarantee all
> these are accurate, I don't quite see how we'd get around WAL logging
> those.

It might be possible to WAL-log the _bt_check_unique() item killing.
That seems to be much more effective than the similar and better known
kill_prior_tuple optimization in practice. I don't think that that
should be in scope for v12, though. I for one am satisfied with your
explanation.

> > The code can do literally hundreds of random I/Os in an 8192 blocksize.
> > What happens with 16 or 32kB?
>
> It's really hard to construct such cases after the sorting changes, but
> obviously not impossible. But to make it actually painful you need a
> workload where the implied randomness of accesses isn't already a major
> bottleneck - and that's hard.

There is also the fact that in many cases you'll just have accessed
the same buffers from within _bt_check_unique() anyway.

-- 
Peter Geoghegan



Re: pgsql: Compute XID horizon for page level index vacuum onprimary.

От
Andres Freund
Дата:
On 2019-03-29 16:20:54 +0000, Simon Riggs wrote:
> On Fri, 29 Mar 2019 at 16:12, Andres Freund <andres@anarazel.de> wrote:
> 
> 
> > On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
> > > On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
> > > > That's far from a trivial feature imo. It seems quite possible that
> > we'd
> > > > end up with increased overhead, because the current logic can get away
> > > > with only doing hint bit style writes - but would that be true if we
> > > > started actually replacing the item pointers? Because I don't see any
> > > > guarantee they couldn't cross a page boundary etc? So I think we'd need
> > > > to do WAL logging during index searches, which seems prohibitively
> > > > expensive.
> > > >
> > >
> > > Don't see that.
> > >
> > > I was talking about reusing the first 4 bytes of an index tuple's
> > > ItemPointerData,
> > > which is the first field of an index tuple. Index tuples are MAXALIGNed,
> > so
> > > I can't see how that would ever cross a page boundary.
> >
> > They're 8 bytes, and MAXALIGN often is 4 bytes:
> >
> 
> xids are 4 bytes, so we're good.

I literally went on to explain why that's not sufficient? You can't
*just* replace the block number with an xid. You *also* need to set a
flag denoting that it's an xid and dead now. Which can't fit in the same
4 bytes.  You either have to set it in the IndexTuple's t_info, or or in
the ItemIdData's lp_flags. And both can be on a different sectors.  If
the flag persists, and the xid doesn't you're going to interpret a block
number as an xid - not great; but even worse, if the xid survives, but
the flag doesn't, you're going to access the xid as a block - definitely
not ok, because you're going to return wrong results.

Greetings,

Andres Freund



Re: pgsql: Compute XID horizon for page level index vacuum onprimary.

От
Andres Freund
Дата:
On 2019-03-29 16:20:54 +0000, Simon Riggs wrote:
> On Fri, 29 Mar 2019 at 16:12, Andres Freund <andres@anarazel.de> wrote:
> 
> 
> > On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
> > > On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
> > > > That's far from a trivial feature imo. It seems quite possible that
> > we'd
> > > > end up with increased overhead, because the current logic can get away
> > > > with only doing hint bit style writes - but would that be true if we
> > > > started actually replacing the item pointers? Because I don't see any
> > > > guarantee they couldn't cross a page boundary etc? So I think we'd need
> > > > to do WAL logging during index searches, which seems prohibitively
> > > > expensive.
> > > >
> > >
> > > Don't see that.
> > >
> > > I was talking about reusing the first 4 bytes of an index tuple's
> > > ItemPointerData,
> > > which is the first field of an index tuple. Index tuples are MAXALIGNed,
> > so
> > > I can't see how that would ever cross a page boundary.
> >
> > They're 8 bytes, and MAXALIGN often is 4 bytes:
> >
> 
> xids are 4 bytes, so we're good.

I literally went on to explain why that's not sufficient? You can't
*just* replace the block number with an xid. You *also* need to set a
flag denoting that it's an xid and dead now. Which can't fit in the same
4 bytes.  You either have to set it in the IndexTuple's t_info, or or in
the ItemIdData's lp_flags. And both can be on a different sectors.  If
the flag persists, and the xid doesn't you're going to interpret a block
number as an xid - not great; but even worse, if the xid survives, but
the flag doesn't, you're going to access the xid as a block - definitely
not ok, because you're going to return wrong results.

Greetings,

Andres Freund



Re: pgsql: Compute XID horizon for page level index vacuum on primary.

От
Simon Riggs
Дата:
On Fri, 29 Mar 2019 at 16:32, Andres Freund <andres@anarazel.de> wrote:
On 2019-03-29 16:20:54 +0000, Simon Riggs wrote:
> On Fri, 29 Mar 2019 at 16:12, Andres Freund <andres@anarazel.de> wrote:
>
>
> > On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
> > > On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
> > > > That's far from a trivial feature imo. It seems quite possible that
> > we'd
> > > > end up with increased overhead, because the current logic can get away
> > > > with only doing hint bit style writes - but would that be true if we
> > > > started actually replacing the item pointers? Because I don't see any
> > > > guarantee they couldn't cross a page boundary etc? So I think we'd need
> > > > to do WAL logging during index searches, which seems prohibitively
> > > > expensive.
> > > >
> > >
> > > Don't see that.
> > >
> > > I was talking about reusing the first 4 bytes of an index tuple's
> > > ItemPointerData,
> > > which is the first field of an index tuple. Index tuples are MAXALIGNed,
> > so
> > > I can't see how that would ever cross a page boundary.
> >
> > They're 8 bytes, and MAXALIGN often is 4 bytes:
> >
>
> xids are 4 bytes, so we're good.

I literally went on to explain why that's not sufficient? You can't
*just* replace the block number with an xid. You *also* need to set a
flag denoting that it's an xid and dead now. Which can't fit in the same
4 bytes.  You either have to set it in the IndexTuple's t_info, or or in
the ItemIdData's lp_flags. And both can be on a different sectors.  If
the flag persists, and the xid doesn't you're going to interpret a block
number as an xid - not great; but even worse, if the xid survives, but
the flag doesn't, you're going to access the xid as a block - definitely
not ok, because you're going to return wrong results.

Yes, I agree, I was thinking the same thing after my last comment, but was afk. The idea requires the atomic update of at least 4 bytes plus at least 1 bit and so would require at least 8byte MAXALIGN to be useful. Your other points suggesting that multiple things all need to be set accurately for this to work aren't correct. The idea was that we would write a hint that would avoid later I/O, so the overall idea is still viable.

Anyway, thinking some more, I think the whole idea of generating lastRemovedXid is moot and there are better ways in the future of doing this that would avoid a performance issue altogether. Clearly not PG12.

The main issue relates to the potential overhead of moving this to the master. I agree its the right thing to do, but we should have some way of checking it is not a performance issue in practice.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: pgsql: Compute XID horizon for page level index vacuum on primary.

От
Simon Riggs
Дата:
On Fri, 29 Mar 2019 at 16:32, Andres Freund <andres@anarazel.de> wrote:
On 2019-03-29 16:20:54 +0000, Simon Riggs wrote:
> On Fri, 29 Mar 2019 at 16:12, Andres Freund <andres@anarazel.de> wrote:
>
>
> > On 2019-03-29 15:58:14 +0000, Simon Riggs wrote:
> > > On Fri, 29 Mar 2019 at 15:29, Andres Freund <andres@anarazel.de> wrote:
> > > > That's far from a trivial feature imo. It seems quite possible that
> > we'd
> > > > end up with increased overhead, because the current logic can get away
> > > > with only doing hint bit style writes - but would that be true if we
> > > > started actually replacing the item pointers? Because I don't see any
> > > > guarantee they couldn't cross a page boundary etc? So I think we'd need
> > > > to do WAL logging during index searches, which seems prohibitively
> > > > expensive.
> > > >
> > >
> > > Don't see that.
> > >
> > > I was talking about reusing the first 4 bytes of an index tuple's
> > > ItemPointerData,
> > > which is the first field of an index tuple. Index tuples are MAXALIGNed,
> > so
> > > I can't see how that would ever cross a page boundary.
> >
> > They're 8 bytes, and MAXALIGN often is 4 bytes:
> >
>
> xids are 4 bytes, so we're good.

I literally went on to explain why that's not sufficient? You can't
*just* replace the block number with an xid. You *also* need to set a
flag denoting that it's an xid and dead now. Which can't fit in the same
4 bytes.  You either have to set it in the IndexTuple's t_info, or or in
the ItemIdData's lp_flags. And both can be on a different sectors.  If
the flag persists, and the xid doesn't you're going to interpret a block
number as an xid - not great; but even worse, if the xid survives, but
the flag doesn't, you're going to access the xid as a block - definitely
not ok, because you're going to return wrong results.

Yes, I agree, I was thinking the same thing after my last comment, but was afk. The idea requires the atomic update of at least 4 bytes plus at least 1 bit and so would require at least 8byte MAXALIGN to be useful. Your other points suggesting that multiple things all need to be set accurately for this to work aren't correct. The idea was that we would write a hint that would avoid later I/O, so the overall idea is still viable.

Anyway, thinking some more, I think the whole idea of generating lastRemovedXid is moot and there are better ways in the future of doing this that would avoid a performance issue altogether. Clearly not PG12.

The main issue relates to the potential overhead of moving this to the master. I agree its the right thing to do, but we should have some way of checking it is not a performance issue in practice.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services