Обсуждение: Avoiding second heap scan in VACUUM

Поиск

Список

Период

Сортировка

Avoiding second heap scan in VACUUM

От

"Pavan Deolasee"

Дата:

28 мая 2008 г., 11:27:20

Tom brought this up during the PGCon developer meet. After thinking a
bit about it, I think it's actually possible to avoid the second heap
scan, especially now that we've HOT. If we can remove the second pass,
not only would that speed up vacuum, but also reduce lots of redundant
read and write IO.

Currently second heap scan is required to remove the dead tuples from
the heap. We can not do this in the first scan because we haven't yet
removed the index pointers pointing to them. HOT now prunes and
defrags the pages in the first phase itself and what is left behind is
just a bunch of DEAD line pointers. The line pointers are marked
"UNUSED" in the second heap scan. Since we don't repair any line
pointer bloat, no additional free space is created in the second pass.
So frankly there is not much left to be done in the second phase. Of
course we also update the FSM information at the end of second pass.

If we want to remove the second pass, what we need is a mechanism to
reclaim the DEAD line pointers. But to this correctly, we must ensure
that the DEAD line pointers are reclaimed only and only after the
index entries pointing to them are removed.

Tom's idea was to store the vacuum-xid in the tuple header and check
that xid to see if the vacuum successfully removed the index pointers
or not. Heikki had some brilliant idea to store the xid in the line
pointer itself. These ideas are good, but would require xid wraparound
handling. I am thinking of a solution on the following lines to handle
DEAD line pointers. Other ideas are welcome too.

1. Before VACUUM starts, it updates the pg_class row of the target
table, noting that VACUUM_IN_PROGRESS for the target table.
2. It then waits for all the existing transactions to finish to make
sure that everyone can see the change in the pg_class row,
3. It then scans the heap, prunes and defrags the pages. The normal
pruning would reclaim all the dead tuples and mark their line pointers
as DEAD. Since VACUUM is going to remove the index pointers pointing
to these DEAD line pointers, it now marks these DEAD line pointers
with additional flag, say DEAD_RECLAIMED.
4. At the end of first scan, VACUUM updates FSM information for heap pages.
5. It then proceeds with the index scan and removes index pointers
pointing to the DEAD line pointers collected in the heap scan.
6. Finally, it again updates the pg_class row and clears the
VACUUM_IN_PROGRESS flag.

Any other backend, when invokes page pruning, would check if the
VACUUM is in progress by looking at the VACUUM_IN_PROGRESS flag. Note
that if the previous vacuum had failed or database crashed before
vacuum completed, the VACUUM_IN_PROGRESS flag would remain set until
the next vacuum successfully completes on the table and resets the
flag (VACUUM_NOT_IN_PROGRESS state).

Since vacuum waits for the existing transactions to finish before
marking any DEAD line pointers DEAD_RECLAIMED, for a backend which
sees VACUUM_NOT_IN_PROGRESS,  any DEAD_RECLAIMED line pointer it finds
must be left over from the previously successfully completed vacuum.
Since the previous vacuum must have removed the index pointers
pointing to it, the backend can now safely reclaim the line pointer
itself. The backend can potentially do this any time it sees a
DEAD_RECLAIMED line pointer, but we may restrict this only during the
pruning activity to keep things simple. This operation need not be WAL
logged if we appropriately handle DEAD_RECLAIMED line pointer during
redo recovery  (if it's reused for some other insert/update activity).

I think this scheme guarantees that a backend would always see
VACUUM_IN_PROGRESS if vacuum is currently in progress on the table or
the last vacuum has failed. There might be situations when a backend
sees VACUUM_IN_PROGRESS when if fact there is no vacuum is progress
and the last vacuum finished successfully, but that won't have any
correctness implication, but would only delay reclaiming
DEAD_RECLAIMED line pointers.

Comments ?

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: Avoiding second heap scan in VACUUM

От

Gregory Stark

Дата:

28 мая 2008 г., 17:55:51

"Pavan Deolasee" <pavan.deolasee@gmail.com> writes:

> 1. Before VACUUM starts, it updates the pg_class row of the target
> table, noting that VACUUM_IN_PROGRESS for the target table.

If I understand correctly nobody would be able to re-use any line-pointers
when a vacuum is in progress? I find that a bit scary since for large tables
you may actually always be running a vacuum. Perhaps the DSM will fix that but
for heavily updated tables I think you might still be pretty much continuously
running vacuum. 

On the other hand it would just result in line pointer bloat. And I think
VACUUM could still safely remove old dead line pointers if it noted that the
table had a clean vacuum status when it started.

> 2. It then waits for all the existing transactions to finish to make
> sure that everyone can see the change in the pg_class row,

I'm a bit scared of how many "waits for all transactions to finish" we're
accumulating. It seemed safe enough when we had only one but I'm not sure what
the consequences for this action are when there are several of them. Are we
perhaps creating deadlocks?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!

Re: Avoiding second heap scan in VACUUM

От

Simon Riggs

Дата:

28 мая 2008 г., 20:29:41

On Wed, 2008-05-28 at 16:56 +0530, Pavan Deolasee wrote:

> 2. It then waits for all the existing transactions to finish to make
> sure that everyone can see the change in the pg_class row

I'm not happy that the VACUUM waits. It might wait a very long time and
cause worse overall performance than the impact of the second scan.

Happily, I think we already have a solution to this overall problem
elsewhere in the code. When we VACUUM away all the index entries on a
page we don't yet remove it. We only add it to the FSM on the second
pass of that page on the *next* VACUUM.

So the idea is to have one pass per VACUUM, but make that one pass do
the first pass of *this* VACUUM and the second pass of the *last*
VACUUM.

We mark the xid of the VACUUM in pg_class as you suggest, but we do it
after VACUUM has completed the pass.

In single pass we mark DEAD line pointers as RECENTLY_DEAD. If the last
VACUUM xid is old enough we mark RECENTLY_DEAD as UNUSED, as well,
during this first pass. If last xid is not old enough we do second pass
to remove them.

That has the effect that large tables that are infrequently VACUUMed
will need only a single scan. Smaller tables that require almost
continual VACUUMing will probably do two scans, but who cares?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Avoiding second heap scan in VACUUM

От

Gregory Stark

Дата:

28 мая 2008 г., 20:55:59

"Simon Riggs" <simon@2ndquadrant.com> writes:

> So the idea is to have one pass per VACUUM, but make that one pass do
> the first pass of *this* VACUUM and the second pass of the *last*
> VACUUM.

I think that's exactly the same as the original suggestion of having HOT
pruning do the second pass of the last vacuum. The trick is to know whether
the last vacuum committed or not. If it didn't commit then it's not safe to
remove those line pointers yet.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning

Re: Avoiding second heap scan in VACUUM

От

Simon Riggs

Дата:

28 мая 2008 г., 21:18:17

On Wed, 2008-05-28 at 16:55 -0400, Gregory Stark wrote:
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> 
> > So the idea is to have one pass per VACUUM, but make that one pass do
> > the first pass of *this* VACUUM and the second pass of the *last*
> > VACUUM.
> 
> I think that's exactly the same as the original suggestion of having HOT
> pruning do the second pass of the last vacuum. The trick is to know whether
> the last vacuum committed or not. If it didn't commit then it's not safe to
> remove those line pointers yet.

Perhaps, though I'm not suggesting storing extra xids on-block.

I think if we have to wait for a VACUUM to run before marking the line
pointers then we may as well wait for two. Having something wait for a
VACUUM and then removed it by HOT afterwards gives you the worst of both
worlds: long wait for a VACUUM then more overhead and extra code during
HOT pruning.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Avoiding second heap scan in VACUUM

От

"Pavan Deolasee"

Дата:

29 мая 2008 г., 04:32:46

On Thu, May 29, 2008 at 2:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>
> I'm not happy that the VACUUM waits. It might wait a very long time and
> cause worse overall performance than the impact of the second scan.
>

Lets not get too paranoid about the wait. It's a minor detail in the
whole theory. I would suggest that the benefit of avoiding second scan
would be huge. Remember, its just not a scan, it also dirties those
blocks again, forcing them write to disk. Also, if you really have a
situation where vacuum needs to wait for very long, then you are
already in trouble. The long running transactions would prevent
vacuuming many tuples.

I think we can easily tweak the "wait" so that it doesn't wait
indefinitely. If the "wait" times out, vacuum can still proceed, but
it can mark the DEAD line pointers as DEAD_RECLAIMED. It would then
have a choice of making a second pass and reclaiming the DEAD line
pointers (like it does today).

>
> So the idea is to have one pass per VACUUM, but make that one pass do
> the first pass of *this* VACUUM and the second pass of the *last*
> VACUUM.
>
> We mark the xid of the VACUUM in pg_class as you suggest, but we do it
> after VACUUM has completed the pass.
>

The trick is to correctly know if the last vacuum removed the index
pointers or not. There could be several ways to do that. But you need
to explain in detail how it would work in cases of vacuum failures and
database crash.

> In single pass we mark DEAD line pointers as RECENTLY_DEAD. If the last
> VACUUM xid is old enough we mark RECENTLY_DEAD as UNUSED, as well,
> during this first pass. If last xid is not old enough we do second pass
> to remove them.
>

Lets not call them RECENTLY_DEAD :-) DEAD is already stricter than
that. We need something even more strong. That's why I used
DEAD_RECLAIMED, to note that the line pointer is DEAD and the index
pointer may have been removed as well.

> That has the effect that large tables that are infrequently VACUUMed
> will need only a single scan. Smaller tables that require almost
> continual VACUUMing will probably do two scans, but who cares?
>

Yeah, I think we need to target the large table case. The second pass
is obviously much more costly for large tables. I think the timed-wait
answers your concern.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: Avoiding second heap scan in VACUUM

От

Simon Riggs

Дата:

29 мая 2008 г., 05:58:50

On Thu, 2008-05-29 at 09:57 +0530, Pavan Deolasee wrote:
> On Thu, May 29, 2008 at 2:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> >
> > I'm not happy that the VACUUM waits. It might wait a very long time and
> > cause worse overall performance than the impact of the second scan.
> >
> 
> Lets not get too paranoid about the wait. It's a minor detail in the
> whole theory. I would suggest that the benefit of avoiding second scan
> would be huge. Remember, its just not a scan, it also dirties those
> blocks again, forcing them write to disk. Also, if you really have a
> situation where vacuum needs to wait for very long, then you are
> already in trouble. The long running transactions would prevent
> vacuuming many tuples.
> 
> I think we can easily tweak the "wait" so that it doesn't wait
> indefinitely. If the "wait" times out, vacuum can still proceed, but
> it can mark the DEAD line pointers as DEAD_RECLAIMED. It would then
> have a choice of making a second pass and reclaiming the DEAD line
> pointers (like it does today).

Which is exactly what I suggested. Don't wait, just check.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Avoiding second heap scan in VACUUM

От

"Pavan Deolasee"

Дата:

30 мая 2008 г., 13:07:33

On Fri, May 30, 2008 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>  What I still
> don't accept is that an unconstrained wait is justifiable. You've just
> said its a minor detail, but that's not the way I see it. It might be a
> second, but it might be an hour or more.
>

I am suggesting a timed wait. May be say between 60-300 seconds.
That's the maximum VACUUM would get delayed. If exiting transactions
don't finish within that time, VACUUM just works as it does today. So
it can't certainly be much worse than what it is today.

> A non-waiting solution seems like the only way to proceed.
>

Yeah, but we don't have a simple solution yet which would work in all
cases and is not too complicated.

> Is this a non-issue anyway, with DSM?
>

I thought about that. DSM would certainly reduce the cost of heap
scans. But still the second pass would be required and it would
re-dirty all the pages again,

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: Avoiding second heap scan in VACUUM

От

"Pavan Deolasee"

Дата:

30 мая 2008 г., 13:09:29

On Fri, May 30, 2008 at 1:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>
>
> Been thinking some more about this. You're right that the second scan
> could re-dirty many pages and is probably something to avoid.

Right. IMHO it would help us a lot.

> The main
> issue I see is that you don't really know how much work will happen in
> the first phase and how much would happen in the second.

With HOT, I see very little work left for the second pass. The dead
space is already collected in the first pass. The second pass only
cleans up the DEAD line pointers. Also if we can update FSM early (at
the end of first pass), we can avoid further irreverssible bloat of
the heap.

> no problem at all. I'd rather keep it as it is than have sometimes
> better, sometimes worse behaviour.
>

For large tables, two heap scans along with several additional page
writes doesn't seem to the cost we can afford, especially in IO-bound
application. IMHO a timed wait is not such a bad thing. Note that its
all about VACUUM which is a background, maintenance activity and it
won't harm to delay it by few seconds or even minutes. Also, as I said
earlier "waiting" is a minor detail, may be there is a better way to
do things.

Unless there are some strong objections, I would like to give it a
shot and see if there are any real benefits. We can then fix any
regression cases. Let me know if somebody thinks there are certain
show stoppers or the benefits of avoiding a second scan on a large
table is not worth. I personally have a strong feeling that it's worth
the efforts.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: Avoiding second heap scan in VACUUM

От

"Pavan Deolasee"

Дата:

30 мая 2008 г., 14:34:06

On Fri, May 30, 2008 at 3:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>
>
> Perhaps we can start first scan, check xid after we scan each few
> blocks. Once we find the xid is older, then we know the size of the
> second scan can be limited to only those blocks already scanned. So the
> two endpoints of behaviour are we skip the scan completely or we do the
> whole scan, but at least there is a saving in many cases without
> waiting.
>

Hmm. Interesting. I was about to suggest that we use some heuristic
such as size of the table to decide whether or not try the
optimization. But what you just suggested makes more sense. So instead
of waiting, we anyways start the first scan. If we are lucky, in some
time all the old transactions would go away and then we can start
marking the DEAD line pointers as DEAD_RECLAIMED. The second pass just
needs to rescan the initial blocks to remove the DEAD line pointers.

Thanks,
Pavan

-- 
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

Re: Avoiding second heap scan in VACUUM

От

Simon Riggs

Дата:

31 мая 2008 г., 03:47:48

On Fri, 2008-05-30 at 14:50 +0530, Pavan Deolasee wrote:
> On Fri, May 30, 2008 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> >  What I still
> > don't accept is that an unconstrained wait is justifiable. You've just
> > said its a minor detail, but that's not the way I see it. It might be a
> > second, but it might be an hour or more.
> >
> 
> I am suggesting a timed wait. May be say between 60-300 seconds.
> That's the maximum VACUUM would get delayed. If exiting transactions
> don't finish within that time, VACUUM just works as it does today. So
> it can't certainly be much worse than what it is today.
> 
> > A non-waiting solution seems like the only way to proceed.
> >

I understand what you're saying and agree that in (some) cases a small
wait is not important. I'm just saying some != all, and the gap covers
important cases:

If we have a database with 100 tables (very common) and we add 5 minutes
waiting time to each vacuum, then we'll make a complete database VACUUM
take ~7 hours longer than it did before. 1000 tables would cause rioting
in the streets.

Waiting for 5 minutes for a 0.5 second vacuum isn't sensible either,
whatever the gain. It's clear Amdahl's Law would not advise us to
optimise that (in this way).

So if its a large table and we submitted it with a non-zero vacuum wait,
then maybe a wait is an acceptable optimisation.

Perhaps we can start first scan, check xid after we scan each few
blocks. Once we find the xid is older, then we know the size of the
second scan can be limited to only those blocks already scanned. So the
two endpoints of behaviour are we skip the scan completely or we do the
whole scan, but at least there is a saving in many cases without
waiting.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Avoiding second heap scan in VACUUM

От

Simon Riggs

Дата:

31 мая 2008 г., 04:29:44

On Thu, 2008-05-29 at 09:57 +0530, Pavan Deolasee wrote:
> On Thu, May 29, 2008 at 2:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> >
> >
> > I'm not happy that the VACUUM waits. It might wait a very long time and
> > cause worse overall performance than the impact of the second scan.
> >
> 
> Lets not get too paranoid about the wait. It's a minor detail in the
> whole theory. I would suggest that the benefit of avoiding second scan
> would be huge. Remember, its just not a scan, it also dirties those
> blocks again, forcing them write to disk. Also, if you really have a
> situation where vacuum needs to wait for very long, then you are
> already in trouble. The long running transactions would prevent
> vacuuming many tuples.

Been thinking some more about this. You're right that the second scan
could re-dirty many pages and is probably something to avoid. The main
issue I see is that you don't really know how much work will happen in
the first phase and how much would happen in the second. Avoiding the
second scan might be worth waiting for, it might not. You really have no
idea how many tuples the LRT might effect. There is no "typical" here,
it can vary from one extreme to another. Waiting could be very bad, or
no problem at all. I'd rather keep it as it is than have sometimes
better, sometimes worse behaviour.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: Avoiding second heap scan in VACUUM

От

Simon Riggs

Дата:

31 мая 2008 г., 04:36:39

On Fri, 2008-05-30 at 14:22 +0530, Pavan Deolasee wrote:

> For large tables, two heap scans along with several additional page
> writes doesn't seem to the cost we can afford, especially in IO-bound
> application. IMHO a timed wait is not such a bad thing. Note that its
> all about VACUUM which is a background, maintenance activity and it
> won't harm to delay it by few seconds or even minutes. Also, as I said
> earlier "waiting" is a minor detail, may be there is a better way to
> do things.
> 
> Unless there are some strong objections, I would like to give it a
> shot and see if there are any real benefits. We can then fix any
> regression cases. Let me know if somebody thinks there are certain
> show stoppers or the benefits of avoiding a second scan on a large
> table is not worth. I personally have a strong feeling that it's worth
> the efforts.

I'm not really clear what you intend to do now.

We both agreed that avoiding a second pass is a good thing. What I still
don't accept is that an unconstrained wait is justifiable. You've just
said its a minor detail, but that's not the way I see it. It might be a
second, but it might be an hour or more. If I run a VACUUM at 0700,
thinking it will finish by 0900 before my database gets busy, it is a
very bad thing to find that it only started at 0900 and is now
interfering with my business workload.

A non-waiting solution seems like the only way to proceed.

Is this a non-issue anyway, with DSM?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Avoiding second heap scan in VACUUM