Обсуждение: Avoiding second heap scan in VACUUM
Tom brought this up during the PGCon developer meet. After thinking a bit about it, I think it's actually possible to avoid the second heap scan, especially now that we've HOT. If we can remove the second pass, not only would that speed up vacuum, but also reduce lots of redundant read and write IO. Currently second heap scan is required to remove the dead tuples from the heap. We can not do this in the first scan because we haven't yet removed the index pointers pointing to them. HOT now prunes and defrags the pages in the first phase itself and what is left behind is just a bunch of DEAD line pointers. The line pointers are marked "UNUSED" in the second heap scan. Since we don't repair any line pointer bloat, no additional free space is created in the second pass. So frankly there is not much left to be done in the second phase. Of course we also update the FSM information at the end of second pass. If we want to remove the second pass, what we need is a mechanism to reclaim the DEAD line pointers. But to this correctly, we must ensure that the DEAD line pointers are reclaimed only and only after the index entries pointing to them are removed. Tom's idea was to store the vacuum-xid in the tuple header and check that xid to see if the vacuum successfully removed the index pointers or not. Heikki had some brilliant idea to store the xid in the line pointer itself. These ideas are good, but would require xid wraparound handling. I am thinking of a solution on the following lines to handle DEAD line pointers. Other ideas are welcome too. 1. Before VACUUM starts, it updates the pg_class row of the target table, noting that VACUUM_IN_PROGRESS for the target table. 2. It then waits for all the existing transactions to finish to make sure that everyone can see the change in the pg_class row, 3. It then scans the heap, prunes and defrags the pages. The normal pruning would reclaim all the dead tuples and mark their line pointers as DEAD. Since VACUUM is going to remove the index pointers pointing to these DEAD line pointers, it now marks these DEAD line pointers with additional flag, say DEAD_RECLAIMED. 4. At the end of first scan, VACUUM updates FSM information for heap pages. 5. It then proceeds with the index scan and removes index pointers pointing to the DEAD line pointers collected in the heap scan. 6. Finally, it again updates the pg_class row and clears the VACUUM_IN_PROGRESS flag. Any other backend, when invokes page pruning, would check if the VACUUM is in progress by looking at the VACUUM_IN_PROGRESS flag. Note that if the previous vacuum had failed or database crashed before vacuum completed, the VACUUM_IN_PROGRESS flag would remain set until the next vacuum successfully completes on the table and resets the flag (VACUUM_NOT_IN_PROGRESS state). Since vacuum waits for the existing transactions to finish before marking any DEAD line pointers DEAD_RECLAIMED, for a backend which sees VACUUM_NOT_IN_PROGRESS, any DEAD_RECLAIMED line pointer it finds must be left over from the previously successfully completed vacuum. Since the previous vacuum must have removed the index pointers pointing to it, the backend can now safely reclaim the line pointer itself. The backend can potentially do this any time it sees a DEAD_RECLAIMED line pointer, but we may restrict this only during the pruning activity to keep things simple. This operation need not be WAL logged if we appropriately handle DEAD_RECLAIMED line pointer during redo recovery (if it's reused for some other insert/update activity). I think this scheme guarantees that a backend would always see VACUUM_IN_PROGRESS if vacuum is currently in progress on the table or the last vacuum has failed. There might be situations when a backend sees VACUUM_IN_PROGRESS when if fact there is no vacuum is progress and the last vacuum finished successfully, but that won't have any correctness implication, but would only delay reclaiming DEAD_RECLAIMED line pointers. Comments ? Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
"Pavan Deolasee" <pavan.deolasee@gmail.com> writes: > 1. Before VACUUM starts, it updates the pg_class row of the target > table, noting that VACUUM_IN_PROGRESS for the target table. If I understand correctly nobody would be able to re-use any line-pointers when a vacuum is in progress? I find that a bit scary since for large tables you may actually always be running a vacuum. Perhaps the DSM will fix that but for heavily updated tables I think you might still be pretty much continuously running vacuum. On the other hand it would just result in line pointer bloat. And I think VACUUM could still safely remove old dead line pointers if it noted that the table had a clean vacuum status when it started. > 2. It then waits for all the existing transactions to finish to make > sure that everyone can see the change in the pg_class row, I'm a bit scared of how many "waits for all transactions to finish" we're accumulating. It seemed safe enough when we had only one but I'm not sure what the consequences for this action are when there are several of them. Are we perhaps creating deadlocks? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
On Wed, 2008-05-28 at 16:56 +0530, Pavan Deolasee wrote: > 2. It then waits for all the existing transactions to finish to make > sure that everyone can see the change in the pg_class row I'm not happy that the VACUUM waits. It might wait a very long time and cause worse overall performance than the impact of the second scan. Happily, I think we already have a solution to this overall problem elsewhere in the code. When we VACUUM away all the index entries on a page we don't yet remove it. We only add it to the FSM on the second pass of that page on the *next* VACUUM. So the idea is to have one pass per VACUUM, but make that one pass do the first pass of *this* VACUUM and the second pass of the *last* VACUUM. We mark the xid of the VACUUM in pg_class as you suggest, but we do it after VACUUM has completed the pass. In single pass we mark DEAD line pointers as RECENTLY_DEAD. If the last VACUUM xid is old enough we mark RECENTLY_DEAD as UNUSED, as well, during this first pass. If last xid is not old enough we do second pass to remove them. That has the effect that large tables that are infrequently VACUUMed will need only a single scan. Smaller tables that require almost continual VACUUMing will probably do two scans, but who cares? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
"Simon Riggs" <simon@2ndquadrant.com> writes: > So the idea is to have one pass per VACUUM, but make that one pass do > the first pass of *this* VACUUM and the second pass of the *last* > VACUUM. I think that's exactly the same as the original suggestion of having HOT pruning do the second pass of the last vacuum. The trick is to know whether the last vacuum committed or not. If it didn't commit then it's not safe to remove those line pointers yet. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
On Wed, 2008-05-28 at 16:55 -0400, Gregory Stark wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > So the idea is to have one pass per VACUUM, but make that one pass do > > the first pass of *this* VACUUM and the second pass of the *last* > > VACUUM. > > I think that's exactly the same as the original suggestion of having HOT > pruning do the second pass of the last vacuum. The trick is to know whether > the last vacuum committed or not. If it didn't commit then it's not safe to > remove those line pointers yet. Perhaps, though I'm not suggesting storing extra xids on-block. I think if we have to wait for a VACUUM to run before marking the line pointers then we may as well wait for two. Having something wait for a VACUUM and then removed it by HOT afterwards gives you the worst of both worlds: long wait for a VACUUM then more overhead and extra code during HOT pruning. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, May 29, 2008 at 2:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > I'm not happy that the VACUUM waits. It might wait a very long time and > cause worse overall performance than the impact of the second scan. > Lets not get too paranoid about the wait. It's a minor detail in the whole theory. I would suggest that the benefit of avoiding second scan would be huge. Remember, its just not a scan, it also dirties those blocks again, forcing them write to disk. Also, if you really have a situation where vacuum needs to wait for very long, then you are already in trouble. The long running transactions would prevent vacuuming many tuples. I think we can easily tweak the "wait" so that it doesn't wait indefinitely. If the "wait" times out, vacuum can still proceed, but it can mark the DEAD line pointers as DEAD_RECLAIMED. It would then have a choice of making a second pass and reclaiming the DEAD line pointers (like it does today). > > So the idea is to have one pass per VACUUM, but make that one pass do > the first pass of *this* VACUUM and the second pass of the *last* > VACUUM. > > We mark the xid of the VACUUM in pg_class as you suggest, but we do it > after VACUUM has completed the pass. > The trick is to correctly know if the last vacuum removed the index pointers or not. There could be several ways to do that. But you need to explain in detail how it would work in cases of vacuum failures and database crash. > In single pass we mark DEAD line pointers as RECENTLY_DEAD. If the last > VACUUM xid is old enough we mark RECENTLY_DEAD as UNUSED, as well, > during this first pass. If last xid is not old enough we do second pass > to remove them. > Lets not call them RECENTLY_DEAD :-) DEAD is already stricter than that. We need something even more strong. That's why I used DEAD_RECLAIMED, to note that the line pointer is DEAD and the index pointer may have been removed as well. > That has the effect that large tables that are infrequently VACUUMed > will need only a single scan. Smaller tables that require almost > continual VACUUMing will probably do two scans, but who cares? > Yeah, I think we need to target the large table case. The second pass is obviously much more costly for large tables. I think the timed-wait answers your concern. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Thu, 2008-05-29 at 09:57 +0530, Pavan Deolasee wrote: > On Thu, May 29, 2008 at 2:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > > > I'm not happy that the VACUUM waits. It might wait a very long time and > > cause worse overall performance than the impact of the second scan. > > > > Lets not get too paranoid about the wait. It's a minor detail in the > whole theory. I would suggest that the benefit of avoiding second scan > would be huge. Remember, its just not a scan, it also dirties those > blocks again, forcing them write to disk. Also, if you really have a > situation where vacuum needs to wait for very long, then you are > already in trouble. The long running transactions would prevent > vacuuming many tuples. > > I think we can easily tweak the "wait" so that it doesn't wait > indefinitely. If the "wait" times out, vacuum can still proceed, but > it can mark the DEAD line pointers as DEAD_RECLAIMED. It would then > have a choice of making a second pass and reclaiming the DEAD line > pointers (like it does today). Which is exactly what I suggested. Don't wait, just check. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, May 30, 2008 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > What I still > don't accept is that an unconstrained wait is justifiable. You've just > said its a minor detail, but that's not the way I see it. It might be a > second, but it might be an hour or more. > I am suggesting a timed wait. May be say between 60-300 seconds. That's the maximum VACUUM would get delayed. If exiting transactions don't finish within that time, VACUUM just works as it does today. So it can't certainly be much worse than what it is today. > A non-waiting solution seems like the only way to proceed. > Yeah, but we don't have a simple solution yet which would work in all cases and is not too complicated. > Is this a non-issue anyway, with DSM? > I thought about that. DSM would certainly reduce the cost of heap scans. But still the second pass would be required and it would re-dirty all the pages again, Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Fri, May 30, 2008 at 1:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > Been thinking some more about this. You're right that the second scan > could re-dirty many pages and is probably something to avoid. Right. IMHO it would help us a lot. > The main > issue I see is that you don't really know how much work will happen in > the first phase and how much would happen in the second. With HOT, I see very little work left for the second pass. The dead space is already collected in the first pass. The second pass only cleans up the DEAD line pointers. Also if we can update FSM early (at the end of first pass), we can avoid further irreverssible bloat of the heap. > no problem at all. I'd rather keep it as it is than have sometimes > better, sometimes worse behaviour. > For large tables, two heap scans along with several additional page writes doesn't seem to the cost we can afford, especially in IO-bound application. IMHO a timed wait is not such a bad thing. Note that its all about VACUUM which is a background, maintenance activity and it won't harm to delay it by few seconds or even minutes. Also, as I said earlier "waiting" is a minor detail, may be there is a better way to do things. Unless there are some strong objections, I would like to give it a shot and see if there are any real benefits. We can then fix any regression cases. Let me know if somebody thinks there are certain show stoppers or the benefits of avoiding a second scan on a large table is not worth. I personally have a strong feeling that it's worth the efforts. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Fri, May 30, 2008 at 3:31 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > Perhaps we can start first scan, check xid after we scan each few > blocks. Once we find the xid is older, then we know the size of the > second scan can be limited to only those blocks already scanned. So the > two endpoints of behaviour are we skip the scan completely or we do the > whole scan, but at least there is a saving in many cases without > waiting. > Hmm. Interesting. I was about to suggest that we use some heuristic such as size of the table to decide whether or not try the optimization. But what you just suggested makes more sense. So instead of waiting, we anyways start the first scan. If we are lucky, in some time all the old transactions would go away and then we can start marking the DEAD line pointers as DEAD_RECLAIMED. The second pass just needs to rescan the initial blocks to remove the DEAD line pointers. Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
On Fri, 2008-05-30 at 14:50 +0530, Pavan Deolasee wrote: > On Fri, May 30, 2008 at 2:41 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > What I still > > don't accept is that an unconstrained wait is justifiable. You've just > > said its a minor detail, but that's not the way I see it. It might be a > > second, but it might be an hour or more. > > > > I am suggesting a timed wait. May be say between 60-300 seconds. > That's the maximum VACUUM would get delayed. If exiting transactions > don't finish within that time, VACUUM just works as it does today. So > it can't certainly be much worse than what it is today. > > > A non-waiting solution seems like the only way to proceed. > > I understand what you're saying and agree that in (some) cases a small wait is not important. I'm just saying some != all, and the gap covers important cases: If we have a database with 100 tables (very common) and we add 5 minutes waiting time to each vacuum, then we'll make a complete database VACUUM take ~7 hours longer than it did before. 1000 tables would cause rioting in the streets. Waiting for 5 minutes for a 0.5 second vacuum isn't sensible either, whatever the gain. It's clear Amdahl's Law would not advise us to optimise that (in this way). So if its a large table and we submitted it with a non-zero vacuum wait, then maybe a wait is an acceptable optimisation. Perhaps we can start first scan, check xid after we scan each few blocks. Once we find the xid is older, then we know the size of the second scan can be limited to only those blocks already scanned. So the two endpoints of behaviour are we skip the scan completely or we do the whole scan, but at least there is a saving in many cases without waiting. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Thu, 2008-05-29 at 09:57 +0530, Pavan Deolasee wrote: > On Thu, May 29, 2008 at 2:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > > > > > > I'm not happy that the VACUUM waits. It might wait a very long time and > > cause worse overall performance than the impact of the second scan. > > > > Lets not get too paranoid about the wait. It's a minor detail in the > whole theory. I would suggest that the benefit of avoiding second scan > would be huge. Remember, its just not a scan, it also dirties those > blocks again, forcing them write to disk. Also, if you really have a > situation where vacuum needs to wait for very long, then you are > already in trouble. The long running transactions would prevent > vacuuming many tuples. Been thinking some more about this. You're right that the second scan could re-dirty many pages and is probably something to avoid. The main issue I see is that you don't really know how much work will happen in the first phase and how much would happen in the second. Avoiding the second scan might be worth waiting for, it might not. You really have no idea how many tuples the LRT might effect. There is no "typical" here, it can vary from one extreme to another. Waiting could be very bad, or no problem at all. I'd rather keep it as it is than have sometimes better, sometimes worse behaviour. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-05-30 at 14:22 +0530, Pavan Deolasee wrote: > For large tables, two heap scans along with several additional page > writes doesn't seem to the cost we can afford, especially in IO-bound > application. IMHO a timed wait is not such a bad thing. Note that its > all about VACUUM which is a background, maintenance activity and it > won't harm to delay it by few seconds or even minutes. Also, as I said > earlier "waiting" is a minor detail, may be there is a better way to > do things. > > Unless there are some strong objections, I would like to give it a > shot and see if there are any real benefits. We can then fix any > regression cases. Let me know if somebody thinks there are certain > show stoppers or the benefits of avoiding a second scan on a large > table is not worth. I personally have a strong feeling that it's worth > the efforts. I'm not really clear what you intend to do now. We both agreed that avoiding a second pass is a good thing. What I still don't accept is that an unconstrained wait is justifiable. You've just said its a minor detail, but that's not the way I see it. It might be a second, but it might be an hour or more. If I run a VACUUM at 0700, thinking it will finish by 0900 before my database gets busy, it is a very bad thing to find that it only started at 0900 and is now interfering with my business workload. A non-waiting solution seems like the only way to proceed. Is this a non-issue anyway, with DSM? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support