Обсуждение: VACUUM can finish an interrupted nbtree page split -- is that okay?
_bt_lock_branch_parent() is used by VACUUM during page deletion, and calls _bt_getstackbuf(), which always finishes incomplete page splits for the parent page that it exclusive locks and returns. ISTM that this may be problematic, since it contradicts the general rule that VACUUM isn't supposed to finish incomplete page splits. According to the nbtree README: "It would seem natural to add the missing downlinks in VACUUM, but since inserting a downlink might require splitting a page, it might fail if you run out of disk space. That would be bad during VACUUM - the reason for running VACUUM in the first place might be that you run out of disk space, and now VACUUM won't finish because you're out of disk space. In contrast, an insertion can require enlarging the physical file anyway." I'm inclined to note this as an exception in the nbtree README, and leave it at that. Interrupted internal page splits are probably very rare in practice, so the operational risk of running out of disk space like this is minimal. FWIW, I notice that the logic that appears after the _bt_lock_branch_parent() call to _bt_getstackbuf() anticipates that it must defend against interrupted splits in at least the grandparent-of-leaf page, and maybe even the parent, so it's probably not unworkable to not finish the split: /* * If the target is the rightmost child of its parent, then we can't * delete, unless it's also the only child. */ if (poffset >= maxoff) { /* It's rightmost child... */ if (poffset == P_FIRSTDATAKEY(opaque)) { /* * It's only child, so safe if parent would itself be removable. * We have to check the parent itself, and then recurse to test * the conditions at the parent's parent. */ if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) || P_INCOMPLETE_SPLIT(opaque)) { _bt_relbuf(rel, pbuf); return false; } Separately, I noticed another minor issue that appears a few lines further down still: /* * Perform the same check on this internal level that * _bt_mark_page_halfdead performed on the leaf level. */ if (_bt_is_page_halfdead(rel, *rightsib)) { elog(DEBUG1, "could not delete page %u because its right sibling %u is half-dead", parent, *rightsib); return false; } I thought that internal pages were never half-dead after Postgres 9.4. If that happens, then the check within _bt_pagedel() will throw an ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't this internal level _bt_is_page_halfdead() check contain a "can't happen" error, or even a simple assertion? -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > _bt_lock_branch_parent() is used by VACUUM during page deletion, and > calls _bt_getstackbuf(), which always finishes incomplete page splits > for the parent page that it exclusive locks and returns. ISTM that > this may be problematic, since it contradicts the general rule that > VACUUM isn't supposed to finish incomplete page splits. According to > the nbtree README: > "It would seem natural to add the missing downlinks in VACUUM, but since > inserting a downlink might require splitting a page, it might fail if you > run out of disk space. That would be bad during VACUUM - the reason for > running VACUUM in the first place might be that you run out of disk space, > and now VACUUM won't finish because you're out of disk space. In contrast, > an insertion can require enlarging the physical file anyway." > I'm inclined to note this as an exception in the nbtree README, and > leave it at that. Interrupted internal page splits are probably very > rare in practice, so the operational risk of running out of disk space > like this is minimal. Also, if your WAL is on the same filesystem as the data, the whole thing is pretty much moot anyway since VACUUM is surely going to add WAL output. I concur with not sweating over this. > FWIW, I notice that the logic that appears after the > _bt_lock_branch_parent() call to _bt_getstackbuf() anticipates that it > must defend against interrupted splits in at least the > grandparent-of-leaf page, and maybe even the parent, so it's probably > not unworkable to not finish the split: -ETOOMANYNEGATIVES ... can't quite parse your point here? > I thought that internal pages were never half-dead after Postgres 9.4. > If that happens, then the check within _bt_pagedel() will throw an > ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't > this internal level _bt_is_page_halfdead() check contain a "can't > happen" error, or even a simple assertion? I think that code is there to deal with the possibility of finding an old half-dead page. Don't know that it's safe to remove it yet. regards, tom lane
On Fri, Mar 1, 2019 at 4:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > FWIW, I notice that the logic that appears after the > > _bt_lock_branch_parent() call to _bt_getstackbuf() anticipates that it > > must defend against interrupted splits in at least the > > grandparent-of-leaf page, and maybe even the parent, so it's probably > > not unworkable to not finish the split: > > -ETOOMANYNEGATIVES ... can't quite parse your point here? Sorry. :-) My point was that it actually seems feasible to not do the split, making the quoted paragraph from nbtree README correct as-is. But, since we're happy to continue to finish the occasional interrupted internal page split within VACUUM anyway, that isn't an important point. > I think that code is there to deal with the possibility of finding > an old half-dead page. Don't know that it's safe to remove it yet. I don't know that it is either. My first instinct is to assume that it's fine to remove the code, since, as I said, we're treating internal pages that are half-dead as being ipso facto corrupt -- we'll throw an error before too long anyway. However, "internal + half dead" was a valid state for an nbtree page prior to 9.4, and we make no distinction about that (versioning nbtree indexes to deal with cross-version incompatibilities came only in Postgres v11). Trying to analyze whether or not it's truly safe to just not do this seems very difficult, and I don't think that it's performance critical. This is a problem only because it's distracting and confusing. I favor keeping the test, but having it throw a ERRCODE_INDEX_CORRUPTED error, just like _bt_pagedel() does already. A comment could point out that the test is historical/defensive, and probably isn't actually necessary. What do you think of that idea? -- Peter Geoghegan
On Fri, Mar 1, 2019 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > I favor keeping the test, but having it throw a > ERRCODE_INDEX_CORRUPTED error, just like _bt_pagedel() does already. A > comment could point out that the test is historical/defensive, and > probably isn't actually necessary. What do you think of that idea? Actually, while 9.4 did indeed start treating "internal + half dead" pages as corrupt, it didn't exactly remove the *concept* of a half dead internal page. I think that the cross check (the one referenced by comments above the corresponding leaf/_bt_mark_page_halfdead() call to _bt_is_page_halfdead()) might have problems in the event of an interrupted *multi-level* page deletion. I wonder, is there a subtle bug here that bugfix commit 8da31837803 didn't quite manage to prevent? (This commit added both of the _bt_is_page_halfdead() checks.) (Thinks some more...) Actually, I think that bugfix commit 8da31837803 works despite possible "logically half dead internal pages", because in the event of such an internal page the sibling would actually have to be the *cousin* of the original parent (half dead/leaf page parent), not the "true sibling" (otherwise, cousin's multi-level page deletion should never have taken place). I think that we'll end up doing the right thing with the downlinks in the grandparent page, despite there being an interrupted multi-level deletion in the cousin's subtree. Since cousin *atomically* removed its downlink in our shared *grandparent* (not its parent) at the same time some leaf page was initially marked half-dead, everything works out. Page deletion is painfully complicated. Seems wise to keep the internal page test, out of sheer paranoia, while making it an error as suggested earlier. I will definitely want to think about it some more, though. -- Peter Geoghegan
On Fri, Mar 1, 2019 at 3:59 PM Peter Geoghegan <pg@bowt.ie> wrote: > /* > * Perform the same check on this internal level that > * _bt_mark_page_halfdead performed on the leaf level. > */ > if (_bt_is_page_halfdead(rel, *rightsib)) > I thought that internal pages were never half-dead after Postgres 9.4. > If that happens, then the check within _bt_pagedel() will throw an > ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't > this internal level _bt_is_page_halfdead() check contain a "can't > happen" error, or even a simple assertion? I think that we should get rid of this code on HEAD shortly, because it's effectively dead code. You don't have to know anything about B-Trees to see why this must be true: VACUUM is specifically checking if an internal page is half-dead here, even though it's already treating half-dead internal pages as ipso facto corrupt in higher level code (it's the first thing we check in _bt_pagedel()). This is clearly contradictory. If there is a half-dead internal page, then there is no danger of VACUUM not complaining loudly that you need to REINDEX. This has been true since the page deletion overhaul that went into 9.4. Attached patch removes the internal page check, and adds a comment that explains why it's sufficient to check on the leaf level alone. Admittedly, it's much easier to understand why the _bt_is_page_halfdead() internal page check is useless than it is to understand why replacing it with this comment is helpful. My observation that a half-dead leaf page is representative of the subtree whose root the leaf page has stored as its "top parent" is hardly controversial, though -- that's the whole basis of multi-level page deletion. If you *visualize* how multi-level deletion works, and consider its rightmost-in-subtree restriction, then it isn't hard to see why everything works out with just the leaf level right-sibling-is-half-dead check: We can only have two adjacent "skinny" pending-deletion subtrees in cases where the removed check might seem to be helpful -- each page is both the leftmost and the rightmost on its level in its subtree. It's okay to just check if the leaf is half-dead because it "owns" exactly the same range in the keyspace as the internal pages up to and including its top parent, if any, and because it is marked half-dead by the same atomic operation that does initial removal of downlinks in an ancestor page. I'm fine with waiting until we branch-off v12 before pushing the patch, even though it seems low risk. -- Peter Geoghegan
Вложения
On 05/05/2019 01:38, Peter Geoghegan wrote: > On Fri, Mar 1, 2019 at 3:59 PM Peter Geoghegan <pg@bowt.ie> wrote: >> /* >> * Perform the same check on this internal level that >> * _bt_mark_page_halfdead performed on the leaf level. >> */ >> if (_bt_is_page_halfdead(rel, *rightsib)) > >> I thought that internal pages were never half-dead after Postgres 9.4. >> If that happens, then the check within _bt_pagedel() will throw an >> ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't >> this internal level _bt_is_page_halfdead() check contain a "can't >> happen" error, or even a simple assertion? > > I think that we should get rid of this code on HEAD shortly, because > it's effectively dead code. You don't have to know anything about > B-Trees to see why this must be true: VACUUM is specifically checking > if an internal page is half-dead here, even though it's already > treating half-dead internal pages as ipso facto corrupt in higher > level code (it's the first thing we check in _bt_pagedel()). This is > clearly contradictory. If there is a half-dead internal page, then > there is no danger of VACUUM not complaining loudly that you need to > REINDEX. This has been true since the page deletion overhaul that went > into 9.4. I don't understand that reasoning. Yes, _bt_pagedel() will complain if it finds a half-dead internal page. But how does that mean that _bt_lock_branch_parent() can't encounter one? - Heikki
On Tue, May 7, 2019 at 12:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I don't understand that reasoning. Yes, _bt_pagedel() will complain if > it finds a half-dead internal page. But how does that mean that > _bt_lock_branch_parent() can't encounter one? I suppose that in theory it could, but only if you allow that any possible state could be found -- it doesn't seem any more likely than any other random illegal state. Even when it happens, we'll get a "failed to re-find parent key" error message when we go a bit further. Isn't that a logical outcome? Actually, maybe we won't get that error, because we're talking about a corrupt index, and all bets are off -- no reason to think that the half-dead internal page would be consistent with other pages in any way. But even then, you'll go on to report it in the usual way, since VACUUM scans nbtree indexes in physical order. -- Peter Geoghegan
On Tue, May 7, 2019 at 9:59 AM Peter Geoghegan <pg@bowt.ie> wrote: > On Tue, May 7, 2019 at 12:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I don't understand that reasoning. Yes, _bt_pagedel() will complain if > > it finds a half-dead internal page. But how does that mean that > > _bt_lock_branch_parent() can't encounter one? > > I suppose that in theory it could, but only if you allow that any > possible state could be found -- it doesn't seem any more likely than > any other random illegal state. To be fair, I suppose that the code made more sense when it first went in, because at the time there was a chance that there could be leftover half-dead internal pages. But, that was a long time ago now. I wonder why the code wasn't complaining about corruption loudly, like the top level code, instead of treating half-dead pages as a legitimate reason to not proceed with multi-level page deletion. That would have been overkill, but it would have made much more sense IMV. I would like to proceed with pushing this patch to HEAD in the next few days, since it's clearly removing code that can't be useful. Are there any objections? -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > To be fair, I suppose that the code made more sense when it first went > in, because at the time there was a chance that there could be > leftover half-dead internal pages. But, that was a long time ago now. Is there a good reason to assume there are none left anywhere? regards, tom lane
On Mon, May 13, 2019 at 8:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <pg@bowt.ie> writes: > > To be fair, I suppose that the code made more sense when it first went > > in, because at the time there was a chance that there could be > > leftover half-dead internal pages. But, that was a long time ago now. > > Is there a good reason to assume there are none left anywhere? That is not an assumption that the proposed patch rests upon, though it is true that there are probably going to be virtually no half-dead internal pages that make there way on to a Postgres 12 installation. You'd have to do a CREATE INDEX on Postgres 9.3, and then not VACUUM or REINDEX the index once it was on a 9.4+ installation. I suppose that a 9.3 -> 12 upgrade is the most plausible scenario in which you could actual get a half-dead internal page on a Postgres 12 installation. Even when that happens, the index is already considered corrupt by VACUUM, so the same VACUUM process that could in theory be adversely affected by removing the half-dead internal page check will complain about the page when it gets to it later on -- the user will be told to REINDEX. And even then, we will never actually get to apply the check that I propose to remove, since we're already checking the leaf page sibling of the leaf-level target -- the leaf-level test that was added by efada2b8e92 was clearly necessary. But it was also sufficient (no equivalent internal half-dead right sibling test is needed): a 9.3-era half-dead internal page cannot have more than one child, which must be undergoing deletion as well. If somebody doubted my rationale for why we don't need to do anything more on internal page levels in installations where the user didn't pg_upgrade from a version that's < 9.4, then they'd still have to explain why we haven't heard of any problems in 5 years, and probably offer some alternative fix that considers "logically half-dead internal pages" (i.e. pages that are or will be the top parent in a deletion chain). Because the code that I propose to remove obviously cannot be doing much of anything for indexes built on 9.4+. -- Peter Geoghegan
On Mon, May 13, 2019 at 9:09 PM Peter Geoghegan <pg@bowt.ie> wrote: > Even when that happens, the index is already considered corrupt by > VACUUM, so the same VACUUM process that could in theory be adversely > affected by removing the half-dead internal page check will complain > about the page when it gets to it later on -- the user will be told to > REINDEX. And even then, we will never actually get to apply the check > that I propose to remove, since we're already checking the leaf page > sibling of the leaf-level target -- the leaf-level test that was added > by efada2b8e92 was clearly necessary. But it was also sufficient (no > equivalent internal half-dead right sibling test is needed): a 9.3-era > half-dead internal page cannot have more than one child, which must be > undergoing deletion as well. Actually, now that I look back at how page deletion worked 5+ years ago, I realize that I have this slightly wrong: the leaf level check is not sufficient to figure out if the parent's right sibling is pending deletion (which is represented explicitly as a half-dead internal page prior to 9.4). All the same, I'm going to push ahead with this patch. Bugfix commit efada2b8e92 was always about a bug in 9.4 -- it had nothing to do with 9.3. And, in the unlikely event that there is a problem on a pg_upgrade'd 9.3 -> 12 database that happens to have half-dead internal pages, we'll still get a useful, correct error from VACUUM one way or another. It might be slightly less friendly as error messages about corruption go, but that seems acceptable to me. -- Peter Geoghegan
On Thu, May 16, 2019 at 1:05 PM Peter Geoghegan <pg@bowt.ie> wrote: > Actually, now that I look back at how page deletion worked 5+ years > ago, I realize that I have this slightly wrong: the leaf level check > is not sufficient to figure out if the parent's right sibling is > pending deletion (which is represented explicitly as a half-dead > internal page prior to 9.4). All the same, I'm going to push ahead > with this patch. Bugfix commit efada2b8e92 was always about a bug in > 9.4 -- it had nothing to do with 9.3. I meant bugfix commit 8da31837803 (commit efada2b8e92 was the commit that had the bug in question). -- Peter Geoghegan