Re: [HACKERS] amcheck (B-Tree integrity checking tool)
От | Peter Geoghegan |
---|---|
Тема | Re: [HACKERS] amcheck (B-Tree integrity checking tool) |
Дата | |
Msg-id | CAH2-Wzm+E4NaxQ6hXQ-VfZiy9z7Z4h1E75vMXcQSBAW96aue4A@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: [HACKERS] amcheck (B-Tree integrity checking tool) (Andres Freund <andres@anarazel.de>) |
Ответы |
Re: [HACKERS] amcheck (B-Tree integrity checking tool)
|
Список | pgsql-hackers |
On Thu, Feb 9, 2017 at 12:05 AM, Andres Freund <andres@anarazel.de> wrote: > reindex_index isn't necessarily comparable, because > RangeVarCallbackForReindexIndex() is, on first blush at least, careful > enough about the locking, and reindex_relation only gets the index list > after building the index: > /* > * Find and lock index, and check permissions on table; use callback to > * obtain lock on table first, to avoid deadlock hazard. The lock level > * used here must match the index lock obtained in reindex_index(). > */ > > I think you ought to do something similar. Okay. Looking into it, but see my response to your later remarks below. > I'd really like to have something like relation_check(), index_check() > which calls the correct functions for the relevan index types (and > possibly warns about !nbtree indexes not being checked?). I feel that that's something for an in-core facility that exposes this through DDL. One good reason to not do that just yet is that I'm not completely comfortable with what a uniform interface looks like here. And frankly, I really just want to get this out of the way, to prove the general concept of verification tooling. amcheck v1 is a good start, but much work remains. I think that no one will take the idea completely seriously until it is usable as a core extension. Let's get the ball rolling. >> +#define CHECK_SUPERUSER() { \ >> + if (!superuser()) \ >> + ereport(ERROR, \ >> + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \ >> + (errmsg("must be superuser to use verification functions")))); } > > Please make this either a function or just copy - the macro doesn't buy > us anything except weirder looking syntax. It's actually from pageinspect, but okay. >> +#define CORRUPTION ERROR >> +#define CONCERN DEBUG1 >> +#define POSITION DEBUG2 >> +#endif > > Please remove this block and use the normal elevels. Okay. >> +/* >> + * State associated with verifying a B-Tree index >> + */ >> +typedef struct BtreeCheckState >> +{ >> + /* >> + * Unchanging state, established at start of verification: >> + * >> + * rel: B-Tree Index Relation >> + * exclusivelock: ExclusiveLock held on rel; else AccessShareLock >> + * checkstrategy: Buffer access strategy >> + * targetcontext: Target page memory context >> + */ >> + Relation rel; >> + bool exclusivelock; >> + BufferAccessStrategy checkstrategy; >> + MemoryContext targetcontext; > > Can you move the field comments to the individual fields? Separating > them makes it more likely to get out of date (obviously keep the > header). I'd also rename 'exclusivelock' to 'exclusivelylocked' or > such. Okay. >> + /* >> + * Mutable state, for verification of particular page: >> + * >> + * target: Main target page >> + * targetblock: Main target page's block number >> + * targetlsn: Main target page's LSN > > Same here. Okay. >> +PG_FUNCTION_INFO_V1(bt_index_check); >> +PG_FUNCTION_INFO_V1(bt_index_parent_check); > > I think it'd be good to avoid adding a lot of functions for each > additional type of check we can think of. I think adding a named 'bool > check_parents = false' argument or such would be better. Then we can > simply add more and more checks subsequently, and we can have > 'bool all_checks = false' so people can trivially check everything, > instead of having to add new checks in each new function. I strongly disagree. I would agree if the lock strength were the same in both cases, but it isn't. Varying the strength of heavyweight lock taken based on an argument to a function is a massive foot-gun IMV. I do think that that's the logical way to split out functionality. I don't think that there is likely to be a problem with adding stuff in the future. It'll either be something that we can do in passing anyway, in which case we can just add it to the existing checks harmlessly. Or, it will be something like a tool that verifies the heap is consistent with a given index, which is vastly more expensive in terms of CPU and I/O costs, and yet is probably no worse than bt_index_check() in terms of lock strength (AccessShareLock). My thinking here is that users really ought to use bt_index_check() in almost all cases, not bt_index_parent_check(), which is more of a tool for hackers that are developing new B-Tree features. The distinction is blurry, in fact, but the lock strength requirements of bt_index_parent_check() make it vastly less useful. Especially considering that it seems fairly unlikely that it will ever catch a problem that bt_index_check() would not have. bt_index_parent_check() will become a lot more important when we implement suffix truncation in internal pages. (I have a very rough patch that does this, actually -- it passes the regression tests, and verification by amcheck, but isn't anywhere near ready to post.) >> +/* >> + * bt_index_check(index regclass) >> + * >> + * Verify integrity of B-Tree index. >> + * >> + * Only acquires AccessShareLock on index relation. Does not consider >> + * invariants that exist between parent/child pages. >> + */ > I'm *VERY* strongly against releasing locks on user relations early. Why? pageinspect does this. I think it could be really handy for users that want to write SQL queries when using bt_index_parent_check(). It's a bit weird, in the sense that it's a special exception that has to be explained sometimes, but I think it is worth it for something like this. >> + /* >> + * We must lock table before index to avoid deadlocks. However, if the >> + * passed indrelid isn't an index then IndexGetRelation() will fail. >> + * Rather than emitting a not-very-helpful error message, postpone >> + * complaining, expecting that the is-it-an-index test below will fail. >> + */ >> + heapid = IndexGetRelation(indrelid, true); >> + if (OidIsValid(heapid)) >> + heaprel = heap_open(heapid, ShareLock); >> + else >> + heaprel = NULL; >> + >> + /* >> + * Open the target index relations separately (like relation_openrv(), but >> + * with heap relation locked first to prevent deadlocking). In hot standby >> + * mode this will raise an error. >> + */ >> + indrel = index_open(indrelid, ExclusiveLock); > > So yea, this is what I was complaining about re locking / index & table > relation IIRC. Note you're also deadlock prone by violating the rule > about always locking the relation first. This is based on brin_summarize_new_values(). I'm not doing a recheck, though, because I don't actually need the heap relation for anything (I suppose I imagined that that was okay, possibly failing to consider some subtlety beyond "consistency is good"). If I added a recheck for the heap relation in the style of brin_summarize_new_values(), would you consider this item closed, and all heavyweight lock concerns addressed? The comments here are almost a direct lift -- is brin_summarize_new_values() wrong to itself claim to be deadlock-safe?. (BTW, brin_summarize_new_values(), an SQL-callable function, also releases heavyweight locks early.) >> +/* >> + * btree_index_checkable() >> + * >> + * Basic checks about the suitability of a relation for checking as a B-Tree >> + * index. >> + */ >> +static void >> +btree_index_checkable(Relation rel) >> +{ >> + if (rel->rd_rel->relkind != RELKIND_INDEX || >> + rel->rd_rel->relam != BTREE_AM_OID) >> + ereport(ERROR, >> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), >> + errmsg("only nbtree access method indexes are supported"), >> + errdetail("Relation \"%s\" is not a B-Tree index.", >> + RelationGetRelationName(rel)))); > > I think this message could stand some rephrasing. There's a significant > overlap between msg and detail, and they use different phrasing for > btree. Okay. >> + if (!IndexIsValid(rel->rd_index)) >> + ereport(ERROR, >> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), >> + errmsg("cannot check index \"%s\"", >> + RelationGetRelationName(rel)), >> + errdetail("Index is not valid"))); > > I'm perfectly fine with not changing that, but IIUC it'd be fine to > weaken this to IndexIsLive() - right? I believe so, but prefer to be more restrictive in the absence of a good reason not to be. >> + * It is the caller's responsibility to acquire appropriate heavyweight lock on >> + * the index relation, and advise us if extra checks are safe when an >> + * ExclusiveLock is held. An ExclusiveLock is generally assumed to prevent any >> + * kind of physical modification to the index structure, including >> + * modifications that VACUUM may make. > > It might make sense to be a bit more explicit about the structure bit > (which I assume refers to the fact that modifications to individual > pages are possible (re killitems)?). It's not really about killitems, which I explained won't matter here, but I will add a bit about that. >> + */ >> +static void >> +bt_check_every_level(Relation rel, bool exclusivelock) > > Hm, is every level accurate - this is more like checking every > "reachable" page, right? Every level should have at least one page that is reachable (not half-dead or fully dead), unless perhaps VACUUM marks the last leaf page in the entire index as half-dead and we land on that. I think that that's such a rare case that it shouldn't alter how we name this function at all, which really does do what it says almost always. Note that there are pretty severe restrictions on when vacuum can do page-level deletion of an internal page. This is what leads to index bloat with sparse deletion patterns (one problem I want to address by adding suffix truncation to internal pages at a later date). >> + /* >> + * Certain deletion patterns can result in "skinny" B-Tree indexes, where >> + * the fast root and true root differ. >> + * >> + * Start from the true root, not the fast root, unlike conventional index >> + * scans. This approach is more thorough, and removes the risk of >> + * following a stale fast root from the meta page. >> + */ >> + if (metad->btm_fastroot != metad->btm_root) >> + ereport(CONCERN, >> + (errcode(ERRCODE_DUPLICATE_OBJECT), >> + errmsg("fast root mismatch in index %s", >> + RelationGetRelationName(rel)), >> + errdetail_internal("Fast block %u (level %u) " >> + "differs from true root " >> + "block %u (level %u).", >> + metad->btm_fastroot, metad->btm_fastlevel, >> + metad->btm_root, metad->btm_level))); > > I'd weaken the message to something that sounds less like an actual > problem... - which this is not, right? No, it isn't a real problem. I'll do something about it. >> +/* >> + * bt_check_level_from_leftmost() >> + * >> + * Given a left-most block at some level, move right, verifying each page >> + * individually (with more verification across pages for "exclusivelock" >> + * callers). Caller should pass the true root page as the leftmost initially, >> + * working their way down by passing what is returned for the last call here >> + * until level 0 (leaf page level) was reached. >> + * >> + * Returns state for next call, if any. > > Wouldn't "'sets state for the next call..." be more accurate than > returning? Okay. >> +static BtreeLevel >> +bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level) >> +{ > >> + if (P_IGNORE(opaque)) >> + { >> + if (P_RIGHTMOST(opaque)) >> + ereport(CORRUPTION, >> + (errcode(ERRCODE_INDEX_CORRUPTED), >> + errmsg("block %u fell off the end of index \"%s\"", >> + current, RelationGetRelationName(state->rel)))); > > I think this could stand a bit more formal language. I disagree, because it's a direct lift from nbtree itself. >> + else >> + ereport(CONCERN, >> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), >> + errmsg("block %u of index \"%s\" ignored", >> + current, RelationGetRelationName(state->rel)))); > > missing verb. Also, is this actually worth logging (even at such a low level)? I'm not following -- "ignore" is a verb. It may not be worth logging. I just don't know, and prefer to err on the side of including it. >> + /* >> + * Before beginning any non-trivial examination of level, establish >> + * state to be passed by caller for next call here, if this isn't >> + * last call/leaf level. > > "state to be passed by caller for next call here" is hard to > understand. "Prepare state for next bt_check_level_from_leftmost > invocation for the next level"? Okay. >> + * There should be at least one non-ignorable page per level. >> + */ > > I might have missed something, but you're not checking that, are you? No. As I said just now, I think it might not be true iff every single item (every item at the leaf level) has been deleted and is being vacuumed. If there was even one live item in one leaf page, it would be guaranteed to have a live parent, which would itself be guaranteed to if it wasn't the true root, and so on, until the true root. It's really hard to "lose" a level by having the true root and fast root differ, in general (of course, we can never truly lose a level -- the pages can never be reclaimed entirely from a deleted level's last page). I guess that this fast root stuff was added only for the benefit of cases where there is continuous bulk deletion and insertion. > "mutual" doesn't seem to add much here. Okay. >> + /* Try to detect circular links */ >> + if (current == leftcurrent || current == opaque->btpo_prev) >> + ereport(CORRUPTION, >> + (errcode(ERRCODE_INDEX_CORRUPTED), >> + errmsg("circular link chain found in block %u of index \"%s\"", >> + current, RelationGetRelationName(state->rel)))); > > That's not really reliable though, is it? I've seen corrupted link > chains (leading to endless vacuums and such) a number of times, so maybe > explicitly keeping track of already visited pages within a level would > be worthwhile? It's not reliable -- it's one of the "shot in the dark, but why not?" checks, which are the majority -- no reason to not be defensive. However, it's highly likely that cases like the one you describe would be caught by the cross-page check, or even high-key check, because they're probably explained by an underlying problem with transposed pages, maybe not even 8KiB aligned pages (e.g., buggy firmware controllers have been known to do this with their 4KiB pages, but you wouldn't necessarily notice that easily with an int4 index or similar). I feel relatively confident that we don't need to do better with that one, since it will never make a difference; the only case where that might be wrong that I can think of is where you have an awful lot of duplicates, which is unfortunately not that uncommon with Postgres. Even if it could help, that sounds like something that won't pay for itself. > Hm. I think it'd be useful if we also made sure (or in the caller) that > the page actually is at the correct level. I seem to recall seing a > case where downlinks were "circular" due to corruption, leading to stuck > scans. That's a good idea, because it is, if nothing else, an easy, low-overhead additional check. >> + * Note: This routine is not especially proactive in freeing memory. High >> + * watermark memory consumption is bound to some small fixed multiple of >> + * BLCKSZ, though. Caller should reset the current context between calls here. > > I'd shorten this to "Memory allocated in this routine is expected to be > released by the caller resetting the ->xxx/current context". Okay. >> +static void >> +bt_target_page_check(BtreeCheckState *state) >> +{ >> + OffsetNumber offset; >> + OffsetNumber max; >> + BTPageOpaque topaque; >> + >> + topaque = (BTPageOpaque) PageGetSpecialPointer(state->target); >> + max = PageGetMaxOffsetNumber(state->target); >> + >> + elog(POSITION, "verifying %u items on %s block %u", max, >> + P_ISLEAF(topaque) ? "leaf":"internal", state->targetblock); > > I think it's fine if you don't, but this could theoretically integrate > with the progress stuff added in 9.6 (as in > pgstat_progress_start_command()). That'd allow for a lot more > convenient display of this than thousands of debug messages or nothing. Never enough time. :-) >> + /* >> + * Loop over page items, but don't start from P_HIKEY (don't have iteration >> + * directly considering high key item, if any). > > I don't understand the parenthetical comment as phrased. It means that the high key item, which is physically the first item on the leaf level, and the second item on internal levels. (That have an explicit "minus infinity" downlink item), is skipped in the sense that it doesn't get its own iteration of the loop (note that there is only a high key on non-rightmost-at-level pages -- I like to say that the rightmost pages have an implicit/imaginary "positive infinity" high key, to illustrate the symmetry between high keys and downlinks. Although, we don't say that in any code.) > "which there must be for" sounds weird. > s/prefer to// Okay. > Does that kind of comment look reasonable after pgindent? Actually, > could you just pindent this patch (including typedefs.list additions) > and repair the bad looking damage? I'll check. > Do all these separate psprintfs have much of a point? It seems just as > easy to includ ethem in the actual ereport? I thought it looked very convoluted that way. >> + * next/sibling page. There may not be such an item available from >> + * sibling for various reasons, though (e.g., target is the rightmost >> + * page on level). >> + */ > > Is the comma after e.g. correct? </mega-pedanticism-from-esl-person> This is a matter of style. "e.g.," is generally considered American style. >> + ereport(CORRUPTION, >> + (errcode(ERRCODE_INDEX_CORRUPTED), >> + errmsg("cross page item order invariant violated for index \"%s\"", >> + RelationGetRelationName(state->rel)), >> + errdetail_internal("Last item on page tid=(%u,%u) " >> + "page lsn=%X/%X.", >> + state->targetblock, offset, >> + (uint32) (state->targetlsn >> 32), >> + (uint32) state->targetlsn))); > > I'm am *wondering* (as in not sure), if we could get rid of all the "for > index ..." bits and use an errcontext for that instead, which'd be > included in all messages. Doesn't seem worth it to me. >> + /* >> + * General notes on concurrent page splits and page deletion: >> + * >> + * Routines like _bt_search() don't require *any* page split interlock when >> + * descending the tree, including something very light like a buffer pin. >> + * That's why it's okay that we don't either. This avoidance of any need >> + * to "couple" buffer locks is the raison d' etre of the Lehman & Yao >> + * algorithm, in fact. > > raison d'etre comment seems more like something belonging to > nbtree/README than this, but I don't care much. I think it adds value. This is a learning tool for nbtree. That's a secondary goal, at least. We're *way* too shy at pointing out simple facts like this, that are of very significant consequence. It's only one sentence. > Maybe s/Ensuring/The reasons why they don't happen in ...'? > "on to this right sibling page"? > missing trailing 'the'? I'm really not a native speaker (and thus might > be off the mark), but making these sentences a bit clearer really > wouldn't hurt. Okay. The sentences are very complicated, but that's justified by the complexity of the underlying issue. I rewrote that text several times. > s/N.B./NB/ by precedence ;) > s/We prefer// Okay. >> + if (OFFSET_IS_MINUS_INFINITY(copaque, offset)) >> + continue; > > FWIW, I'd even convert this to an inline function rather than a macro. > Macros are useful when you have to do stuff you can't do otherwise, but > shouldn't be used much otherwise. Okay. >> + if (!invariant_key_less_than_equal_nontarget_offset(state, child, >> + targetkey, offset)) >> + ereport(CORRUPTION, >> + (errcode(ERRCODE_INDEX_CORRUPTED), > > What's your reasoning for differing sometimes using > ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE and sometimes ERRCODE_INDEX_CORRUPTED? The ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE ereports are not errors, and do not indicate corruption. -- Peter Geoghegan
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Tom LaneДата:
Сообщение: Re: [HACKERS] [PATCH] Rename pg_switch_xlog to pg_switch_wal