Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
От | Peter Geoghegan |
---|---|
Тема | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan |
Дата | |
Msg-id | CAH2-WzmTHoCsOmSgLg=yyft9LoERtuCKXyG2GZn+28PzonFA_g@mail.gmail.com обсуждение исходный текст |
Ответ на | Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Ответы |
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan |
Список | pgsql-hackers |
On Sat, Nov 11, 2023 at 1:08 PM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Thanks. Here's my review of the btree-related code: Attached is v7. The main focus in v7 is making the handling of required inequality-strategy scan keys more explicit -- now there is an understanding of inequalities shared by _bt_check_compare (the function that becomes the guts of _bt_checkkeys) and the new _bt_advance_array_keys function/state machine. The big idea for v7 is to generalize how we handle required equality-strategy scan keys (always required in both scan directions), extending the same concept to deal with required inequality strategy scan keys (only ever required in one direction, which may or may not be the scan direction). This led to my discovering and fixing a couple of bugs related to inequality handling. These issues were of the same general character as many others I've dealt with before now: they involved subtle confusion about when and how to start another primitive index scan, leading to the scan reading many more pages than strictly necessary (potentially many more than master). In other words, cases where we didn't give up and start another primitive index scan, even though (with a repro of the issue) it's obviously not sensible. An accidental full index scan. While I'm still not completely happy with the way that inequalities are handled, things in this area are much improved in v7. It should be noted that the patch isn't strictly guaranteed to always read fewer index pages than master, for a given query plan and index. This is by design. Though the patch comes close, it's not quite a certainty. There are known cases where the patch reads the occasional extra page (relative to what master would have done under the same circumstances). These are cases where the implementation just cannot know for sure whether the next/sibling leaf page has key space covered by any of the scan's array keys (at least not in a way that seems practical). The implementation has simple heuristics that infer (a polite word for "make an educated guess") about what will be found on the next page. Theoretically we could be more conservative in how we go about this, but that seems like a bad idea to me. It's really easy to find cases where the maximally conservative approach loses by a lot, and really hard to show cases where it wins at all. These heuristics are more or less a limited form of the heuristics that skip scan would need. A *very* limited form. We're still conservative. Here's how it works, at a high level: if the scan can make it all the way to the end of the page without having to start a new primitive index scan (before reaching the end), and then finds that "finaltup" itself (which is usually the page high key) advances the array keys, we speculate: we move on to the sibling page. It's just about possible that we'll discover (once on the next page) that finaltup actually advanced the array keys by so much (in one single advancement step) that the current/new keys cover key space beyond the sibling page we just arrived at. The sibling page access will have been wasted (though I prefer to think of it as a cost of doing business). I go into a lot of detail on the trade-offs in this area in comments at the end of the new _bt_checkkeys(), just after it calls _bt_advance_array_keys(). Hopefully this is reasonably clear. It's always much easier to understand these things when you've written lots of test cases, though. So I wouldn't at all be surprised to hear that my explanation needs more work. I suspect that I'm spending more time on the topic than it actually warrants, but you have to spend a lot of time on it for yourself to be able to see why that is. > > +++ b/src/backend/access/nbtree/nbtsearch.c > > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) > > * set flag to true if all required keys are satisfied and false > > * otherwise. > > */ > > - (void) _bt_checkkeys(scan, itup, indnatts, dir, > > - &requiredMatchedByPrecheck, false); > > + _bt_checkkeys(scan, &pstate, itup, false, false); > > + requiredMatchedByPrecheck = pstate.continuescan; > > + pstate.continuescan = true; /* reset */ > > The comment above the updated section needs to be updated. Updated. > > @@ -1625,8 +1633,9 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) > > * set flag to true if all required keys are satisfied and false > > * otherwise. > > */ > > - (void) _bt_checkkeys(scan, itup, indnatts, dir, > > - &requiredMatchedByPrecheck, false); > > + _bt_checkkeys(scan, &pstate, itup, false, false); > > This 'false' finaltup argument is surely wrong for the rightmost > page's rightmost tuple, no? Not in any practical sense. Since finaltup means "the tuple that you should use to decide whether to go to the next page or not", and a rightmost page doesn't have a next page. There are exactly two ways that the top-level scan can end (not to be confused with the primitive scan), at least in v7. They are: 1. The state machine can exhaust the scan's array keys, ending the top-level scan. 2. The scan can just run out of pages, without ever running out of array keys (some array keys can sort higher than any real value from the index). This is just like how an index scan ends when it lacks any required scan keys to terminate the scan, and eventually runs out of pages to scan (think of an index-only scan that performs a full scan of the index, feeding into a group aggregate). Note that it wouldn't be okay if the design relied on _bt_checkkeys advancing and exhausting the array keys -- we really do need both 1 and 2 to deal with various edge cases. For example, there is no way that we'll ever be able to call _bt_checkkeys with a completely empty index. It simply doesn't have any tuples at all. In fact, it doesn't even have any pages (apart from the metapage), so clearly we can't expect any calls to _bt_readpage (much less _bt_checkkeys). > > +++ b/src/backend/access/nbtree/nbtutils.c > > @@ -357,6 +431,46 @@ _bt_preprocess_array_keys(IndexScanDesc scan) > > + /* We could pfree(elem_values) after, but not worth the cycles */ > > + num_elems = _bt_merge_arrays(scan, cur, > > + (indoption[cur->sk_attno - 1] & INDOPTION_DESC) != 0, > > + prev->elem_values, prev->num_elems, > > + elem_values, num_elems); > > This code can get hit several times when there are multiple = ANY > clauses, which may result in repeated leakage of these arrays during > this scan. I think cleaning up may well be worth the cycles when the > total size of the arrays is large enough. They won't leak because the memory is allocated in the same dedicated memory context. That said, I added a pfree(). It couldn't hurt. > > @@ -496,6 +627,48 @@ _bt_sort_array_elements(IndexScanDesc scan, ScanKey skey, > > _bt_compare_array_elements, &cxt); > > +_bt_merge_arrays(IndexScanDesc scan, ScanKey skey, bool reverse, > > + Datum *elems_orig, int nelems_orig, > > + Datum *elems_next, int nelems_next) > > [...] > > + /* > > + * Incrementally copy the original array into a temp buffer, skipping over > > + * any items that are missing from the "next" array > > + */ > > Given that we only keep the members that both arrays have in common, > the result array will be a strict subset of the original array. So, I > don't quite see why we need the temporary buffer here - we can reuse > the entries of the elems_orig array that we've already compared > against the elems_next array. This code path is only hit when the query was written on autopilot, since it must have contained redundant SAOPs for the same index column -- a glaring inconsistency. Plus these arrays just aren't very big in practice (despite my concerns about huge arrays). Plus there is only one of these array-specific preprocessing steps per btrescan. So I don't think that it's worth going to too much trouble here. > We may want to optimize this further by iterating over only the > smallest array: With the current code, [1, 2] + [1....1000] is faster > to merge than [1..1000] + [1000, 1001], because 2 * log(1000) is much > smaller than 1000*log(2). In practice this may matter very little, > though. > An even better optimized version would do a merge join on the two > arrays, rather than loop + binary search. v7 allocates the temp buffer using the size of whatever array is the smaller of the two, just because it's an easy marginal improvement. > > @@ -515,6 +688,161 @@ _bt_compare_array_elements(const void *a, const void *b, void *arg) > > [...] > > +_bt_binsrch_array_skey(FmgrInfo *orderproc, > > Is there a reason for this complex initialization of high/low_elem, > rather than the this easier to understand and more compact > initialization?: > > + low_elem = 0; > + high_elem = array->num_elems - 1; > + if (cur_elem_start) > + { > + if (ScanDirectionIsForward(dir)) > + low_elem = array->cur_elem; > + else > + high_elem = array->cur_elem; > + } I agree that it's better your way. Done that way in v7. > I think the conditional _bt_parallel_done(scan) feels misplaced here, > as the comment immediately below indicates the scan is to be > terminated after that comment. So, either move this _bt_parallel_done > call outside the function (which by name would imply it is read-only, > without side effects like this) or move it below the comment > "terminate the top-level scan". v7 moves the comment up, so that it's just before the _bt_parallel_done() call. > > +_bt_advance_array_keys(IndexScanDesc scan, BTReadPageState *pstate, > > [...] > > + * Set up ORDER 3-way comparison function and array state > > [...] > > + * Optimization: Skip over non-required scan keys when we know that > > These two sections should probably be swapped, as the skip makes the > setup useless. Not quite: we need to increment arrayidx for later loop iterations/scan keys. > Also, the comment here is wrong; the scan keys that are skipped are > 'required', not 'non-required'. Agreed. Fixed. > > +++ b/src/test/regress/expected/join.out > > @@ -8620,10 +8620,9 @@ where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 and j2.id1 >= any (array[1,5]); > > Merge Cond: (j1.id1 = j2.id1) > > Join Filter: (j2.id2 = j1.id2) > > -> Index Scan using j1_id1_idx on j1 > > - -> Index Only Scan using j2_pkey on j2 > > + -> Index Scan using j2_id1_idx on j2 > > Index Cond: (id1 >= ANY ('{1,5}'::integer[])) > > - Filter: ((id1 % 1000) = 1) > > -(7 rows) > > +(6 rows) > > I'm a bit surprised that we don't have the `id1 % 1000 = 1` filter > anymore. The output otherwise matches (quite possibly because the > other join conditions don't match) and I don't have time to > investigate the intricacies between IOS vs normal IS, but this feels > off. This happens because the new plan uses a completely different index -- which happens to be a partial index whose predicate exactly matches the old plan's filter quals. That factor makes the filter quals unnecessary. That's all this is. > As for the planner changes, I don't think I'm familiar enough with the > planner to make any authorative comments on this. However, it does > look like you've changed the meaning of 'amsearcharray', and I'm not > sure it's OK to assume all indexes that support amsearcharray will > also support for this new assumption of ordered retrieval of SAOPs. > For one, the pgroonga extension [0] does mark > amcanorder+amsearcharray. The changes that I've made to the planner are subtractive. We more or less go back to how things were just after the initial work on nbtree amsearcharray support. That work was (at least tacitly) assumed to have no impact on ordered scans. Because why should it? What other type of index clause has ever affected what seems like a rather unrelated thing (namely the sort order of the scan)? The oversight was understandable. The kinds of plans that master cannot produce output for in standard index order are really silly plans, independent of this issue; it makes zero sense to allow a non-required array scan key to affect how or when we skip. The code that I'm removing from the planner is code that quite obviously assumes nbtree-like behavior. So I'm taking away code like that, rather than adding new code like that. That said, I am really surprised that any extension creates an index AM amcanorder=true (not to be confused with amcanorderbyop=true, which is less surprising). That means that it promises the planner that it behaves just like nbtree. To quote the docs, it must have "btree-compatible strategy numbers for their [its] equality and ordering operators". Is that really something that pgroonga even attempts? And if so, why? I also find it bizarre that pgroonga's handler-stated capabilities include "amcanunique=true". So pgroonga is a full text search engine, but also supports unique indexes? I find that particularly hard to believe, and suspect that the way that they set things up in the AM handler just isn't very well thought out. -- Peter Geoghegan
Вложения
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Tom LaneДата:
Сообщение: Re: About #13489, array dimensions and CREATE TABLE ... LIKE
Следующее
От: Michael PaquierДата:
Сообщение: Re: Add recovery to pg_control and remove backup_label