Обсуждение: Questions regarding Index AMs and natural ordering
Hi, Over in [0] and [1] there are patches that touch on the topic of 'natual ordering' index retrieval, and [2] also touches on the topic. For those patches, I've been looking at how the planner and executor indicate to index AMs that they expects the output to be ordered, and how this ordering should work. I've mostly found how it works for index_key opr constant, but I've yet to find a good mental model for how the planner handles indexes that can expose the 'intrinsic order' of data, i.e. indexes with `amcanorder=true`, because there is very little (if any) real documentation on what is expected from indexes when it advertises certain features, and how the executor signals to the AM that it wants to make use of those features. For example, btree ignores any ordering scan keys that it is given in btrescan, which seems fine for btree because the ordering of a btree is static (and no other order than that is expected apart from it's reverse order), but this becomes problematic for other indexes that could return ordered data but would prefer not to have to go through the motions of making sure the return order is 100% correct, rather than a k-sorted sequence, or just the matches to the quals (like GIST). Is returning index scan results in (reverse) natural order not optional but required with amcanorder? If it is required, why is the am indicator called 'canorder' instead of 'willorder', 'doesorder' or 'isordered'? Alternatively, if an am should be using the order scan keys from .amrescan and natural order scans also get scan keys, is there some place I find the selection process for ordered index AMs, and how this ordering should be interepreted? There are no good examples available in core code because btree is special-cased, and there are no other in-tree AMs that have facilities where both `amcanorderbyop` and `amcanorder` are set. I did read through indexam.sgml, but that does not give a clear answer on this distinction of 'amcanorder' having required ordered results or not, nor on how to interpret amrescan's orderbys argument. I also looked at planner code where it interacts with amcanorder / amcanorderbyop, but I couldn't wrap my head around its interactions with indexes, either (more specifically, the ordering part of those indexes) due to the complexity of the planner and the many layers that the various concepts are passed through. The README in backend/optimizer didn't answer this question for me, either. Kind regards, Matthias van de Meent Neon (https://neon.tech) [0] https://www.postgresql.org/message-id/flat/EB2AF704-70FC-4D73-A97A-A7884A0381B5%40kleczek.org [1] https://www.postgresql.org/message-id/flat/CAH2-Wz%3DksvN_sjcnD1%2BBt-WtifRA5ok48aDYnq3pkKhxgMQpcw%40mail.gmail.com [2] https://www.postgresql.org/message-id/flat/e70fa091-e338-1598-9de4-6d0ef6b693e2%40enterprisedb.com
On Thu, Nov 23, 2023 at 9:16 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > For example, btree ignores any ordering scan keys that it is given in > btrescan, which seems fine for btree because the ordering of a btree > is static (and no other order than that is expected apart from it's > reverse order), but this becomes problematic for other indexes that > could return ordered data but would prefer not to have to go through > the motions of making sure the return order is 100% correct, rather > than a k-sorted sequence, or just the matches to the quals (like > GIST). Is returning index scan results in (reverse) natural order not > optional but required with amcanorder? If it is required, why is the > am indicator called 'canorder' instead of 'willorder', 'doesorder' or > 'isordered'? I don't know. I have a hard time imagining an index AM that is amcanorder=true that isn't either nbtree, or something very similar (so similar that it seems unlikely that anybody would actually go to the trouble of implementing it from scratch). You didn't mention support for merge joins. That's one of the defining characteristics of an amcanorder=true index AM, since an "ammarkpos/amrestrpos function need only be provided if the access method supports ordered scans". It's hard to imagine how that could work with a loosely ordered index. It seems to imply that the scan must always work with a simple linear order. Cases where the planner uses a merge join often involve an index path with an "interesting sort order" that "enables" the merge join. Perhaps most of the alternative plans (that were almost as cheap as the merge join plan) would have had to scan the same index in the same way anyway, so it ends up making sense to use a merge join. The merge join can get ordered results from the index "at no extra cost". (All of this is implicit, of course -- the actual reason why the planner chose the merge join plan is because it worked out to be the cheapest plan.) My impression is that this structure is fairly well baked in -- which the optimizer/README goes into. That is, the planner likes to think of paths as having an intrinsic order that merge joins can take advantage of -- merge joins tend to win by being "globally optimal" without being "locally optimal". So generating extra index paths that don't have any intrinsic order (but can be ordered using a special kind of index scan) seems like it might be awkward to integrate. I'm far from an expert on the planner, so take anything that I say about it with a grain of salt. > Alternatively, if an am should be using the order scan keys from > .amrescan and natural order scans also get scan keys, is there some > place I find the selection process for ordered index AMs, and how this > ordering should be interepreted? There are no good examples available > in core code because btree is special-cased, and there are no other > in-tree AMs that have facilities where both `amcanorderbyop` and > `amcanorder` are set. The general notion of a data type's sort order comes from its default btree operator class, so the whole idea of a generic sort order is deeply tied to the nbtree AM. That's why we sometimes have btree operator classes for types that you'd never actually want to index (at least not using a btree index). -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Thu, Nov 23, 2023 at 9:16 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: >> Is returning index scan results in (reverse) natural order not >> optional but required with amcanorder? If it is required, why is the >> am indicator called 'canorder' instead of 'willorder', 'doesorder' or >> 'isordered'? > I don't know. I have a hard time imagining an index AM that is > amcanorder=true that isn't either nbtree, or something very similar > (so similar that it seems unlikely that anybody would actually go to > the trouble of implementing it from scratch). Agreed on that, but I don't have that hard a time imagining cases where it might be useful for btree not to guarantee ordered output. IIRC, it currently has to do extra pushups to ensure that behavior in ScalarArrayOp cases. We've not bothered to expand the planner infrastructure to distinguish "could, but doesn't" paths for btree scans, but that's more about it not being a priority than because it wouldn't make sense. If we did put work into that, we'd probably generate multiple indexscan Paths for the same index and same index conditions, some of which are marked with sort ordering PathKeys and some of which aren't (and have a flag that would eventually tell the index AM not to bother with sorting at runtime). > The general notion of a data type's sort order comes from its default > btree operator class, so the whole idea of a generic sort order is > deeply tied to the nbtree AM. Right. There hasn't been a reason to decouple that, just as our notions of hashing are tied to the hash AM. This doesn't entirely foreclose other AMs that handle sorted output, but it constrains them to use btree's opclasses to represent the ordering. regards, tom lane
On Thu, Nov 23, 2023 at 11:15 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Agreed on that, but I don't have that hard a time imagining cases > where it might be useful for btree not to guarantee ordered output. > IIRC, it currently has to do extra pushups to ensure that behavior > in ScalarArrayOp cases. We've not bothered to expand the planner > infrastructure to distinguish "could, but doesn't" paths for btree > scans, but that's more about it not being a priority than because it > wouldn't make sense. I'm glad that you brought that up, because it's an important issue for my ScalarArrayOp patch (which Matthias referenced). My patch simply removes this restriction from the planner -- now ScalarArrayOp clauses aren't treated as a special case when generating path keys. This works in all cases because the patch generalizes nbtree's approach to native ScalarArrayOp execution, allowing index scans to work as one continuous index scan in all cases. As you might recall, the test case that exercises the issue is: SELECT thousand, tenthous FROM tenk1 WHERE thousand < 2 AND tenthous IN (1001,3000) ORDER BY thousand; It doesn't actually make much sense to execute this as two primitive index scans, though. The most efficient approach is to perform a single index descent, while still being able to use a true index qual for "tenthous" (since using a filter qual is so much slower due to the overhead of accessing the heap just to filter out non-matching tuples). That's what the patch does. This would be true even without the "ORDER BY" -- accessing the leaf page once is faster than accessing it twice (same goes for the root). So I see no principled reason why we'd ever really want to start a primitive index scan that wasn't "anchored to an equality constraint". This is reliably faster, while also preserving index sort order, almost as a side-effect. -- Peter Geoghegan
On Thu, 23 Nov 2023 at 19:52, Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Nov 23, 2023 at 9:16 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: > > For example, btree ignores any ordering scan keys that it is given in > > btrescan, which seems fine for btree because the ordering of a btree > > is static (and no other order than that is expected apart from it's > > reverse order), but this becomes problematic for other indexes that > > could return ordered data but would prefer not to have to go through > > the motions of making sure the return order is 100% correct, rather > > than a k-sorted sequence, or just the matches to the quals (like > > GIST). Is returning index scan results in (reverse) natural order not > > optional but required with amcanorder? If it is required, why is the > > am indicator called 'canorder' instead of 'willorder', 'doesorder' or > > 'isordered'? > > I don't know. I have a hard time imagining an index AM that is > amcanorder=true that isn't either nbtree, or something very similar > (so similar that it seems unlikely that anybody would actually go to > the trouble of implementing it from scratch). Well, BRIN (with minmax opclasses) could return ordered results if it needs to (see [0]; though that implements it as a distinct plan node). Ordering the tuples correctly takes quite some effort, but is quite likely to use less effort and/or scratch space than a table/bitmap scan + sort, because we won't have to manage all tuples of the table at the same time. However, it woould be extremely expensive if the planner expects this to always return the data in btree order. For GIST with the btree_gist opclasses it is even easier to return ordered results (patch over at [1]), but then still it prefers not to have to make a strict ordering as it adds overhead vs 'normal' index scans. Also, was that a confirmation that amcanorder is a requirement for the AM to return data in index order (unless amrescan's orderbys is not null), or just a comment on the reason for the name of 'amcanorder' being unclear? > You didn't mention support for merge joins. That's one of the defining > characteristics of an amcanorder=true index AM, since an > "ammarkpos/amrestrpos function need only be provided if the access > method supports ordered scans". It's hard to imagine how that could > work with a loosely ordered index. It seems to imply that the scan > must always work with a simple linear order. I probably didn't think of merge join support because 'merge join' is not mentioned as such in the index AM api - I knew of ammarkpos/amrestrpos, but hadn't yet gone into detail what they're used for. > Cases where the planner uses a merge join often involve an index path > with an "interesting sort order" that "enables" the merge join. > Perhaps most of the alternative plans (that were almost as cheap as > the merge join plan) would have had to scan the same index in the same > way anyway, so it ends up making sense to use a merge join. The merge > join can get ordered results from the index "at no extra cost". (All > of this is implicit, of course -- the actual reason why the planner > chose the merge join plan is because it worked out to be the cheapest > plan.) Couldn't the merge join (or scan node) use a tuple store to return to some earlier point in the index scan when a native version of markpos is not easily supported? It would add (potentially very significant) IO overhead, but it would also allow merge joins on ordered paths that currently don't have a natural way of marking their position. > > Alternatively, if an am should be using the order scan keys from > > .amrescan and natural order scans also get scan keys, is there some > > place I find the selection process for ordered index AMs, and how this > > ordering should be interepreted? There are no good examples available > > in core code because btree is special-cased, and there are no other > > in-tree AMs that have facilities where both `amcanorderbyop` and > > `amcanorder` are set. > > The general notion of a data type's sort order comes from its default > btree operator class, so the whole idea of a generic sort order is > deeply tied to the nbtree AM. That's why we sometimes have btree > operator classes for types that you'd never actually want to index (at > least not using a btree index). Yes, the part where btree opclasses determine a type's ordering is clear. But what I'm looking for is "how do I, as an index AM implementation, get the signal that I need to return column-ordered data?" If that signal is "index AM marked amcanorder == index must return ordered data", then that's suboptimal for the index AM writer, but understandable. If amcanorder does not imply always ordered retrieval, then I'd like to know how it is signaled to the AM. But as of yet I've not found a clear and conclusive answer either way. Kind regards, Mattthias van de Meent. [0] https://www.postgresql.org/message-id/flat/e70fa091-e338-1598-9de4-6d0ef6b693e2%40enterprisedb.com [1] https://www.postgresql.org/message-id/flat/EB2AF704-70FC-4D73-A97A-A7884A0381B5%40kleczek.org
On Fri, Nov 24, 2023 at 8:44 AM Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Also, was that a confirmation that amcanorder is a requirement for the > AM to return data in index order (unless amrescan's orderbys is not > null), or just a comment on the reason for the name of 'amcanorder' > being unclear? It was both, I suppose. > > Cases where the planner uses a merge join often involve an index path > > with an "interesting sort order" that "enables" the merge join. > > Perhaps most of the alternative plans (that were almost as cheap as > > the merge join plan) would have had to scan the same index in the same > > way anyway, so it ends up making sense to use a merge join. The merge > > join can get ordered results from the index "at no extra cost". (All > > of this is implicit, of course -- the actual reason why the planner > > chose the merge join plan is because it worked out to be the cheapest > > plan.) > > Couldn't the merge join (or scan node) use a tuple store to return to > some earlier point in the index scan when a native version of markpos > is not easily supported? You can materialize any executor node, allowing it to be accessed in random order, as required by merge joins (in many cases). But any index AM that relies on that clearly isn't an amcanorder=true index AM, which is what you asked about. Whether or not you should actually care about whether your index AM can meet the expectations that the API has for amcanorder=true index AMs is far from clear. In the end the design has to make sense, and integrate into the existing API in some way or other -- but the details are likely to depend on things that nobody thought of just yet. I don't think that it's all that useful to discuss it while everything remains so abstract. > It would add (potentially very significant) > IO overhead, but it would also allow merge joins on ordered paths that > currently don't have a natural way of marking their position. I don't know. Maybe it's possible. It might even be practically achievable, while delivering a compelling benefit to users. This is the kind of thing that I don't tend to speculate about, at least not before I have more concrete information about what is possible through some kind of prototyping process. > Yes, the part where btree opclasses determine a type's ordering is > clear. But what I'm looking for is "how do I, as an index AM > implementation, get the signal that I need to return column-ordered > data?" If that signal is "index AM marked amcanorder == index must > return ordered data", then that's suboptimal for the index AM writer, > but understandable. If amcanorder does not imply always ordered > retrieval, then I'd like to know how it is signaled to the AM. But as > of yet I've not found a clear and conclusive answer either way. I suppose that amcanorder=true cannot mean that, since we have the SAOP path key thing (at least for now). But that wasn't true until bug fix commit 807a40c5, so prior to that point I guess it was tacitly the case that B-Tree scans always returned ordered results. Here we are trying to divine the intentions of an "abstract" API by discussing highly obscure bugs that were either bugs in nbtree, or bugs in what we once expected of nbtree. Seems more than a bit silly to me. I'm not suggesting that the idea of an abstract API doesn't need to be taken seriously at all -- far from it. Just that "case law precedent" can play an important role in how it is interpreted. The fact that some things remain ambiguous isn't necessarily a problem that needs to be solved. If there is a real problem, then IMV it should always start with a concrete complaint about how the API doesn't support certain requirements. In other words, I'm of the opinion that such a complaint should end with the API, as opposed to starting with it. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Fri, Nov 24, 2023 at 8:44 AM Matthias van de Meent > <boekewurm+postgres@gmail.com> wrote: >> Yes, the part where btree opclasses determine a type's ordering is >> clear. But what I'm looking for is "how do I, as an index AM >> implementation, get the signal that I need to return column-ordered >> data?" If that signal is "index AM marked amcanorder == index must >> return ordered data", then that's suboptimal for the index AM writer, >> but understandable. If amcanorder does not imply always ordered >> retrieval, then I'd like to know how it is signaled to the AM. But as >> of yet I've not found a clear and conclusive answer either way. > I suppose that amcanorder=true cannot mean that, since we have the > SAOP path key thing (at least for now). As things stand, amcanorder definitely means that the index always returns ordered data, since the planner will unconditionally apply pathkeys to the indexscan Paths generated for it (see plancat.c's get_relation_info which sets up info->sortopfamily, and build_index_pathkeys which uses that). We could reconsider that definition if there were a reason to, but so far it hasn't seemed interesting. The hack in 807a40c5 is a hack, without a doubt, but that doesn't necessarily mean we should spend time on generalizing it, and even less that we should have done so in 2012. Maybe by now there are non-core index AMs that have cases where it's worth being pickier. We'd have to invent some API that allows the index AM to have a say in what pathkeys get applied. regards, tom lane
On Fri, Nov 24, 2023 at 10:58 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Peter Geoghegan <pg@bowt.ie> writes: > > I suppose that amcanorder=true cannot mean that, since we have the > > SAOP path key thing (at least for now). > > As things stand, amcanorder definitely means that the index always > returns ordered data, since the planner will unconditionally apply > pathkeys to the indexscan Paths generated for it (see plancat.c's > get_relation_info which sets up info->sortopfamily, and > build_index_pathkeys which uses that). We could reconsider that > definition if there were a reason to, but so far it hasn't seemed > interesting. The hack in 807a40c5 is a hack, without a doubt, but > that doesn't necessarily mean we should spend time on generalizing it, > and even less that we should have done so in 2012. That is certainly my preferred interpretation. Not least because I am in the process of removing the hack completely, which has shown very significant benefits for queries with SAOPs that now get to depend on the sort order in some crucial way. > Maybe by now there > are non-core index AMs that have cases where it's worth being pickier. > We'd have to invent some API that allows the index AM to have a say in > what pathkeys get applied. Matthias and I recently discussed a design sketched by James Coleman some years back, which Matthias seemed particularly interested in: https://www.postgresql.org/message-id/flat/CAAaqYe-SsHgXKXPpjn7WCTUnB_RQSxXjpSaJd32aA%3DRquv0AgQ%40mail.gmail.com James' test case benefits from my own patch in the obvious way: it can use SAOP index quals, while still being able to get an ordered scan that terminates early via a LIMIT. But the design he sketched proposes to go much further than that -- it's far more targeted. His design reconstructs a useful sort order by "multiplexing" different parts of the index (for different SAOP constants), merging together multiple streams of ordered tuples into one stream. This means that the index produces tuples in a useful order, sufficient to terminate the scan early -- but it's a sort order that doesn't match the index at all. Obviously, that's a totally novel idea. It's possible that something like that might just make sense -- it's theoretically optimal, at least. My guess is that if it really did happen then the planner would treat it like the special case that it is. It very much reminds me of loose index scan, where the index AM API has to be revised so that high level semantic information can be pushed down into the index AM. If something like that can offer stellar performance, that just isn't achievable in any other way, then I guess it's worth accepting the cost of an uglified index AM API. Whether or not such a feature really can be that compelling likely depends on a myriad of factors that we cannot possibly anticipate ahead of time. There are just too many things in this general area that *might* make sense someday. As we discussed back in 2022, I think that MDAM style "skip scan" (meaning support for skipping around an index given a query with "missing key predicates") shouldn't be a special case in the planner -- only costing needs to know anything about skipping. In general I find that it's most useful to classify all of these techniques according to whether or not they are compatible with the orthodox definition of amcanorder that you described. In other words, to classify them based on whether they involve pushing down high level semantic information to the index AM. -- Peter Geoghegan
On Fri, 24 Nov 2023, 19:58 Tom Lane, <tgl@sss.pgh.pa.us> wrote: > > Peter Geoghegan <pg@bowt.ie> writes: > > On Fri, Nov 24, 2023 at 8:44 AM Matthias van de Meent > > <boekewurm+postgres@gmail.com> wrote: > >> Yes, the part where btree opclasses determine a type's ordering is > >> clear. But what I'm looking for is "how do I, as an index AM > >> implementation, get the signal that I need to return column-ordered > >> data?" If that signal is "index AM marked amcanorder == index must > >> return ordered data", then that's suboptimal for the index AM writer, > >> but understandable. If amcanorder does not imply always ordered > >> retrieval, then I'd like to know how it is signaled to the AM. But as > >> of yet I've not found a clear and conclusive answer either way. > > > I suppose that amcanorder=true cannot mean that, since we have the > > SAOP path key thing (at least for now). > > As things stand, amcanorder definitely means that the index always > returns ordered data, since the planner will unconditionally apply > pathkeys to the indexscan Paths generated for it (see plancat.c's > get_relation_info which sets up info->sortopfamily, and > build_index_pathkeys which uses that). We could reconsider that > definition if there were a reason to, but so far it hasn't seemed > interesting. For GIST there is now a case for improving the support for optionally ordered retrieval, as there is a patch that tries to hack ORDER BY support into GIST. Right now that patch applies (what I consider) a hack by implicitly adding an operator ordering clause for ORDER BY column with the column type's btree ordering operator, but with improved APIs that shouldn't need such a hacky approach. > The hack in 807a40c5 is a hack, without a doubt, but > that doesn't necessarily mean we should spend time on generalizing it, > and even less that we should have done so in 2012. Maybe by now there > are non-core index AMs that have cases where it's worth being pickier. > We'd have to invent some API that allows the index AM to have a say in > what pathkeys get applied. I think that would be quite useful, as it would allow indexes to return ordered results in other orders than the defined key order, and it would allow e.g. BRIN to run its sort for ordered retrieval inside the index scan node (rather than requiring its own sort node type). CC: Tomas, maybe you have some ideas about this as well? What was the reason for moving BRIN-assisted sort into its own node? Was there more to it than "BRIN currently doesn't have amgettuple, and amgettuple can't always be used"? Kind regards, Matthias van de Meent