Обсуждение: Using Btree to Provide Sorting on Suffix Keys with LIMIT
tl;dr: I'd like to teach btrees to support returning ordered results where the ordering is only a suffix of the index keys and the query includes a scalar array op qual on the prefix key of the index. Suppose we have the following schema: CREATE TABLE foos(bar_fk integer, created_at timestamp); CREATE INDEX index_foos_on_bar_fk_and_created_at ON foos(bar_fk, created_at); and seed it with the following: INSERT INTO foos (bar_fk, created_at) SELECT i % 1000, now() - (random() * '5 years'::interval) FROM generate_series(1, 500000) t(i); then execute the following query: SELECT * FROM foos WHERE bar_fk IN (1, 2, 3) ORDER BY created_at LIMIT 50; currently we get a query plan (I've disabled bitmap scans for this test) like: Limit (cost=5197.26..5197.38 rows=50 width=12) (actual time=2.212..2.222 rows=50 loops=1) -> Sort (cost=5197.26..5201.40 rows=1657 width=12) (actual time=2.211..2.217 rows=50 loops=1) Sort Key: created_at Sort Method: top-N heapsort Memory: 27kB -> Index Only Scan using index_foos_on_bar_fk_and_created_at on foos (cost=0.42..5142.21 rows=1657 width=12) (actual time=0.025..1.736 rows=1500 loops=1) Index Cond: (bar_fk = ANY ('{1,2,3}'::integer[])) Heap Fetches: 1500 Planning time: 0.137 ms Execution time: 2.255 ms Note that the index scan (or bitmap scan) nodes return all 1500 rows matching `bar_fk IN (1,2,3)`. After all rows are returned, that total set is ordered, and finally the LIMIT is applied. While only 50 rows were requested, 30x that were fetched from the heap. I believe it is possible to use the index `index_foos_on_bar_fk_and_created_at` to fulfill both the `bar_fk IN (1,2,3)` qualifier and (at least partially; more on that later) the ordering `ORDER BY created_at` while fetching fewer than all rows matching the qualifier. Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in (...) order by c` and `a in (...) and b = 1 order by c` (and further similar derivations with increasing numbers of equality quals). Note: Another (loosely) related problem is that sorting can't currently take advantage of cases where an index provides a partial (prefix of requested pathkeys) ordering. Proposal 1: General idea: teach btrees to apply LIMIT internally so that we bound the number of rows fetched and sorted to the <number of array op elements> * <limit>. Pros: - Presumably simpler to implement. - Ought to be faster (then master) in virtually all cases (or at least penalty in worst case is extremely small). Cons: - Doesn't capture all of the theoretical value. For example, for array of size 100, limit 25, and 100 values per array key, the current code fetches 10,000 values, this proposal would fetch 2,500, and the best case is 25. In this sample scenario we fetch 25% of the original tuples, but we also fetch 100x the theoretical minimum required. - Index scan node has to learn about limit (this feels pretty dirty). - Still needs a sort node. Proposal 2: General idea: find the "first" (or last depending on sort order) index tuple for each array op element. Sort those index tuples by the suffix keys. Return tuples from the first of these sorted index tuple locations until we find one that's past the next ordered index tuple. Continue this round-robin approach until we exhaust all tuples. Pros: - Best case is significantly improved over proposal 1. - Always fetches the minimal number of tuples required by the limit. - When ordered values are not evenly distributed should be even faster (just continue to pull tuples for array value with lowest order value). - Index scan node remains unaware of limit. - Doesn't need a sort node. Cons: - Presumably more difficult to implement. - Degenerate cases: when ordered values are evenly distributed we may have to re-search from the top of the tree for each tuple (so worst case is when few tuples match within each prefix). --- Questions: - Do we need a new index access method for this? Or do we shoehorn it into the existing index scan. (Perhaps answer depends on which strategy chosen?) - Similarly do we want a new scan node type? (This brings up the same kinds of questions in the skip scan discussion about multiplying node types.) - Is holding a pin on multiple index pages acceptable? - Or do we avoid that at the cost of most frequent searches from the top of the tree? - If we have to search from the top of the tree, then does the "many duplicates" case become degenerate (to put it another way, how do we search to proper next tuple in a non-unique index)? Status: I've begun work on proposal 2 as I believe it shows the most potential gain (though it also has higher complexity). I don't have a patch in a state worth showing yet, but I wanted to start the discussion on the design considerations while I continue work on the patch. - James Coleman
On Sun, 30 Dec 2018 at 13:00, James Coleman <jtc331@gmail.com> wrote: > Note that the index scan (or bitmap scan) nodes return all 1500 rows > matching `bar_fk IN (1,2,3)`. After all rows are returned, that total > set is ordered, and finally the LIMIT is applied. While only 50 rows > were requested, 30x that were fetched from the heap. > > I believe it is possible to use the index > `index_foos_on_bar_fk_and_created_at` to fulfill both the `bar_fk IN > (1,2,3)` qualifier and (at least partially; more on that later) the > ordering `ORDER BY created_at` while fetching fewer than all rows > matching the qualifier. > > Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in > (...) order by c` and `a in (...) and b = 1 order by c` (and further > similar derivations with increasing numbers of equality quals). I don't quite understand this the above paragraph, but I assume this would be possible to do with some new index am routine which allowed multiple different values for the initial key. Probably execution for something like this could be handled by having 1 IndexScanDesc per initial key. A scan would have to scan all of those for the first tuple and return the lowest order key. Subsequent scans would fetch the next tuple for the IndexScanDesc used previously then again, return the lowest order tuple. There's some binary heap code and examples in nodeMergeAppend.c about how that could be done fairly efficiently. The hard part about that would be knowing when to draw the line. If there was 1000's of initial keys then some other method might be better. Probably the planner would have to estimate which method was best. There are also issues if there are multiple prefix keys as you'd need to scan a cartesian product of the keys, which would likely get out of hand quickly with 2 or more initial keys. There were discussions and a patch for a planner-level implementation of which could likely assist with this in [1]. I'm not sure if this particular case was handled in the patch. I believe it was more intended for queries such as: SELECT ... FROM t WHERE a = 1 OR b = 2 and could transform this into something more along the lines of: SELECT .. FROM t WHERE a = 1 UNION SELECT ... FROM t WHERE b = 1, and using the table's ctid to uniquify the rows. You could get away with something similar but use UNION ALL instead. You don't need UNION since your "OR" is on the same column, meaning there can be no overlapping rows. Something like: (SELECT * FROM foos WHERE bar_fk = 1 LIMIT 50) UNION ALL (SELECT * FROM foos WHERE bar_fk = 2 LIMIT 50) UNION ALL (SELECT * FROM foos WHERE bar_fk = 3 LIMIT 50) ORDER BY created_at LIMIT 50; > Note: Another (loosely) related problem is that sorting can't > currently take advantage of cases where an index provides a partial > (prefix of requested pathkeys) ordering. There has been a patch [2] around for about 4 years now that does this. I'm unsure of the current status, other than not yet committed. [1] https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489%40BlueTreble.com [2] https://www.postgresql.org/message-id/flat/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Dec 29, 2018 at 9:50 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > > On Sun, 30 Dec 2018 at 13:00, James Coleman <jtc331@gmail.com> wrote: > > Note that the index scan (or bitmap scan) nodes return all 1500 rows > > matching `bar_fk IN (1,2,3)`. After all rows are returned, that total > > set is ordered, and finally the LIMIT is applied. While only 50 rows > > were requested, 30x that were fetched from the heap. > > > > I believe it is possible to use the index > > `index_foos_on_bar_fk_and_created_at` to fulfill both the `bar_fk IN > > (1,2,3)` qualifier and (at least partially; more on that later) the > > ordering `ORDER BY created_at` while fetching fewer than all rows > > matching the qualifier. > > > > Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in > > (...) order by c` and `a in (...) and b = 1 order by c` (and further > > similar derivations with increasing numbers of equality quals). > > I don't quite understand this the above paragraph, but I assume this > would be possible to do with some new index am routine which allowed > multiple different values for the initial key. Probably execution for > something like this could be handled by having 1 IndexScanDesc per > initial key. A scan would have to scan all of those for the first > tuple and return the lowest order key. Subsequent scans would fetch > the next tuple for the IndexScanDesc used previously then again, > return the lowest order tuple. There's some binary heap code and > examples in nodeMergeAppend.c about how that could be done fairly > efficiently. Mostly I was pointing out that the simple case (scalar array op qual on first index column and order by second index column) isn't the only potentially interesting one; we'd also want to handle, for example, a 3 column index with a equality qual on the first, an array op on the second, and order by the third. And then as you note later we could also theoretically do this for multiple array op quals. Thanks for the pointer to nodeMergeAppend.c; I'll look at that to see if it sparks any ideas. I'm intrigued by the idea of having multiple IndexScanDesc in the node. My current route had been to include an array of BTScanPos in BTScanOpaqueData and work within the same scan. Do you think that a new index am targeting multiple initial values would be a better route than improving the existing native array handling in nbtree.c/nbtutil.c? It seems to me that fitting it into the existing code gives us the greater potential usefulness but with more effort to maintain the existing efficiencies there. It sounds like multiple pinned pages (if you suggest multiple IndexScanDesc) for the same index in the same node is acceptable? We should still not be locking multiple pages at once, so I don't think there's risk of deadlock, but I wasn't sure if there were specific expectations about the number of pinned pages in a single relation at a given time. > The hard part about that would be knowing when to draw the line. If > there was 1000's of initial keys then some other method might be > better. Probably the planner would have to estimate which method was > best. There are also issues if there are multiple prefix keys as you'd > need to scan a cartesian product of the keys, which would likely get > out of hand quickly with 2 or more initial keys. Agreed. I expect the costing here to both report a higher startup cost (since it has to look at one index tuple per array element up front) and higher per-tuple cost (since we might have to re-search), but if there are a very large (e.g., millions) number of rows and a small LIMIT then it's hard to imagine this not being the better option even up to a large number of keys (though at some point memory becomes a concern also). > There were discussions and a patch for a planner-level implementation > of which could likely assist with this in [1]. I'm not sure if this > particular case was handled in the patch. I believe it was more > intended for queries such as: SELECT ... FROM t WHERE a = 1 OR b = 2 > and could transform this into something more along the lines of: > SELECT .. FROM t WHERE a = 1 UNION SELECT ... FROM t WHERE b = 1, and > using the table's ctid to uniquify the rows. You could get away with > something similar but use UNION ALL instead. You don't need UNION > since your "OR" is on the same column, meaning there can be no > overlapping rows. > > Something like: > > (SELECT * FROM foos WHERE bar_fk = 1 LIMIT 50) > UNION ALL > (SELECT * FROM foos WHERE bar_fk = 2 LIMIT 50) > UNION ALL > (SELECT * FROM foos WHERE bar_fk = 3 LIMIT 50) > ORDER BY created_at LIMIT 50; This sounds effectively like a way to do my first proposal. In theory I think both are valuable and potentially complementary, so I'll read up on that one also. > > Note: Another (loosely) related problem is that sorting can't > > currently take advantage of cases where an index provides a partial > > (prefix of requested pathkeys) ordering. > > There has been a patch [2] around for about 4 years now that does > this. I'm unsure of the current status, other than not yet committed. Doh, I should have linked to that; I've been following the incremental sort patch for a while (and submitted a test-case review) since it solves some significant problems for us. - James Coleman
On Sat, Dec 29, 2018 at 6:50 PM David Rowley <david.rowley@2ndquadrant.com> wrote: > > Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in > > (...) order by c` and `a in (...) and b = 1 order by c` (and further > > similar derivations with increasing numbers of equality quals). > > I don't quite understand this the above paragraph, but I assume this > would be possible to do with some new index am routine which allowed > multiple different values for the initial key. I'm confused about James' use of the term "new index am". I guess he just meant new support function or something? > > Note: Another (loosely) related problem is that sorting can't > > currently take advantage of cases where an index provides a partial > > (prefix of requested pathkeys) ordering. > > There has been a patch [2] around for about 4 years now that does > this. I'm unsure of the current status, other than not yet committed. > > [1] https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489%40BlueTreble.com > [2] https://www.postgresql.org/message-id/flat/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com I can see why you'd mention these two, but I also expected you to mention the skip scan project, since that involves pushing down knowledge about how the index is to be accessed to the index am (at least, I assume that it does), and skipping leading attributes to use the sort order from a suffix attribute. Actually, the partial sort idea that you linked to is more or less a dual of skip scan, at least to my mind (the former *extends* the sort order by adding a suffix tie-breaker, while the latter *skips* a leading attribute to get to an interesting suffix attribute). The way James constructed his example suggested that there'd be some kind of natural locality, that we'd expect to be able to take advantage of at execution time when the new strategy is favorable. I'm not sure if that was intended -- James? I think it might help James to construct a more obviously realistic/practical motivating example. I'm perfectly willing to believe that this idea would help his real world queries, and having an example that can easily be played with is helpful in other ways. But I'd like to know why this idea is important is in detail, since I think that it would help me to place it in the wider landscape of ideas that are like this. Placing it in that wider landscape, and figuring out next steps at a high level seem to be the problem right now. -- Peter Geoghegan
On Thu, Jan 10, 2019 at 6:52 PM Peter Geoghegan <pg@bowt.ie> wrote: > > On Sat, Dec 29, 2018 at 6:50 PM David Rowley > <david.rowley@2ndquadrant.com> wrote: > > > Areas of extension: (given index `(a, b, c)`) include `a = 1 and b in > > > (...) order by c` and `a in (...) and b = 1 order by c` (and further > > > similar derivations with increasing numbers of equality quals). > > > > I don't quite understand this the above paragraph, but I assume this > > would be possible to do with some new index am routine which allowed > > multiple different values for the initial key. > > I'm confused about James' use of the term "new index am". I guess he > just meant new support function or something? Thanks for responding. I was wondering if a new index access method would be the preferable way to do this such as how the skip scan patch does (I was taking some ideas from that one). > > > Note: Another (loosely) related problem is that sorting can't > > > currently take advantage of cases where an index provides a partial > > > (prefix of requested pathkeys) ordering. > > > > There has been a patch [2] around for about 4 years now that does > > this. I'm unsure of the current status, other than not yet committed. > > > > [1] https://www.postgresql.org/message-id/flat/7f70bd5a-5d16-e05c-f0b4-2fdfc8873489%40BlueTreble.com > > [2] https://www.postgresql.org/message-id/flat/CAPpHfds1waRZ=NOmueYq0sx1ZSCnt+5QJvizT8ndT2=etZEeAQ@mail.gmail.com > > I can see why you'd mention these two, but I also expected you to > mention the skip scan project, since that involves pushing down > knowledge about how the index is to be accessed to the index am (at > least, I assume that it does), and skipping leading attributes to use > the sort order from a suffix attribute. Actually, the partial sort > idea that you linked to is more or less a dual of skip scan, at least > to my mind (the former *extends* the sort order by adding a suffix > tie-breaker, while the latter *skips* a leading attribute to get to an > interesting suffix attribute). Yes, I'd been looking at the skip scan patch but didn't mention it. A lot of my initial email was from my initial brainstorming notes on the topic, and I should have cleaned it up a bit better before sending it. > The way James constructed his example suggested that there'd be some > kind of natural locality, that we'd expect to be able to take > advantage of at execution time when the new strategy is favorable. I'm > not sure if that was intended -- James? ... I'm not sure what you mean by "natural locality"; I believe that the artificial data I've constructed is actually somewhat close to worst case for what I'm proposing. Evenly distributed (this is random, but in this case I think that's close enough) data will realize the smallest possible gains (and be the most likely to represent a regression with few enough rows in each group) because it is the case where we'd have have the most overhead of rotating among scan keys. > ... I think it might help James to > construct a more obviously realistic/practical motivating example. I'm > perfectly willing to believe that this idea would help his real world > queries, and having an example that can easily be played with is > helpful in other ways. But I'd like to know why this idea is important > is in detail, since I think that it would help me to place it in the > wider landscape of ideas that are like this. Placing it in that wider > landscape, and figuring out next steps at a high level seem to be the > problem right now. I'll attempt to describe a more real world scenario: suppose we have a schema like: users(id serial primary key) orders(id serial primary key, user_id integer, created_at timestamp) And wanted to find the most recent N orders for a specific group of users (e.g., in a report or search). Your query might look like: SELECT * FROM orders WHERE orders.user_id IN (1, 2, 3) ORDER BY orders.created_at DESC LIMIT 25 Currently an index on orders(user_id, created_at) will be used for this query, but only to satisfy the scalar array op qual. Then all matching orders (say, years worth) will be fetched, a sort node will sort all of those results, and then a limit node will take the top N. Generalized the problem is something like "find the top N rows across a group of foreign keys" (though saying foreign keys probably is too specific). But under the scheme I'm proposing that same index would be able to provide both the filter and guarantee ordering as well. Does that more real-world-ish example help place the usefulness of this? I think this goes beyond increasing the usefulness of indexes by requiring less specific indexes (incremental sort does this), but rather allows the index to support a kind of query you can't currently (as far as I'm aware) can't express in a performant way at call currently (other than a complex recursive cte or in some subset of cases a bunch of union statements -- one per array entry). James Coleman
On Fri, Jan 18, 2019 at 12:15 PM James Coleman <jtc331@gmail.com> wrote: > I'll attempt to describe a more real world scenario: suppose we have a > schema like: > > users(id serial primary key) > orders(id serial primary key, user_id integer, created_at timestamp) > > And wanted to find the most recent N orders for a specific group of > users (e.g., in a report or search). Your query might look like: > > SELECT * > FROM orders > WHERE orders.user_id IN (1, 2, 3) > ORDER BY orders.created_at DESC > LIMIT 25 > > Currently an index on orders(user_id, created_at) will be used for > this query, but only to satisfy the scalar array op qual. Then all > matching orders (say, years worth) will be fetched, a sort node will > sort all of those results, and then a limit node will take the top N. > > Generalized the problem is something like "find the top N rows across > a group of foreign keys" (though saying foreign keys probably is too > specific). > > But under the scheme I'm proposing that same index would be able to > provide both the filter and guarantee ordering as well. > > Does that more real-world-ish example help place the usefulness of this? Yes. It didn't make much sense back in 2019, but I understand what you meant now, I think. The latest version of my ScalarArrayOpExpr patch (v2) can execute queries like this efficiently: https://postgr.es/m/CAH2-WzkEyBU9UQM-5GWPcB=WEShAUKcJdvgFuqVHuPuO-iYW0Q@mail.gmail.com Note that your example is similar to the test case from the latest update on the thread. The test case from Benoit Tigeot, that appears here: https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d?permalink_comment_id=4690491#gistcomment-4690491 You seemed to want to use an index that started with user_id/bar_fk. But I think you'd have to have an index on "created_at DESC, user_id". It could work the other way around with your suggested index, for a query written to match -- "ORDER BY user_id, created_at DESC". With an index on "created_at DESC, user_id", you'd be able to efficiently execute your limit query. The index scan could only terminate when it found (say) 25 matching tuples, so you might still have to scan quite a few index pages. But, you wouldn't have to do heap access to eliminate non-matches (with or without the VM being set) -- you could eliminate all of those non-matches using true SAOP index quals, that don't need to operate on known visible rows. This is possible with the patch, despite the fact that the user_id column is a low-order column (so this isn't one of the cases where it's useful to "skip"). Avoiding heap hits just to eliminate non-matching rows on user_id is what really matters here, though -- not skipping. It would be helpful if you could confirm this understanding, though. Thanks -- Peter Geoghegan