Обсуждение: We need index-only scans
We last researched index-only scans, also called covering indexes, in September of 2008, but have made little progress on it since. Many have been waiting for Heikki to implement this but I talked to him and he doesn't have time. I believe it is time for the community to move forward and I would like to assemble a team to work on this feature. We might not be able to implement it for Postgres 9.1, but hopefully we can make some progress on this. I have created a wiki page for this TODO item: http://wiki.postgresql.org/wiki/Index-only_scans I am interested in people improving this wiki page, and in people discussing and coding some of the items needed to implement this feature. I have personally talked to a few people already who seemed receptive. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Wed, Nov 10, 2010 at 4:04 AM, Bruce Momjian <bruce@momjian.us> wrote: > We last researched index-only scans, also called covering indexes, in > September of 2008, but have made little progress on it since. Many have > been waiting for Heikki to implement this but I talked to him and he > doesn't have time. > > I believe it is time for the community to move forward and I would like > to assemble a team to work on this feature. We might not be able to > implement it for Postgres 9.1, but hopefully we can make some progress > on this. Just so everyone is on the same page.... Even once we have index-only scans they won't be anywhere near as useful with Postgres as they are with Oracle and other databases. At least not unless we find a solution for a different problem -- our inability to scan btree indexes sequentially. In Oracle "Fast Full Index" scans are particularly useful for things like unconstrained select count(*). Since the scan can scan through the index sequentially and the index is much smaller than the table it can count all the values fairly quickly even on a very wide table. In Postgres, aside from the visibility issues we have a separate problem. In order to achieve high concurrency we allow splits to occur without locking the index. And the new pages can be found anywhere in the index, even to the left of the existing page. So a sequential scan could miss some data if the page it's on is split and some of the data is moved to be to the left of where our scan is. It's possible this is a non-issue in the future due to large RAM sizes and SSDs. Large amounts of RAM mean perhaps indexes will be in memory much of the time and SSDs might mean that scanning the btree in index order might not really be that bad. -- greg
Excerpts from Greg Stark's message of vie nov 12 10:33:28 -0300 2010: > In Postgres, aside from the visibility issues we have a separate > problem. In order to achieve high concurrency we allow splits to occur > without locking the index. And the new pages can be found anywhere in > the index, even to the left of the existing page. So a sequential scan > could miss some data if the page it's on is split and some of the data > is moved to be to the left of where our scan is. Eh? If a transaction splits a page and put some new tuples to the "left" of another transaction's scan (assuming a forward scan), then necessarily the second transaction cannot see those tuples anyway -- the inserter must have done the split after you got your snapshot. A transaction cannot split a page that other transaction has a scan stopped in, because there are buffer locks involved. Therefore the scan cannot miss any tuples. Saith src/backend/access/nbtree/README: Lehman and Yao don't require read locks, but assume that in-memory copies of tree pages are unshared. Postgres shares in-memory buffers among backends. As a result, we do page-level read locking on btree pages in order to guarantee that no record is modified while we are examining it. This reduces concurrency but guaranteees correct behavior. An advantage is that when trading in a read lock for a write lock, we need not re-read the page after getting the write lock. Since we're also holding a pin on the shared buffer containing the page, we know that buffer still contains the page and is up-to-date. We support the notion of an ordered "scan" of an index as well as insertions, deletions, and simple lookups. A scan in the forward direction is no problem, we just use the right-sibling pointers that L&Y require anyway. (Thus, once we have descended the tree to the correct start point for the scan, the scan looks only at leaf pages and never at higher tree levels.) To support scans in the backward direction, we also store a "left sibling" link much like the "right sibling". (This adds an extra step to the L&Y split algorithm: while holding the write lock on the page being split, we also lock its former right sibling to update that page's left-link. This is safe since no writer of that page can be interested in acquiring a write lock on our page.) A backwards scan has one additional bit of complexity: after following the left-link we must account for the possibility that the left sibling page got split before we could read it. So, we have to move right until we find a page whose right-link matches the page we came from. (Actually, it's even harder than that; see deletion discussion below.) Page read locks are held only for as long as a scan is examining a page. To minimize lock/unlock traffic, an index scan always searches a leaf page to identify all the matching items at once, copying their heap tuple IDs into backend-local storage. The heap tuple IDs are then processed while not holding any page lock within the index. We do continue to hold a pin on the leaf page, to protect against concurrent deletions (see below). In this state the scan is effectively stopped "between" pages, either before or after the page it has pinned. This is safe in the presence of concurrent insertions and even page splits, because items are never moved across pre-existing page boundaries --- so the scan cannot miss any items it should have seen, nor accidentally return the same item twice. The scan must remember the page's right-link at the time it was scanned, since that is the page to move right to; if we move right to the current right-link then we'd re-scan any items moved by a page split. We don't similarly remember the left-link, since it's best to use the most up-to-date left-link when trying to move left (see detailed move-left algorithm below). -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On 12.11.2010 15:51, Alvaro Herrera wrote: > Excerpts from Greg Stark's message of vie nov 12 10:33:28 -0300 2010: > >> In Postgres, aside from the visibility issues we have a separate >> problem. In order to achieve high concurrency we allow splits to occur >> without locking the index. And the new pages can be found anywhere in >> the index, even to the left of the existing page. So a sequential scan >> could miss some data if the page it's on is split and some of the data >> is moved to be to the left of where our scan is. > > Eh? It took me a while to understand what Greg meant as well. You can't scan a B-tree index in *physical order*, You have to first descend to the leftmost leaf, and follow the right pointers from there until you reach the rightmost leaf. That is a lot slower than seqscanning a file in physical order. We solved that for VACUUM, taking advantage of the fact that there can only be one VACUUM on a table at a time. Maybe that mechanism could be generalized to all scans, but it would require some thinking.. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Greg Stark wrote: > On Wed, Nov 10, 2010 at 4:04 AM, Bruce Momjian <bruce@momjian.us> wrote: > > We last researched index-only scans, also called covering indexes, in > > September of 2008, but have made little progress on it since. ?Many have > > been waiting for Heikki to implement this but I talked to him and he > > doesn't have time. > > > > I believe it is time for the community to move forward and I would like > > to assemble a team to work on this feature. ?We might not be able to > > implement it for Postgres 9.1, but hopefully we can make some progress > > on this. > > Just so everyone is on the same page.... Even once we have index-only > scans they won't be anywhere near as useful with Postgres as they are > with Oracle and other databases. At least not unless we find a > solution for a different problem -- our inability to scan btree > indexes sequentially. > > In Oracle "Fast Full Index" scans are particularly useful for things > like unconstrained select count(*). Since the scan can scan through > the index sequentially and the index is much smaller than the table it > can count all the values fairly quickly even on a very wide table. > > In Postgres, aside from the visibility issues we have a separate > problem. In order to achieve high concurrency we allow splits to occur > without locking the index. And the new pages can be found anywhere in > the index, even to the left of the existing page. So a sequential scan > could miss some data if the page it's on is split and some of the data > is moved to be to the left of where our scan is. > > It's possible this is a non-issue in the future due to large RAM sizes > and SSDs. Large amounts of RAM mean perhaps indexes will be in memory > much of the time and SSDs might mean that scanning the btree in index > order might not really be that bad. Agreed. I updated the index-only scans wiki for this: http://wiki.postgresql.org/wiki/Index-only_scanstest speed improvement for scans of the entire index (this involvesrandomI/O) * we can't scan the index in physical order like vacuum does -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 11/12/2010 09:17 AM, Bruce Momjian wrote: > Greg Stark wrote: >> On Wed, Nov 10, 2010 at 4:04 AM, Bruce Momjian<bruce@momjian.us> wrote: >>> We last researched index-only scans, also called covering indexes, in >>> September of 2008, but have made little progress on it since. ?Many have >>> been waiting for Heikki to implement this but I talked to him and he >>> doesn't have time. >>> >>> I believe it is time for the community to move forward and I would like >>> to assemble a team to work on this feature. ?We might not be able to >>> implement it for Postgres 9.1, but hopefully we can make some progress >>> on this. >> Just so everyone is on the same page.... Even once we have index-only >> scans they won't be anywhere near as useful with Postgres as they are >> with Oracle and other databases. At least not unless we find a >> solution for a different problem -- our inability to scan btree >> indexes sequentially. >> >> In Oracle "Fast Full Index" scans are particularly useful for things >> like unconstrained select count(*). Since the scan can scan through >> the index sequentially and the index is much smaller than the table it >> can count all the values fairly quickly even on a very wide table. >> >> In Postgres, aside from the visibility issues we have a separate >> problem. In order to achieve high concurrency we allow splits to occur >> without locking the index. And the new pages can be found anywhere in >> the index, even to the left of the existing page. So a sequential scan >> could miss some data if the page it's on is split and some of the data >> is moved to be to the left of where our scan is. >> >> It's possible this is a non-issue in the future due to large RAM sizes >> and SSDs. Large amounts of RAM mean perhaps indexes will be in memory >> much of the time and SSDs might mean that scanning the btree in index >> order might not really be that bad. > Agreed. I updated the index-only scans wiki for this: > > http://wiki.postgresql.org/wiki/Index-only_scans > > test speed improvement for scans of the entire index (this involves > random I/O) > * we can't scan the index in physical order like vacuum does For unconstrained select count(*), why does scanning in index order matter? cheers andrew
Excerpts from Heikki Linnakangas's message of vie nov 12 11:01:39 -0300 2010: > It took me a while to understand what Greg meant as well. You can't scan > a B-tree index in *physical order*, You have to first descend to the > leftmost leaf, and follow the right pointers from there until you reach > the rightmost leaf. That is a lot slower than seqscanning a file in > physical order. Oh, that makes more sense. I'm not sure that can be supported sanely (i.e. not locking the whole index) -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Fri, Nov 12, 2010 at 8:33 AM, Greg Stark <gsstark@mit.edu> wrote: > On Wed, Nov 10, 2010 at 4:04 AM, Bruce Momjian <bruce@momjian.us> wrote: >> We last researched index-only scans, also called covering indexes, in >> September of 2008, but have made little progress on it since. Many have >> been waiting for Heikki to implement this but I talked to him and he >> doesn't have time. >> >> I believe it is time for the community to move forward and I would like >> to assemble a team to work on this feature. We might not be able to >> implement it for Postgres 9.1, but hopefully we can make some progress >> on this. > > Just so everyone is on the same page.... Even once we have index-only > scans they won't be anywhere near as useful with Postgres as they are > with Oracle and other databases. At least not unless we find a > solution for a different problem -- our inability to scan btree > indexes sequentially. I have very little doubt that our first attempts to chip away at this problem are going to be a bit rough around the edges. Here's another problem to mull over: a large insert-only table will never be vacuumed; therefore, the visibility map bits will never become set; therefore, the index-only scan optimization won't apply (and the user may not realize it or understand why it's happening). But the journey of a thousand miles begins with the first step. I think we need to focus our first effort on making the visibility map crash-safe. Then we can implement the basic feature, which I would characterize this way: if performing an index-scan, and all the attributes we need are available from the index tuple, then skip the heap fetch when the visibility map bit is set. This requires minimal planner support - just an adjustment of the costing model for index scans; although to do it right I think we're going to need statistics on what fraction of pages in the heap have the visibility map bit set.Then, we can work on refinements, of which I thinkthere will be many, including the one you listed. Another is to bubble up heap fetches in the plan tree - so for example if you eventually need to return some attributes that aren't in the index tuple, you could consider performing some other join based on the index columns and then do the heap fetches for the remaining attributes (and visibility checks) later. I am not confident that we can get even a basic implementation of index-only scans into 9.1 at this point, and we're certainly not going to get all the kinks worked out. So I agree with you that we shouldn't set expectations above the level at which they can be met, but, I'd be happy if we can make a start on it. ...Robert
Greg Stark <gsstark@mit.edu> writes: > Just so everyone is on the same page.... Even once we have index-only > scans they won't be anywhere near as useful with Postgres as they are > with Oracle and other databases. At least not unless we find a > solution for a different problem -- our inability to scan btree > indexes sequentially. True, however I would not be too pessimistic about this. For OLTP like typical web applications, index-only scans are a killer feature for being able to read N rows with 1 I/O (for some small N), when the data no longer fits in the buffer pool, or after cold start. Eg. read all account settings for one user account, or subjects of all messages, etc. A composite index with user_id in the first column allows to fetch all N rows from one (or a few) disk pages with an index-only scan, as opposed to N disk pages. So for this, index-only scans can make a _big_ difference, even without support for Oracle-type index fast-full-scans. - Kristian.
On Wed, Dec 1, 2010 at 3:53 AM, Kristian Nielsen <knielsen@knielsen-hq.org> wrote: > Greg Stark <gsstark@mit.edu> writes: > >> Just so everyone is on the same page.... Even once we have index-only >> scans they won't be anywhere near as useful with Postgres as they are >> with Oracle and other databases. At least not unless we find a >> solution for a different problem -- our inability to scan btree >> indexes sequentially. > > True, however I would not be too pessimistic about this. > > For OLTP like typical web applications, index-only scans are a killer feature > for being able to read N rows with 1 I/O (for some small N), when the data no > longer fits in the buffer pool, or after cold start. > > Eg. read all account settings for one user account, or subjects of all > messages, etc. A composite index with user_id in the first column allows to > fetch all N rows from one (or a few) disk pages with an index-only scan, as > opposed to N disk pages. > > So for this, index-only scans can make a _big_ difference, even without > support for Oracle-type index fast-full-scans. I am not trying start a MySQL vs PostgreSQL thread. I lurk on these lists to learn more about PostgreSQL. I know that PostgreSQL is good at OLTP and complex query processing and that index fast-full scans can make a big difference for large joins, but the workload that I care about is OLTP-only. PostgreSQL will be more efficient on that workload with support for index-only scans. The majority of the load is simple queries -- joins that touch a few rows, short index range scans and index point lookups. With covering indexes and InnoDB the queries do a few disk reads in the worst case. Without covering indexes the queries do extra disk IO in the worst case (buffer pool read miss) and this is much worse for the range scans. I assume that the behavior with covering indexes but without index-only scans is similar to not having index-only scans. I collect 95th percentile response times for my popular queries and they are much improved with the use of covering indexes. -- Mark Callaghan mdcallag@gmail.com