Обсуждение: Making hash indexes worthwhile
I see that the docs were recently changed from discouraging hash indexes both because they were no known uses in which they were meaningfully better than btree, and because they aren't recoverable; to now just discouraging them because they are not recoverable. Does that mean there are now uses in which hash indexes are provably better than btree if one is willing to overlook the lack of recoverability? If so, what are those situations? I've played around a bit with hash indexes, and it seems to me that making them generally worthwhile will take (at least) reducing or entirely doing away with the heavy-wait locks. I have a few ideas, which are not necessarily independent of each other. ===metablock=== If each bucket knows how many bits of the hash-value are significant for the entries in that bucket, then there will be a way to detect races from the metablock to the main bucket block. Rather than chaining locks from the metablock to the bucket, the process can compute the bucket and remember the number of significant bits in that bucket according to the metablock, then release the metablock lock before attempting to acquire the the buck lock. When the bucket lock is obtained, if the number of bits is greater than the number remembered from the metablock, then it can drop the bucket lock and try again. Once you don't need to chain locks from the metablock to the bucket, the lock on the metablock can probably be lowered to just a buffer content lock rather than a heavy lock. Even better, maybe the metablock can be cached in the relcache. Since the changes to the metablock seem to be entirely unidirectional (other than things like index dropping and rebuilding, which I think maybe the normal relcache flushing mechanisms will take care of), it should always be possible to detect when you have a too-stale copy and have to go get the current one. ===get rid of heavy locks=== While there are several potential places for deadlock, it seems like the main one is that the hash index scan code returns control to the executor while holding locks on the hash buckets/pages, so you can get hybrid deadlocks where one side of the blockage is in hash-code space and the other in executor-space. The obvious way to fix this would be to read the entire bucket worth of qualifying tids into local memory and release the bucket lock before returning control to the executor, kind of like btree does with pages. the main problem here is that a logical bucket can have an unbounded number of overflow pages. So if a bucket has too many tids with the sought-for full hash value (more than will fit in work_mem), we would have to prepared to spill to disk. Is that an acceptable trade-off? Obviously there are a lot of subtleties to work out with insertions and bucket-splits, but if having read-only scans occasionally need to spill to disk is a no-go then there is no point in trying to work the other things out. Thanks for any feedback, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > I see that the docs were recently changed from discouraging hash > indexes both because they were no known uses in which they were > meaningfully better than btree, and because they aren't recoverable; > to now just discouraging them because they are not recoverable. Does > that mean there are now uses in which hash indexes are provably better > than btree if one is willing to overlook the lack of recoverability? > If so, what are those situations? One reason is that you can index values that are too big for btrees; since hash indexes now store only the hash codes, they don't have a hard length limit on the underlying value. I'm not sure how useful that really is in practice, but it's at least an argument for considering them in certain cases. > I've played around a bit with hash indexes, and it seems to me that > making them generally worthwhile will take (at least) reducing or > entirely doing away with the heavy-wait locks. Concurrency is really the least of the issues for hash indexes. So far it's not clear that they're fast enough even in sequential use ... regards, tom lane
On Sun, Oct 4, 2009 at 7:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: > >> I've played around a bit with hash indexes, and it seems to me that >> making them generally worthwhile will take (at least) reducing or >> entirely doing away with the heavy-wait locks. > > Concurrency is really the least of the issues for hash indexes. So far > it's not clear that they're fast enough even in sequential use ... Do you know why that should be? I've done some work with gprof, and the results are pretty suspect, because the total gprof time adds up to only about 1/3 of the total time the backend spends on CPU (according to "top"), and I don't know where the unaccounted for time is going. But a good part of the accounted-for time seems to be associated with the locking system, even when there is only one active backend. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > Do you know why that should be? I've done some work with gprof, and > the results are pretty suspect, because the total gprof time adds up > to only about 1/3 of the total time the backend spends on CPU > (according to "top"), and I don't know where the unaccounted for time > is going. Are you sure that gprof is delivering trustworthy numbers at all? I've seen cases where it was consistently mis-timing things. https://bugzilla.redhat.com/show_bug.cgi?id=151763 Admittedly that was an old Linux version, but ... regards, tom lane
On Mon, Oct 5, 2009 at 6:49 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> Do you know why that should be? I've done some work with gprof, and >> the results are pretty suspect, because the total gprof time adds up >> to only about 1/3 of the total time the backend spends on CPU >> (according to "top"), and I don't know where the unaccounted for time >> is going. > > Are you sure that gprof is delivering trustworthy numbers at all? > I've seen cases where it was consistently mis-timing things. > https://bugzilla.redhat.com/show_bug.cgi?id=151763 > Admittedly that was an old Linux version, but ... That bug goes in the other direction versus what I am seeing, reporting more time than the clock on the wall would allow. I think things are more or less good, because profiling simple programs gives the expected answers, where under that bug they wouldn't. I think there are some kinds of system calls which are accounted for as user-space by "top", but which are not interruptable and so don't get timed by gprof. But are such events evenly distributed throughout the code or concentrated in certain places? I'll need to experiment with it a bit more. Any case, I hacked the hash index code to not take heavy locks on meta-block or on bucket-blocks (lw bufffer content locks are still taken) and the performance in a nested loop (full table scan outer, hash-index inner scan) went up by over 50%, with only one backend active. And the heavy-weight locking code dropped out of the top spots in gprof. So I think I may be on to something. I want to run it for many more hours to make sure I'm getting good statistics. Then it is a question of whether my other ideas for making it safe to remove the heavy weight locks on a non-read-only system can be implemented. Jeff
On Sun, Oct 4, 2009 at 7:13 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jeff Janes <jeff.janes@gmail.com> writes: >> I see that the docs were recently changed from discouraging hash >> indexes both because they were no known uses in which they were >> meaningfully better than btree, and because they aren't recoverable; >> to now just discouraging them because they are not recoverable. Does >> that mean there are now uses in which hash indexes are provably better >> than btree if one is willing to overlook the lack of recoverability? >> If so, what are those situations? > > One reason is that you can index values that are too big for btrees; > since hash indexes now store only the hash codes, they don't have a hard > length limit on the underlying value. I'm not sure how useful that > really is in practice, but it's at least an argument for considering > them in certain cases. I haven't tested this situation at all yet, hoping to show that there is substantially less restrictive use case under which they can be made to work well. > >> I've played around a bit with hash indexes, and it seems to me that >> making them generally worthwhile will take (at least) reducing or >> entirely doing away with the heavy-wait locks. > > Concurrency is really the least of the issues for hash indexes. So far > it's not clear that they're fast enough even in sequential use ... Hash index performance came up again in another thread, so I wanted to come back to this and report my findings. I've done testing using select-only statements with either btree or hash index on pgbench_accounts.aid, using only one active backend. For all tests, CPU usage was the bottleneck, essentially 100%. I used 600MB shared_buffers and -s 30. I figure that if each table-page-fetch is a physical IO, then that will dominate the performance, so the index's nature is irrelevant. So if the hash index has a strength it will best show if the table fits in memory. (The alternative would be if the table is so large that not even the btree's index pages fit in memory, resulting in IO for those index pages as well as table's, but I didn't have the patience or adequate IO subsystem to set up a test for that) With pgbench -S (select only, one tuple at a time) the overhead of communication between pgbench and the backend dominated the run time. Moving the select loop into a pl/pgsql routine did not help much. So I made a temp table with 10,000 randomly selected ids (from 1 to 3,000,000) and used that as the outer member to drive an nested loop join against the pgbench_accounts table, doing a custom query of select sum(abalance) from pgbench_accounts, foo where aid=random_id The HEAD code could do 16-17 tps (160,000 to 170,000 inner loop tuple fetches per second) using hash index, and 18 tps with btree. (But occasionally the rank order would be reversed, out of many alternating runs of reindexing, vacuuming, then doing pgbench -T 600 -f nested_loop.sql). When I disabled the lmgr page locking of the hash index, by making it think it was a temporary relation, the tps for the hash index jumped to 24 tps. If I make all the aid pgbench_accounts be even, and load "foo" with only odd IDs so that no rows are actually found, then the speed of the hash index jumps to about 48 tps, so about half the remaining CPU usage with hash index pertains to dealing with the rows once they are found, and not with using the index to attempt to find them. (I don't know how much of this time is used to recheck the qual, which btree doesn't need to do.) So my conclusions are: 1) btree indexes are so good at equality lookup, there may not be much room for hash indexes to improve on it by a substantial amount. 2) Heavy-weight locks are called that for a reason, they use a lot of CPU even without contention. 3) The CPU usage of the hash-index code proper is quite small, with more time being spent in heavy-weight PageLocks (specific to hash indexes, but not part of the hash index code itself) and in executor code common to all index methods, than in the hash index code. Based on this, I don't think that changing the bucket size to be smaller than a 8K block size is going to make much difference in hash index performance. Thinking about how to get rid of heavy weight locks and add WAL, I was wondering if we couldn't get rid of bucket splits altogether, requiring the bucket count to be set immutably at build time based on the estimated eventual max/equilibrium size of the table. Maybe this would render hash indexes somewhat of a niche feature, but probably much less so than they currently are. I think that getting rid of dynamic bucket splits would make it easier to get rid of heavy-weight locks and also much easier to implement WAL. WAL logging the addition of an overflow page seems about the same as a btree node split, while WAL logging a bucket split, when the bucket potentially has overflow pages in it, seems much much harder. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > So my conclusions are: > 2) Heavy-weight locks are called that for a reason, they use a lot of > CPU even without contention. > 3) The CPU usage of the hash-index code proper is quite small, with > more time being spent in heavy-weight PageLocks (specific to hash > indexes, but not part of the hash index code itself) and in executor > code common to all index methods, than in the hash index code. Interesting. My reaction to that would be to try to replace the heavyweight locks with LWLocks. IIRC the reason for using heavyweight locks was fear of deadlocks, but maybe closer analysis and some tweaking would allow us to eliminate that risk. regards, tom lane