keys_are_unique optimization causes out-of-buffers failure
От | Tom Lane |
---|---|
Тема | keys_are_unique optimization causes out-of-buffers failure |
Дата | |
Msg-id | 15534.1048282597@sss.pgh.pa.us обсуждение исходный текст |
Список | pgsql-hackers |
I've just tracked down the reason for this failure: http://archives.postgresql.org/pgsql-general/2003-03/msg01052.php It seems the reason I didn't see it in my previous test was that I didn't happen to declare the index as unique. A many-way indexscan on unique indexes *will* run 7.3.* and CVS tip out of buffers. The problem is the optimization I added to index_getnext() to skip calling the index access method after the first call, if the index key is known unique. With that optimization in place, index_getnext() can return NULL while still holding a buffer pin on the heap page containing the single interesting tuple. (Furthermore, the underlying AM is still holding a pin on the corresponding index page.) This is no big problem in the normal case --- but when we are doing a multiple-indexscan plan, nodeIndexscan.c will then move on to the next indexscan in the plan. If that acts the same, two more buffers will be pinned when nodeIndexscan.c advances to the third indexscan. Etc. Sooner or later you run out of buffers. This isn't a resource leak in the classic sense, because all these buffers will be released when the plan is closed down; but it's effectively an intraquery leak, and it will cause problems for many-way indexscans. This is the second major bug found in the keys_are_unique hack (the first being that it didn't cope with backwards scanning). At this point I'm a little bit disgusted with the hack, and would be willing to consider ripping it out entirely. But it does offer some performance benefits, and probably we ought to look first to see if it can be saved. A minimal-change solution could be made by disabling the keys_are_unique optimization in the context of a multiple-index scan. This is pretty grotty though, both from a performance standpoint and because it introduces coupling between levels that ought to be independent. A better way to proceed would be to get index_getnext() to release all the resources involved before returning NULL. I believe this is safe even though index_getnext() might later need to back up and re-fetch the heap tuple --- if it returned the tuple once, then the tuple is visible to the current transaction and can't get deleted from under us, even if we release the pin on its page. A bit of ugliness is that the only available way to get the index AM to release its index-level pins is to call index_rescan, which would normally reset index_getnext's own state. So this seems doable but ugly. Can anyone see a better way? Are there (still) holes in this plan? Anyone want to lobby for dropping the keys_are_unique code completely? A related bug has been latent all along: because "mark" positions are remembered per-indexscan, it'd be possible for successive ExecMarkPos calls applied to a multi-indexscan plan node to build up an indefinitely large number of marks and thereby run the system out of buffers. I don't see any way to fix this without adding a new index-AM API call to clear out a remembered mark position. However, this problem cannot actually occur in the current system or any readily-foreseeable version, because ExecMarkPos is only used by merge joins and a multi-indexscan couldn't be a direct child of a merge join (since a multi-indexscan would never be thought to produce any useful sort order). So I'm going to ignore the potential future problem, but I thought I'd mention it for the archives. regards, tom lane
В списке pgsql-hackers по дате отправления: