Обсуждение: [HACKERS] RFC: Key normalization for nbtree
I've created a new Wiki page that describes a scheme for normalizing internal page items within B-Tree indexes, and the many optimizations that this can enable: https://wiki.postgresql.org/wiki/Key_normalization Key normalization means creating a representation for internal page items that we always just memcmp(), regardless of the details of the underlying datatypes. My intent in creating this wiki page is to document these techniques centrally, as well as the problems that they may solve, and to show how they're all interrelated. It might be that confusion about how one optimization enables another holds back patch authors. It might appear excessive to talk about several different techniques in one place, but that seemed like the best way to me, because there are subtle dependencies. If most of the optimizations are pursued as a project all at once (say, key normalization, suffix truncation, and treating heap TID as a unique-ifier), that may actually be more likely to succeed than a project to do just one. The techniques don't appear to be related at first, but they really are. I'm not planning on working on key normalization or any of the other techniques as projects myself, but FWIW I have produced minimal prototypes of a few of the techniques over the past several years, just to verify my understanding. My theories on this topic seem worth writing down. I'm happy to explain or clarify any aspect of what I describe, and to revise the design based on feedback. It is still very much a work in progress. -- Peter Geoghegan
On Mon, Jul 10, 2017 at 3:40 PM, Peter Geoghegan <pg@bowt.ie> wrote: > It might appear excessive to talk about several different techniques > in one place, but that seemed like the best way to me, because there > are subtle dependencies. If most of the optimizations are pursued as a > project all at once (say, key normalization, suffix truncation, and > treating heap TID as a unique-ifier), that may actually be more likely > to succeed than a project to do just one. The techniques don't appear > to be related at first, but they really are. I do have a patch that attacks suffix truncation, heap tid unification and prefix compression all at once. It's on a hiatus ATM, but, as you say, the implementations are highly correlated so attacking them at once makes a lot of sense. Or, at least, attacking one having the other in the back of your mind. Key normalization would simplify prefix compression considerably, for instance. A missing optimization is that having tid unification allows VACUUM to implement a different strategy when it needs to clean up only a tiny fraction of the index. It can do the lookup by key-tid instead of scanning the whole index, which can be a win if the index is large and the number of index pointers to kill is small.
Claudio Freire wrote: > A missing optimization is that having tid unification allows VACUUM to > implement a different strategy when it needs to clean up only a tiny > fraction of the index. It can do the lookup by key-tid instead of > scanning the whole index, which can be a win if the index is large and > the number of index pointers to kill is small. Doing index cleanup by using keys instead of scanning the whole index would be a *huge* win for many use cases. I don't think that work needs to be related to either of these patches. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Jul 10, 2017 at 1:23 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Claudio Freire wrote: > >> A missing optimization is that having tid unification allows VACUUM to >> implement a different strategy when it needs to clean up only a tiny >> fraction of the index. It can do the lookup by key-tid instead of >> scanning the whole index, which can be a win if the index is large and >> the number of index pointers to kill is small. > > Doing index cleanup by using keys instead of scanning the whole index > would be a *huge* win for many use cases. I don't think that work needs > to be related to either of these patches. The problem is that it will perform horribly when there are many duplicates, unless you really can treat heap TID as a part of the key. Once you do that, you can be almost certain that you won't have to visit more than one leaf page per index tuple to kill. And, you can amortize descending the index among a set of duplicate keyed tuples. You arrive at the leaf tuple with the lowest sort order, based on (key, TID), kill that, and then continue an index scan along the leaf level. VACUUM ticks them off a private "kill list" which is also ordered by (key, TID). Much less random I/O, and pretty good guarantees about the worst case. -- Peter Geoghegan
On Mon, Jul 10, 2017 at 12:53 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > I do have a patch that attacks suffix truncation, heap tid unification > and prefix compression all at once. That's great! I'll certainly be able to review it. > It's on a hiatus ATM, but, as you say, the implementations are highly > correlated so attacking them at once makes a lot of sense. Or, at > least, attacking one having the other in the back of your mind. > > Key normalization would simplify prefix compression considerably, for instance. The stuff that I go into detail about is the stuff I think you'd need to do up front. Prefix compression would be a "nice to have", but I think you'd specifically do key normalization, suffix truncation, and adding the heap TID to the key space all up front. > A missing optimization is that having tid unification allows VACUUM to > implement a different strategy when it needs to clean up only a tiny > fraction of the index. It can do the lookup by key-tid instead of > scanning the whole index, which can be a win if the index is large and > the number of index pointers to kill is small. I do mention that on the Wiki page. I also describe ways that these techniques have some non-obvious benefits for VACUUM B-Tree page deletion, which you should get "for free" once you do the 3 things I mentioned. A lot of the benefits for VACUUM are seen in the worst case, which is when they're really needed. -- Peter Geoghegan
On 10 July 2017 at 19:40, Peter Geoghegan <pg@bowt.ie> wrote: > Key normalization means creating a representation for internal page > items that we always just memcmp(), regardless of the details of the > underlying datatypes. One thing I would like to see is features like this added to the opclasses (or opfamilies?) using standard PG functions that return standard PG data types. So if each opclass had a function that took the data type in question and returned a bytea then you could implement that function using a language you felt like (in theory), test it using standard SQL, and possibly even find other uses for it. That kind of abstraction would be more promising for the future than having yet another C api that is used for precisely one purpose and whose definition is "provide the data needed for this usage". -- greg
On Mon, Jul 10, 2017 at 4:08 PM, Greg Stark <stark@mit.edu> wrote: > One thing I would like to see is features like this added to the > opclasses (or opfamilies?) using standard PG functions that return > standard PG data types. So if each opclass had a function that took > the data type in question and returned a bytea then you could > implement that function using a language you felt like (in theory), > test it using standard SQL, and possibly even find other uses for it. That seems like a good goal. > That kind of abstraction would be more promising for the future than > having yet another C api that is used for precisely one purpose and > whose definition is "provide the data needed for this usage". Perhaps this is obvious, but the advantage of flattening everything into one universal representation is that it's a very simple API for opclass authors, that puts control into the hands of the core nbtree code, where it belongs. For text, say, you can generate fictitious minimal separator keys with suffix truncation, that really are as short as possible, down to level of individual bits. If you tried to do this with the original text representation, you'd have to worry about the fact that that's probably not going to even be valid UTF-8, and that encoding aware truncation is needed, etc. You'd definitely have to "double check" that the truncated key was greater than the left half and less than the right half, just in case you didn't end up with a valid separator due to the vagaries of the collation rules. That's the kind of complexity that scales poorly, because the complexity cannot be isolated. -- Peter Geoghegan