Обсуждение: I did some testing of GIST/GIN vs BTree indexing…

Поиск
Список
Период
Сортировка

I did some testing of GIST/GIN vs BTree indexing…

От
Guyren Howe
Дата:
Obviously, database benchmarking is basically a silly idea, because every workload, every hardware configuration, every schema are different, with very different consequences.

Still, I was left with wondering when one should choose a BTree vs GIST or GIN (I didn’t even try to look at SP-GIST: any thoughts?).

The documentation on heuristics for when to choose which doesn’t say exactly but suggests that one should just choose a BTree.

But what made me curious about the alternatives is this: you have to use a compound BTree index left-to-right (you can use the first column, or the first and second, or first, second and third…), whereas GIST and GIN both support using arbitrary fields. That right there is mighty compelling for any kind of ad-hoc query scenario. So I thought I’d do some testing about insert and query speed.

I made a table with six integer columns but no indexes, and made copies with a regular BT primary key, and then an index on all the columns, in separate tables using either a BTree, a GIST or a GIN index. The original table I generated 10 million rows of random values in the range 1..1 million.

This was on 9.4RC1 on a recent-model iMac. I used an external 2.5” hard drive so the SSD wasn’t mucking things up. I restarted Postgres (but not the computer) between tests. I tested a different range of values, of the same size, when comparing searches, in order to mostly avoid having the MacOS disk cache helping things out. I also tried running multiple versions of the search tests on different ranges in different orders, just to see if the comparison of search times was roughly consistent, which it was.

The results were… surprising, and don’t seem to accord with what the documentation says.

I was interested in insert speed. The docs suggest that GIN would be appallingly slow, but don’t really talk about GIST vs BTree. When I configured reasonable caches, working mem and such, I got the following times for insert… select * (values in ms; columns are BTree-GIST-GIN):

1,233,570
2,323,639
1,700,752

GIST was slower than GIN (which flat contradicts the documentation), but both are in shouting distance of BTree. When I tuned down the caches, to simulate a machine under load, I got:

20,252,299
3,583,346
10,495,424

So wow, BTree is much slower. This is basically a default setup, with tiny cache and working memory and whatnot.

Searching on the first column is just as interesting. With decent caches, find 100 values (so about a thousand rows):

936
508
2,215

So this result is interesting. GIN is *slowest*, and in particular is slower than GIST by a good margin, whereas the docs suggest GIN should always be faster. GIST is significantly faster than BTree.

Now finding 1,000 values (so 10,000 rows, give or take):

4,148
1,302
3,078

GIST is fastest again, but now GIN is faster than BTree.

Again, but 10,000 values/100,000 rows:

34,767
8,104
8,995

So in a memory-rich environment, GIST appears to be the clear winner. When I turn down the memory values, though, I get a very different result.

100 values/1000 rows:

7,931
7,209
1,034

1000 values/10,000 rows:

81,675
88,191
1,598

I tried that one several times, because I found it hard to believe. The results were consistent. With 10,000 values/100,000 rows, the results are even starker.

Index sizes in pages:

258,578
309,636
581,795

GIN is certainly not the “three times” size suggested in the docs, but perhaps that just hasn’t been updated for the 9.4 improvements. Certainly, there isn’t sufficient difference here to make the BTree advantage compelling in most applications.

Given the futility of database benchmarking in general, I didn’t want to go any further with this. What I was interested in was whether it might be worth switching from BTree to GIST/GIN indexes with regular sorts of data. It appears to be the case that GIST and GIN are often better than BTree in general, and given their much greater flexibility in satisfying queries on different columns, it might even be the case that one should recommend a single GIST or GIN index on the frequently-searched columns of a table in most cases?

I would absolutely *love* to hear what this community has to say about this question: should one consider GIST or GIN indexes on regular old numeric/text columns? When, theoretically? When, in practice (ie does anyone have comparable benchmarks on 9.4?). Other thoughts?

Re: I did some testing of GIST/GIN vs BTree indexing…

От
Bruce Momjian
Дата:
On Wed, Dec  3, 2014 at 01:15:50AM -0800, Guyren Howe wrote:
> GIN is certainly not the “three times” size suggested in the docs, but perhaps
> that just hasn’t been updated for the 9.4 improvements. Certainly, there isn’t
> sufficient difference here to make the BTree advantage compelling in most
> applications.

I am sure the docs need updating for 9.4 --- any suggestions?

> Given the futility of database benchmarking in general, I didn’t want to go any
> further with this. What I was interested in was whether it might be worth
> switching from BTree to GIST/GIN indexes with regular sorts of data. It appears
> to be the case that GIST and GIN are often better than BTree in general, and
> given their much greater flexibility in satisfying queries on different
> columns, it might even be the case that one should recommend a single GIST or
> GIN index on the frequently-searched columns of a table in most cases?

What GiST and GIN "ops" did you use for the testing?  Was it
contrib/btree_gist and contrib/btree_gin?

> I would absolutely *love* to hear what this community has to say about this
> question: should one consider GIST or GIN indexes on regular old numeric/text
> columns? When, theoretically? When, in practice (ie does anyone have comparable
> benchmarks on 9.4?). Other thoughts?

You might want to look at my presentation on indexing:

    http://momjian.us/main/presentations/features.html#indexing

It is my understanding that btree is best for single-match indexes like
unique indexes, or range queries (not range data types), while GIN is
best for indexes with many duplicates.  GiST is more of an indexing
framework and I am unclear where it is best except in cases where is the
only option, like geometry and perhaps range (shared with SP-GiST).
With the 9.4 GIN improvements I am unclear if GiST is ever better for
full text indexing compared to GIN.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


Re: [GENERAL] I did some testing of GIST/GIN vs BTree indexing…

От
Guyren Howe
Дата:
On Dec 6, 2014, at 12:38 , Bruce Momjian <bruce@momjian.us> wrote:
>
> On Wed, Dec  3, 2014 at 01:15:50AM -0800, Guyren Howe wrote:
>> GIN is certainly not the “three times” size suggested in the docs, but perhaps
>> that just hasn’t been updated for the 9.4 improvements. Certainly, there isn’t
>> sufficient difference here to make the BTree advantage compelling in most
>> applications.
>
> I am sure the docs need updating for 9.4 — any suggestions?

I want to get to the point where I can make fairly definitive statements about indexing regular fields with GIST or
GIN.When I do, I’ll be happy to write something for the docs. If folks here can help me get to that point, all to the
betterof all… :-) 

>> Given the futility of database benchmarking in general, I didn’t want to go any
>> further with this. What I was interested in was whether it might be worth
>> switching from BTree to GIST/GIN indexes with regular sorts of data. It appears
>> to be the case that GIST and GIN are often better than BTree in general, and
>> given their much greater flexibility in satisfying queries on different
>> columns, it might even be the case that one should recommend a single GIST or
>> GIN index on the frequently-searched columns of a table in most cases?
>
> What GiST and GIN "ops" did you use for the testing?  Was it
> contrib/btree_gist and contrib/btree_gin?

Sorry; yes. I didn’t realize there was any practical alternative. Is there another option I should test?

> You might want to look at my presentation on indexing:
>
>     http://momjian.us/main/presentations/features.html#indexing
>
> It is my understanding that btree is best for single-match indexes like
> unique indexes, or range queries (not range data types), while GIN is
> best for indexes with many duplicates.  GiST is more of an indexing
> framework and I am unclear where it is best except in cases where is the
> only option, like geometry and perhaps range (shared with SP-GiST).
> With the 9.4 GIN improvements I am unclear if GiST is ever better for
> full text indexing compared to GIN.

Thanks for this. I will look at your presentation.

As I say, if folks can help me work out the definitive answer to all this, I’d love to contribute it to the docs.

My starting point was this: given that GIN (and GIST, maybe, the docs sort-of say “sort of”) can use arbitrary index
fields,rather than left to right, if you’re in a situation of wanting to query arbitrary subsets of some of the fields
ona table, it seems likely that a GIN index might be called for. Is that right? The description I’ve been able to find
(thatit’s a BTree with more sophisticated handling of duplicates) would surely entail otherwise, but this is clearly
whatthe docs say. 



Re: I did some testing of GIST/GIN vs BTree indexing…

От
Bruce Momjian
Дата:
On Wed, Dec 10, 2014 at 05:27:16PM -0800, Guyren Howe wrote:
> >> Given the futility of database benchmarking in general, I didn’t
> >> want to go any further with this. What I was interested in was
> >> whether it might be worth switching from BTree to GIST/GIN indexes
> >> with regular sorts of data. It appears to be the case that GIST and
> >> GIN are often better than BTree in general, and given their much
> >> greater flexibility in satisfying queries on different columns, it
> >> might even be the case that one should recommend a single GIST or
> >> GIN index on the frequently-searched columns of a table in most
> >> cases?
> >
> > What GiST and GIN "ops" did you use for the testing?  Was it
> > contrib/btree_gist and contrib/btree_gin?
>
> Sorry; yes. I didn’t realize there was any practical alternative. Is
> there another option I should test?

Well, GIN and GiST are usually used for data that btree can't index,
e.g. JSONB, 2-dimensional points  I  thought the only win for
contrib/btree_gist and contrib/btree_gin would be for indexes with many
duplicates.

> > You might want to look at my presentation on indexing:
> >
> >     http://momjian.us/main/presentations/features.html#indexing
> >
> > It is my understanding that btree is best for single-match indexes
> > like unique indexes, or range queries (not range data types), while
> > GIN is best for indexes with many duplicates.  GiST is more of an
> > indexing framework and I am unclear where it is best except in cases
> > where is the only option, like geometry and perhaps range (shared
> > with SP-GiST).  With the 9.4 GIN improvements I am unclear if GiST
> > is ever better for full text indexing compared to GIN.
>
> Thanks for this. I will look at your presentation.
>
> As I say, if folks can help me work out the definitive answer to all
> this, I’d love to contribute it to the docs.

Great, thanks.

> My starting point was this: given that GIN (and GIST, maybe, the docs
> sort-of say “sort of”) can use arbitrary index fields, rather
> than left to right, if you’re in a situation of wanting to query
> arbitrary subsets of some of the fields on a table, it seems likely
> that a GIN index might be called for. Is that right? The description
> I’ve been able to find (that it’s a BTree with more sophisticated
> handling of duplicates) would surely entail otherwise, but this is
> clearly what the docs say.

Are you saying when you use a GIN index on a,b,c fields, you can do
lookups on them independently, like 'c'?  I was not aware that works,
but it might.  I know it doesn't work for traditional btree as the index
is hierarchical.  You can look up things like a,c and it will skip over
'b', but doing 'c' alone doesn't make any sense for traditional btree.

It would be interesting if that was true, though, and something we
should more clearly document.  Your testing is very useful here.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +


Re: [GENERAL] I did some testing of GIST/GIN vs BTree indexing…

От
Guyren Howe
Дата:
On Dec 10, 2014, at 19:38 , Bruce Momjian <bruce@momjian.us> wrote:
>
> Are you saying when you use a GIN index on a,b,c fields, you can do
> lookups on them independently, like 'c'?  I was not aware that works,
> but it might.  I know it doesn't work for traditional btree as the index
> is hierarchical.  You can look up things like a,c and it will skip over
> 'b', but doing 'c' alone doesn't make any sense for traditional btree.
>
> It would be interesting if that was true, though, and something we
> should more clearly document.  Your testing is very useful here.

This page:

http://www.postgresql.org/docs/9.4/static/indexes-multicolumn.html

says:

A multicolumn GiST index can be used with query conditions that involve any subset of the index's columns. Conditions
onadditional columns restrict the entries returned by the index, but the condition on the first column is the most
importantone for determining how much of the index needs to be scanned. A GiST index will be relatively ineffective if
itsfirst column has only a few distinct values, even if there are many distinct values in additional columns. 

A multicolumn GIN index can be used with query conditions that involve any subset of the index's columns. Unlike B-tree
orGiST, index search effectiveness is the same regardless of which index column(s) the query conditions use. 



This appears to imply greater (complete?) flexibility in using non-leading columns with GIST and GIN indexes, or am I
misunderstandingsomething? This is the whole reason I’ve started investigating this — particularly given what it says
aboutGIN.