Обсуждение: I'd like to learn a bit more about how indexes work

Поиск
Список
Период
Сортировка

I'd like to learn a bit more about how indexes work

От
Mike Christensen
Дата:
Hi -

I'm trying to increase my general knowledge about how indexes work in
databases.  Though my questions are probably general and implemented
in a similar way across major relational DBs, I'm also curious as to
how they're implemented in Postgres specifically (mainly because I
like PG, and am always interested in knowing if PG does things in some
cool and interesting way).

I know the basics of how binary trees work, so I understand a query such as :

select * from Table where Id = 5;

Provided Id has a btree index on it.  I'm curious as to how indexes
are used with OR and AND clauses.

Something like:

select * from Table where X = 5 or y = 3;

It seems to me both the index of X would be scanned and those rows
would be loaded into memory, and then the index of Y would be scanned
and loaded.  Then, Postgres would have to merge both sets into a set
of unique rows.  Is this pretty much what's going on?  Let's ignore
table stats for now.

Then, something like:

select * from Table where X = 5 AND y = 3;

I would imagine the same thing is going on, only Postgres would find
rows that appear in both sets.  I also imagine Postgres might create a
hash table from the larger set, and then iterate through the smaller
set looking for rows that were in that hash table.

Lastly, If you had a query such as:

select * from Table where X IN (1,2,3,4,5,6,7);

I would imagine Postgres would parse that query as a bunch of OR
clauses.  Does this mean the index for X would be scanned 7 times and
merged into a set of unique results?  Though, obviously if Postgres
estimated this would return the majority of the rows in the table, it
would probably just ignore the index completely.

Thanks!
Mike

Re: I'd like to learn a bit more about how indexes work

От
Dann Corbit
Дата:
If you want to discover how B+Trees or B-Trees work, I suggest a web search.  A database like PostgreSQL is not going
touse an ordinary btree for an index, but they use special trees that have page level structures, such as B-Trees, GiST
trees,etc.    For PostgreSQL the list includes {IIRC} B-tree, Hash, GiST and GIN, though I am not sure it is current.
Ibelieve that there is also a GIS extension to PostgreSQL which probably uses Octrees or Quadtrees, but that is purely
aguess. 
Place this criteria into your favorite search engine, for instance:
"B-Tree" index

You can qualify it with "PostgreSQL" if you like, but I suspect you just want to know how indexes work in general with
differentindex types. 

I suspect that what you really want to eventually understand is:

"How does the optimizer make plans to create efficient queries" which is what is indicated in your questions below.

If that is the case, then I suggest performing search queries with keywords such as:
sql cost based optimizer

-----Original Message-----
From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Mike Christensen
Sent: Tuesday, June 05, 2012 3:25 PM
To: pgsql-general@postgresql.org
Subject: [GENERAL] I'd like to learn a bit more about how indexes work

Hi -

I'm trying to increase my general knowledge about how indexes work in databases.  Though my questions are probably
generaland implemented in a similar way across major relational DBs, I'm also curious as to how they're implemented in
Postgresspecifically (mainly because I like PG, and am always interested in knowing if PG does things in some cool and
interestingway). 

I know the basics of how binary trees work, so I understand a query such as :

select * from Table where Id = 5;

Provided Id has a btree index on it.  I'm curious as to how indexes are used with OR and AND clauses.

Something like:

select * from Table where X = 5 or y = 3;

It seems to me both the index of X would be scanned and those rows would be loaded into memory, and then the index of Y
wouldbe scanned and loaded.  Then, Postgres would have to merge both sets into a set of unique rows.  Is this pretty
muchwhat's going on?  Let's ignore table stats for now. 

Then, something like:

select * from Table where X = 5 AND y = 3;

I would imagine the same thing is going on, only Postgres would find rows that appear in both sets.  I also imagine
Postgresmight create a hash table from the larger set, and then iterate through the smaller set looking for rows that
werein that hash table. 

Lastly, If you had a query such as:

select * from Table where X IN (1,2,3,4,5,6,7);

I would imagine Postgres would parse that query as a bunch of OR clauses.  Does this mean the index for X would be
scanned7 times and merged into a set of unique results?  Though, obviously if Postgres estimated this would return the
majorityof the rows in the table, it would probably just ignore the index completely. 

Thanks!
Mike

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: I'd like to learn a bit more about how indexes work

От
Mike Christensen
Дата:
I'm aware of how "B-Trees" work, but I only understand them on the
level of traversing a single tree at a time.  I'm curious how Postgres
combines multiple indexes with OR and AND clauses.

I've done some Google searches, however I can't find anything basic.
Everything I've found assumes you already have knowledge of terms such
as "hash join, aggregate join, etc".

At this point I'm not looking at learning how the optimizer works..

Mike

On Tue, Jun 5, 2012 at 3:36 PM, Dann Corbit <DCorbit@connx.com> wrote:
> If you want to discover how B+Trees or B-Trees work, I suggest a web search.  A database like PostgreSQL is not going
touse an ordinary btree for an index, but they use special trees that have page level structures, such as B-Trees, GiST
trees,etc.    For PostgreSQL the list includes {IIRC} B-tree, Hash, GiST and GIN, though I am not sure it is current.
 Ibelieve that there is also a GIS extension to PostgreSQL which probably uses Octrees or Quadtrees, but that is purely
aguess. 
> Place this criteria into your favorite search engine, for instance:
> "B-Tree" index
>
> You can qualify it with "PostgreSQL" if you like, but I suspect you just want to know how indexes work in general
withdifferent index types. 
>
> I suspect that what you really want to eventually understand is:
>
> "How does the optimizer make plans to create efficient queries" which is what is indicated in your questions below.
>
> If that is the case, then I suggest performing search queries with keywords such as:
> sql cost based optimizer
>
> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of Mike Christensen
> Sent: Tuesday, June 05, 2012 3:25 PM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] I'd like to learn a bit more about how indexes work
>
> Hi -
>
> I'm trying to increase my general knowledge about how indexes work in databases.  Though my questions are probably
generaland implemented in a similar way across major relational DBs, I'm also curious as to how they're implemented in
Postgresspecifically (mainly because I like PG, and am always interested in knowing if PG does things in some cool and
interestingway). 
>
> I know the basics of how binary trees work, so I understand a query such as :
>
> select * from Table where Id = 5;
>
> Provided Id has a btree index on it.  I'm curious as to how indexes are used with OR and AND clauses.
>
> Something like:
>
> select * from Table where X = 5 or y = 3;
>
> It seems to me both the index of X would be scanned and those rows would be loaded into memory, and then the index of
Ywould be scanned and loaded.  Then, Postgres would have to merge both sets into a set of unique rows.  Is this pretty
muchwhat's going on?  Let's ignore table stats for now. 
>
> Then, something like:
>
> select * from Table where X = 5 AND y = 3;
>
> I would imagine the same thing is going on, only Postgres would find rows that appear in both sets.  I also imagine
Postgresmight create a hash table from the larger set, and then iterate through the smaller set looking for rows that
werein that hash table. 
>
> Lastly, If you had a query such as:
>
> select * from Table where X IN (1,2,3,4,5,6,7);
>
> I would imagine Postgres would parse that query as a bunch of OR clauses.  Does this mean the index for X would be
scanned7 times and merged into a set of unique results?  Though, obviously if Postgres estimated this would return the
majorityof the rows in the table, it would probably just ignore the index completely. 
>
> Thanks!
> Mike
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: I'd like to learn a bit more about how indexes work

От
Dann Corbit
Дата:
-----Original Message-----
From: Mike Christensen [mailto:mike@kitchenpc.com]
Sent: Tuesday, June 05, 2012 4:28 PM
To: Dann Corbit
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] I'd like to learn a bit more about how indexes work

I'm aware of how "B-Trees" work, but I only understand them on the level of traversing a single tree at a time.  I'm
curioushow Postgres combines multiple indexes with OR and AND clauses. 

I've done some Google searches, however I can't find anything basic.
Everything I've found assumes you already have knowledge of terms such as "hash join, aggregate join, etc".

At this point I'm not looking at learning how the optimizer works..
>>
"How the optimizer works" is the answer to your question.
The plan of attack for forming a query is a function of the optimizer.

One possible plan  for " WHERE data = key1 OR data = key2 " is something along the lines of:
SEEK("key1")
While key == key1 accumulate rows into the result set
   GetNextRow()
SEEK("key2")
While key == key2 accumulate rows into the result set
   GetNextRow()

However, if the table is tiny (suppose it is ten rows and fits into memory) then a table scan might be cheaper.

Here at CONNX, I have written a hashed btree search that tends to be cheaper than using a clustered index if there are
noqualifiers on the join. 
For instance
SELECT a.*, b.* from table1 a, table2 b WHERE a.unique_index = b.foreign_key

It will be faster to actually not use the index.  Whereas if there are additional where clause criteria such as:
SELECT a.*, b.* from table1 a, table2 b WHERE a.unique_index = b.foreign_key AND a.unique_index IN (k1, k2, k3, k4,...,
kn,kn+1) 
It will probably be faster to use the index unless the list of items is a substantial proportion of the possible data
values.

The point is that there is not a simple formula that describes how data values are retrieved from the database.  The
methodof collection is decided by the optimizer. 

It isn't always cheaper to use an index.  In fact, sometimes building an index is a complete waste of time.
For instance, suppose that you have a column named 'sex' that can contain the values 'F', 'M', and 'U'
If you built an index on that column it won't help you to find all the males faster than a table scan because the data
isnot specific enough so that the total number of pages of disk that are read would be more with the index than if the
indexwere not used. 
So, I suggest that possibly the articles you do not want to read are the very ones that will answer your questions.

On the other hand, it is not unlikely that I simply do not understand the questions that you are asking.
<<

Re: I'd like to learn a bit more about how indexes work

От
Chris Curvey
Дата:


On Tue, Jun 5, 2012 at 6:24 PM, Mike Christensen <mike@kitchenpc.com> wrote:
Hi -

I'm trying to increase my general knowledge about how indexes work in
databases.  Though my questions are probably general and implemented
in a similar way across major relational DBs, I'm also curious as to
how they're implemented in Postgres specifically (mainly because I
like PG, and am always interested in knowing if PG does things in some
cool and interesting way).

Quick!   Create some test data!

drop table if exists foobar;
create table foobar
(  a int not null primary key
,  b int null
,  c int null
,  d int null
);
create index b_idx on foobar (b);
create index c_idx on foobar (c);
create index d_idx on foobar (d);

create or replace function generate_test_data() returns void as $$
declare
  i integer;
begin
  for i in 1..100000 loop
     insert into foobar (a, b, c, d) values (i, i%2, i%10, i%1000);
  end loop;
end;
$$ language plpgsql;

select generate_test_data();

vacuum analyze foobar;
 

I know the basics of how binary trees work, so I understand a query such as :

select * from Table where Id = 5;

Provided Id has a btree index on it.

sometimes.  Sometimes not.

explain analyze
select * from foobar
where a = 1234;
                                                     QUERY PLAN                  
---------------------------------------------------------------------------------------------------------------------
 Index Scan using foobar_pkey on foobar  (cost=0.00..8.38 rows=1 width=16) (actual time=0.008..0.008 rows=1 loops=1)
   Index Cond: (a = 1234)
 Total runtime: 0.030 ms
(3 rows)

explain analyze
select *
from foobar
where b = 1;
                                                 QUERY PLAN                      
-------------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..1791.00 rows=50270 width=16) (actual time=0.011..13.603 rows=50000 loops=1)
   Filter: (b = 1)
   Rows Removed by Filter: 50000
 Total runtime: 16.005 ms
(4 rows)
 
 I'm curious as to how indexes
are used with OR and AND clauses.

Something like:

select * from Table where X = 5 or y = 3;

It seems to me both the index of X would be scanned and those rows
would be loaded into memory, and then the index of Y would be scanned
and loaded.  Then, Postgres would have to merge both sets into a set
of unique rows.  Is this pretty much what's going on?  Let's ignore
table stats for now.

The right answer is "sometimes".  Here's a query that is solved the way you expect:

explain analyze
select *
from foobar
where c = 1
or d = 1;
                                                          QUERY PLAN             
------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foobar  (cost=226.47..920.65 rows=10202 width=16) (actual time=1.284..3.784 rows=10000 loops=1)
   Recheck Cond: ((c = 1) OR (d = 1))
   ->  BitmapOr  (cost=226.47..226.47 rows=10212 width=0) (actual time=1.174..1.174 rows=0 loops=1)
         ->  Bitmap Index Scan on c_idx  (cost=0.00..216.23 rows=10113 width=0) (actual time=1.157..1.157 rows=10000 loops=1)
               Index Cond: (c = 1)
         ->  Bitmap Index Scan on d_idx  (cost=0.00..5.13 rows=98 width=0) (actual time=0.016..0.016 rows=100 loops=1)
               Index Cond: (d = 1)
 Total runtime: 4.452 ms
(8 rows)

And here is a query that is not solved the way you expect, because the index on B is not useful.

explain analyze
select *
from foobar
where c = 1
or b = 1;
                                                 QUERY PLAN                      
-------------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..2041.00 rows=55299 width=16) (actual time=0.007..14.922 rows=50000 loops=1)
   Filter: ((c = 1) OR (b = 1))
   Rows Removed by Filter: 50000
 Total runtime: 17.002 ms
(4 rows)

 

Then, something like:

select * from Table where X = 5 AND y = 3;

I would imagine the same thing is going on, only Postgres would find
rows that appear in both sets.  I also imagine Postgres might create a
hash table from the larger set, and then iterate through the smaller
set looking for rows that were in that hash table 

and you would be largely right about that.  Interestingly, on an earlier run of this, I got a BitmapAnd strategy, rather than applying the c=1 filter to the rows after identifying all the d=1 rows.

explain analyze
select *
from foobar
where c = 1
and d = 1;
                                                   QUERY PLAN                    
-----------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foobar  (cost=5.13..256.48 rows=10 width=16) (actual time=0.046..0.150 rows=100 loops=1)
   Recheck Cond: (d = 1)
   Filter: (c = 1)
   ->  Bitmap Index Scan on d_idx  (cost=0.00..5.13 rows=98 width=0) (actual time=0.026..0.026 rows=100 loops=1)
         Index Cond: (d = 1)
 Total runtime: 0.179 ms
(6 rows)

 

Lastly, If you had a query such as:

select * from Table where X IN (1,2,3,4,5,6,7);

I would imagine Postgres would parse that query as a bunch of OR
clauses.  Does this mean the index for X would be scanned 7 times and
merged into a set of unique results?  Though, obviously if Postgres
estimated this would return the majority of the rows in the table, it
would probably just ignore the index completely.

 and you would be right on both counts

explain analyze
select *
from foobar
where c in (1, 2, 3);
                                                       QUERY PLAN                
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on foobar  (cost=609.14..1562.18 rows=29967 width=16) (actual time=3.456..7.589 rows=30000 loops=1)
   Recheck Cond: (c = ANY ('{1,2,3}'::integer[]))
   ->  Bitmap Index Scan on c_idx  (cost=0.00..601.64 rows=29967 width=0) (actual time=3.342..3.342 rows=30000 loops=1)
         Index Cond: (c = ANY ('{1,2,3}'::integer[]))
 Total runtime: 9.083 ms
(5 rows)

explain analyze
select *
from foobar
where c in (1, 2, 3, 4, 5, 6);
                                                 QUERY PLAN                      
-------------------------------------------------------------------------------------------------------------
 Seq Scan on foobar  (cost=0.00..2291.00 rows=59723 width=16) (actual time=0.005..18.450 rows=60000 loops=1)
   Filter: (c = ANY ('{1,2,3,4,5,6}'::integer[]))
   Rows Removed by Filter: 40000
 Total runtime: 21.181 ms
(4 rows)




Thanks!
Mike


Re: I'd like to learn a bit more about how indexes work

От
Mike Christensen
Дата:
On Tue, Jun 5, 2012 at 7:20 PM, Chris Curvey <chris@chriscurvey.com> wrote:
>
>
> On Tue, Jun 5, 2012 at 6:24 PM, Mike Christensen <mike@kitchenpc.com> wrote:
>>
>> Hi -
>>
>> I'm trying to increase my general knowledge about how indexes work in
>> databases.  Though my questions are probably general and implemented
>> in a similar way across major relational DBs, I'm also curious as to
>> how they're implemented in Postgres specifically (mainly because I
>> like PG, and am always interested in knowing if PG does things in some
>> cool and interesting way).
>
>
> Quick!   Create some test data!
>
> drop table if exists foobar;
> create table foobar
> (  a int not null primary key
> ,  b int null
> ,  c int null
> ,  d int null
> );
> create index b_idx on foobar (b);
> create index c_idx on foobar (c);
> create index d_idx on foobar (d);
>
> create or replace function generate_test_data() returns void as $$
> declare
>   i integer;
> begin
>   for i in 1..100000 loop
>      insert into foobar (a, b, c, d) values (i, i%2, i%10, i%1000);
>   end loop;
> end;
> $$ language plpgsql;
>
> select generate_test_data();
>
> vacuum analyze foobar;
>
>>
>>
>> I know the basics of how binary trees work, so I understand a query such
>> as :
>>
>> select * from Table where Id = 5;
>>
>> Provided Id has a btree index on it.
>
>
> sometimes.  Sometimes not.
>
> explain analyze
> select * from foobar
> where a = 1234;
>                                                      QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------
>  Index Scan using foobar_pkey on foobar  (cost=0.00..8.38 rows=1 width=16)
> (actual time=0.008..0.008 rows=1 loops=1)
>    Index Cond: (a = 1234)
>  Total runtime: 0.030 ms
> (3 rows)
>
> explain analyze
> select *
> from foobar
> where b = 1;
>                                                  QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------
>  Seq Scan on foobar  (cost=0.00..1791.00 rows=50270 width=16) (actual
> time=0.011..13.603 rows=50000 loops=1)
>    Filter: (b = 1)
>    Rows Removed by Filter: 50000
>  Total runtime: 16.005 ms
> (4 rows)
>
>>
>>  I'm curious as to how indexes
>> are used with OR and AND clauses.
>>
>> Something like:
>>
>> select * from Table where X = 5 or y = 3;
>>
>> It seems to me both the index of X would be scanned and those rows
>> would be loaded into memory, and then the index of Y would be scanned
>> and loaded.  Then, Postgres would have to merge both sets into a set
>> of unique rows.  Is this pretty much what's going on?  Let's ignore
>> table stats for now.
>
>
> The right answer is "sometimes".  Here's a query that is solved the way you
> expect:
>
> explain analyze
> select *
> from foobar
> where c = 1
> or d = 1;
>                                                           QUERY PLAN
>
>
------------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foobar  (cost=226.47..920.65 rows=10202 width=16)
> (actual time=1.284..3.784 rows=10000 loops=1)
>    Recheck Cond: ((c = 1) OR (d = 1))
>    ->  BitmapOr  (cost=226.47..226.47 rows=10212 width=0) (actual
> time=1.174..1.174 rows=0 loops=1)
>          ->  Bitmap Index Scan on c_idx  (cost=0.00..216.23 rows=10113
> width=0) (actual time=1.157..1.157 rows=10000 loops=1)
>                Index Cond: (c = 1)
>          ->  Bitmap Index Scan on d_idx  (cost=0.00..5.13 rows=98 width=0)
> (actual time=0.016..0.016 rows=100 loops=1)
>                Index Cond: (d = 1)
>  Total runtime: 4.452 ms
> (8 rows)
>
> And here is a query that is not solved the way you expect, because the index
> on B is not useful.
>
> explain analyze
> select *
> from foobar
> where c = 1
> or b = 1;
>                                                  QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------
>  Seq Scan on foobar  (cost=0.00..2041.00 rows=55299 width=16) (actual
> time=0.007..14.922 rows=50000 loops=1)
>    Filter: ((c = 1) OR (b = 1))
>    Rows Removed by Filter: 50000
>  Total runtime: 17.002 ms
> (4 rows)
>
>
>>
>>
>> Then, something like:
>>
>> select * from Table where X = 5 AND y = 3;
>>
>> I would imagine the same thing is going on, only Postgres would find
>> rows that appear in both sets.  I also imagine Postgres might create a
>> hash table from the larger set, and then iterate through the smaller
>> set looking for rows that were in that hash table
>
>
> and you would be largely right about that.  Interestingly, on an earlier run
> of this, I got a BitmapAnd strategy, rather than applying the c=1 filter to
> the rows after identifying all the d=1 rows.
>
> explain analyze
> select *
> from foobar
> where c = 1
> and d = 1;
>                                                    QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foobar  (cost=5.13..256.48 rows=10 width=16) (actual
> time=0.046..0.150 rows=100 loops=1)
>    Recheck Cond: (d = 1)
>    Filter: (c = 1)
>    ->  Bitmap Index Scan on d_idx  (cost=0.00..5.13 rows=98 width=0) (actual
> time=0.026..0.026 rows=100 loops=1)
>          Index Cond: (d = 1)
>  Total runtime: 0.179 ms
> (6 rows)
>
>
>>
>>
>> Lastly, If you had a query such as:
>>
>> select * from Table where X IN (1,2,3,4,5,6,7);
>>
>> I would imagine Postgres would parse that query as a bunch of OR
>> clauses.  Does this mean the index for X would be scanned 7 times and
>> merged into a set of unique results?  Though, obviously if Postgres
>> estimated this would return the majority of the rows in the table, it
>> would probably just ignore the index completely.
>
>
>  and you would be right on both counts
>
> explain analyze
> select *
> from foobar
> where c in (1, 2, 3);
>                                                        QUERY PLAN
>
>
------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foobar  (cost=609.14..1562.18 rows=29967 width=16)
> (actual time=3.456..7.589 rows=30000 loops=1)
>    Recheck Cond: (c = ANY ('{1,2,3}'::integer[]))
>    ->  Bitmap Index Scan on c_idx  (cost=0.00..601.64 rows=29967 width=0)
> (actual time=3.342..3.342 rows=30000 loops=1)
>          Index Cond: (c = ANY ('{1,2,3}'::integer[]))
>  Total runtime: 9.083 ms
> (5 rows)
>
> explain analyze
> select *
> from foobar
> where c in (1, 2, 3, 4, 5, 6);
>                                                  QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------
>  Seq Scan on foobar  (cost=0.00..2291.00 rows=59723 width=16) (actual
> time=0.005..18.450 rows=60000 loops=1)
>    Filter: (c = ANY ('{1,2,3,4,5,6}'::integer[]))
>    Rows Removed by Filter: 40000
>  Total runtime: 21.181 ms
> (4 rows)

Thanks!  One thing that still confuses me is the difference between IN
and OR.  With this query:

explain analyze
select *
from foobar
where d in (500, 750);

It scans the d index only once:

'Bitmap Heap Scan on foobar  (cost=10.03..400.63 rows=196 width=16)
(actual time=0.128..0.489 rows=200 loops=1)'
'  Recheck Cond: (d = ANY ('{500,750}'::integer[]))'
'  ->  Bitmap Index Scan on d_idx  (cost=0.00..9.98 rows=196 width=0)
(actual time=0.086..0.086 rows=200 loops=1)'
'        Index Cond: (d = ANY ('{500,750}'::integer[]))'
'Total runtime: 0.592 ms'

And if I change it to:

explain analyze
select *
from foobar
where d = 500 or d = 750;

Then the d index is scanned twice and it uses a BItmapOr:

'Bitmap Heap Scan on foobar  (cost=10.10..401.18 rows=196 width=16)
(actual time=0.118..0.480 rows=200 loops=1)'
'  Recheck Cond: ((d = 500) OR (d = 750))'
'  ->  BitmapOr  (cost=10.10..10.10 rows=196 width=0) (actual
time=0.079..0.079 rows=0 loops=1)'
'        ->  Bitmap Index Scan on d_idx  (cost=0.00..5.00 rows=98
width=0) (actual time=0.047..0.047 rows=100 loops=1)'
'              Index Cond: (d = 500)'
'        ->  Bitmap Index Scan on d_idx  (cost=0.00..5.00 rows=98
width=0) (actual time=0.030..0.030 rows=100 loops=1)'
'              Index Cond: (d = 750)'
'Total runtime: 0.611 ms'

Shouldn't these two queries be the same?  I'm curious as to why the OR
operator will do two Bitmap Index Scans, then OR then together - while
the IN clause will do a single Index scan that gets all 200 rows at
once.  Probably the answer is just "Because that's the way it was
written, deal."

Anyway, I don't want to be too much of a vampire.  I'll keep reading
stuff online as well.  I found some old Tom Lane posts that are
helpful, but a lot of it is over my head still.

Mike

Re: I'd like to learn a bit more about how indexes work

От
Tom Lane
Дата:
Mike Christensen <mike@kitchenpc.com> writes:
> Thanks!  One thing that still confuses me is the difference between IN
> and OR.  With this query:

> explain analyze
> select *
> from foobar
> where d in (500, 750);

> It scans the d index only once:

> 'Bitmap Heap Scan on foobar  (cost=10.03..400.63 rows=196 width=16)
> (actual time=0.128..0.489 rows=200 loops=1)'
> '  Recheck Cond: (d = ANY ('{500,750}'::integer[]))'
> '  ->  Bitmap Index Scan on d_idx  (cost=0.00..9.98 rows=196 width=0)
> (actual time=0.086..0.086 rows=200 loops=1)'
> '        Index Cond: (d = ANY ('{500,750}'::integer[]))'
> 'Total runtime: 0.592 ms'

Actually, that's two indexscans --- the btree code is aware that it has
to perform an indexscan for each element of the =ANY array, and add all
the resulting bits to the output bitmap.  (The bitmap takes care of
eliminating any duplicate hits, which is why this type of index
condition is only supported for bitmap scans and not plain indexscans.)

So this isn't really any different from the OR case in terms of what
happens while probing the index.  It's marginally more efficient in
that we save an explicit BitmapOr step, but only marginally.

As for why these cases are treated differently, yeah that's historical
to some extent, but it's also true that trying to convert one notation
to the other would be expensive and often fruitless.

            regards, tom lane