Обсуждение: Can the query planner create indexes?

Поиск
Список
Период
Сортировка

Can the query planner create indexes?

От
Dario Beraldi
Дата:
Hello,

This question is just for my curiosity...

When an index is available for a query, the planner decides whether to
use it or not depending on whether it would make the query perform
better, right? However if an index, which does not exist, would make
the query run better the planner is not able (allowed?) to create such
index, use it, and drop it once the query is done. Why is it so?
Why is the query planner not allowed to create indexes, but only
allowed to use or not use what's available?

Am I misunderstanding the way the planner works? Any explanation &
clarification much appreciated!

All the best and Christmas wishes!
Dario

--

Dr. Dario Beraldi
Institute of Evolutionary Biology
University of Edinburgh
West Mains Road
Edinburgh EH9 3JT
Scotland, UK


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Can the query planner create indexes?

От
Gabriele Bartolini
Дата:
 Ciao Dario,

 On Tue, 21 Dec 2010 09:14:36 +0000, Dario Beraldi
 <dario.beraldi@ed.ac.uk> wrote:
> the query run better the planner is not able (allowed?) to create
> such
> index, use it, and drop it once the query is done. Why is it so?

 Because it is not its responsibility. This is the simplest and most
 rational answer.

 I would reply with some questions that hopefully will give you the
 answer. How can you define a 'good' query from a 'bad' query? Consider
 the case when an user launches a wrong query and the planner you propose
 starts creating an index. I believe that would be a mess.

 Then ... how could you project this scenario in a concurrent context
 where multiple users launch queries that 'need' an index?

 I suggest that you look at the documentation for more information.
 Otherwise, I strongly suggest that you read the chapter on the planner
 from Greg's book on High Performance (which you can find from here:
 http://www.postgresql.org/docs/books/)

 Merry Christmas to you too!

 Cheers,
 Gabriele

--
  Gabriele Bartolini - 2ndQuadrant Italia
  PostgreSQL Training, Services and Support
  Gabriele.Bartolini@2ndQuadrant.it - www.2ndQuadrant.it

Re: Can the query planner create indexes?

От
"Massa, Harald Armin"
Дата:
Hello Dario,

When an index is available for a query, the planner decides whether to use it or not depending on whether it would make the query perform better, right? However if an index, which does not exist, would make the query run better the planner is not able (allowed?) to create such index, use it, and drop it once the query is done. Why is it so?

From my knowledge there is more then one answer:

a) There is a proposal (and, at the time being) also some code on pgfoundry creating "hypothetical indexes"

The idea is: to play with "what-would-be-if-there-would-be-an-index".

With keywords "hypothetical index", "Index advisor" and "virtual index" there is some research googleable

b) creating an index requires to read the data-to-be-indexed. So, to have an index pointing at the interesting rows for your query, the table has to be read ... which would be the perfect time to allready select the interesting rows. And after having the interesting rows: the index is worthless

c) PostgreSQL is in fact doing something quite similiar like "temporary indexes during the query"; only for the selected rows. The words to google for is "hash maps", "bitmap access".

Why is the query planner not allowed to create indexes, but only allowed to use or not use what's available?

as in b): Creating an index is quite expensiv

additionally: having multiple possible plans is also creating a new decision problem: which of the possible plans will lead to the better result; again with some meanings of "better": faster result or less processor usage or less memory usage or less disk accesses. So adding additional indices during planning would worsen this problem; which has to be balanced against possible gains.

So: hypothetical indizes are a good idea, BUT current limitations would most likely force them to be made outside the life query process.

Best wishes,

Harald

--
GHUM GmbH
Harald Armin Massa
Spielberger Straße 49
70435 Stuttgart
0173/9409607

Amtsgericht Stuttgart, HRB 734971
-
persuadere.
et programmare

Re: Can the query planner create indexes?

От
Gabriele Bartolini
Дата:
 Hi Harald,

 On Tue, 21 Dec 2010 11:42:40 +0100, "Massa, Harald Armin"
 <chef@ghum.de> wrote:
> a) There is a proposal (and, at the time being) also some code on
> pgfoundry creating "hypothetical indexes"
>
> http://postgresql.1045698.n5.nabble.com/Hypothetical-Indexes-PostgreSQL-extension-PGCON-2010-td3288085.html
> [1]
> http://archives.postgresql.org/pgsql-hackers/2010-12/msg00087.php [2]

 I totally missed this. Thanks for posting it.

 Cheers,
 Gabriele

--
  Gabriele Bartolini - 2ndQuadrant Italia
  PostgreSQL Training, Services and Support
  Gabriele.Bartolini@2ndQuadrant.it - www.2ndQuadrant.it

Re: Can the query planner create indexes?

От
Dario Beraldi
Дата:
Ok, thanks a lot to all of you for your answers! (Always impressed by
the prompt feedback you get on this list!)

Quoting Gabriele Bartolini <Gabriele.Bartolini@2ndQuadrant.it>:

> Ciao Dario,
>
> On Tue, 21 Dec 2010 09:14:36 +0000, Dario Beraldi
> <dario.beraldi@ed.ac.uk> wrote:
>> the query run better the planner is not able (allowed?) to create such
>> index, use it, and drop it once the query is done. Why is it so?
>
> Because it is not its responsibility. This is the simplest and most
> rational answer.
>
> I would reply with some questions that hopefully will give you the
> answer. How can you define a 'good' query from a 'bad' query?
> Consider the case when an user launches a wrong query and the
> planner you propose starts creating an index. I believe that would
> be a mess.
>
> Then ... how could you project this scenario in a concurrent context
> where multiple users launch queries that 'need' an index?
>
> I suggest that you look at the documentation for more information.
> Otherwise, I strongly suggest that you read the chapter on the
> planner from Greg's book on High Performance (which you can find
> from here: http://www.postgresql.org/docs/books/)
>
> Merry Christmas to you too!
>
> Cheers,
> Gabriele
>
> --
>  Gabriele Bartolini - 2ndQuadrant Italia
>  PostgreSQL Training, Services and Support
>  Gabriele.Bartolini@2ndQuadrant.it - www.2ndQuadrant.it
>
>



--

Dr. Dario Beraldi
Institute of Evolutionary Biology
University of Edinburgh
West Mains Road
Edinburgh EH9 3JT
Scotland, UK


--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



Re: Can the query planner create indexes?

От
Jeremy Harris
Дата:
On 2010-12-21 10:42, Massa, Harald Armin wrote:
> b) creating an index requires to read the data-to-be-indexed. So, to have an
> index pointing at the interesting rows for your query, the table has to be
> read ... which would be the perfect time to allready select the interesting
> rows. And after having the interesting rows: the index is worthless

... until another similar query comes along, when suddenly it's a massive win.
Why not auto-create indices for some limited period after database load
(copy?  any large number of inserts from a single connection?), track those
that actually get re-used and remove the rest?   Would this not provide
a better out-of-the-box experience for neophytes?

[...]
> Why is the query planner not allowed to create indexes, but only allowed to
>> use or not use what's available?
>>
>
> as in b): Creating an index is quite expensiv

How much more so than doing that full-table-scan plus sort, which your
query is doing anyway?

Cheers,
     Jeremy



Re: Can the query planner create indexes?

От
tv@fuzzy.cz
Дата:
> On 2010-12-21 10:42, Massa, Harald Armin wrote:
>> b) creating an index requires to read the data-to-be-indexed. So, to
>> have an
>> index pointing at the interesting rows for your query, the table has to
>> be
>> read ... which would be the perfect time to allready select the
>> interesting
>> rows. And after having the interesting rows: the index is worthless
>
> ... until another similar query comes along, when suddenly it's a massive
> win.
> Why not auto-create indices for some limited period after database load
> (copy?  any large number of inserts from a single connection?), track
> those
> that actually get re-used and remove the rest?   Would this not provide
> a better out-of-the-box experience for neophytes?

Say you have a table with several columns (A,B,C), and the query is using
some of them. What indexes would you create? One index on every column? A
multi-column index on all columns? Indexs for each combination of columns?

There really is no automatic way to solve this puzzle using a single
query. Indexing strategy is a very tough design discipline, and it
requires a complex knowledge of the workload. One slow query does not mean
the index should be created - what if that was just an ad-hoc query and
will not be executed ever again?

So although this 'automatic index creation' seems nice, it really does not
work in practice.

I really don't want to work with products that try to be smarter than me
(and I admit I'm pretty dumb from time to time) and act rather randomly
due to this 'artificial intelligence'. I've already been there and I don't
want to repeat this experience.

>
> [...]
>> Why is the query planner not allowed to create indexes, but only allowed
>> to
>>> use or not use what's available?
>>>
>>
>> as in b): Creating an index is quite expensiv
>
> How much more so than doing that full-table-scan plus sort, which your
> query is doing anyway?

A lot. It consumes a lot of CPU, it may consume a lot of memory (up to
maintenance_work_mem) etc. So imagine a few users, building indices on a
big table simultaneously - no, I really don't want to see this on a
production server.

Building an index is just one side of the problem - maintenance of the
indexes is another thing. Each index has an impact on write operations
(INSERT/UPDATE) and may cause that HOT actually does not work (thus
causing unnecessary bloat of the index). And the indexes may need lot of
space on a disk.

But the real show stopper is probably locking. Building an index takes a
write lock on a table, effectively blocking writes. Sure, you can use a
'CREATE INDEX ... CONCURRENTLY' introduced in 8.2, but there are some
disadvantages of that (see
http://www.postgresql.org/docs/9.0/static/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY).

cheers
Tomas


Re: Can the query planner create indexes?

От
Grzegorz Jaśkiewicz
Дата:
I don't think planner should do things like creating an index. But it
might hint at doing it in the logs.
There was a discussion around that sort of feature on -hackers not so
long time ago. I don't remember what the conclusion was, but probably
that it just isn't worth wasting planner's cycles when you execute a
query.

Re: Can the query planner create indexes?

От
tv@fuzzy.cz
Дата:
> I don't think planner should do things like creating an index. But it
> might hint at doing it in the logs.
> There was a discussion around that sort of feature on -hackers not so
> long time ago. I don't remember what the conclusion was, but probably
> that it just isn't worth wasting planner's cycles when you execute a
> query.

Yes, that would be a much better solution, definitely. Something like
'watch seq-scan nodes and if the number of matching rows is much smaller
than the total number, put a HINT into log.'

I was thinking about doing something like this for correlated columns
(when the number of matching rows is severely underestimated). But that's
far in the future.

Tomas



Re: Can the query planner create indexes?

От
Jeremy Harris
Дата:
On 2010-12-21 14:26, tv@fuzzy.cz wrote:
>> Why not auto-create indices for some limited period after database load
>> (copy?  any large number of inserts from a single connection?), track
>> those
>> that actually get re-used and remove the rest?   Would this not provide
>> a better out-of-the-box experience for neophytes?
>
> Say you have a table with several columns (A,B,C), and the query is using
> some of them. What indexes would you create? One index on every column? A
> multi-column index on all columns? Indexs for each combination of columns?

One single-column index, on the first index-worthy column appearing.
Keep it simple.    Maybe, while you're doing that full-table-scan. gather
stats on all the indexable columns for later reference, to guide choice of
which column to index later.

> There really is no automatic way to solve this puzzle using a single
> query. Indexing strategy is a very tough design discipline, and it
> requires a complex knowledge of the workload. One slow query does not mean
> the index should be created - what if that was just an ad-hoc query and
> will not be executed ever again?

Then the index you just built gets automatically dropped, as I said above.

> I really don't want to work with products that try to be smarter than me
> (and I admit I'm pretty dumb from time to time) and act rather randomly
> due to this 'artificial intelligence'. I've already been there and I don't
> want to repeat this experience.

Then, since you're not a neophyte, leave the feature turned off.   But don't
deny the possibility of using it to actual neophytes.


>>> as in b): Creating an index is quite expensiv
>>
>> How much more so than doing that full-table-scan plus sort, which your
>> query is doing anyway?
>
> A lot. It consumes a lot of CPU, it may consume a lot of memory (up to
> maintenance_work_mem) etc. So imagine a few users, building indices on a
> big table simultaneously

Pffthht.   One simple trylock, used only by the auto-indexer.

>
> Building an index is just one side of the problem - maintenance of the
> indexes is another thing. Each index has an impact on write operations
> (INSERT/UPDATE) and may cause that HOT actually does not work (thus
> causing unnecessary bloat of the index).

This is worth discussing.    How much help does the DBMS currently give
the DBA in evaluating these tradeoffs?    Could we do better, given an
actual installation and workload?

> And the indexes may need lot of
> space on a disk.

By all means require limits as well as a "don't do that" switch.

>
> But the real show stopper is probably locking. Building an index takes a
> write lock on a table, effectively blocking writes. Sure, you can use a
> 'CREATE INDEX ... CONCURRENTLY' introduced in 8.2, but there are some
> disadvantages of that (see
> http://www.postgresql.org/docs/9.0/static/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY).

The only disadvantages I see there are a) the leftover "invalid" index - which feels like a
bug; why is it not auto-dropped?   and b) the second scan, which implies more total work
and a wish to background that portion after completing the query triggering
the auto-index.

Don't forget I suggested doing this only for a limited time after DB creation.  Possibly another
reason to turn it off should be any manual index creation, as that implies that the DBA
knows about indexing.

I see in another thread you suggest merely placing hints in the log.   That's a fine
first step - but I'll then be wanting to auto-parse that log to auto-create....

Cheers,
     Jeremy


Re: Can the query planner create indexes?

От
bricklen
Дата:
On Tue, Dec 21, 2010 at 7:34 AM, Jeremy Harris <jgh@wizmail.org> wrote:
> On 2010-12-21 14:26, tv@fuzzy.cz wrote:
>>>
>>> Why not auto-create indices for some limited period after database load
>>> (copy?  any large number of inserts from a single connection?), track
>>> those
>>> that actually get re-used and remove the rest?   Would this not provide
>>> a better out-of-the-box experience for neophytes?
>>
>> Say you have a table with several columns (A,B,C), and the query is using
>> some of them. What indexes would you create? One index on every column? A
>> multi-column index on all columns? Indexs for each combination of columns?
>
> One single-column index, on the first index-worthy column appearing.
> Keep it simple.    Maybe, while you're doing that full-table-scan. gather
> stats on all the indexable columns for later reference, to guide choice of
> which column to index later.
>
>> There really is no automatic way to solve this puzzle using a single
>> query. Indexing strategy is a very tough design discipline, and it
>> requires a complex knowledge of the workload. One slow query does not mean
>> the index should be created - what if that was just an ad-hoc query and
>> will not be executed ever again?
>
> Then the index you just built gets automatically dropped, as I said above.
>
>> I really don't want to work with products that try to be smarter than me
>> (and I admit I'm pretty dumb from time to time) and act rather randomly
>> due to this 'artificial intelligence'. I've already been there and I don't
>> want to repeat this experience.
>
> Then, since you're not a neophyte, leave the feature turned off.   But don't
> deny the possibility of using it to actual neophytes.
>
>
>>>> as in b): Creating an index is quite expensiv
>>>
>>> How much more so than doing that full-table-scan plus sort, which your
>>> query is doing anyway?
>>
>> A lot. It consumes a lot of CPU, it may consume a lot of memory (up to
>> maintenance_work_mem) etc. So imagine a few users, building indices on a
>> big table simultaneously
>
> Pffthht.   One simple trylock, used only by the auto-indexer.
>
>>
>> Building an index is just one side of the problem - maintenance of the
>> indexes is another thing. Each index has an impact on write operations
>> (INSERT/UPDATE) and may cause that HOT actually does not work (thus
>> causing unnecessary bloat of the index).
>
> This is worth discussing.    How much help does the DBMS currently give
> the DBA in evaluating these tradeoffs?    Could we do better, given an
> actual installation and workload?
>
>> And the indexes may need lot of
>> space on a disk.
>
> By all means require limits as well as a "don't do that" switch.
>
>>
>> But the real show stopper is probably locking. Building an index takes a
>> write lock on a table, effectively blocking writes. Sure, you can use a
>> 'CREATE INDEX ... CONCURRENTLY' introduced in 8.2, but there are some
>> disadvantages of that (see
>>
>> http://www.postgresql.org/docs/9.0/static/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY).
>
> The only disadvantages I see there are a) the leftover "invalid" index -
> which feels like a
> bug; why is it not auto-dropped?   and b) the second scan, which implies
> more total work
> and a wish to background that portion after completing the query triggering
> the auto-index.
>
> Don't forget I suggested doing this only for a limited time after DB
> creation.  Possibly another
> reason to turn it off should be any manual index creation, as that implies
> that the DBA
> knows about indexing.
>
> I see in another thread you suggest merely placing hints in the log.
> That's a fine
> first step - but I'll then be wanting to auto-parse that log to
> auto-create....
>
> Cheers,
>    Jeremy
>

This thread offers up some interesting possibilities. Expanding on
what has already been discussed, maybe a contrib module for query
tuning/index suggestions?
Some things that came to mind immediately that the module could do
(feasible or not):
- Look at the EXPLAIN ANALYZE
- Examine the index access methods and table scans + costs/rows
- Which indexes were used?
- What were the blocks/tuples hit & read?
- Look at join conditions and WHERE clause filters
- Data types in the joins (mismatched?)
- Churn rate of the tables, eg. the updates/deletes/inserts. This
might allow suggestion of other index types (eg. gist)

Tool then provides feedback on possibly helpful indexes to test, and
why (hypothetical indexes could be applied here). Possibly provided
suggestions on ways to improve the query, eg. data types don't match
in the join, EXISTS vs IN, etc

Re: Can the query planner create indexes?

От
Tomas Vondra
Дата:
Dne 21.12.2010 16:34, Jeremy Harris napsal(a):
>> There really is no automatic way to solve this puzzle using a single
>> query. Indexing strategy is a very tough design discipline, and it
>> requires a complex knowledge of the workload. One slow query does not
>> mean
>> the index should be created - what if that was just an ad-hoc query and
>> will not be executed ever again?
>
> Then the index you just built gets automatically dropped, as I said above.

I'm a bit confused. Should the indexes be dropped automatically (as you
state here) or kept for the future. Because if they should be dropped,
then it does not make sense to do this magic just for a limited time
after the DB goes live.

>> I really don't want to work with products that try to be smarter than me
>> (and I admit I'm pretty dumb from time to time) and act rather randomly
>> due to this 'artificial intelligence'. I've already been there and I
>> don't
>> want to repeat this experience.
>
> Then, since you're not a neophyte, leave the feature turned off. But
> don't deny the possibility of using it to actual neophytes.

This is not a question of whether I am a neophyte or not, this is a
question of good practices. My experience is that building indexes
should be done when designing the application, and tested at in a test
environment. At production, log slow queries, analyze the log and add
indexes when needed.

This 'neophyte approach' is a really bad idea from my point of view. It
inevitably leads to a database with a zillion of unnecessary indexes,
missing the right ones. And any limits don't prevent this. In the end
you'll get a slow database, and the neophytes would blame the database
although that state is inevitable.

I've been working with several database products over the years, and
AFAIK none of them does this. In most cases "the whole industry is
wrong" is usually a really bad sign (does not hold if you're Steve Jobs).

Most of them do have 'advisors' though - that's a good idea, and I think
sooner or later that will be in PostgreSQL too.

>>>> as in b): Creating an index is quite expensiv
>>>
>>> How much more so than doing that full-table-scan plus sort, which your
>>> query is doing anyway?
>>
>> A lot. It consumes a lot of CPU, it may consume a lot of memory (up to
>> maintenance_work_mem) etc. So imagine a few users, building indices on a
>> big table simultaneously
>
> Pffthht.   One simple trylock, used only by the auto-indexer.

Pffthht? Anyway trylock is not an answer. Think about a different table
for each user.

>> Building an index is just one side of the problem - maintenance of the
>> indexes is another thing. Each index has an impact on write operations
>> (INSERT/UPDATE) and may cause that HOT actually does not work (thus
>> causing unnecessary bloat of the index).
>
> This is worth discussing.    How much help does the DBMS currently give
> the DBA in evaluating these tradeoffs?    Could we do better, given an
> actual installation and workload?

The database gives you explain plans, plus pg_stat_* and pg_statio_*
views. That's a lot of data, although does not provide a complete view
in many cases. And then there's a log_min_duration, which is the primary
weapon in fighting slow queries after going live.

>> And the indexes may need lot of
>> space on a disk.
>
> By all means require limits as well as a "don't do that" switch.

As I said, this is not a good approach from my POV. And it's really
really hard to implement this in a way that those who don't want to use
it don't have to pay the price. Because what portion of users would
really use this feature? 0.0001%?

>> But the real show stopper is probably locking. Building an index takes a
>> write lock on a table, effectively blocking writes. Sure, you can use a
>> 'CREATE INDEX ... CONCURRENTLY' introduced in 8.2, but there are some
>> disadvantages of that (see
>> http://www.postgresql.org/docs/9.0/static/sql-createindex.html#SQL-CREATEINDEX-CONCURRENTLY).
>>
>
> The only disadvantages I see there are a) the leftover "invalid" index -
> which feels like a
> bug; why is it not auto-dropped?   and b) the second scan, which implies
> more total work
> and a wish to background that portion after completing the query triggering
> the auto-index.

Dropping an index automatically is a really bad idea. Actually dropping
indexes in general (no matter if the index was created automatically or
manually) is a very tricky thing.

How do you decide which indexes are really "leftover" from those that
are used only rarely? I've seen application failing spectacularily
because a DBA dropped an index that was not used ... except for a batch
process that runs once a year, to close a fiscal year.

> Don't forget I suggested doing this only for a limited time after DB
> creation.  Possibly another
> reason to turn it off should be any manual index creation, as that
> implies that the DBA
> knows about indexing.
>
> I see in another thread you suggest merely placing hints in the log.
> That's a fine
> first step - but I'll then be wanting to auto-parse that log to
> auto-create....

Well, I don't think that is going to happen. I'm not going to implement
this (I've explained my opinions on this above), and I don't think this
would get into core anyway.

Logging? Probably. Advisors? Maybe, although as a contrib module.
Creating indexes automatically? I doubt that.

regards
Tomas

Re: Can the query planner create indexes?

От
Jeremy Harris
Дата:
On 2010-12-21 18:50, Tomas Vondra wrote:
>> Then the index you just built gets automatically dropped, as I said above.
>
> I'm a bit confused. Should the indexes be dropped automatically (as you
> state here) or kept for the future. Because if they should be dropped,
> then it does not make sense to do this magic just for a limited time
> after the DB goes live.

Here what I said:
"track those that actually get re-used and remove the rest".

Which part is confusing?

- Jeremy

Re: Can the query planner create indexes?

От
Tomas Vondra
Дата:
Dne 21.12.2010 20:03, Jeremy Harris napsal(a):
> On 2010-12-21 18:50, Tomas Vondra wrote:
>>> Then the index you just built gets automatically dropped, as I said
>>> above.
>>
>> I'm a bit confused. Should the indexes be dropped automatically (as you
>> state here) or kept for the future. Because if they should be dropped,
>> then it does not make sense to do this magic just for a limited time
>> after the DB goes live.
>
> Here what I said:
> "track those that actually get re-used and remove the rest".
>
> Which part is confusing?

As I described, identifying which indexes are actually used is a very
tricky task. And it's not difficult to come up with scenarios where this
causes significantly more harm than good.

Basically the time to keep the indices needs to be long enough that the
indexes that are actually used are not dropped (and the resources spent
creating them actually pays off). But on the other side it needs to be
short so that resources are not wasted because of unused indices.

Which are clearly contradictory requirements.

And the 'limits' you've proposed make that even worse, because when the
unnecessary indices get created first and take most of the resources
(e.g. disk space), then the indexes that are actually needed won't be
created because of those limits.

regards
Tomas