Обсуждение: seq vs index scan in join query
hi all, we're in the process of optimizing some queries and we've noted a case where the planner prefers a sequential scan instead of using an index, while the index scan is actually much faster. to give you some context: we have two main tables, keywords and results. keywords has approximately 700.000 rows; while results holds approximately one row per keyword per day (roughly 70m at the moment, not all keywords are active at any given day). results is currently partitioned by (creation) time. it's also worth noting that we use SSDs in our servers, and have random_page_cost set to 1. the problematic query looks like this: SELECT keywords.strategy_id, results.position, results.created_at FROM results JOIN keywords ON results.keyword_id = keywords.idWHERE results.account_id = 1 AND results.created_at >= '2017-10-25 00:00:00.000000' AND results.created_at<= '2017-11-10 23:59:59.999999'; as you can see in the query plan [1] a sequential scan is preferred. as we understand it, this happens because the number of rows returned from results is too large. if we reduce this number by either selecting a smaller created_at range, or another account_id with fewer keywords, the planner falls back to an index scan, confirming that the number of rows returned from results has a direct influence in this choice. on the other hand, if we disable sequential scans (SET enable_seqscan = 0), we see than not only the query runs faster but the cost seems to be lower, as seen in the query plan [2]. in this example the gain it's not much: ~0.5s. but when we add a second join table with additional keyword data the planner still prefers a sequential scan on a table that has +6m rows. query looks like this: SELECT keywords.strategy_id, results.position, results.created_at, keyword_data.volume FROM results JOIN keywords ON results.keyword_id = keywords.id JOIN keyword_data ON keywords.keyword_data_id= keyword_data.id WHERE results.account_id = 1 AND results.created_at >= '2017-10-25 00:00:00.000000' AND results.created_at <= '2017-11-19 23:59:59.999999'; in this case query takes up to 8s, query plan can be found in [3]. obviously dataset has to be large to prefer a sequential on a 6m rows table. similarly, reducing the created_at range or using an account_id with fewer keywords makes the planner prefer index scan, accelerating the query considerably. currently we're exploring the option of fetching keywords data within a subquery and feed that into the main query, which works as expected, but also complicates the design a bit. we'd like to know:1. why does the planner prefers a sequential scan in these cases?2. is there a way we can make the plannerchoose a better strategy using indexes? thank you for your time. [1] seq scan plan: https://gist.github.com/emnlvrz/5e53235c82260be011d84cf264e597e7 [2] indexed plan: https://gist.github.com/emnlvrz/8aa85edbdedcdb90d8d4f38863abc134 [3] seq scan additional join plan: https://gist.github.com/emnlvrz/b3f13518f863f829c65f91a514f407d9
Hi, Could you give us the partitions (ranges values) and indexes definition for result table ? Regards PAscal -- Sent from: http://www.postgresql-archive.org/PostgreSQL-general-f1843780.html
Hi On Wed, Nov 29, 2017 at 8:55 AM, Emanuel Alvarez <ema@abductedcow.com.ar> wrote: > on the other hand, if we disable sequential scans (SET enable_seqscan > = 0), we see than not only the query runs faster but the cost seems to > be lower, as seen in the query plan [2]. True, the cost of the scan itself is lower, but together with hashjoin/nestloop, the total cost of plan [2] is higher. This is a wild guess but... -> Index Scan using keywords_pkey on keywords Buffers: shared hit=284808 read=4093 vs -> Seq Scan on keywords Buffers: shared read=36075 Looks like the index scan's advantage in this example is a much higher cache hit ratio (despite touching so many more pages) and PostgreSQL is underestimating it. Have you tuned the effective_cache_size setting? A good starting point is half the total RAM in your machine. It would be interesting to see how high you need to set it for the planner to switch to the index scan plan. Regards, Marti Raudsepp
Emanuel Alvarez wrote: > the problematic query looks like this: > > SELECT keywords.strategy_id, results.position, results.created_at FROM results > JOIN keywords ON results.keyword_id = keywords.id > WHERE results.account_id = 1 > AND results.created_at >= '2017-10-25 00:00:00.000000' > AND results.created_at <= '2017-11-10 23:59:59.999999'; > > > as you can see in the query plan [1] a sequential scan is preferred. > as we understand it, this happens because the number of rows returned > from results is too large. if we reduce this number by either > selecting a smaller created_at range, or another account_id with fewer > keywords, the planner falls back to an index scan, confirming that the > number of rows returned from results has a direct influence in this > choice. > > on the other hand, if we disable sequential scans (SET enable_seqscan > = 0), we see than not only the query runs faster but the cost seems to > be lower, as seen in the query plan [2]. The optimizer is right here. Even though your second execution without sequential scans ran faster, it is worse. That is because the execution with the sequential scan touched 26492 + 80492 = 106984 blocks, while the second execution touched 311301 + 48510 = 359811 blocks, more than three times as many. The second execution was just lucky because most of these blocks were already cached, and it had to read only half as many blocks from disk. If you repeat the execution a couple of times, you should see that the execution using the sequential scans becomes faster. You can boost performance even more by increasing work_mem so that the hash can be created in memory. Yours, Laurenz Albe
On 2017-11-29 18:17:18 +0100, Laurenz Albe wrote: > That is because the execution with the sequential scan touched > 26492 + 80492 = 106984 blocks, while the second execution touched > 311301 + 48510 = 359811 blocks, more than three times as many. That's not necessarily said. What those count are buffer *accesses*, *not* the number of distinct blocks accessed. You'll very commonly have more buffer accesses in indexscans but still fewer total reads because a lot of those accesses will be reads previously done in the same scan. Just imagine a scan of an index with a leaf page pointing to 100 tuples of the same value - that'd result in at least a 100 buffer accesses, but it'd be highly likely that they'll be in cache. Greetings, Andres Freund
Andres Freund wrote: > On 2017-11-29 18:17:18 +0100, Laurenz Albe wrote: > > That is because the execution with the sequential scan touched > > 26492 + 80492 = 106984 blocks, while the second execution touched > > 311301 + 48510 = 359811 blocks, more than three times as many. > > That's not necessarily said. What those count are buffer *accesses*, > *not* the number of distinct blocks accessed. You'll very commonly have > more buffer accesses in indexscans but still fewer total reads because a > lot of those accesses will be reads previously done in the same > scan. Just imagine a scan of an index with a leaf page pointing to 100 > tuples of the same value - that'd result in at least a 100 buffer > accesses, but it'd be highly likely that they'll be in cache. Thanks for explaining that. Yours, Laurenz Albe
Thank you all for your responses! On Wed, Nov 29, 2017 at 7:31 AM, legrand legrand <legrand_legrand@hotmail.com> wrote: > Hi, > > Could you give us the partitions (ranges values) and indexes definition for > result table ? We partition by month, they usually start the 20th of each month (this was the date we partitioned the table), for the tables in questions constraints look like this: "constraint_37" CHECK (created_at >= '2017-10-21 00:00:00'::timestamp without time zone AND created_at < '2017-11-20 00:00:00'::timestamp without time zone) "constraint_110" CHECK (created_at >= '2017-11-20 00:00:00'::timestamp without time zone AND created_at < '2017-12-20 00:00:00'::timestamp without time zone) Indexes are: "results_account_id_created_at_idx" btree (account_id, created_at DESC) "results_id_idx" btree (id) "results_keyword_id_created_at_idx"btree (keyword_id, created_at DESC) On Wed, Nov 29, 2017 at 11:57 AM, Marti Raudsepp <marti@juffo.org> wrote: > -> Index Scan using keywords_pkey on keywords > Buffers: shared hit=284808 read=4093 > vs > -> Seq Scan on keywords > Buffers: shared read=36075 > > Looks like the index scan's advantage in this example is a much higher > cache hit ratio (despite touching so many more pages) and PostgreSQL > is underestimating it. Interesting, we were completely missing this. > Have you tuned the effective_cache_size setting? A good starting point > is half the total RAM in your machine. It would be interesting to see > how high you need to set it for the planner to switch to the index > scan plan. We did this, and although it has an effect it's rapidly shadowed by the results size. For example, setting it to 10GB is OK for one month worth of data, but it'll fallback to seq scan for two months data. Although we might be able to live with that for now. On Wed, Nov 29, 2017 at 2:17 PM, Laurenz Albe <laurenz.albe@cybertec.at> wrote: > The optimizer is right here. Never doubted it :) > Even though your second execution without sequential scans ran faster, > it is worse. > > That is because the execution with the sequential scan touched > 26492 + 80492 = 106984 blocks, while the second execution touched > 311301 + 48510 = 359811 blocks, more than three times as many. > > The second execution was just lucky because most of these blocks were > already cached, and it had to read only half as many blocks from disk. > > If you repeat the execution a couple of times, you should see that > the execution using the sequential scans becomes faster. In a live environment, execution times for sequential scans are always slower, though they do vary in time. The reason for this is that the partition for last month's results is accessed frequently, and as such is kept in the cache; while the other two tables, keywords and keyword_data, are accessed sparsely, and mostly through their indexes. This way, indexes have a much bigger chance of being cached in memory than other parts of the table. In other words: the second execution is always lucky. Is this interpretation correct? Is there any option to deal with this issue? Besides adding more RAM. We could, theoretically, partition keyword_data as it's also time base, but it's not that big to justify a partitioning. It's also not small enough to be doing sequential scan on it all the time. > You can boost performance even more by increasing work_mem > so that the hash can be created in memory. This is interesting, and has a positive effect on our queries. We are currently testing a combination of work_mem with effective_cache_size settings, though I'm afraid refactoring the query will be inevitable. Thank you!
On Tue, Nov 28, 2017 at 10:55 PM, Emanuel Alvarez
wrote:
> hi all,
>
> we're in the process of optimizing some queries and we've noted a case
> where the planner prefers a sequential scan instead of using an index,
> while the index scan is actually much faster. to give you some
> context: we have two main tables, keywords and results. keywords has
> approximately 700.000 rows; while results holds approximately one row
> per keyword per day (roughly 70m at the moment, not all keywords are
> active at any given day). results is currently partitioned by
> (creation) time. it's also worth noting that we use SSDs in our
> servers, and have random_page_cost set to 1.
>
>
> the problematic query looks like this:
>
> SELECT keywords.strategy_id, results.position, results.created_at FROM
> results
> JOIN keywords ON results.keyword_id = keywords.id
> WHERE results.account_id = 1
> AND results.created_at >= '2017-10-25 00:00:00.000000'
> AND results.created_at <= '2017-11-10 23:59:59.999999';
>
>
> as you can see in the query plan [1] a sequential scan is preferred.
>
I would say the preference is not for the seq scan, but rather for the hash
join. If the seq scan couldn't be fed into a hash join, it would not look
very favorable.
I think hash joins are a bit optimistic on how much cpu time they think
they use building the hash table. You can probably get better plans for
this type of query by increasing cpu_tuple_cost to 0.02 or 0.03. That
works because the hash join over the seq scan has to scan 700,000 tuples to
build the hash table, which is then probed only 70,000 time, while the
nested loop index scan just probes the 70,000 rows is needs directly and
ignores the other 90%.
...
>
> on the other hand, if we disable sequential scans (SET enable_seqscan
> = 0), we see than not only the query runs faster but the cost seems to
> be lower, as seen in the query plan [2].
>
The costs for plan 2 doesn't look lower to me. 196754.90 > 120421.32
>
> in this example the gain it's not much: ~0.5s. but when we add a
> second join table with additional keyword data the planner still
> prefers a sequential scan on a table that has +6m rows. query looks
> like this:
>
> SELECT keywords.strategy_id, results.position, results.created_at,
> keyword_data.volume FROM results
> JOIN keywords ON results.keyword_id = keywords.id
> JOIN keyword_data ON keywords.keyword_data_id = keyword_data.id
> WHERE results.account_id = 1
> AND results.created_at >= '2017-10-25 00:00:00.000000'
> AND results.created_at <= '2017-11-19 23:59:59.999999';
>
>
> in this case query takes up to 8s, query plan can be found in [3].
> obviously dataset has to be large to prefer a sequential on a 6m rows
> table. similarly, reducing the created_at range or using an account_id
> with fewer keywords makes the planner prefer index scan, accelerating
> the query considerably.
>
If that is the query you are really concerned about, we would have to see
the faster plan for that query. (Or better yet, keep the created_at range
the same, and set enable_hashjoin to off to get it to switch plans).
This looks like is a very skewed query. keyword_data has 10 rows for every
row in keywords, yet adding a join to keyword_data doesn't increase the
number of rows returned by the query at all. That is weird, isn't it?
For what its worth, in my hands on your simpler query it likes to sort the
70,000 qualifying rows from "results" table, then do a merge join againsts
the index on keywords. And it truly is the faster option. I have to
enable_mergejoin=off before I can get either of your plans. Once I do, the
nested loop does seem to be faster than the hash join but not by the two
fold that you see, and they jump around quite a bit from run to run.
Cheers,
Jeff