Обсуждение: optimizing a query

Поиск
Список
Период
Сортировка

optimizing a query

От
Jonathan Vanasco
Дата:
I have a handful of queries in the following general form that I can't seem to optimize any further (same results on
9.3,9.4, 9.5) 

I'm wondering if anyone might have a suggestion, or if they're done.

The relevant table structure:

    t_a2b
        a_id INT references t_a(id)
        b_id INT references t_b(id)
        col_a

    t_a
        id INT
        col_1 INT
        col_2 BOOL

The selects query the association table (t_a2b) and join in a related table (t_a) for some filtering.

In effort of simplifying the work, I've created indexes on t_a that have all the related columns.

    CREATE INDEX test_idx ON t_a(col_1, id) WHERE col_2 IS NOT FALSE;
        CREATE INDEX test_idx__a ON t_a(col_1, id) WHERE col_2 IS NOT FALSE;

postgres will query test_idx__a first (yay!) but then does a bitmap heap scan on t_a, and uses the raw t_a for the hash
join.  

I don't actually need any information from t_a - it's just there for the filtering, and ideally postgres would just use
theindex. 

I thought this might have been from using a partial index, but the same results happen with a full index.  I just can't
seemto avoid this hash join against the full table. 

anyone have a suggestion?


example query

    SELECT t_a2b.b_id AS b_id,
           count(t_a2b.b_id) AS counted
    FROM t_a2b
    WHERE
          t_a2b.col_a = 1
          AND
          t_a.col_1 = 730
          AND
          t_a.col_2 IS NOT False
    GROUP BY t_a2b.b_id
    ORDER BY     counted DESC,
                 t_a2b.b_id ASC

                                   



Re: optimizing a query

От
Adrian Klaver
Дата:
On 06/21/2016 03:33 PM, Jonathan Vanasco wrote:
> I have a handful of queries in the following general form that I can't seem to optimize any further (same results on
9.3,9.4, 9.5) 
>
> I'm wondering if anyone might have a suggestion, or if they're done.
>
> The relevant table structure:
>
>     t_a2b
>         a_id INT references t_a(id)
>         b_id INT references t_b(id)
>         col_a
>
>     t_a
>         id INT
>         col_1 INT
>         col_2 BOOL
>
> The selects query the association table (t_a2b) and join in a related table (t_a) for some filtering.
>
> In effort of simplifying the work, I've created indexes on t_a that have all the related columns.
>
>     CREATE INDEX test_idx ON t_a(col_1, id) WHERE col_2 IS NOT FALSE;
>         CREATE INDEX test_idx__a ON t_a(col_1, id) WHERE col_2 IS NOT FALSE;
>
> postgres will query test_idx__a first (yay!) but then does a bitmap heap scan on t_a, and uses the raw t_a for the
hashjoin. 
>
> I don't actually need any information from t_a - it's just there for the filtering, and ideally postgres would just
usethe index. 
>
> I thought this might have been from using a partial index, but the same results happen with a full index.  I just
can'tseem to avoid this hash join against the full table. 
>
> anyone have a suggestion?
>

The below works without including t_a in the FROM?

>
> example query
>
>     SELECT t_a2b.b_id AS b_id,
>            count(t_a2b.b_id) AS counted
>     FROM t_a2b
>     WHERE
>           t_a2b.col_a = 1
>           AND
>           t_a.col_1 = 730
>           AND
>           t_a.col_2 IS NOT False
>     GROUP BY t_a2b.b_id
>     ORDER BY     counted DESC,
>                  t_a2b.b_id ASC
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: optimizing a query

От
"David G. Johnston"
Дата:
On Tue, Jun 21, 2016 at 6:44 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 06/21/2016 03:33 PM, Jonathan Vanasco wrote:


In effort of simplifying the work, I've created indexes on t_a that have all the related columns.

        CREATE INDEX test_idx ON t_a(col_1, id) WHERE col_2 IS NOT FALSE;
        CREATE INDEX test_idx__a ON t_a(col_1, id) WHERE col_2 IS NOT FALSE;

​Aside from the name these indexes are identical...​


postgres will query test_idx__a first (yay!) but then does a bitmap heap scan on t_a, and uses the raw t_a for the hash join.

I don't actually need any information from t_a - it's just there for the filtering, and ideally postgres would just use the index.

​This is the description of a semi-join.

WHERE EXISTS (SELECT 1 FROM t_a WHERE t_a.id = ​t_a2b.a_id AND  t_a.col_1 = 730 AND t_a.col_2 IS NOT FALSE)


I thought this might have been from using a partial index, but the same results happen with a full index.  I just can't seem to avoid this hash join against the full table.

anyone have a suggestion?


The below works without including t_a in the FROM?


example query

        SELECT t_a2b.b_id AS b_id,
                   count(t_a2b.b_id) AS counted
        FROM t_a2b
        WHERE
                  t_a2b.col_a = 1
                  AND
                  t_a.col_1 = 730
                  AND
                  t_a.col_2 IS NOT False
        GROUP BY t_a2b.b_id
        ORDER BY        counted DESC,
                                t_a2b.b_id ASC


​These two items combined reduce the desirability of diagnosing this...it doesn't seem like you've faithfully recreated the scenario for us to evaluate.

Your post is also not self-contained and you haven't provided the actual EXPLAINs you are getting.

David J.

Re: optimizing a query

От
Jonathan Vanasco
Дата:

On Jun 21, 2016, at 6:55 PM, David G. Johnston wrote:

​Aside from the name these indexes are identical...​

sorry. tired eyes copy/pasting between windows and trying to 'average' out 40 similar queries.

​These two items combined reduce the desirability of diagnosing this...it doesn't seem like you've faithfully recreated the scenario for us to evaluate.

Your post is also not self-contained and you haven't provided the actual EXPLAINs you are getting.

I played around with some more indexes, creating and disabling them on one specific query
Eventually i found some index formats that didn't pull in the whole table.
They gave approximately the same results as the other selects, with some differences in reporting.  the heap scan on the table was negligible.  the big hit was off the outer hash join.
the formatting in explain made a negligible check look like it was the root issue


CREATE TABLE t_a (id SERIAL PRIMARY KEY,
 col_1 INT NOT NULL,
 col_2 BOOLEAN DEFAULT NULL
 );
CREATE INDEX test_idx__t_a_col1_col2__v1 on t_a (col_1) WHERE col_2 IS NOT FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v2 on t_a (col_1, id) WHERE col_2 IS NOT FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v3 on t_a (id, col_1) WHERE col_2 IS NOT FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v4 on t_a (id, col_1, col_2) WHERE col_2 IS NOT FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v5 on t_a (col_1, col_2) WHERE col_2 IS NOT FALSE;

CREATE TABLE t_b (id SERIAL PRIMARY KEY,
 col_1 INT NOT NULL,
 col_2 BOOLEAN DEFAULT NULL
 );
CREATE TABLE t_a2b (a_id INT NOT NULL REFERENCES t_a(id),
   b_id INT NOT NULL REFERENCES t_b(id),
   col_a INT NOT NULL,
   PRIMARY KEY (a_id, b_id)
   );

EXPLAIN ANALYZE
SELECT t_a2b.b_id AS t_a2b_b_id,
  count(t_a2b.b_id) AS counted
FROM t_a2b
JOIN t_a ON t_a2b.a_id = t_a.id
WHERE t_a.col_1 = 730
 AND t_a2b.col_a = 1
 AND (t_a.col_2 IS NOT False)
GROUP BY t_a2b.b_id
ORDER BY counted DESC,
t_a2b.b_id ASC
LIMIT 25
OFFSET 0
;


                                                                                     QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=270851.55..270851.62 rows=25 width=4) (actual time=1259.950..1259.953 rows=25 loops=1)
   ->  Sort  (cost=270851.55..270863.43 rows=4750 width=4) (actual time=1259.945..1259.945 rows=25 loops=1)
         Sort Key: (count(t_a2b.b_id)) DESC, t_a2b.b_id
         Sort Method: top-N heapsort  Memory: 26kB
         ->  HashAggregate  (cost=270670.01..270717.51 rows=4750 width=4) (actual time=1259.430..1259.769 rows=1231 loops=1)
               Group Key: t_a2b.b_id
               ->  Hash Join  (cost=171148.45..270516.71 rows=30660 width=4) (actual time=107.662..1230.481 rows=124871 loops=1)
                     Hash Cond: (t_a2b.a_id = t_a.id)
                     ->  Seq Scan on t_a2b  (cost=0.00..89741.18 rows=2485464 width=8) (actual time=0.011..661.978 rows=2492783 loops=1)
                           Filter: (col_a = 1)
                           Rows Removed by Filter: 2260712
                     ->  Hash  (cost=170446.87..170446.87 rows=56126 width=4) (actual time=107.409..107.409 rows=48909 loops=1)
                           Buckets: 65536  Batches: 1  Memory Usage: 2232kB
                           ->  Bitmap Heap Scan on t_a  (cost=1055.41..170446.87 rows=56126 width=4) (actual time=18.243..94.470 rows=48909 loops=1)
                                 Recheck Cond: ((col_1 = 730) AND (col_2 IS NOT FALSE))
                                 Heap Blocks: exact=43972
                                 ->  Bitmap Index Scan on test_idx__t_a_col1_col2__v2  (cost=0.00..1041.38 rows=56126 width=0) (actual time=8.661..8.661 rows=48909 loops=1)
                                       Index Cond: (col_1 = 730)
 Planning time: 0.796 ms
 Execution time: 1260.092 ms

                                                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=208239.59..208239.65 rows=25 width=4) (actual time=1337.739..1337.743 rows=25 loops=1)
   ->  Sort  (cost=208239.59..208251.47 rows=4750 width=4) (actual time=1337.737..1337.739 rows=25 loops=1)
         Sort Key: (count(t_a2b.b_id)) DESC, t_a2b.b_id
         Sort Method: top-N heapsort  Memory: 26kB
         ->  HashAggregate  (cost=208058.05..208105.55 rows=4750 width=4) (actual time=1337.183..1337.556 rows=1231 loops=1)
               Group Key: t_a2b.b_id
               ->  Hash Join  (cost=108628.33..207935.37 rows=24537 width=4) (actual time=173.116..1306.910 rows=124871 loops=1)
                     Hash Cond: (t_a2b.a_id = t_a.id)
                     ->  Seq Scan on t_a2b  (cost=0.00..89741.18 rows=2485464 width=8) (actual time=0.010..669.616 rows=2492783 loops=1)
                           Filter: (col_a = 1)
                           Rows Removed by Filter: 2260712
                     ->  Hash  (cost=108066.87..108066.87 rows=44917 width=4) (actual time=172.884..172.884 rows=48909 loops=1)
                           Buckets: 65536  Batches: 1  Memory Usage: 2232kB
                           ->  Index Only Scan using test_idx__t_a_col1_col2__v4 on t_a  (cost=0.43..108066.87 rows=44917 width=4) (actual time=0.031..160.088 rows=48909 loops=1)
                                 Index Cond: (col_1 = 730)
                                 Heap Fetches: 2426
 Planning time: 0.769 ms
 Execution time: 1337.861 ms

Re: optimizing a query

От
Erik Gustafson
Дата:
Hi,

don't you want an index on t_a2b.col_a, maybe partial where col_a=1 ?



On Wed, Jun 22, 2016 at 2:07 AM, Jonathan Vanasco <postgres@2xlp.com> wrote:

On Jun 21, 2016, at 6:55 PM, David G. Johnston wrote:

​Aside from the name these indexes are identical...​

sorry. tired eyes copy/pasting between windows and trying to 'average' out 40 similar queries.

​These two items combined reduce the desirability of diagnosing this...it doesn't seem like you've faithfully recreated the scenario for us to evaluate.

Your post is also not self-contained and you haven't provided the actual EXPLAINs you are getting.

I played around with some more indexes, creating and disabling them on one specific query
Eventually i found some index formats that didn't pull in the whole table.
They gave approximately the same results as the other selects, with some differences in reporting.  the heap scan on the table was negligible.  the big hit was off the outer hash join.
the formatting in explain made a negligible check look like it was the root issue


CREATE TABLE t_a (id SERIAL PRIMARY KEY,
 col_1 INT NOT NULL,
 col_2 BOOLEAN DEFAULT NULL
 );
CREATE INDEX test_idx__t_a_col1_col2__v1 on t_a (col_1) WHERE col_2 IS NOT FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v2 on t_a (col_1, id) WHERE col_2 IS NOT FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v3 on t_a (id, col_1) WHERE col_2 IS NOT FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v4 on t_a (id, col_1, col_2) WHERE col_2 IS NOT FALSE;
CREATE INDEX test_idx__t_a_col1_col2__v5 on t_a (col_1, col_2) WHERE col_2 IS NOT FALSE;

CREATE TABLE t_b (id SERIAL PRIMARY KEY,
 col_1 INT NOT NULL,
 col_2 BOOLEAN DEFAULT NULL
 );
CREATE TABLE t_a2b (a_id INT NOT NULL REFERENCES t_a(id),
   b_id INT NOT NULL REFERENCES t_b(id),
   col_a INT NOT NULL,
   PRIMARY KEY (a_id, b_id)
   );

EXPLAIN ANALYZE
SELECT t_a2b.b_id AS t_a2b_b_id,
  count(t_a2b.b_id) AS counted
FROM t_a2b
JOIN t_a ON t_a2b.a_id = t_a.id
WHERE t_a.col_1 = 730
 AND t_a2b.col_a = 1
 AND (t_a.col_2 IS NOT False)
GROUP BY t_a2b.b_id
ORDER BY counted DESC,
t_a2b.b_id ASC
LIMIT 25
OFFSET 0
;


                                                                                     QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=270851.55..270851.62 rows=25 width=4) (actual time=1259.950..1259.953 rows=25 loops=1)
   ->  Sort  (cost=270851.55..270863.43 rows=4750 width=4) (actual time=1259.945..1259.945 rows=25 loops=1)
         Sort Key: (count(t_a2b.b_id)) DESC, t_a2b.b_id
         Sort Method: top-N heapsort  Memory: 26kB
         ->  HashAggregate  (cost=270670.01..270717.51 rows=4750 width=4) (actual time=1259.430..1259.769 rows=1231 loops=1)
               Group Key: t_a2b.b_id
               ->  Hash Join  (cost=171148.45..270516.71 rows=30660 width=4) (actual time=107.662..1230.481 rows=124871 loops=1)
                     Hash Cond: (t_a2b.a_id = t_a.id)
                     ->  Seq Scan on t_a2b  (cost=0.00..89741.18 rows=2485464 width=8) (actual time=0.011..661.978 rows=2492783 loops=1)
                           Filter: (col_a = 1)
                           Rows Removed by Filter: 2260712
                     ->  Hash  (cost=170446.87..170446.87 rows=56126 width=4) (actual time=107.409..107.409 rows=48909 loops=1)
                           Buckets: 65536  Batches: 1  Memory Usage: 2232kB
                           ->  Bitmap Heap Scan on t_a  (cost=1055.41..170446.87 rows=56126 width=4) (actual time=18.243..94.470 rows=48909 loops=1)
                                 Recheck Cond: ((col_1 = 730) AND (col_2 IS NOT FALSE))
                                 Heap Blocks: exact=43972
                                 ->  Bitmap Index Scan on test_idx__t_a_col1_col2__v2  (cost=0.00..1041.38 rows=56126 width=0) (actual time=8.661..8.661 rows=48909 loops=1)
                                       Index Cond: (col_1 = 730)
 Planning time: 0.796 ms
 Execution time: 1260.092 ms

                                                                                        QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=208239.59..208239.65 rows=25 width=4) (actual time=1337.739..1337.743 rows=25 loops=1)
   ->  Sort  (cost=208239.59..208251.47 rows=4750 width=4) (actual time=1337.737..1337.739 rows=25 loops=1)
         Sort Key: (count(t_a2b.b_id)) DESC, t_a2b.b_id
         Sort Method: top-N heapsort  Memory: 26kB
         ->  HashAggregate  (cost=208058.05..208105.55 rows=4750 width=4) (actual time=1337.183..1337.556 rows=1231 loops=1)
               Group Key: t_a2b.b_id
               ->  Hash Join  (cost=108628.33..207935.37 rows=24537 width=4) (actual time=173.116..1306.910 rows=124871 loops=1)
                     Hash Cond: (t_a2b.a_id = t_a.id)
                     ->  Seq Scan on t_a2b  (cost=0.00..89741.18 rows=2485464 width=8) (actual time=0.010..669.616 rows=2492783 loops=1)
                           Filter: (col_a = 1)
                           Rows Removed by Filter: 2260712
                     ->  Hash  (cost=108066.87..108066.87 rows=44917 width=4) (actual time=172.884..172.884 rows=48909 loops=1)
                           Buckets: 65536  Batches: 1  Memory Usage: 2232kB
                           ->  Index Only Scan using test_idx__t_a_col1_col2__v4 on t_a  (cost=0.43..108066.87 rows=44917 width=4) (actual time=0.031..160.088 rows=48909 loops=1)
                                 Index Cond: (col_1 = 730)
                                 Heap Fetches: 2426
 Planning time: 0.769 ms
 Execution time: 1337.861 ms


Re: optimizing a query

От
Jonathan Vanasco
Дата:
On Jun 22, 2016, at 4:25 AM, Erik Gustafson wrote:

> don't you want an index on t_a2b.col_a, maybe partial where col_a=1 ?

that table has indexes on all columns.  they're never referenced because the rows are so short.  this was just an
examplequery too, col_a has 200k variations  

After a lot of testing, I think I found a not-bug but possible area-for-improvement in the planner when joining against
atable for filtering (using my production 9.5.2 box) 

I checked a query against multiple possible indexes using the related columns.  only one of indexes was on the table
foreach series of tests, and I analyzed the table after the drop/create of indexes. 


Note 1: The only time an index-only scan is used, is on this form:

        CREATE INDEX idx_partial_fkey_id_partial ON table_a(fkey_1, id, col_partial) WHERE col_partial IS NOT FALSE;

    Omitting the col_partial from being indexed will trigger a Bitmap Heap Scan on the full table with a recheck
condition:

        CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE col_partial IS NOT FALSE;

    This shouldn't be necessary.  the planner knew that `col_partial` fulfilled the WHERE clause when it used the
index,but scanned the table to check it anyways. 

    On most tables the heap scan was negligible, but on a few larger tables it accounted a 20% increase in execution.

Note 2:

    This is odd, but this index is used by the planner:
        CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE col_partial IS NOT FALSE;

    but this index is never used:
        CREATE INDEX idx_partial_id_fkey ON table_a(id, fkey_1) WHERE col_partial IS NOT FALSE;

    I honestly don't know why the second index would not be used.  The query time doubled without it when run on a
tablewith 6million rows and about 20 columns. 

-------

The indexes I tested on:

    CREATE INDEX idx_fkey_1 ON table_a(fkey_1);
    CREATE INDEX idx_partial_fkey ON table_a(fkey_1) WHERE col_partial IS NOT FALSE;
    CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE col_partial IS NOT FALSE;
    CREATE INDEX idx_partial_id_fkey ON table_a(id, fkey_1) WHERE col_partial IS NOT FALSE;
    CREATE INDEX idx_partial_fkey_partial ON table_a(fkey_1, col_partial) WHERE col_partial IS NOT FALSE;
    CREATE INDEX idx_partial_fkey_id_partial ON table_a(fkey_1, id, col_partial) WHERE col_partial IS NOT FALSE;



Re: optimizing a query

От
"David G. Johnston"
Дата:
On Wed, Jun 22, 2016 at 2:23 PM, Jonathan Vanasco <postgres@2xlp.com> wrote:

On Jun 22, 2016, at 4:25 AM, Erik Gustafson wrote:

> don't you want an index on t_a2b.col_a, maybe partial where col_a=1 ?

that table has indexes on all columns.  they're never referenced because the rows are so short.  this was just an example query too, col_a has 200k variations

After a lot of testing, I think I found a not-bug but possible area-for-improvement in the planner when joining against a table for filtering (using my production 9.5.2 box)

I checked a query

What query?  ​A self-contained email would be nice.​

against multiple possible indexes using the related columns.  only one of indexes was on the table for each series of tests, and I analyzed the table after the drop/create of indexes.


Note 1: The only time an index-only scan is used, is on this form:

                CREATE INDEX idx_partial_fkey_id_partial ON table_a(fkey_1, id, col_partial) WHERE col_partial IS NOT FALSE;

        Omitting the col_partial from being indexed will trigger a Bitmap Heap Scan on the full table with a recheck condition:

                CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE col_partial IS NOT FALSE;

        This shouldn't be necessary.  the planner knew that `col_partial` fulfilled the WHERE clause when it used the index, but scanned the table to check it anyways.

        On most tables the heap scan was negligible, but on a few larger tables it accounted a 20% increase in execution.



Note especially:

​"Visibility information is not stored in index entries, only in heap entries; ..."

The check against the heap isn't for the truthiness of the predicate but the visibility of the row.
 
Note 2:

        This is odd, but this index is used by the planner:
                CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE col_partial IS NOT FALSE;

        but this index is never used:
                CREATE INDEX idx_partial_id_fkey ON table_a(id, fkey_1) WHERE col_partial IS NOT FALSE;

        I honestly don't know why the second index would not be used.  The query time doubled without it when run on a table with 6million rows and about 20 columns.


​This one requires knowledge of the query; but I am not surprised that reversing the order of columns in a b-tree index has an impact.

-------

The indexes I tested on:

        CREATE INDEX idx_fkey_1 ON table_a(fkey_1);
        CREATE INDEX idx_partial_fkey ON table_a(fkey_1) WHERE col_partial IS NOT FALSE;
        CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE col_partial IS NOT FALSE;
        CREATE INDEX idx_partial_id_fkey ON table_a(id, fkey_1) WHERE col_partial IS NOT FALSE;
        CREATE INDEX idx_partial_fkey_partial ON table_a(fkey_1, col_partial) WHERE col_partial IS NOT FALSE;
        CREATE INDEX idx_partial_fkey_id_partial ON table_a(fkey_1, id, col_partial) WHERE col_partial IS NOT FALSE;

​All at once?

David J.
 

Re: optimizing a query

От
Jonathan Vanasco
Дата:
On Jun 22, 2016, at 2:38 PM, David G. Johnston wrote:
> What query?  ​A self-contained email would be nice.​

This was the same query as in the previous email in the thread.  I didn't think to repeat it.  I did include it below.

> ​https://www.postgresql.org/docs/9.6/static/indexes-index-only-scans.html
>
> Note especially:
>
> ​"Visibility information is not stored in index entries, only in heap entries; ..."
>
> The check against the heap isn't for the truthiness of the predicate but the visibility of the row.

Thanks for this link.

The table I worked on hasn't had any writes since a server restart, and according to those docs the queries should have
beenoff the visibility map not the heap.   
However the amount of time to search is not in line with expectations for the visibility map.

After reading the last paragraph about some index optimizations in 9.6 that looked related, I installed the RC on an
anothermachine and dumped 2 tables from production to see if I would qualify for any improvements.   

>>> But there's a problem: the WHERE clause refers to success which is not available as a result column of the index.
Nonetheless,an index-only scan is possible because the plan does not need to recheck that part of the WHERE clause at
runtime:all entries found in the index necessarily have success = true so this need not be explicitly checked in the
plan.PostgreSQL versions 9.6 and later will recognize such cases and allow index-only scans to be generated, but older
versionswill not. 

The 9.6 branch planner optimizes for my query and realizes that it doesn't need to check the table:

So while this index is necessary on 9.5:
    CREATE INDEX idx__9_5 ON table_a(column_1, id, column_2) WHERE column_2 IS NOT FALSE;

This index works on 9.6
    CREATE INDEX idx__9_6 ON table_a(column_1, id) WHERE column_2 IS NOT FALSE;

Considering I have several million rows, this has a noticeable effect .

Combined with the various improvements on 9.6, there is a huge difference in query speed:

    9.6 runs the query with the smaller index in an average of 1200ms
    9.5 runs the query with the larger index in an average of 2700ms


> ​This one requires knowledge of the query; but I am not surprised that reversing the order of columns in a b-tree
indexhas an impact. 

I expected this to impact the decision on which index to use when multiple ones are available, or to offer poor
performance-- but not to discount using the index entirely. 


> ​All at once?

No.  I dropped all indexes to test, then for each column combination did:

    CREATE INDEX foo_idx;
        ANALYZE foo ;
    EXPLAIN ANALYZE;
    DROP INDEX foo_idx;

I call Explain Analyze manually once for the plan, then via script 25x to average out execution times and account for
cold-startvs having loaded all the indexes.  I shut down all other user processes on the machine as well. 





Re: optimizing a query

От
"David G. Johnston"
Дата:
On Wed, Jun 22, 2016 at 2:23 PM, Jonathan Vanasco <postgres@2xlp.com> wrote:

Note 2:

        This is odd, but this index is used by the planner:
                CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE col_partial IS NOT FALSE;

        but this index is never used:
                CREATE INDEX idx_partial_id_fkey ON table_a(id, fkey_1) WHERE col_partial IS NOT FALSE;

        I honestly don't know why the second index would not be used.  The query time doubled without it when run on a table with 6million rows and about 20 columns.

-------

The indexes I tested on:

        CREATE INDEX idx_fkey_1 ON table_a(fkey_1);
        CREATE INDEX idx_partial_fkey ON table_a(fkey_1) WHERE col_partial IS NOT FALSE;
        CREATE INDEX idx_partial_fkey_id ON table_a(fkey_1, id) WHERE col_partial IS NOT FALSE;
        CREATE INDEX idx_partial_id_fkey ON table_a(id, fkey_1) WHERE col_partial IS NOT FALSE;
        CREATE INDEX idx_partial_fkey_partial ON table_a(fkey_1, col_partial) WHERE col_partial IS NOT FALSE;
        CREATE INDEX idx_partial_fkey_id_partial ON table_a(fkey_1, id, col_partial) WHERE col_partial IS NOT FALSE;


​There is no relation named "table_a" anywhere in your past emails.

Your "t_a" relation is (id : int, col1 : int, ​col_2 : bool) not (id : int?, fkey_1 : int?, col_partial : bool), and given that t_a is an end relation it doesn't typically have a foreign key.

1) t_a is going to use the indexes appropriate to its where clauses to select records.  For the (id, fkey_1) index unless there is dependency [*] (id => fkey_1) you have to scan the entire index so know that you've located all records containing a given fkey_1 value when id is unspecified (as it is in your query).

* if even there was a dependency I don't not believe PostgreSQL would be able to use that knowledge during planning.

2) to join t_a2b an index on t_a2b having a_id as the first column would likely be used to quickly locate matching records from [1].  No additional indexes on t_a are going to help here.

You have a good head for this, and maybe I'm missing something obvious here - this is not my strong suit.  The random, though similar, naming of objects in your posts makes it a bit hard to follow.  As you found the partial indexes might or might not be influencing the results.  And note that the missing column influences the use of an index-only scan - if you are going to get one of those anyway its quite possible a given index will be worse than some non-index alternative.

David J.