Обсуждение: Inheritance & Indexes

Поиск
Список
Период
Сортировка

Inheritance & Indexes

От
Alan Williams
Дата:
Hi,

I'm trying to understand an issue we've encountered when using
inheritance. We've observed this behavior with both 7.2 and 7.3.
The crux of the issue is that indexes are not getting used when
one has a join on the parent table.

Say I have two tables:

  hs.exon.2=> \d ga_psr_transcript
         Table "public.ga_psr_transcript"
    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 id           | integer                | not null
 parent       | integer                |
 seqname      | character varying(100) | not null
 source_type  | smallint               | not null
 feature_type | smallint               | not null
 start        | integer                | not null
 stop         | integer                | not null
 strand       | character(1)           | not null
 annot_name   | character varying(100) | not null
 depth        | integer                | not null

hs.exon.2=> \d ga_psr_exon
                 Table "public.ga_psr_exon"
        Column         |          Type          | Modifiers
-----------------------+------------------------+-----------
 id                    | integer                | not null
 parent                | integer                |
 seqname               | character varying(100) | not null
 source_type           | smallint               | not null
 feature_type          | smallint               | not null
 start                 | integer                | not null
 stop                  | integer                | not null
 strand                | character(1)           | not null
 annot_name            | character varying(100) | not null
 transcript_cluster_id | integer                | not null
 depth                 | integer                | not null

Both of these tables inherit from a base table ga_annot. Each of these
tables are further subclassed by chromosome (these tables hold genome
annotation information with several 10's thousands to several 100's
thousands of rows in each of the chromosome level tables). Tuples are
only loaded into the chromosome level tables. So we have:

                                     /--- ga_psr_transcript_1
                                    /---- ga_psr_transcript_2
           /---- ga_psr_transcript <----- ga_psr_transcript_3
          /                         \---- ...
ga_annot <
          \                          /--- ga_psr_transcript_1
           \                        /---- ga_psr_transcript_2
            \--- ga_psr_exon ----- <----- ga_psr_transcript_3
                                    \---- ...

                                          ^
                                          All Tuples Loaded into here

There is also a foreign key relationship where the parent of
ga_psr_exon_N references id in ga_psr_transcript_N. An example of
"leaf" tables:

 hs.exon.2=> \d ga_psr_transcript_1
        Table "public.ga_psr_transcript_1"
    Column    |          Type          | Modifiers
--------------+------------------------+-----------
 id           | integer                | not null
 parent       | integer                |
 seqname      | character varying(100) | not null
 source_type  | smallint               | not null
 feature_type | smallint               | not null
 start        | integer                | not null
 stop         | integer                | not null
 strand       | character(1)           | not null
 annot_name   | character varying(100) | not null
 depth        | integer                | not null
Indexes: ga_psr_transcript_1_pkey primary key btree (id),
         ga_psr_transcript_1_start_stop btree ("start", stop),
         ga_psr_transcript_1_stop btree (stop)
Check constraints: "aw_psr_transcript_1_strand" (((strand = '+'::bpchar) OR (strand = '-'::bpchar)) OR (strand =
'.'::bpchar))
Triggers: RI_ConstraintTrigger_1412526244,
          RI_ConstraintTrigger_1412526245

hs.exon.2=> \d ga_psr_exon_1
                Table "public.ga_psr_exon_1"
        Column         |          Type          | Modifiers
-----------------------+------------------------+-----------
 id                    | integer                | not null
 parent                | integer                |
 seqname               | character varying(100) | not null
 source_type           | smallint               | not null
 feature_type          | smallint               | not null
 start                 | integer                | not null
 stop                  | integer                | not null
 strand                | character(1)           | not null
 annot_name            | character varying(100) | not null
 transcript_cluster_id | integer                | not null
 depth                 | integer                | not null
Indexes: ga_psr_exon_1_pkey primary key btree (id),
         ga_psr_exon_1_parent btree (parent),
         ga_psr_exon_1_start_stop btree ("start", stop),
         ga_psr_exon_1_stop btree (stop)
Check constraints: "aw_psr_exon_1_strand" (((strand = '+'::bpchar) OR (strand = '-'::bpchar)) OR (strand =
'.'::bpchar))
Triggers: RI_ConstraintTrigger_1412526088,
          RI_ConstraintTrigger_1412526089

hs.exon.2=> select count(*) from ga_psr_transcript_1;
 count
-------
 43398
(1 row)

hs.exon.2=> select count(*) from ga_psr_exon_1;
 count
--------
 176908
(1 row)

Now if I do a join on the "leaf" tables everything looks good:

 hs.exon.2=> explain select * from ga_psr_transcript_1 t, ga_psr_exon_1e where e.parent = t.id;
                                                     QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=0.00..9087.71 rows=176908 width=98)
   Merge Cond: ("outer".id = "inner".parent)
   ->  Index Scan using ga_psr_transcript_1_pkey on ga_psr_transcript_1 t  (cost=0.00..1066.17 rows=43398 width=47)
   ->  Index Scan using ga_psr_exon_1_parent on ga_psr_exon_1 e (cost=0.00..5259.52 rows=176908 width=51)
(4 rows)

If I do a join on the parent table, the optimizer refuses to use the
indicies:

hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e where e.parent = t.id;
                                          QUERY PLAN
-----------------------------------------------------------------------------------------------
 Merge Join  (cost=1239155.37..70188119.40 rows=5514877218 width=334)
   Merge Cond: ("outer".id = "inner".parent)
   ->  Sort  (cost=243481.37..244816.14 rows=533908 width=165)
         Sort Key: t.id
         ->  Append  (cost=0.00..10980.08 rows=533908 width=165)
               ->  Seq Scan on ga_psr_transcript t  (cost=0.00..20.00rows=1000 width=165)
               ->  Seq Scan on ga_psr_transcript_1 t  (cost=0.00..881.98rows=43398 width=47)
               ->  Seq Scan on ga_psr_transcript_2 t  (cost=0.00..905.44rows=44544 width=47)
               ->  Seq Scan on ga_psr_transcript_3 t  (cost=0.00..694.82rows=34182 width=47)
               ->  Seq Scan on ga_psr_transcript_4 t  (cost=0.00..617.48rows=30348 width=47)
               ->  Seq Scan on ga_psr_transcript_5 t  (cost=0.00..653.92rows=32192 width=47)
               ->  Seq Scan on ga_psr_transcript_6 t  (cost=0.00..653.88rows=32088 width=47)
               ->  Seq Scan on ga_psr_transcript_7 t  (cost=0.00..570.88rows=28088 width=47)
               ->  Seq Scan on ga_psr_transcript_8 t  (cost=0.00..537.06rows=26406 width=47)
               ->  Seq Scan on ga_psr_transcript_9 t  (cost=0.00..460.06rows=22606 width=47)
               ->  Seq Scan on ga_psr_transcript_10 t(cost=0.00..543.20 rows=26020 width=48)
               ->  Seq Scan on ga_psr_transcript_11 t(cost=0.00..542.82 rows=25982 width=48)
               ->  Seq Scan on ga_psr_transcript_12 t(cost=0.00..490.66 rows=23466 width=48)
               ->  Seq Scan on ga_psr_transcript_13 t(cost=0.00..346.64 rows=16564 width=48)
               ->  Seq Scan on ga_psr_transcript_14 t(cost=0.00..345.68 rows=16568 width=47)
               ->  Seq Scan on ga_psr_transcript_15 t(cost=0.00..358.46 rows=17146 width=48)
               ->  Seq Scan on ga_psr_transcript_16 t(cost=0.00..352.76 rows=16876 width=48)
               ->  Seq Scan on ga_psr_transcript_17 t(cost=0.00..375.72 rows=17972 width=48)
               ->  Seq Scan on ga_psr_transcript_18 t(cost=0.00..295.60 rows=14160 width=48)
               ->  Seq Scan on ga_psr_transcript_19 t(cost=0.00..262.24 rows=12524 width=48)
               ->  Seq Scan on ga_psr_transcript_20 t(cost=0.00..268.34 rows=12834 width=47)
               ->  Seq Scan on ga_psr_transcript_21 t(cost=0.00..145.18 rows=6918 width=47)
               ->  Seq Scan on ga_psr_transcript_22 t(cost=0.00..157.54 rows=7554 width=46)
               ->  Seq Scan on ga_psr_transcript_x t  (cost=0.00..431.08rows=21208 width=47)
               ->  Seq Scan on ga_psr_transcript_y t  (cost=0.00..60.40rows=2940 width=47)
               ->  Seq Scan on ga_psr_transcript_un t  (cost=0.00..7.14rows=314 width=55)
               ->  Seq Scan on ga_psr_transcript_m t  (cost=0.00..1.10rows=10 width=47)
   ->  Sort  (cost=995674.00..1000838.64 rows=2065853 width=169)
         Sort Key: e.parent
         ->  Append  (cost=0.00..43563.52 rows=2065853 width=169)
               ->  Seq Scan on ga_psr_exon e  (cost=0.00..0.00 rows=1width=169)
               ->  Seq Scan on ga_psr_exon_1 e  (cost=0.00..3691.08rows=176908 width=51)
               ->  Seq Scan on ga_psr_exon_2 e  (cost=0.00..3565.00rows=170800 width=51)
               ->  Seq Scan on ga_psr_exon_3 e  (cost=0.00..2709.88rows=129788 width=51)
               ->  Seq Scan on ga_psr_exon_4 e  (cost=0.00..2206.62rows=105662 width=51)
               ->  Seq Scan on ga_psr_exon_5 e  (cost=0.00..2376.48rows=113848 width=51)
               ->  Seq Scan on ga_psr_exon_6 e  (cost=0.00..2559.56rows=122156 width=51)
               ->  Seq Scan on ga_psr_exon_7 e  (cost=0.00..2352.78rows=112778 width=51)
               ->  Seq Scan on ga_psr_exon_8 e  (cost=0.00..1981.14rows=94914 width=51)
               ->  Seq Scan on ga_psr_exon_9 e  (cost=0.00..1807.76rows=86576 width=51)
               ->  Seq Scan on ga_psr_exon_10 e  (cost=0.00..2144.76rows=100376 width=52)
               ->  Seq Scan on ga_psr_exon_11 e  (cost=0.00..2158.90rows=100990 width=52)
               ->  Seq Scan on ga_psr_exon_12 e  (cost=0.00..2035.66rows=95266 width=52)
               ->  Seq Scan on ga_psr_exon_13 e  (cost=0.00..1212.48rows=56748 width=52)
               ->  Seq Scan on ga_psr_exon_14 e  (cost=0.00..1380.52rows=64652 width=51)
               ->  Seq Scan on ga_psr_exon_15 e  (cost=0.00..1529.88rows=71588 width=52)
               ->  Seq Scan on ga_psr_exon_16 e  (cost=0.00..1575.04rows=73704 width=52)
               ->  Seq Scan on ga_psr_exon_17 e  (cost=0.00..1811.04rows=84704 width=52)
               ->  Seq Scan on ga_psr_exon_18 e  (cost=0.00..1033.48rows=48348 width=52)
               ->  Seq Scan on ga_psr_exon_19 e  (cost=0.00..1375.44rows=64344 width=52)
               ->  Seq Scan on ga_psr_exon_20 e  (cost=0.00..1037.50rows=48550 width=51)
               ->  Seq Scan on ga_psr_exon_21 e  (cost=0.00..537.40rows=25140 width=51)
               ->  Seq Scan on ga_psr_exon_22 e  (cost=0.00..736.96rows=34596 width=50)
               ->  Seq Scan on ga_psr_exon_x e  (cost=0.00..1512.30rows=72430 width=51)
               ->  Seq Scan on ga_psr_exon_y e  (cost=0.00..205.00rows=9800 width=51)
               ->  Seq Scan on ga_psr_exon_un e  (cost=0.00..25.58rows=1158 width=59)
               ->  Seq Scan on ga_psr_exon_m e  (cost=0.00..1.28 rows=28 width=51)
(62 rows)


Same thing even if I'm querying for a specific tuple:

hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e
where e.parent = t.id and t.id = 123;
                                                        QUERY PLAN
------------------------------------------------------------------------
---------------------------------------------------
 Hash Join  (cost=99.06..73488.33 rows=320207 width=334)
   Hash Cond: ("outer".parent = "inner".id)
   ->  Append  (cost=0.00..43563.52 rows=2065853 width=169)
         ->  Seq Scan on ga_psr_exon e  (cost=0.00..0.00 rows=1width=169)
         ->  Seq Scan on ga_psr_exon_1 e  (cost=0.00..3691.08rows=176908 width=51)
         ->  Seq Scan on ga_psr_exon_2 e  (cost=0.00..3565.00rows=170800 width=51)
         ->  Seq Scan on ga_psr_exon_3 e  (cost=0.00..2709.88rows=129788 width=51)
         ->  Seq Scan on ga_psr_exon_4 e  (cost=0.00..2206.62rows=105662 width=51)
         ->  Seq Scan on ga_psr_exon_5 e  (cost=0.00..2376.48rows=113848 width=51)
         ->  Seq Scan on ga_psr_exon_6 e  (cost=0.00..2559.56rows=122156 width=51)
         ->  Seq Scan on ga_psr_exon_7 e  (cost=0.00..2352.78rows=112778 width=51)
         ->  Seq Scan on ga_psr_exon_8 e  (cost=0.00..1981.14 rows=94914 width=51)
         ->  Seq Scan on ga_psr_exon_9 e  (cost=0.00..1807.76 rows=86576 width=51)
         ->  Seq Scan on ga_psr_exon_10 e  (cost=0.00..2144.76rows=100376 width=52)
         ->  Seq Scan on ga_psr_exon_11 e  (cost=0.00..2158.90rows=100990 width=52)
         ->  Seq Scan on ga_psr_exon_12 e  (cost=0.00..2035.66rows=95266 width=52)
         ->  Seq Scan on ga_psr_exon_13 e  (cost=0.00..1212.48rows=56748 width=52)
         ->  Seq Scan on ga_psr_exon_14 e  (cost=0.00..1380.52rows=64652 width=51)
         ->  Seq Scan on ga_psr_exon_15 e  (cost=0.00..1529.88rows=71588 width=52)
         ->  Seq Scan on ga_psr_exon_16 e  (cost=0.00..1575.04rows=73704 width=52)
         ->  Seq Scan on ga_psr_exon_17 e  (cost=0.00..1811.04rows=84704 width=52)
         ->  Seq Scan on ga_psr_exon_18 e  (cost=0.00..1033.48rows=48348 width=52)
         ->  Seq Scan on ga_psr_exon_19 e  (cost=0.00..1375.44rows=64344 width=52)
         ->  Seq Scan on ga_psr_exon_20 e  (cost=0.00..1037.50rows=48550 width=51)
         ->  Seq Scan on ga_psr_exon_21 e  (cost=0.00..537.40 rows=25140 width=51)
         ->  Seq Scan on ga_psr_exon_22 e  (cost=0.00..736.96 rows=34596 width=50)
         ->  Seq Scan on ga_psr_exon_x e  (cost=0.00..1512.30 rows=72430 width=51)
         ->  Seq Scan on ga_psr_exon_y e  (cost=0.00..205.00 rows=9800 width=51)
         ->  Seq Scan on ga_psr_exon_un e  (cost=0.00..25.58 rows=1158 width=59)
         ->  Seq Scan on ga_psr_exon_m e  (cost=0.00..1.28 rows=28 width=51)
   ->  Hash  (cost=98.98..98.98 rows=31 width=165)
         ->  Append  (cost=0.00..98.98 rows=31 width=165)
               ->  Seq Scan on ga_psr_transcript t  (cost=0.00..22.50 rows=5 width=165)
                     Filter: (id = 123)
               ->  Index Scan using ga_psr_transcript_1_pkey on ga_psr_transcript_1 t  (cost=0.00..3.03 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_2_pkey on ga_psr_transcript_2 t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_3_pkey on ga_psr_transcript_3 t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_4_pkey on ga_psr_transcript_4 t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_5_pkey on ga_psr_transcript_5 t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_6_pkey on ga_psr_transcript_6 t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_7_pkey on ga_psr_transcript_7 t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_8_pkey on ga_psr_transcript_8 t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_9_pkey on ga_psr_transcript_9 t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_10_pkey on ga_psr_transcript_10 t  (cost=0.00..3.01 rows=1
width=48)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_11_pkey on ga_psr_transcript_11 t  (cost=0.00..3.01 rows=1
width=48)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_12_pkey on ga_psr_transcript_12 t  (cost=0.00..3.01 rows=1
width=48)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_13_pkey on ga_psr_transcript_13 t  (cost=0.00..3.01 rows=1
width=48)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_14_pkey on ga_psr_transcript_14 t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_15_pkey on ga_psr_transcript_15 t  (cost=0.00..3.01 rows=1
width=48)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_16_pkey on ga_psr_transcript_16 t  (cost=0.00..3.01 rows=1
width=48)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_17_pkey on ga_psr_transcript_17 t  (cost=0.00..3.01 rows=1
width=48)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_18_pkey on ga_psr_transcript_18 t  (cost=0.00..3.01 rows=1
width=48)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_19_pkey on ga_psr_transcript_19 t  (cost=0.00..3.01 rows=1
width=48)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_20_pkey on ga_psr_transcript_20 t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_21_pkey on ga_psr_transcript_21 t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_22_pkey on ga_psr_transcript_22 t  (cost=0.00..3.01 rows=1
width=46)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_x_pkey on ga_psr_transcript_x t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_y_pkey on ga_psr_transcript_y t  (cost=0.00..3.01 rows=1
width=47)
                     Index Cond: (id = 123)
               ->  Index Scan using ga_psr_transcript_un_pkey on ga_psr_transcript_un t  (cost=0.00..3.01 rows=1
width=55)
                     Index Cond: (id = 123)
               ->  Seq Scan on ga_psr_transcript_m t  (cost=0.00..1.12 rows=1 width=47)
                     Filter: (id = 123)
(86 rows)

Any comments or suggestions would be greatly appreciated. Thanks.


---------------------------------------------------------
Alan Williams, PhD           Alan_Williams@Affymetrix.com
Bioinformatics Manager
---------------------------------------------------------



Re: Inheritance & Indexes

От
Stephan Szabo
Дата:
On Tue, 24 Jun 2003, Alan Williams wrote:

>  hs.exon.2=> \d ga_psr_transcript_1
>         Table "public.ga_psr_transcript_1"
>     Column    |          Type          | Modifiers
> --------------+------------------------+-----------
>  id           | integer                | not null
>  parent       | integer                |
>  seqname      | character varying(100) | not null
>  source_type  | smallint               | not null
>  feature_type | smallint               | not null
>  start        | integer                | not null
>  stop         | integer                | not null
>  strand       | character(1)           | not null
>  annot_name   | character varying(100) | not null
>  depth        | integer                | not null
> Indexes: ga_psr_transcript_1_pkey primary key btree (id),
>          ga_psr_transcript_1_start_stop btree ("start", stop),
>          ga_psr_transcript_1_stop btree (stop)
> Check constraints: "aw_psr_transcript_1_strand" (((strand = '+'::bpchar) OR (strand = '-'::bpchar)) OR (strand =
'.'::bpchar))
> Triggers: RI_ConstraintTrigger_1412526244,
>           RI_ConstraintTrigger_1412526245
>
> hs.exon.2=> \d ga_psr_exon_1
>                 Table "public.ga_psr_exon_1"
>         Column         |          Type          | Modifiers
> -----------------------+------------------------+-----------
>  id                    | integer                | not null
>  parent                | integer                |
>  seqname               | character varying(100) | not null
>  source_type           | smallint               | not null
>  feature_type          | smallint               | not null
>  start                 | integer                | not null
>  stop                  | integer                | not null
>  strand                | character(1)           | not null
>  annot_name            | character varying(100) | not null
>  transcript_cluster_id | integer                | not null
>  depth                 | integer                | not null
> Indexes: ga_psr_exon_1_pkey primary key btree (id),
>          ga_psr_exon_1_parent btree (parent),
>          ga_psr_exon_1_start_stop btree ("start", stop),
>          ga_psr_exon_1_stop btree (stop)
> Check constraints: "aw_psr_exon_1_strand" (((strand = '+'::bpchar) OR (strand = '-'::bpchar)) OR (strand =
'.'::bpchar))
> Triggers: RI_ConstraintTrigger_1412526088,
>           RI_ConstraintTrigger_1412526089
>
> hs.exon.2=> select count(*) from ga_psr_transcript_1;
>  count
> -------
>  43398
> (1 row)
>
> hs.exon.2=> select count(*) from ga_psr_exon_1;
>  count
> --------
>  176908
> (1 row)
>
> Now if I do a join on the "leaf" tables everything looks good:
>
>  hs.exon.2=> explain select * from ga_psr_transcript_1 t, ga_psr_exon_1e where e.parent = t.id;
>                                                      QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------
>  Merge Join  (cost=0.00..9087.71 rows=176908 width=98)
>    Merge Cond: ("outer".id = "inner".parent)
>    ->  Index Scan using ga_psr_transcript_1_pkey on ga_psr_transcript_1 t  (cost=0.00..1066.17 rows=43398 width=47)
>    ->  Index Scan using ga_psr_exon_1_parent on ga_psr_exon_1 e (cost=0.00..5259.52 rows=176908 width=51)
> (4 rows)
>
> If I do a join on the parent table, the optimizer refuses to use the
> indicies:
>
> hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e where e.parent = t.id;

In this case, you can't use a single index scan to get the rows in order
so the part that makes the above a nice plan doesn't really apply.  If
you're getting all the rows and sorting them, index scans are probably a
waste of time unless you have alot of dead space.  If we supported
multi-table indexes, that'd potentially let you get a plan like
the above.

> -----------------------------------------------------------------------------------------------
>  Merge Join  (cost=1239155.37..70188119.40 rows=5514877218 width=334)
>    Merge Cond: ("outer".id = "inner".parent)
>    ->  Sort  (cost=243481.37..244816.14 rows=533908 width=165)
>          Sort Key: t.id
>          ->  Append  (cost=0.00..10980.08 rows=533908 width=165)
[lots of seqscans snipped]
>    ->  Sort  (cost=995674.00..1000838.64 rows=2065853 width=169)
>          Sort Key: e.parent
>          ->  Append  (cost=0.00..43563.52 rows=2065853 width=169)
[more seqscans snipped]

> Same thing even if I'm querying for a specific tuple:
>
> hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e
> where e.parent = t.id and t.id = 123;

ISTM it's willing to use an index scan on at least some of t's subtables.
Does explicitly saying e.parent=123 help?

>                                                         QUERY PLAN
> ------------------------------------------------------------------------
> ---------------------------------------------------
>  Hash Join  (cost=99.06..73488.33 rows=320207 width=334)
>    Hash Cond: ("outer".parent = "inner".id)
>    ->  Append  (cost=0.00..43563.52 rows=2065853 width=169)
[lots of seqscans snipped again]
>    ->  Hash  (cost=98.98..98.98 rows=31 width=165)
>          ->  Append  (cost=0.00..98.98 rows=31 width=165)
>                ->  Seq Scan on ga_psr_transcript t  (cost=0.00..22.50 rows=5 width=165)
>                      Filter: (id = 123)
[snipping a bunch of index scans]
>                ->  Seq Scan on ga_psr_transcript_m t  (cost=0.00..1.12 rows=1 width=47)
>                      Filter: (id = 123)
> (86 rows)



Re: Inheritance & Indexes

От
Alan Williams
Дата:
On Tue, 24 Jun 2003, Stephan Szabo wrote:
> >  hs.exon.2=> explain select * from ga_psr_transcript_1 t,
> ga_psr_exon_1e where e.parent = t.id;
> >                                                      QUERY PLAN
> >
> ------------------------------------------------------------------------
> --------------------------------------------
> >  Merge Join  (cost=0.00..9087.71 rows=176908 width=98)
> >    Merge Cond: ("outer".id = "inner".parent)
> >    ->  Index Scan using ga_psr_transcript_1_pkey on
> ga_psr_transcript_1 t  (cost=0.00..1066.17 rows=43398 width=47)
> >    ->  Index Scan using ga_psr_exon_1_parent on ga_psr_exon_1 e
> (cost=0.00..5259.52 rows=176908 width=51)
> > (4 rows)
> >
> > If I do a join on the parent table, the optimizer refuses to use the
> > indicies:
> >
> > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e
> where e.parent = t.id;
>
> In this case, you can't use a single index scan to get the rows in order
> so the part that makes the above a nice plan doesn't really apply.  If
> you're getting all the rows and sorting them, index scans are probably a
> waste of time unless you have alot of dead space.  If we supported
> multi-table indexes, that'd potentially let you get a plan like
> the above.

Because of the foreign key constraint, the database engine could do
the above query on each of the child tables and concatenate the
results. This is because there is a notion in our schema of "paired"
inheritance where both the ga_psr_exon ahd ga_psr_transcript tables
are subclassed by choromosome. IE:

   ga_psr_exon_1.parent --> ga_psr_transcript_1.id

Of course the foreign key is the only
indication of this and I can't say that I'm entirely surprised that
the optimizer doesn't catch this.

This constraint unfortunately breaks down when joining on a non-foreign
key, such as range queries (the rows in these tables represent ranges
in a one dimensional space) like:

 explain select * from ga_psr_exon_1 e1, ga_psr_exon_1 e2 where
 e1.start >= e2.start and e1.start<=e2.stop and e1.stop >= e2.start and
 e1.stop <= e2.stop;

 Nested Loop  (cost=0.00..995313942.65 rows=386375808 width=102)
   ->  Seq Scan on ga_psr_exon_1 e1  (cost=0.00..3691.08 rows=176908 width=51)
   ->  Index Scan using ga_psr_exon_1_start_stop on ga_psr_exon_1 e2  (cost=0.00..5582.47 rows=2184 width=51)
         Index Cond: (("outer"."start" >= e2."start") AND ("outer".stop >= e2."start") AND ("outer"."start" <= e2.stop)
AND("outer".stop <= e2.stop)) 

versus

 explain select * from ga_psr_exon e1, ga_psr_exon e2 where
 e1.start >= e2.start and e1.start<=e2.stop and e1.stop >= e2.start and
 e1.stop <= e2.stop;

which results in a nested loop of seq scans. (It is actually worse
than this as we would really like to query for overlapping ranges, not
containment.)

Currently we use either a perl middleware or UNIONs to explicitly force
these paired table relationships.

>
> >
> ------------------------------------------------------------------------
> -----------------------
> >  Merge Join  (cost=1239155.37..70188119.40 rows=5514877218 width=334)
> >    Merge Cond: ("outer".id = "inner".parent)
> >    ->  Sort  (cost=243481.37..244816.14 rows=533908 width=165)
> >          Sort Key: t.id
> >          ->  Append  (cost=0.00..10980.08 rows=533908 width=165)
> [lots of seqscans snipped]
> >    ->  Sort  (cost=995674.00..1000838.64 rows=2065853 width=169)
> >          Sort Key: e.parent
> >          ->  Append  (cost=0.00..43563.52 rows=2065853 width=169)
> [more seqscans snipped]
>
> > Same thing even if I'm querying for a specific tuple:
> >
> > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e
> > where e.parent = t.id and t.id = 123;
>
> ISTM it's willing to use an index scan on at least some of t's
> subtables.
> Does explicitly saying e.parent=123 help?

Yes, adding e.parent=123 results in the desired result of index scans
into both tables. However, without including this the optimizer still
predicts 31 results from the index scans on ga_psr_transcript* and yet
insists on using a seq scan into each ga_psr_exon* table. It expects
to get 2065853 rows back from the ga_psr_exon* tables when in reality
it is more like 310 rows.

Thanks for the response.

-Alan

FWIW, we subclass by chromosome for performance reasons. We have
tables (at the chromosome level) with upwards of 6 million rows
against which we run a variety of data mining queries.


Re: Inheritance & Indexes

От
Stephan Szabo
Дата:
On Tue, 24 Jun 2003, Alan Williams wrote:

>
> On Tue, 24 Jun 2003, Stephan Szabo wrote:
> > >  hs.exon.2=> explain select * from ga_psr_transcript_1 t,
> > ga_psr_exon_1e where e.parent = t.id;
> > >                                                      QUERY PLAN
> > >
> > ------------------------------------------------------------------------
> > --------------------------------------------
> > >  Merge Join  (cost=0.00..9087.71 rows=176908 width=98)
> > >    Merge Cond: ("outer".id = "inner".parent)
> > >    ->  Index Scan using ga_psr_transcript_1_pkey on
> > ga_psr_transcript_1 t  (cost=0.00..1066.17 rows=43398 width=47)
> > >    ->  Index Scan using ga_psr_exon_1_parent on ga_psr_exon_1 e
> > (cost=0.00..5259.52 rows=176908 width=51)
> > > (4 rows)
> > >
> > > If I do a join on the parent table, the optimizer refuses to use the
> > > indicies:
> > >
> > > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e
> > where e.parent = t.id;
> >
> > In this case, you can't use a single index scan to get the rows in order
> > so the part that makes the above a nice plan doesn't really apply.  If
> > you're getting all the rows and sorting them, index scans are probably a
> > waste of time unless you have alot of dead space.  If we supported
> > multi-table indexes, that'd potentially let you get a plan like
> > the above.
>
> Because of the foreign key constraint, the database engine could do
> the above query on each of the child tables and concatenate the
> results. This is because there is a notion in our schema of "paired"

I don't think it can do exon_1 -> transcript_1 union exon_2 ->
transcript_2 etc from the above unless there's also a guarantee of
uniqueness since if the same id showed up in transcript_1 and transcript_2
you'd have to join them both to a parent in exon_1.  The individual id
primary keys are not sufficient to show that though so you'd have to join
exon_1 -> transcript_1 union exon_1 -> transcript_2 union exon_2 ->
transcript_1... to guarantee the same results I think.

I don't think that we're ever likely to figure out the optimization for
those cases in any case. Multi-table indexes will probably be coming
eventually which will allow a scan over that rather than the append step.

> > ------------------------------------------------------------------------
> > -----------------------
> > >  Merge Join  (cost=1239155.37..70188119.40 rows=5514877218 width=334)
> > >    Merge Cond: ("outer".id = "inner".parent)
> > >    ->  Sort  (cost=243481.37..244816.14 rows=533908 width=165)
> > >          Sort Key: t.id
> > >          ->  Append  (cost=0.00..10980.08 rows=533908 width=165)
> > [lots of seqscans snipped]
> > >    ->  Sort  (cost=995674.00..1000838.64 rows=2065853 width=169)
> > >          Sort Key: e.parent
> > >          ->  Append  (cost=0.00..43563.52 rows=2065853 width=169)
> > [more seqscans snipped]
> >
> > > Same thing even if I'm querying for a specific tuple:
> > >
> > > hs.exon.2=> explain select * from ga_psr_transcript t, ga_psr_exon e
> > > where e.parent = t.id and t.id = 123;
> >
> > ISTM it's willing to use an index scan on at least some of t's
> > subtables.
> > Does explicitly saying e.parent=123 help?
>
> Yes, adding e.parent=123 results in the desired result of index scans
> into both tables. However, without including this the optimizer still
> predicts 31 results from the index scans on ga_psr_transcript* and yet
> insists on using a seq scan into each ga_psr_exon* table. It expects
> to get 2065853 rows back from the ga_psr_exon* tables when in reality
> it is more like 310 rows.

Yeah, it's guessing the number of rows rather poorly.  Without the
implied search condition, the index scan wouldn't help barring a small
estimated number of rows in t making nested loop look good (I assume the
t estimate is way off too, does analyzing the various tables or possibly
raising the analyze buckets for the id column and analyzing get that
estimate to something reasonable?).  I *think* 7.4 may be smarter about
implying these conditions as well.



Re: Inheritance & Indexes

От
Tom Lane
Дата:
Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> I *think* 7.4 may be smarter about
> implying these conditions as well.

Not really.  AFAIR the Append-style plan is the only thing you can get
out of the planner for inheritance trees.  This works well enough for
restriction clauses like "id = constant" (since those get pushed down to
the member tables, much as with UNION ALL), but it just isn't gonna be
efficient for join situations.  And I can't see any realistic way for
the planner to realize that only some pairs of child tables need be
joined.

I think the decision to use an inheritance tree for performance reasons
(as opposed to any logical necessity) was probably not a good one.
It'd be better to store all the data in one big table and use partial
indexes for faster access to subsets.

            regards, tom lane

Re: Inheritance & Indexes

От
Stephan Szabo
Дата:
On Wed, 25 Jun 2003, Tom Lane wrote:

> Stephan Szabo <sszabo@megazone23.bigpanda.com> writes:
> > I *think* 7.4 may be smarter about
> > implying these conditions as well.
>
> Not really.  AFAIR the Append-style plan is the only thing you can get
> out of the planner for inheritance trees.  This works well enough for
> restriction clauses like "id = constant" (since those get pushed down to
> the member tables, much as with UNION ALL), but it just isn't gonna be
> efficient for join situations.  And I can't see any realistic way for
> the planner to realize that only some pairs of child tables need be
> joined.

I was actually thinking of the table1.col=table2.col and table1.col=42
implying table2.col=42 when I wrote the above because he was also
wondering why it wasn't using index scans on the table2 tree. Which now
that I have access to my 7.4 box again, it does appear to.