Обсуждение: Question about double table scans for a table

Поиск

Список

Период

Сортировка

Question about double table scans for a table

От

Ba Jinsheng

Дата:

27 июля 2023 г., 11:17:58

Hi everyone,

Consider the query 11 in the TPC-H benchmark:

select

ps_partkey,

sum(ps_supplycost * ps_availqty) as value

from

PARTSUPP,

SUPPLIER,

NATION

where

ps_suppkey = s_suppkey

and s_nationkey = n_nationkey

and n_name = 'MOZAMBIQUE'

group by

ps_partkey

having

sum(ps_supplycost * ps_availqty) > (

select

sum(ps_supplycost * ps_availqty) * 0.0001000000

from

PARTSUPP,

SUPPLIER,

NATION

where

ps_suppkey = s_suppkey

and s_nationkey = n_nationkey

and n_name = 'MOZAMBIQUE'

)

order by

value desc;

PostgreSQL generates the following query plan:

Sort (cost=1798.52..1799.32 rows=320 width=36)

Sort Key: (sum((partsupp.ps_supplycost * (partsupp.ps_availqty)::numeric))) DESC

InitPlan 1 (returns $0)

-> Aggregate (cost=884.20..884.21 rows=1 width=32)

-> Hash Join (cost=12.40..877.00 rows=960 width=10)

Hash Cond: (partsupp_1.ps_suppkey = supplier_1.s_suppkey)

-> Seq Scan on partsupp partsupp_1 (cost=0.00..765.00 rows=24000 width=14)

-> Hash (cost=12.25..12.25 rows=12 width=4)

-> Hash Join (cost=1.32..12.25 rows=12 width=4)

Hash Cond: (supplier_1.s_nationkey = nation_1.n_nationkey)

-> Seq Scan on supplier supplier_1 (cost=0.00..10.00 rows=300 width=8)

-> Hash (cost=1.31..1.31 rows=1 width=4)

-> Seq Scan on nation nation_1 (cost=0.00..1.31 rows=1 width=4)

Filter: (n_name = 'MOZAMBIQUE'::bpchar)

-> HashAggregate (cost=886.60..901.00 rows=320 width=36)

Group Key: partsupp.ps_partkey

Filter: (sum((partsupp.ps_supplycost * (partsupp.ps_availqty)::numeric)) > $0)

-> Hash Join (cost=12.40..877.00 rows=960 width=14)

Hash Cond: (partsupp.ps_suppkey = supplier.s_suppkey)

-> Seq Scan on partsupp (cost=0.00..765.00 rows=24000 width=18)

-> Hash (cost=12.25..12.25 rows=12 width=4)

-> Hash Join (cost=1.32..12.25 rows=12 width=4)

Hash Cond: (supplier.s_nationkey = nation.n_nationkey)

-> Seq Scan on supplier (cost=0.00..10.00 rows=300 width=8)

-> Hash (cost=1.31..1.31 rows=1 width=4)

-> Seq Scan on nation (cost=0.00..1.31 rows=1 width=4)

Filter: (n_name = 'MOZAMBIQUE'::bpchar)

While TiDB has the following query plan:

Projection_63

└─Sort_64

└─Selection_66

└─HashAgg_67

└─Projection_94

└─HashJoin_71

├─HashJoin_84(Build)

│ ├─TableReader_89(Build)

│ │ └─Selection_88

│ │ └─TableFullScan_87

│ └─TableReader_86(Probe)

│ └─TableFullScan_85

└─TableReader_91(Probe)

└─TableFullScan_90

...

Both query plans include different numbers of table scans, as highlighted in red color. PostgreSQL uses six table scans, while TiDB has only three. I understand that the table scanning operation is expensive and query plans are typically more efficient with fewer table scans. My question is why PostgreSQL uses six table scans to scan each table twice? Is it a more efficient query plan, or does this indicate an optimization that is not performed by PostgreSQL?

Best regards,

Jinsheng Ba

Re: Question about double table scans for a table

От

David Rowley

Дата:

27 июля 2023 г., 23:38:35

On Thu, 27 Jul 2023 at 20:49, Ba Jinsheng <bajinsheng@u.nus.edu> wrote:
> Both query plans include different numbers of table scans, as highlighted in red color. PostgreSQL uses six table
scans,while TiDB has only three. I understand that the table scanning operation is expensive and query plans are
typicallymore efficient with fewer table scans. My question is why PostgreSQL uses six table scans to scan each table
twice?Is it a more efficient query plan, or does this indicate an optimization that is not performed by PostgreSQL? 

The PostgreSQL planner does not do any scan deduplication like this.
You could likely write a query containing a WITH MATERALIZE that runs
the query with the GROUP BY, then reference the CTE in both the main
query and also the HAVING clause. e.g, something like:

explain with cte as materialized (select ps_partkey, sum(ps_supplycost
* ps_availqty) as cost from partsupp ... other joins...) select * from
cte having cost > (select sum(cost) from cte);

How much more efficient that'll be will depend on the number of distinct parts.

David

Re: Question about double table scans for a table

От

jian he

Дата:

28 июля 2023 г., 03:11:58

On Thu, Jul 27, 2023 at 4:49 PM Ba Jinsheng <bajinsheng@u.nus.edu> wrote:
>
> Hi everyone,
>
>
>
> Consider the query 11 in the TPC-H benchmark:
>
> select
>
>     ps_partkey,
>
>     sum(ps_supplycost * ps_availqty) as value
>
> from
>
>     PARTSUPP,
>
>     SUPPLIER,
>
>     NATION
>
> where
>
>     ps_suppkey = s_suppkey
>
>     and s_nationkey = n_nationkey
>
>     and n_name = 'MOZAMBIQUE'
>
> group by
>
>     ps_partkey
>
> having
>
>     sum(ps_supplycost * ps_availqty) > (
>
>         select
>
>             sum(ps_supplycost * ps_availqty) * 0.0001000000
>
>         from
>
>             PARTSUPP,
>
>             SUPPLIER,
>
>             NATION
>
>         where
>
>             ps_suppkey = s_suppkey
>
>             and s_nationkey = n_nationkey
>
>             and n_name = 'MOZAMBIQUE'
>
>     )
>
> order by
>
>     value desc;
>
>

I think you query is equivalent to following:

select
    ps_partkey,
    sum(ps_supplycost * ps_availqty) filter (where ps_supplycost > 0
and ps_availqty > 0 ) as value
from
    PARTSUPP,
    SUPPLIER,
    NATION
where
    ps_suppkey = s_suppkey
    and s_nationkey = n_nationkey
    and n_name = 'MOZAMBIQUE'
group by
    ps_partkey;

maybe you can use inner join like:
select
    ps_partkey,
    sum(ps_supplycost * ps_availqty) filter (where ps_supplycost > 0
and ps_availqty > 0 ) as value
from PARTSUPP join SUPPLIER on (ps_suppkey = s_suppkey)
        join NATION on (s_nationkey = n_nationkey)
where n_name = 'MOZAMBIQUE'
group by
    ps_partkey;

Re: Question about double table scans for a table

От

David Rowley

Дата:

28 июля 2023 г., 07:01:34

On Fri, 28 Jul 2023 at 12:12, jian he <jian.universality@gmail.com> wrote:
> I think you query is equivalent to following:
>
> select
>     ps_partkey,
>     sum(ps_supplycost * ps_availqty) filter (where ps_supplycost > 0
> and ps_availqty > 0 ) as value

The FILTER clause is applied before aggregation.  HAVING is applied
after aggregation. This is not even nearly the same.

(You might have forgotten that numbers can be negative and also you
might have missed the * 0.0001000000.)

The original query seems to want all parts apart from the ones that
are below 1/10000th of the total ps_supplycost * ps_availqty for all
parts.

David

Re: Question about double table scans for a table

От

jian he

Дата:

28 июля 2023 г., 08:34:07

On Fri, Jul 28, 2023 at 12:01 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Fri, 28 Jul 2023 at 12:12, jian he <jian.universality@gmail.com> wrote:
> > I think you query is equivalent to following:
> >
> > select
> >     ps_partkey,
> >     sum(ps_supplycost * ps_availqty) filter (where ps_supplycost > 0
> > and ps_availqty > 0 ) as value
>
> The FILTER clause is applied before aggregation.  HAVING is applied
> after aggregation. This is not even nearly the same.
>
> (You might have forgotten that numbers can be negative and also you
> might have missed the * 0.0001000000.)
>
> The original query seems to want all parts apart from the ones that
> are below 1/10000th of the total ps_supplycost * ps_availqty for all
> parts.
>
> David

Is this equivalent to the original query?

select ps_partkey, value from
(
    select
        ps_partkey,
        sum(ps_supplycost * ps_availqty) as value,
        sum(ps_supplycost * ps_availqty) over(partition by ps_partkey)
* 0.0001000000 as temp
    from
        PARTSUPP,
        SUPPLIER,
        NATION
    where
        ps_suppkey = s_suppkey
        and s_nationkey = n_nationkey
        and n_name = 'MOZAMBIQUE'
    group by
        ps_partkey
) sub1
where value > temp
order by value desc;

Re: Question about double table scans for a table

От

Jeff Janes

Дата:

28 июля 2023 г., 16:26:30

> or does this indicate an optimization that is not performed by PostgreSQL?

Yes, this is an optimization which PostgreSQL doesn't do.

PostgreSQL does not claim to implement every conceivable optimization, so this is not a bug. You are in the wrong forum. If you had working code to implement this, or a serious plan to write some, then pgsql-hackers would be the right place to go. But if you are just pointing out a curiosity, I doubt you would get much traction there. (I could be wrong, maybe there is some easy way to hook this into the same code used by GROUPING SETS which someone would be willing to do just based on your example, but I don't think so asGROUPING SETS produces extra rows, while I think this would need extra columns)

Cheers,

Jeff

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Question about double table scans for a table

Question about double table scans for a table

Re: Question about double table scans for a table

Re: Question about double table scans for a table

Re: Question about double table scans for a table

Re: Question about double table scans for a table

Re: Question about double table scans for a table