Обсуждение: Problems with estimating OR conditions, IS NULL on LEFT JOINs

Поиск
Список
Период
Сортировка

Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Tomas Vondra
Дата:
Hi,

I ran into a pretty terrible case of LEFT JOIN estimate, resulting in
pretty arbitrary underestimate. The query is pretty trivial, nothing
overly complex, and the more I think about it the more I think this is
a fairly fundamental flaw in how we estimate this type of joins.

Imagine you have two trivial tables:

  CREATE TABLE large (id INT, a INT);
  INSERT INTO large SELECT i, i FROM generate_series(1,1000000) s(i);

  CREATE TABLE small (id INT, b INT);
  INSERT INTO small SELECT i, i FROM generate_series(1,100) s(i);

The business meaning may be that "large" stores orders and "small" is
for events related to tiny fraction of the large table (e.g. returns).
And let's do a couple simple LEFT JOIN queries, adding conditions to it.

Let's start with no condition at all:

  EXPLAIN ANALYZE
  SELECT * FROM large LEFT JOIN small ON (large.id = small.id)

                                 QUERY PLAN
  ----------------------------------------------------------------------
   Hash Left Join  (cost=3.25..18179.25 rows=1000000 width=16)
                   (actual time=0.069..550.290 rows=1000000 loops=1)
     Hash Cond: (large.id = small.id)
     ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8)
                      (actual time=0.010..174.056 rows=1000000 loops=1)
     ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual time=0.052...
           Buckets: 1024  Batches: 1  Memory Usage: 12kB
           ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) ...
   Planning Time: 0.291 ms
   Execution Time: 663.551 ms
  (8 rows)

So great, this estimate is perfect. Now, let's add IS NULL condition on
the small table, to find rows without a match (e.g. orders that were not
returned):

  EXPLAIN ANALYZE
  SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
  WHERE (small.id IS NULL);

                                 QUERY PLAN
  ----------------------------------------------------------------------
   Hash Anti Join  (cost=3.25..27052.36 rows=999900 width=16)
                   (actual time=0.071..544.568 rows=999900 loops=1)
     Hash Cond: (large.id = small.id)
     ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8)
                     (actual time=0.015..166.019 rows=1000000 loops=1)
     ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual time=0.051...
           Buckets: 1024  Batches: 1  Memory Usage: 12kB
           ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) ...
   Planning Time: 0.260 ms
   Execution Time: 658.379 ms
  (8 rows)

Also very accurate, great! Now let's do a condition on the large table
instead, filtering some the rows:

  EXPLAIN ANALYZE
  SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
  WHERE (large.a IN (1000, 2000, 3000, 4000, 5000));

                                 QUERY PLAN
  ----------------------------------------------------------------------
   Nested Loop Left Join  (cost=0.00..20684.75 rows=5 width=16)
                    (actual time=0.957..127.376 rows=5 loops=1)
     Join Filter: (large.id = small.id)
     Rows Removed by Join Filter: 500
     ->  Seq Scan on large  (cost=0.00..20675.00 rows=5 width=8)
                            (actual time=0.878..127.171 rows=5 loops=1)
           Filter: (a = ANY ('{1000,2000,3000,4000,5000}'::integer[]))
           Rows Removed by Filter: 999995
     ->  Materialize  (cost=0.00..2.50 rows=100 width=8) ...
           ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) ...
   Planning Time: 0.223 ms
   Execution Time: 127.407 ms
  (10 rows)

Also great estimate! Surely, if we do both conditions with OR, we'll get
a decent estimate too?

  EXPLAIN ANALYZE
  SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
  WHERE (small.id IS NULL)
     OR (large.a IN (1000, 2000, 3000, 4000, 5000));

                                 QUERY PLAN
  ----------------------------------------------------------------------
   Hash Left Join  (cost=3.25..18179.88 rows=5 width=16)
                   (actual time=0.073..580.827 rows=999900 loops=1)
     Hash Cond: (large.id = small.id)
     Filter: ((small.id IS NULL) OR
              (large.a = ANY ('{1000,2000,3000,4000,5000}'::integer[])))
     Rows Removed by Filter: 100
     ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8)
                     (actual time=0.015..166.809 rows=1000000 loops=1)
     ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual time=0.052...
           Buckets: 1024  Batches: 1  Memory Usage: 12kB
           ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) ...
   Planning Time: 0.309 ms
   Execution Time: 694.427 ms
  (10 rows)

Well, bummer! This is pretty surprising, because if we know that clause
A produces estimate 1M and clause B estimates as 5, then it's expected
that (A OR B) should be estimated as something >= max(1M, 5). For users
running this, this has to be really surprising.

It's also quite serious, because with underestimates like this the
planner is likely to pick nestloops for additional joins, and we all
know how that performs for millions of rows ...

So, how does this happen? Well, the simple reason is that joins are
estimated by applying selectivities for all the clauses on a cartesian
product. So we calculate product (100 * 1M), and then apply selectivity
for the join condition, and the two single-table clauses.

The problem is that the selectivity for "IS NULL" is estimated using the
table-level statistics. But the LEFT JOIN entirely breaks the idea that
the null_frac has anything to do with NULLs in the join result. Because
the join result is not a subset of cartesian product - it's a superset.
Even if the small table has no NULLs, the join can have them - that's
the whole point of outer joins.

When there's only the IS NULL condition, we actually recognize this as
a special case and treat the join as antijoin (Hash Anti Join), and that
also fixes the estimates - antijoins do handle this fine. But as soon as
you change the condition a bit and do the IS NULL check on the other
column of the table, it goes wrong too:

  EXPLAIN ANALYZE
  SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
  WHERE (small.b IS NULL);

                                 QUERY PLAN
  ----------------------------------------------------------------------
   Hash Left Join  (cost=3.25..18179.25 rows=1 width=16)
                   (actual time=0.311..3110.298 rows=999900 loops=1)
     Hash Cond: (large.id = small.id)
     Filter: (small.b IS NULL)
     Rows Removed by Filter: 100
     ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8)
                      (actual time=0.014..1032.497 rows=1000000 loops=1)
     ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual time=0.287...
           Buckets: 1024  Batches: 1  Memory Usage: 12kB
           ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) ...
   Planning Time: 0.105 ms
   Execution Time: 4083.634 ms
  (10 rows)

I'd bet most users would be rather surprised to learn this subtle change
makes a difference.

I wonder how to improve this, say by adjusting the IS NULL selectivity
when we know to operate on the outer side of the join. We're able to
do this for antijoins, so maybe we could do that here, somehow?

Unfortunately the important things (join type detection, IS NULL clause
estimation) happen pretty far away, but maybe we could pass that info
about fraction of NULLs introduced by he join to the nulltestsel, and
use it there (essentially by doing null_frac + join_null_frac). And
maybe we could do something like that for NULL tests on other columns
from the outer side ...

The other thing that might be beneficial is calculating boundaries for
the estimate. In this case we're capable of estimating the join size for
individual conditions, and then we could make ensure the final estimate
for the "a OR b" join is >= max of these cardinalities.

Of course, that might be expensive if we have to redo some of the join
planning/estimates for each individual condition (otherwise might not
notice the antijoin transformation and stuff like that).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Tom Lane
Дата:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
> The problem is that the selectivity for "IS NULL" is estimated using the
> table-level statistics. But the LEFT JOIN entirely breaks the idea that
> the null_frac has anything to do with NULLs in the join result.

Right.

> I wonder how to improve this, say by adjusting the IS NULL selectivity
> when we know to operate on the outer side of the join. We're able to
> do this for antijoins, so maybe we could do that here, somehow?

This mess is part of the long-term plan around the work I've been doing
on outer-join-aware Vars.  We now have infrastructure that can let
the estimator routines see "oh, this Var isn't directly from a scan
of its table, it's been passed through a potentially-nulling outer
join --- and I can see which one".  I don't have more than vague ideas
about what happens next, but that is clearly an essential step on the
road to doing better.

            regards, tom lane



Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Tomas Vondra
Дата:
On 6/24/23 02:08, Tom Lane wrote:
> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>> The problem is that the selectivity for "IS NULL" is estimated using the
>> table-level statistics. But the LEFT JOIN entirely breaks the idea that
>> the null_frac has anything to do with NULLs in the join result.
> 
> Right.
> 
>> I wonder how to improve this, say by adjusting the IS NULL selectivity
>> when we know to operate on the outer side of the join. We're able to
>> do this for antijoins, so maybe we could do that here, somehow?
> 
> This mess is part of the long-term plan around the work I've been doing
> on outer-join-aware Vars.  We now have infrastructure that can let
> the estimator routines see "oh, this Var isn't directly from a scan
> of its table, it's been passed through a potentially-nulling outer
> join --- and I can see which one".  I don't have more than vague ideas
> about what happens next, but that is clearly an essential step on the
> road to doing better.
> 

I was wondering if that work on outer-join-aware Vars could help with
this, but I wasn't following it very closely. I agree the ability to
check if the Var could be NULL due to an outer join seems useful, as it
says whether applying raw attribute statistics makes sense or not.

I was thinking about what to do for the case when that's not possible,
i.e. when the Var refers to nullable side of the join. Knowing that this
is happening is clearly not enough - we need to know how many new NULLs
are "injected" into the join result, and "communicate" that to the
estimation routines.

Attached is a very ugly experimental patch doing that, and with it the
estimate changes to this:

                                 QUERY PLAN
  ----------------------------------------------------------------------
   Hash Left Join  (cost=3.25..18179.88 rows=999900 width=16)
                   (actual time=0.528..596.151 rows=999900 loops=1)
     Hash Cond: (large.id = small.id)
     Filter: ((small.id IS NULL) OR
              (large.a = ANY ('{1000,2000,3000,4000,5000}'::integer[])))
     Rows Removed by Filter: 100
     ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8)
                     (actual time=0.069..176.138 rows=1000000 loops=1)
     ->  Hash  (cost=2.00..2.00 rows=100 width=8)
               (actual time=0.371..0.373 rows=100 loops=1)
           Buckets: 1024  Batches: 1  Memory Usage: 12kB
           ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8)
                         (actual time=0.032..0.146 rows=100 loops=1)
   Planning Time: 3.845 ms
   Execution Time: 712.405 ms
  (10 rows)

Seems nice, but. The patch is pretty ugly, I don't claim it works for
other queries or that this is exactly what we should do. It calculates
"unmatched frequency" next to eqjoinsel_inner, stashes that info into
sjinfo and the estimator (nulltestsel) then uses that to adjust the
nullfrac it gets from the statistics.

The good thing is this helps even for IS NULL checks on non-join-key
columns (where we don't switch to an antijoin), but there's a couple
things that I dislike ...

1) It's not restricted to outer joins or anything like that (this is
mostly just my laziness / interest in one particular query, but also
something the outer-join-aware patch might help with).

2) We probably don't want to pass this kind of information through
sjinfo. That was the simplest thing for an experimental patch, but I
suspect it's not the only piece of information we may need to pass to
the lower levels of estimation code.

3) I kinda doubt we actually want to move this responsibility (to
consider fraction of unmatched rows) to the low-level estimation
routines (e.g. nulltestsel and various others). AFAICS this just
"introduces NULLs" into the relation, so maybe we could "adjust" the
attribute statistics (in examine_variable?) by inflating null_frac and
modifying the other frequencies in MCV/histogram.

4) But I'm not sure we actually want to do that in these low-level
selectivity functions. The outer join essentially produces output with
two subsets - one with matches on the outer side, one without them. But
the side without matches has NULLs in all columns. In a way, we know
exactly how are these columns correlated - if we do the usual estimation
(even with the null_frac adjusted), we just throw this information away.
And when there's a lot of rows without a match, that seems bad.

So maybe we should split the join estimate into two parts, one for each
subset of the join result. One for the rows with a match (and then we
can just do what we do now, with the attribute stats we already have).
And one for the "unmatched part" where we know the values on the outer
side are NULL (and then we can easily "fake" stats with null_frac=1.0).


I really hope what I just wrote makes at least a little bit of sense.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Andrey Lepikhov
Дата:
On 24/6/2023 17:23, Tomas Vondra wrote:
> I really hope what I just wrote makes at least a little bit of sense.
Throw in one more example:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;

Here you can see the same kind of underestimation:
Hash Left Join  (... rows=500 width=14) (... rows=99999 ...)

So the eqjoinsel_unmatch_left() function should be modified for the case 
where nd1<nd2.

-- 
regards,
Andrey Lepikhov
Postgres Professional




Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Alena Rybakina
Дата:
Hi, all!

On 24.06.2023 14:23, Tomas Vondra wrote:
On 6/24/23 02:08, Tom Lane wrote:
Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
The problem is that the selectivity for "IS NULL" is estimated using the
table-level statistics. But the LEFT JOIN entirely breaks the idea that
the null_frac has anything to do with NULLs in the join result.
Right.

I wonder how to improve this, say by adjusting the IS NULL selectivity
when we know to operate on the outer side of the join. We're able to
do this for antijoins, so maybe we could do that here, somehow?
This mess is part of the long-term plan around the work I've been doing
on outer-join-aware Vars.  We now have infrastructure that can let
the estimator routines see "oh, this Var isn't directly from a scan
of its table, it's been passed through a potentially-nulling outer
join --- and I can see which one".  I don't have more than vague ideas
about what happens next, but that is clearly an essential step on the
road to doing better.

I was wondering if that work on outer-join-aware Vars could help with
this, but I wasn't following it very closely. I agree the ability to
check if the Var could be NULL due to an outer join seems useful, as it
says whether applying raw attribute statistics makes sense or not.

I was thinking about what to do for the case when that's not possible,
i.e. when the Var refers to nullable side of the join. Knowing that this
is happening is clearly not enough - we need to know how many new NULLs
are "injected" into the join result, and "communicate" that to the
estimation routines.

Attached is a very ugly experimental patch doing that, and with it the
estimate changes to this:
                                 QUERY PLAN  ----------------------------------------------------------------------   Hash Left Join  (cost=3.25..18179.88 rows=999900 width=16)                   (actual time=0.528..596.151 rows=999900 loops=1)     Hash Cond: (large.id = small.id)     Filter: ((small.id IS NULL) OR              (large.a = ANY ('{1000,2000,3000,4000,5000}'::integer[])))     Rows Removed by Filter: 100     ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8)                     (actual time=0.069..176.138 rows=1000000 loops=1)     ->  Hash  (cost=2.00..2.00 rows=100 width=8)               (actual time=0.371..0.373 rows=100 loops=1)           Buckets: 1024  Batches: 1  Memory Usage: 12kB           ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8)                         (actual time=0.032..0.146 rows=100 loops=1)   Planning Time: 3.845 ms   Execution Time: 712.405 ms  (10 rows)

Seems nice, but. The patch is pretty ugly, I don't claim it works for
other queries or that this is exactly what we should do. It calculates
"unmatched frequency" next to eqjoinsel_inner, stashes that info into
sjinfo and the estimator (nulltestsel) then uses that to adjust the
nullfrac it gets from the statistics.

The good thing is this helps even for IS NULL checks on non-join-key
columns (where we don't switch to an antijoin), but there's a couple
things that I dislike ...

1) It's not restricted to outer joins or anything like that (this is
mostly just my laziness / interest in one particular query, but also
something the outer-join-aware patch might help with).

2) We probably don't want to pass this kind of information through
sjinfo. That was the simplest thing for an experimental patch, but I
suspect it's not the only piece of information we may need to pass to
the lower levels of estimation code.

3) I kinda doubt we actually want to move this responsibility (to
consider fraction of unmatched rows) to the low-level estimation
routines (e.g. nulltestsel and various others). AFAICS this just
"introduces NULLs" into the relation, so maybe we could "adjust" the
attribute statistics (in examine_variable?) by inflating null_frac and
modifying the other frequencies in MCV/histogram.

4) But I'm not sure we actually want to do that in these low-level
selectivity functions. The outer join essentially produces output with
two subsets - one with matches on the outer side, one without them. But
the side without matches has NULLs in all columns. In a way, we know
exactly how are these columns correlated - if we do the usual estimation
(even with the null_frac adjusted), we just throw this information away.
And when there's a lot of rows without a match, that seems bad.

So maybe we should split the join estimate into two parts, one for each
subset of the join result. One for the rows with a match (and then we
can just do what we do now, with the attribute stats we already have).
And one for the "unmatched part" where we know the values on the outer
side are NULL (and then we can easily "fake" stats with null_frac=1.0).


I really hope what I just wrote makes at least a little bit of sense.


regards

I am also interested in this problem.

I did some refactoring of the source code in the patch, moved the calculation of unmatched_fraction to eqjoinsel_inner.
I wrote myself in this commit as a co-author, if you don't mind, and I'm going to continue working.


On 26.06.2023 12:22, Andrey Lepikhov wrote:
On 24/6/2023 17:23, Tomas Vondra wrote:
I really hope what I just wrote makes at least a little bit of sense.
Throw in one more example:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;

Here you can see the same kind of underestimation:
Hash Left Join  (... rows=500 width=14) (... rows=99999 ...)

So the eqjoinsel_unmatch_left() function should be modified for the case where nd1<nd2.

Unfortunately, this patch could not fix the cardinality calculation in this request, I'll try to look and figure out what is missing here.

postgres=# SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
SELECT 100000
CREATE TABLE
INSERT 0 2
ANALYZE
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=1.04..1819.07 rows=1 width=14) (actual time=0.143..114.792 rows=99999 loops=1)
   Hash Cond: (l.id = r.id)
   Filter: (r.v IS NULL)
   Rows Removed by Filter: 1
   ->  Seq Scan on l  (cost=0.00..1443.00 rows=100000 width=4) (actual time=0.027..35.278 rows=100000 loops=1)
   ->  Hash  (cost=1.02..1.02 rows=2 width=10) (actual time=0.014..0.017 rows=2 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on r  (cost=0.00..1.02 rows=2 width=10) (actual time=0.005..0.007 rows=2 loops=1)
 Planning Time: 0.900 ms
 Execution Time: 126.180 ms
(10 rows)


As in the previous query, even with applied the patch, the cardinality is calculated poorly here, I would even say that it has become worse:

EXPLAIN ANALYZE
  SELECT * FROM large FULL JOIN small ON (large.id = small.id)
WHERE (large.a IS NULL);

MASTER:

                                                              QUERY PLAN                                                         
-----------------------------------------------------------------------------------------------------------------------------------
 Merge Full Join  (cost=127921.69..299941.59 rows=56503 width=16) (actual time=795.092..795.094 rows=0 loops=1)
   Merge Cond: (small.id = large.id)
   Filter: (large.a IS NULL)
   Rows Removed by Filter: 1000000
   ->  Sort  (cost=158.51..164.16 rows=2260 width=8) (actual time=0.038..0.046 rows=100 loops=1)
         Sort Key: small.id
         Sort Method: quicksort  Memory: 29kB
         ->  Seq Scan on small  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.013..0.022 rows=100 loops=1)
   ->  Materialize  (cost=127763.19..132763.44 rows=1000050 width=8) (actual time=363.016..649.103 rows=1000000 loops=1)
         ->  Sort  (cost=127763.19..130263.31 rows=1000050 width=8) (actual time=363.012..481.480 rows=1000000 loops=1)
               Sort Key: large.id
               Sort Method: external merge  Disk: 17664kB
               ->  Seq Scan on large  (cost=0.00..14425.50 rows=1000050 width=8) (actual time=0.009..111.166 rows=1000000 loops=1)
 Planning Time: 0.124 ms
 Execution Time: 797.139 ms
(15 rows)

With patch:

                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Hash Full Join  (cost=3.25..18179.25 rows=999900 width=16) (actual time=261.480..261.482 rows=0 loops=1)
   Hash Cond: (large.id = small.id)
   Filter: (large.a IS NULL)
   Rows Removed by Filter: 1000000
   ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.006..92.827 rows=1000000 loops=1)
   ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual time=0.032..0.034 rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) (actual time=0.008..0.015 rows=100 loops=1)
 Planning Time: 0.151 ms
 Execution Time: 261.529 ms
(10 rows)

In addition, I found a few more queries, where the estimation of cardinality with the patch has become better:
EXPLAIN ANALYZE
  SELECT * FROM small LEFT JOIN large ON (large.id = small.id)
WHERE (small.b IS NULL);

MASTER:

                                                 QUERY PLAN                                                  
-------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=32.74..18758.45 rows=55003 width=16) (actual time=0.100..0.104 rows=0 loops=1)
   Hash Cond: (large.id = small.id)
   ->  Seq Scan on large  (cost=0.00..14425.50 rows=1000050 width=8) (never executed)
   ->  Hash  (cost=32.60..32.60 rows=11 width=8) (actual time=0.089..0.091 rows=0 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 8kB
         ->  Seq Scan on small  (cost=0.00..32.60 rows=11 width=8) (actual time=0.088..0.088 rows=0 loops=1)
               Filter: (b IS NULL)
               Rows Removed by Filter: 100
 Planning Time: 0.312 ms
 Execution Time: 0.192 ms
(10 rows)

With patch:

                                                QUERY PLAN                                                 
-----------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=2.01..18177.02 rows=1 width=16) (actual time=0.127..0.132 rows=0 loops=1)
   Hash Cond: (large.id = small.id)
   ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8) (never executed)
   ->  Hash  (cost=2.00..2.00 rows=1 width=8) (actual time=0.112..0.114 rows=0 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 8kB
         ->  Seq Scan on small  (cost=0.00..2.00 rows=1 width=8) (actual time=0.111..0.111 rows=0 loops=1)
               Filter: (b IS NULL)
               Rows Removed by Filter: 100
 Planning Time: 0.984 ms
 Execution Time: 0.237 ms
(10 rows)

EXPLAIN ANALYZE
  SELECT * FROM large FULL JOIN small ON (large.id = small.id)
WHERE (small.b IS NULL);

MASTER:

                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Merge Full Join  (cost=127921.69..299941.59 rows=56503 width=16) (actual time=339.478..819.232 rows=999900 loops=1)
   Merge Cond: (small.id = large.id)
   Filter: (small.b IS NULL)
   Rows Removed by Filter: 100
   ->  Sort  (cost=158.51..164.16 rows=2260 width=8) (actual time=0.129..0.136 rows=100 loops=1)
         Sort Key: small.id
         Sort Method: quicksort  Memory: 29kB
         ->  Seq Scan on small  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.044..0.075 rows=100 loops=1)
   ->  Materialize  (cost=127763.19..132763.44 rows=1000050 width=8) (actual time=339.260..605.444 rows=1000000 loops=1)
         ->  Sort  (cost=127763.19..130263.31 rows=1000050 width=8) (actual time=339.254..449.930 rows=1000000 loops=1)
               Sort Key: large.id
               Sort Method: external merge  Disk: 17664kB
               ->  Seq Scan on large  (cost=0.00..14425.50 rows=1000050 width=8) (actual time=0.032..104.484 rows=1000000 loops=1)
 Planning Time: 0.324 ms
 Execution Time: 859.705 ms
(15 rows)

With patch:

                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Hash Full Join  (cost=3.25..18179.25 rows=999900 width=16) (actual time=0.162..349.683 rows=999900 loops=1)
   Hash Cond: (large.id = small.id)
   Filter: (small.b IS NULL)
   Rows Removed by Filter: 100
   ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.021..95.972 rows=1000000 loops=1)
   ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual time=0.125..0.127 rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) (actual time=0.030..0.059 rows=100 loops=1)
 Planning Time: 0.218 ms
 Execution Time: 385.819 ms
(10 rows)

EXPLAIN ANALYZE
  SELECT * FROM large RIGHT JOIN small ON (large.id = small.id)
  WHERE (large.a IS NULL);

MASTER:

                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 Merge Left Join  (cost=127921.69..299941.59 rows=56503 width=16) (actual time=345.403..345.404 rows=0 loops=1)
   Merge Cond: (small.id = large.id)
   Filter: (large.a IS NULL)
   Rows Removed by Filter: 100
   ->  Sort  (cost=158.51..164.16 rows=2260 width=8) (actual time=0.033..0.039 rows=100 loops=1)
         Sort Key: small.id
         Sort Method: quicksort  Memory: 29kB
         ->  Seq Scan on small  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.012..0.020 rows=100 loops=1)
   ->  Materialize  (cost=127763.19..132763.44 rows=1000050 width=8) (actual time=345.287..345.315 rows=101 loops=1)
         ->  Sort  (cost=127763.19..130263.31 rows=1000050 width=8) (actual time=345.283..345.295 rows=101 loops=1)
               Sort Key: large.id
               Sort Method: external merge  Disk: 17664kB
               ->  Seq Scan on large  (cost=0.00..14425.50 rows=1000050 width=8) (actual time=0.009..104.648 rows=1000000 loops=1)
 Planning Time: 0.098 ms
 Execution Time: 347.807 ms
(15 rows)

With patch:

                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Hash Right Join  (cost=3.25..18179.25 rows=100 width=16) (actual time=209.838..209.842 rows=0 loops=1)
   Hash Cond: (large.id = small.id)
   Filter: (large.a IS NULL)
   Rows Removed by Filter: 100
   ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.006..91.571 rows=1000000 loops=1)
   ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual time=0.034..0.036 rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) (actual time=0.008..0.016 rows=100 loops=1)
 Planning Time: 0.168 ms
 Execution Time: 209.883 ms
(10 rows)

Вложения

Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Tomas Vondra
Дата:

On 6/26/23 20:15, Alena Rybakina wrote:
> Hi, all!
> 
> On 24.06.2023 14:23, Tomas Vondra wrote:
>> On 6/24/23 02:08, Tom Lane wrote:
>>> Tomas Vondra <tomas.vondra@enterprisedb.com> writes:
>>>> The problem is that the selectivity for "IS NULL" is estimated using the
>>>> table-level statistics. But the LEFT JOIN entirely breaks the idea that
>>>> the null_frac has anything to do with NULLs in the join result.
>>> Right.
>>>
>>>> I wonder how to improve this, say by adjusting the IS NULL selectivity
>>>> when we know to operate on the outer side of the join. We're able to
>>>> do this for antijoins, so maybe we could do that here, somehow?
>>> This mess is part of the long-term plan around the work I've been doing
>>> on outer-join-aware Vars.  We now have infrastructure that can let
>>> the estimator routines see "oh, this Var isn't directly from a scan
>>> of its table, it's been passed through a potentially-nulling outer
>>> join --- and I can see which one".  I don't have more than vague ideas
>>> about what happens next, but that is clearly an essential step on the
>>> road to doing better.
>>>
>> I was wondering if that work on outer-join-aware Vars could help with
>> this, but I wasn't following it very closely. I agree the ability to
>> check if the Var could be NULL due to an outer join seems useful, as it
>> says whether applying raw attribute statistics makes sense or not.
>>
>> I was thinking about what to do for the case when that's not possible,
>> i.e. when the Var refers to nullable side of the join. Knowing that this
>> is happening is clearly not enough - we need to know how many new NULLs
>> are "injected" into the join result, and "communicate" that to the
>> estimation routines.
>>
>> Attached is a very ugly experimental patch doing that, and with it the
>> estimate changes to this:
>>
>>                                  QUERY PLAN
>>   ----------------------------------------------------------------------
>>    Hash Left Join  (cost=3.25..18179.88 rows=999900 width=16)
>>                    (actual time=0.528..596.151 rows=999900 loops=1)
>>      Hash Cond: (large.id = small.id)
>>      Filter: ((small.id IS NULL) OR
>>               (large.a = ANY ('{1000,2000,3000,4000,5000}'::integer[])))
>>      Rows Removed by Filter: 100
>>      ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8)
>>                      (actual time=0.069..176.138 rows=1000000 loops=1)
>>      ->  Hash  (cost=2.00..2.00 rows=100 width=8)
>>                (actual time=0.371..0.373 rows=100 loops=1)
>>            Buckets: 1024  Batches: 1  Memory Usage: 12kB
>>            ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8)
>>                          (actual time=0.032..0.146 rows=100 loops=1)
>>    Planning Time: 3.845 ms
>>    Execution Time: 712.405 ms
>>   (10 rows)
>>
>> Seems nice, but. The patch is pretty ugly, I don't claim it works for
>> other queries or that this is exactly what we should do. It calculates
>> "unmatched frequency" next to eqjoinsel_inner, stashes that info into
>> sjinfo and the estimator (nulltestsel) then uses that to adjust the
>> nullfrac it gets from the statistics.
>>
>> The good thing is this helps even for IS NULL checks on non-join-key
>> columns (where we don't switch to an antijoin), but there's a couple
>> things that I dislike ...
>>
>> 1) It's not restricted to outer joins or anything like that (this is
>> mostly just my laziness / interest in one particular query, but also
>> something the outer-join-aware patch might help with).
>>
>> 2) We probably don't want to pass this kind of information through
>> sjinfo. That was the simplest thing for an experimental patch, but I
>> suspect it's not the only piece of information we may need to pass to
>> the lower levels of estimation code.
>>
>> 3) I kinda doubt we actually want to move this responsibility (to
>> consider fraction of unmatched rows) to the low-level estimation
>> routines (e.g. nulltestsel and various others). AFAICS this just
>> "introduces NULLs" into the relation, so maybe we could "adjust" the
>> attribute statistics (in examine_variable?) by inflating null_frac and
>> modifying the other frequencies in MCV/histogram.
>>
>> 4) But I'm not sure we actually want to do that in these low-level
>> selectivity functions. The outer join essentially produces output with
>> two subsets - one with matches on the outer side, one without them. But
>> the side without matches has NULLs in all columns. In a way, we know
>> exactly how are these columns correlated - if we do the usual estimation
>> (even with the null_frac adjusted), we just throw this information away.
>> And when there's a lot of rows without a match, that seems bad.
>>
>> So maybe we should split the join estimate into two parts, one for each
>> subset of the join result. One for the rows with a match (and then we
>> can just do what we do now, with the attribute stats we already have).
>> And one for the "unmatched part" where we know the values on the outer
>> side are NULL (and then we can easily "fake" stats with null_frac=1.0).
>>
>>
>> I really hope what I just wrote makes at least a little bit of sense.
>>
>>
>> regards
>>
> I am also interested in this problem.
> 
> I did some refactoring of the source code in the patch, moved the
> calculation of unmatched_fraction to eqjoinsel_inner.
> I wrote myself in this commit as a co-author, if you don't mind, and I'm
> going to continue working.
> 

Sure, if you want to take over the patch and continue working on it,
that's perfectly fine with me. I'm happy to cooperate on it, doing
reviews etc.

> 
> On 26.06.2023 12:22, Andrey Lepikhov wrote:
>> On 24/6/2023 17:23, Tomas Vondra wrote:
>>> I really hope what I just wrote makes at least a little bit of sense.
>> Throw in one more example:
>>
>> SELECT i AS id INTO l FROM generate_series(1,100000) i;
>> CREATE TABLE r (id int8, v text);
>> INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
>> ANALYZE l,r;
>> EXPLAIN ANALYZE
>> SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
>>
>> Here you can see the same kind of underestimation:
>> Hash Left Join  (... rows=500 width=14) (... rows=99999 ...)
>>
>> So the eqjoinsel_unmatch_left() function should be modified for the
>> case where nd1<nd2.
>>
> Unfortunately, this patch could not fix the cardinality calculation in
> this request, I'll try to look and figure out what is missing here.
> 
> *postgres=# SELECT i AS id INTO l FROM generate_series(1,100000) i;
> CREATE TABLE r (id int8, v text);
> INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
> ANALYZE l,r;
> EXPLAIN ANALYZE
> SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
> SELECT 100000
> CREATE TABLE
> INSERT 0 2
> ANALYZE
>                                                   QUERY
> PLAN                                                   
> ---------------------------------------------------------------------------------------------------------------
>  Hash Left Join  (cost=1.04..1819.07 rows=1 width=14) (actual
> time=0.143..114.792 rows=99999 loops=1)
>    Hash Cond: (l.id = r.id)
>    Filter: (r.v IS NULL)
>    Rows Removed by Filter: 1
>    ->  Seq Scan on l  (cost=0.00..1443.00 rows=100000 width=4) (actual
> time=0.027..35.278 rows=100000 loops=1)
>    ->  Hash  (cost=1.02..1.02 rows=2 width=10) (actual time=0.014..0.017
> rows=2 loops=1)
>          Buckets: 1024  Batches: 1  Memory Usage: 9kB
>          ->  Seq Scan on r  (cost=0.00..1.02 rows=2 width=10) (actual
> time=0.005..0.007 rows=2 loops=1)
>  Planning Time: 0.900 ms
>  Execution Time: 126.180 ms
> (10 rows)*
> 
> 
> As in the previous query, even with applied the patch, the cardinality
> is calculated poorly here, I would even say that it has become worse:
> 
> EXPLAIN ANALYZE
>   SELECT * FROM large FULL JOIN small ON (large.id = small.id)
> WHERE (large.a IS NULL);
> 
> MASTER:
> 
> *                                                              QUERY
> PLAN                                                         
>
-----------------------------------------------------------------------------------------------------------------------------------
>  Merge Full Join  (cost=127921.69..299941.59 rows=56503 width=16)
> (actual time=795.092..795.094 rows=0 loops=1)
>    Merge Cond: (small.id = large.id)
>    Filter: (large.a IS NULL)
>    Rows Removed by Filter: 1000000
>    ->  Sort  (cost=158.51..164.16 rows=2260 width=8) (actual
> time=0.038..0.046 rows=100 loops=1)
>          Sort Key: small.id
>          Sort Method: quicksort  Memory: 29kB
>          ->  Seq Scan on small  (cost=0.00..32.60 rows=2260 width=8)
> (actual time=0.013..0.022 rows=100 loops=1)
>    ->  Materialize  (cost=127763.19..132763.44 rows=1000050 width=8)
> (actual time=363.016..649.103 rows=1000000 loops=1)
>          ->  Sort  (cost=127763.19..130263.31 rows=1000050 width=8)
> (actual time=363.012..481.480 rows=1000000 loops=1)
>                Sort Key: large.id
>                Sort Method: external merge  Disk: 17664kB
>                ->  Seq Scan on large  (cost=0.00..14425.50 rows=1000050
> width=8) (actual time=0.009..111.166 rows=1000000 loops=1)
>  Planning Time: 0.124 ms
>  Execution Time: 797.139 ms
> (15 rows)*
> 
> With patch:
> 
> *                                                      QUERY
> PLAN                                                      
>
----------------------------------------------------------------------------------------------------------------------
>  Hash Full Join  (cost=3.25..18179.25 rows=999900 width=16) (actual
> time=261.480..261.482 rows=0 loops=1)
>    Hash Cond: (large.id = small.id)
>    Filter: (large.a IS NULL)
>    Rows Removed by Filter: 1000000
>    ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8)
> (actual time=0.006..92.827 rows=1000000 loops=1)
>    ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual
> time=0.032..0.034 rows=100 loops=1)
>          Buckets: 1024  Batches: 1  Memory Usage: 12kB
>          ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8)
> (actual time=0.008..0.015 rows=100 loops=1)
>  Planning Time: 0.151 ms
>  Execution Time: 261.529 ms
> (10 rows)
> *
> 
> In addition, I found a few more queries, where the estimation of
> cardinality with the patch has become better:
> 
> 

Yes, this does not surprise me at all - the patch was very early /
experimental code, and I only really aimed it at that single example
query. So don't hesitate to rethink what/how the patch works.

I do think collecting / constructing a wider set of queries with joins
of different types is going to be an important step in working on this
patch and making sure it doesn't break something.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Alena Rybakina
Дата:

Hi, all!

On 26.06.2023 12:22, Andrey Lepikhov wrote:
On 24/6/2023 17:23, Tomas Vondra wrote:
I really hope what I just wrote makes at least a little bit of sense.
Throw in one more example:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;

Here you can see the same kind of underestimation:
Hash Left Join  (... rows=500 width=14) (... rows=99999 ...)

So the eqjoinsel_unmatch_left() function should be modified for the case where nd1<nd2.


Unfortunately, this patch could not fix the cardinality calculation in this request, I'll try to look and figure out what is missing here.

I tried to fix the cardinality score in the query above by changing:

diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index 8e18aa1dd2b..40901836146 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2604,11 +2604,16 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
                 * if we're calculating fraction of NULLs or fraction of unmatched rows.
                 */
                // unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
-               if (nd1 > nd2)
+               if (nd1 != nd2)
                {
-                       selec /= nd1;
-                       *unmatched_frac = (nd1 - nd2) * 1.0 / nd1;
+                       selec /= Max(nd1, nd2);
+                       *unmatched_frac = abs(nd1 - nd2) * 1.0 / Max(nd1, nd2);
                }
+               /*if (nd1 > nd2)
+               {
+                       selec /= nd1;
+                       *unmatched_frac = nd1 - nd2 * 1.0 / nd1;
+               }*/
                else
                {
                        selec /= nd2;

and it worked:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
ERROR:  relation "l" already exists
ERROR:  relation "r" already exists
INSERT 0 2
ANALYZE
                                                  QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=1.09..1944.13 rows=99998 width=14) (actual time=0.152..84.184 rows=99999 loops=1)
   Hash Cond: (l.id = r.id)
   Filter: (r.v IS NULL)
   Rows Removed by Filter: 2
   ->  Seq Scan on l  (cost=0.00..1443.00 rows=100000 width=4) (actual time=0.040..27.635 rows=100000 loops=1)
   ->  Hash  (cost=1.04..1.04 rows=4 width=10) (actual time=0.020..0.022 rows=4 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on r  (cost=0.00..1.04 rows=4 width=10) (actual time=0.009..0.011 rows=4 loops=1)
 Planning Time: 0.954 ms
 Execution Time: 92.309 ms
(10 rows)

It looks too simple and I suspect that I might have missed something somewhere, but so far I haven't found any examples of queries where it doesn't work.

I didn't see it breaking anything in the examples from my previous letter [1].

1. https://www.postgresql.org/message-id/7af1464e-2e24-cfb1-b6d4-1544757f8cfa%40yandex.ru


Unfortunately, I can't understand your idea from point 4, please explain it?

The good thing is this helps even for IS NULL checks on non-join-key
columns (where we don't switch to an antijoin), but there's a couple
things that I dislike ...

1) It's not restricted to outer joins or anything like that (this is
mostly just my laziness / interest in one particular query, but also
something the outer-join-aware patch might help with).

2) We probably don't want to pass this kind of information through
sjinfo. That was the simplest thing for an experimental patch, but I
suspect it's not the only piece of information we may need to pass to
the lower levels of estimation code.

3) I kinda doubt we actually want to move this responsibility (to
consider fraction of unmatched rows) to the low-level estimation
routines (e.g. nulltestsel and various others). AFAICS this just
"introduces NULLs" into the relation, so maybe we could "adjust" the
attribute statistics (in examine_variable?) by inflating null_frac and
modifying the other frequencies in MCV/histogram.

4) But I'm not sure we actually want to do that in these low-level
selectivity functions. The outer join essentially produces output with
two subsets - one with matches on the outer side, one without them. But
the side without matches has NULLs in all columns. In a way, we know
exactly how are these columns correlated - if we do the usual estimation
(even with the null_frac adjusted), we just throw this information away.
And when there's a lot of rows without a match, that seems bad.

-- 
Regards,
Alena Rybakina
Postgres Professional
Вложения

Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Tomas Vondra
Дата:

On 7/6/23 15:51, Alena Rybakina wrote:
> Hi, all!
> 
> On 26.06.2023 12:22, Andrey Lepikhov wrote:
>> On 24/6/2023 17:23, Tomas Vondra wrote:
>>> I really hope what I just wrote makes at least a little bit of sense.
>> Throw in one more example:
>>
>> SELECT i AS id INTO l FROM generate_series(1,100000) i;
>> CREATE TABLE r (id int8, v text);
>> INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
>> ANALYZE l,r;
>> EXPLAIN ANALYZE
>> SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
>>
>> Here you can see the same kind of underestimation:
>> Hash Left Join  (... rows=500 width=14) (... rows=99999 ...)
>>
>> So the eqjoinsel_unmatch_left() function should be modified for the
>> case where nd1<nd2.
>>
>>
>> Unfortunately, this patch could not fix the cardinality calculation in
>> this request, I'll try to look and figure out what is missing here.
> 
> I tried to fix the cardinality score in the query above by changing:
> 
> diff --git a/src/backend/utils/adt/selfuncs.c
> b/src/backend/utils/adt/selfuncs.c
> index 8e18aa1dd2b..40901836146 100644
> --- a/src/backend/utils/adt/selfuncs.c
> +++ b/src/backend/utils/adt/selfuncs.c
> @@ -2604,11 +2604,16 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
>                  * if we're calculating fraction of NULLs or fraction of
> unmatched rows.
>                  */
>                 // unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
> -               if (nd1 > nd2)
> +               if (nd1 != nd2)
>                 {
> -                       selec /= nd1;
> -                       *unmatched_frac = (nd1 - nd2) * 1.0 / nd1;
> +                       selec /= Max(nd1, nd2);
> +                       *unmatched_frac = abs(nd1 - nd2) * 1.0 /
> Max(nd1, nd2);
>                 }
> +               /*if (nd1 > nd2)
> +               {
> +                       selec /= nd1;
> +                       *unmatched_frac = nd1 - nd2 * 1.0 / nd1;
> +               }*/
>                 else
>                 {
>                         selec /= nd2;
> 
> and it worked:
> 
> SELECT i AS id INTO l FROM generate_series(1,100000) i;
> CREATE TABLE r (id int8, v text);
> INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
> ANALYZE l,r;
> EXPLAIN ANALYZE
> SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
> ERROR:  relation "l" already exists
> ERROR:  relation "r" already exists
> INSERT 0 2
> ANALYZE
>                                                   QUERY
> PLAN                                                   
> ---------------------------------------------------------------------------------------------------------------
>  Hash Left Join  (cost=1.09..1944.13 rows=99998 width=14) (actual
> time=0.152..84.184 rows=99999 loops=1)
>    Hash Cond: (l.id = r.id)
>    Filter: (r.v IS NULL)
>    Rows Removed by Filter: 2
>    ->  Seq Scan on l  (cost=0.00..1443.00 rows=100000 width=4) (actual
> time=0.040..27.635 rows=100000 loops=1)
>    ->  Hash  (cost=1.04..1.04 rows=4 width=10) (actual time=0.020..0.022
> rows=4 loops=1)
>          Buckets: 1024  Batches: 1  Memory Usage: 9kB
>          ->  Seq Scan on r  (cost=0.00..1.04 rows=4 width=10) (actual
> time=0.009..0.011 rows=4 loops=1)
>  Planning Time: 0.954 ms
>  Execution Time: 92.309 ms
> (10 rows)
> 
> It looks too simple and I suspect that I might have missed something
> somewhere, but so far I haven't found any examples of queries where it
> doesn't work.
> 
> I didn't see it breaking anything in the examples from my previous
> letter [1].
> 

I think it's correct. Or at least it doesn't break anything my patch
didn't already break. My patch was simply written for one specific
query, so it didn't consider the option that the nd1 and nd2 values
might be in the opposite direction ...

> 1.
> https://www.postgresql.org/message-id/7af1464e-2e24-cfb1-b6d4-1544757f8cfa%40yandex.ru
> 
> 
> Unfortunately, I can't understand your idea from point 4, please explain it?
> 
> The good thing is this helps even for IS NULL checks on non-join-key
> columns (where we don't switch to an antijoin), but there's a couple
> things that I dislike ...
> 
> 1) It's not restricted to outer joins or anything like that (this is
> mostly just my laziness / interest in one particular query, but also
> something the outer-join-aware patch might help with).
> 
> 2) We probably don't want to pass this kind of information through
> sjinfo. That was the simplest thing for an experimental patch, but I
> suspect it's not the only piece of information we may need to pass to
> the lower levels of estimation code.
> 
> 3) I kinda doubt we actually want to move this responsibility (to
> consider fraction of unmatched rows) to the low-level estimation
> routines (e.g. nulltestsel and various others). AFAICS this just
> "introduces NULLs" into the relation, so maybe we could "adjust" the
> attribute statistics (in examine_variable?) by inflating null_frac and
> modifying the other frequencies in MCV/histogram.
> 
> 4) But I'm not sure we actually want to do that in these low-level
> selectivity functions. The outer join essentially produces output with
> two subsets - one with matches on the outer side, one without them. But
> the side without matches has NULLs in all columns. In a way, we know
> exactly how are these columns correlated - if we do the usual estimation
> (even with the null_frac adjusted), we just throw this information away.
> And when there's a lot of rows without a match, that seems bad.
> 

Well, one option would be to modify all selectivity functions to do
something like the patch does for nulltestsel(). That seems a bit
cumbersome because why should those places care about maybe running on
the outer side of a join, or what? For code in extensions this would be
particularly problematic, I think.

So what I was thinking about doing this in a way that'd make this
automatic, without having to modify the selectivity functions.

Option (3) is very simple - examine_variable would simply adjust the
statistics by tweaking the null_frac field, when looking at variables on
the outer side of the join. But it has issues when estimating multiple
conditions.

Imagine t1 has 1M rows, and we want to estimate

  SELECT * FROM t1 LEFT JOIN t2 ON (t1.id = t2.id)
          WHERE ((t2.a=1) AND (t2.b=1))

but only 50% of the t1 rows has a match in t2. Assume each of the t2
conditions matches 100% rows in the table. With the correction, this
means 50% selectivity for each condition. And if we combine them the
usual way, it's 0.5 * 0.5 = 0.25.

But we know all the rows in the "matching" part match the condition, so
the correct selectivity should be 0.5.

In a way, this is just another case of estimation issues due to the
assumption of independence.

But (4) was suggesting we could improve this essentially by treating the
join as two distinct sets of rows

 - the inner join result

 - rows without match on the outer side

For the inner part, we would do estimates as now (using the regular
per-column statistics). If we knew the conditions match 100% rows, we'd
still get 100% when the conditions are combined.

For the second part of the join we know the outer side is just NULLs in
all columns, and that'd make the estimation much simpler for most
clauses. We'd just need to have "fake" statistics with null_frac=1.0 and
that's it.

And then we'd just combine these two selectivities. If we know the inner
side is 50% and all rows match the conditions, and no rows in the other
50% match, the selectivity is 50%.

inner_part * inner_sel + outer_part * outer_sel = 0.5 * 1.0 + 0.0 = 0.5

Now, we still have issues with independence assumption in each of these
parts separately. But that's OK, I think.

I think (4) could be implemented by doing the current estimation for the
 inner part, and by tweaking examine_variable in the "outer" part in a
way similar to (3). Except that it just sets null_frac=1.0 everywhere.



FWIW, I used "AND" in the example for simplicity, but that'd probably be
pushed to the baserel level. There'd need to be OR to keep it at the
join level, but the overall issue is the same, I think.

Also, this entirely ignores extended statistics - I have no idea how we
might tweak those in (3). For (4) we don't need to tweak those at all,
because for inner part we can just apply them as is, and for outer part
it's irrelevant because everything is NULL.


I hope this makes more sense. If not, let me know and I'll try to
explain it better.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Alena Rybakina
Дата:
> Well, one option would be to modify all selectivity functions to do
> something like the patch does for nulltestsel(). That seems a bit
> cumbersome because why should those places care about maybe running on
> the outer side of a join, or what? For code in extensions this would be
> particularly problematic, I think.
Agree. I would say that we can try it if nothing else works out.
> So what I was thinking about doing this in a way that'd make this
> automatic, without having to modify the selectivity functions.
>
> Option (3) is very simple - examine_variable would simply adjust the
> statistics by tweaking the null_frac field, when looking at variables on
> the outer side of the join. But it has issues when estimating multiple
> conditions.
>
> Imagine t1 has 1M rows, and we want to estimate
>
>    SELECT * FROM t1 LEFT JOIN t2 ON (t1.id = t2.id)
>            WHERE ((t2.a=1) AND (t2.b=1))
>
> but only 50% of the t1 rows has a match in t2. Assume each of the t2
> conditions matches 100% rows in the table. With the correction, this
> means 50% selectivity for each condition. And if we combine them the
> usual way, it's 0.5 * 0.5 = 0.25.
>
> But we know all the rows in the "matching" part match the condition, so
> the correct selectivity should be 0.5.
>
> In a way, this is just another case of estimation issues due to the
> assumption of independence.
> FWIW, I used "AND" in the example for simplicity, but that'd probably be
> pushed to the baserel level. There'd need to be OR to keep it at the
> join level, but the overall issue is the same, I think.
>
> Also, this entirely ignores extended statistics - I have no idea how we
> might tweak those in (3).

I understood the idea - it is very similar to what is implemented in the 
current patch.

But I don't understand how to do it in the examine_variable function, to 
be honest.

> But (4) was suggesting we could improve this essentially by treating the
> join as two distinct sets of rows
>
>   - the inner join result
>
>   - rows without match on the outer side
>
> For the inner part, we would do estimates as now (using the regular
> per-column statistics). If we knew the conditions match 100% rows, we'd
> still get 100% when the conditions are combined.
>
> For the second part of the join we know the outer side is just NULLs in
> all columns, and that'd make the estimation much simpler for most
> clauses. We'd just need to have "fake" statistics with null_frac=1.0 and
> that's it.
>
> And then we'd just combine these two selectivities. If we know the inner
> side is 50% and all rows match the conditions, and no rows in the other
> 50% match, the selectivity is 50%.
>
> inner_part * inner_sel + outer_part * outer_sel = 0.5 * 1.0 + 0.0 = 0.5
>
> Now, we still have issues with independence assumption in each of these
> parts separately. But that's OK, I think.
>
> I think (4) could be implemented by doing the current estimation for the
>   inner part, and by tweaking examine_variable in the "outer" part in a
> way similar to (3). Except that it just sets null_frac=1.0 everywhere.
>
> For (4) we don't need to tweak those at all,
> because for inner part we can just apply them as is, and for outer part
> it's irrelevant because everything is NULL.
I like this idea the most) I'll try to start with this and implement the 
patch.
> I hope this makes more sense. If not, let me know and I'll try to
> explain it better.

Thank you for your explanation)

I will unsubscribe soon based on the results or if I have any questions.

-- 
Regards,
Alena Rybakina
Postgres Professional




Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Tomas Vondra
Дата:

On 7/8/23 10:29, Alena Rybakina wrote:
> 
>> Well, one option would be to modify all selectivity functions to do
>> something like the patch does for nulltestsel(). That seems a bit
>> cumbersome because why should those places care about maybe running on
>> the outer side of a join, or what? For code in extensions this would be
>> particularly problematic, I think.
> Agree. I would say that we can try it if nothing else works out.
>> So what I was thinking about doing this in a way that'd make this
>> automatic, without having to modify the selectivity functions.
>>
>> Option (3) is very simple - examine_variable would simply adjust the
>> statistics by tweaking the null_frac field, when looking at variables on
>> the outer side of the join. But it has issues when estimating multiple
>> conditions.
>>
>> Imagine t1 has 1M rows, and we want to estimate
>>
>>    SELECT * FROM t1 LEFT JOIN t2 ON (t1.id = t2.id)
>>            WHERE ((t2.a=1) AND (t2.b=1))
>>
>> but only 50% of the t1 rows has a match in t2. Assume each of the t2
>> conditions matches 100% rows in the table. With the correction, this
>> means 50% selectivity for each condition. And if we combine them the
>> usual way, it's 0.5 * 0.5 = 0.25.
>>
>> But we know all the rows in the "matching" part match the condition, so
>> the correct selectivity should be 0.5.
>>
>> In a way, this is just another case of estimation issues due to the
>> assumption of independence.
>> FWIW, I used "AND" in the example for simplicity, but that'd probably be
>> pushed to the baserel level. There'd need to be OR to keep it at the
>> join level, but the overall issue is the same, I think.
>>
>> Also, this entirely ignores extended statistics - I have no idea how we
>> might tweak those in (3).
> 
> I understood the idea - it is very similar to what is implemented in the
> current patch.
> 
> But I don't understand how to do it in the examine_variable function, to
> be honest.
> 

Well, I don't have a detailed plan either. In principle it shouldn't be
that hard, I think - examine_variable is loading the statistics, so it
could apply the same null_frac correction, just like nulltestsel would
do a bit later.

The main question is how to pass the information to examine_variable. It
doesn't get the SpecialJoinInfo (which is what nulltestsel used), so
we'd need to invent something new ... add a new argument?

>> But (4) was suggesting we could improve this essentially by treating the
>> join as two distinct sets of rows
>>
>>   - the inner join result
>>
>>   - rows without match on the outer side
>>
>> For the inner part, we would do estimates as now (using the regular
>> per-column statistics). If we knew the conditions match 100% rows, we'd
>> still get 100% when the conditions are combined.
>>
>> For the second part of the join we know the outer side is just NULLs in
>> all columns, and that'd make the estimation much simpler for most
>> clauses. We'd just need to have "fake" statistics with null_frac=1.0 and
>> that's it.
>>
>> And then we'd just combine these two selectivities. If we know the inner
>> side is 50% and all rows match the conditions, and no rows in the other
>> 50% match, the selectivity is 50%.
>>
>> inner_part * inner_sel + outer_part * outer_sel = 0.5 * 1.0 + 0.0 = 0.5
>>
>> Now, we still have issues with independence assumption in each of these
>> parts separately. But that's OK, I think.
>>
>> I think (4) could be implemented by doing the current estimation for the
>>   inner part, and by tweaking examine_variable in the "outer" part in a
>> way similar to (3). Except that it just sets null_frac=1.0 everywhere.
>>
>> For (4) we don't need to tweak those at all,
>> because for inner part we can just apply them as is, and for outer part
>> it's irrelevant because everything is NULL.
> I like this idea the most) I'll try to start with this and implement the
> patch.

Good to hear.

>> I hope this makes more sense. If not, let me know and I'll try to
>> explain it better.
> 
> Thank you for your explanation)
> 
> I will unsubscribe soon based on the results or if I have any questions.
> 

OK

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Alena Rybakina
Дата:
Well, I don't have a detailed plan either. In principle it shouldn't be
that hard, I think - examine_variable is loading the statistics, so it
could apply the same null_frac correction, just like nulltestsel would
do a bit later.

The main question is how to pass the information to examine_variable. It
doesn't get the SpecialJoinInfo (which is what nulltestsel used), so
we'd need to invent something new ... add a new argument?

Sorry I didn't answer right away, I could adapt the last version of the patch [2] to the current idea, but so far I have implemented
it only for the situation where we save the number of zero values in SpecialJoinInfo variable.

I'm starting to look for different functions scalararraysel_containment, boolvarsel and I try to find some bad cases for current problem,
when I can fix in similar way it in those functions. But I'm not sure about different ones functions:
(mergejoinscansel, estimate_num_groups, estimate_hash_bucket_stats, get_restriction_variable, get_join_variables).

The examine_variable function is also called in them, and even though there is no being outer join in them
(the absence of a SpecialJoinInfo variable),  I can't think of similar cases, when we have such problem caused by the same reasons.


The code passes all regression tests and I found no deterioration in cardinality prediction for queries from [1], except for one:

EXPLAIN ANALYZE
  SELECT * FROM large FULL JOIN small ON (large.id = small.id)
WHERE (large.a IS NULL);

MASTER:

*QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Merge Full Join  (cost=127921.69..299941.59 rows=56503 width=16) (actual time=795.092..795.094 rows=0 loops=1)
   Merge Cond: (small.id = large.id)
   Filter: (large.a IS NULL)
   Rows Removed by Filter: 1000000
   ->  Sort  (cost=158.51..164.16 rows=2260 width=8) (actual time=0.038..0.046 rows=100 loops=1)
         Sort Key: small.id
         Sort Method: quicksort  Memory: 29kB
         ->  Seq Scan on small  (cost=0.00..32.60 rows=2260 width=8) (actual time=0.013..0.022 rows=100 loops=1)    ->  Materialize  (cost=127763.19..132763.44 rows=1000050 width=8) (actual time=363.016..649.103 rows=1000000 loops=1)          ->  Sort  (cost=127763.19..130263.31 rows=1000050 width=8) (actual time=363.012..481.480 rows=1000000 loops=1)
               Sort Key: large.id
               Sort Method: external merge  Disk: 17664kB
               ->  Seq Scan on large (cost=0.00..14425.50 rows=1000050 width=8) (actual time=0.009..111.166 rows=1000000 loops=1)
 Planning Time: 0.124 ms
 Execution Time: 797.139 ms
(15 rows)*

With patch:

*QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Hash Full Join  (cost=3.25..18179.25 rows=999900 width=16) (actual time=261.480..261.482 rows=0 loops=1)
   Hash Cond: (large.id = small.id)
   Filter: (large.a IS NULL)
   Rows Removed by Filter: 1000000
   ->  Seq Scan on large  (cost=0.00..14425.00 rows=1000000 width=8) (actual time=0.006..92.827 rows=1000000 loops=1)    ->  Hash  (cost=2.00..2.00 rows=100 width=8) (actual time=0.032..0.034 rows=100 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         ->  Seq Scan on small  (cost=0.00..2.00 rows=100 width=8) (actual time=0.008..0.015 rows=100 loops=1)
 Planning Time: 0.151 ms
 Execution Time: 261.529 ms
(10 rows)

[1] https://www.mail-archive.com/pgsql-hackers@lists.postgresql.org/msg146044.html

[2] https://www.postgresql.org/message-id/148ff8f1-067b-1409-c754-af6117de9b7d%40yandex.ru


Unfortunately, I found that my previous change leads to a big error in predicting selectivity.
I will investigate this case in more detail and try to find a solution (I wrote the code and query below).

// unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
if (nd1 != nd2)
{
        selec /= Max(nd1, nd2);
        *unmatched_frac = abs(nd1 - nd2) * 1.0 / Max(nd1, nd2);
}
else
{
        selec /= nd2;
        *unmatched_frac = 0.0;
}


postgres=# explain analyze select * from large l1 inner join large l2 on l1.a is null where l1.a <100;

MASTER:

                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1000.00..35058.43 rows=1000000 width=16) (actual time=91.846..93.622 rows=0 loops=1)
   ->  Gather  (cost=1000.00..10633.43 rows=1 width=8) (actual time=91.844..93.619 rows=0 loops=1)
         Workers Planned: 2
         Workers Launched: 2
         ->  Parallel Seq Scan on large l1  (cost=0.00..9633.33 rows=1 width=8) (actual time=86.153..86.154 rows=0 loops=3)
               Filter: ((a IS NULL) AND (a < 100))
               Rows Removed by Filter: 333333
   ->  Seq Scan on large l2  (cost=0.00..14425.00 rows=1000000 width=8) (never executed)
 Planning Time: 0.299 ms
 Execution Time: 93.689 ms

(10 rows)


With patch:
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=1000.00..20863771.84 rows=1667083350 width=16) (actual time=290.255..290.323 rows=0 loops=1)
   ->  Seq Scan on large l2  (cost=0.00..14425.50 rows=1000050 width=8) (actual time=0.104..94.037 rows=1000000 loops=1)
   ->  Materialize  (cost=1000.00..10808.63 rows=1667 width=8) (actual time=0.000..0.000 rows=0 loops=1000000)
         ->  Gather  (cost=1000.00..10800.29 rows=1667 width=8) (actual time=79.472..79.539 rows=0 loops=1)
               Workers Planned: 2
               Workers Launched: 2
               ->  Parallel Seq Scan on large l1  (cost=0.00..9633.59 rows=695 width=8) (actual time=75.337..75.338 rows=0 loops=3)
                     Filter: ((a IS NULL) AND (a < 100))
                     Rows Removed by Filter: 333333
 Planning Time: 0.721 ms
 Execution Time: 290.425 ms

(11 rows)

I remember, it could fix this one:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
ERROR:  relation "l" already exists
ERROR:  relation "r" already exists
INSERT 0 2
ANALYZE
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=1.09..1944.13 rows=99998 width=14) (actual time=0.152..84.184 rows=99999 loops=1)
   Hash Cond: (l.id = r.id)
   Filter: (r.v IS NULL)
   Rows Removed by Filter: 2
   ->  Seq Scan on l  (cost=0.00..1443.00 rows=100000 width=4) (actual time=0.040..27.635 rows=100000 loops=1)    ->  Hash  (cost=1.04..1.04 rows=4 width=10) (actual time=0.020..0.022 rows=4 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on r  (cost=0.00..1.04 rows=4 width=10) (actual time=0.009..0.011 rows=4 loops=1)
 Planning Time: 0.954 ms
 Execution Time: 92.309 ms
(10 rows)

Do you think it's worth trying to apply the fourth method now?
As far as I remember, here we will face the problem of estimating multiple conditions, in the 4th approach there is a chance to avoid this.

I couldn't get such case. I found only one that I also found interesting, but so far I haven't been able to figure out well enough what influenced the prediction of cardinality and, accordingly, the choice of a better plan.

CREATE TABLE large (id INT, a INT);
  INSERT INTO large SELECT i, 1 FROM generate_series(1,50000) s(i);
CREATE TABLE small (id INT, b INT);
  INSERT INTO small SELECT i, 1 FROM generate_series(1,25000) s(i);
INSERT INTO large SELECT i+50000, i+2 FROM generate_series(1,50000) s(i);
INSERT INTO small SELECT i+25000, i+2 FROM generate_series(1,25000) s(i);

explain analyze SELECT * FROM large LEFT JOIN small ON (large.id = small.id)
          WHERE ((large.a=1) OR (small.b=1));

With patch:

                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=1347.00..3911.91 rows=99775 width=16) (actual time=36.864..82.634 rows=50000 loops=1)
   Hash Cond: (large.id = small.id)
   Filter: ((large.a = 1) OR (small.b = 1))
   Rows Removed by Filter: 50000
   ->  Seq Scan on large  (cost=0.00..1440.75 rows=99775 width=8) (actual time=0.034..12.536 rows=100000 loops=1)
   ->  Hash  (cost=722.00..722.00 rows=50000 width=8) (actual time=36.752..36.754 rows=50000 loops=1)
         Buckets: 65536  Batches: 1  Memory Usage: 2466kB
         ->  Seq Scan on small  (cost=0.00..722.00 rows=50000 width=8) (actual time=0.028..13.337 rows=50000 loops=1)
 Planning Time: 2.363 ms
 Execution Time: 84.790 ms
(10 rows)

original:

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Right Join  (cost=14400.45..516963.33 rows=250528 width=16) (actual time=85.498..126.799 rows=50000 loops=1)
   Merge Cond: (small.id = large.id)
   Filter: ((large.a = 1) OR (small.b = 1))
   Rows Removed by Filter: 50000
   ->  Sort  (cost=4640.80..4766.23 rows=50172 width=8) (actual time=41.538..44.204 rows=50000 loops=1)
         Sort Key: small.id
         Sort Method: quicksort  Memory: 3710kB
         ->  Seq Scan on small  (cost=0.00..723.72 rows=50172 width=8) (actual time=0.068..15.182 rows=50000 loops=1)
   ->  Sort  (cost=9759.65..10009.95 rows=100118 width=8) (actual time=43.939..53.697 rows=100000 loops=1)
         Sort Key: large.id
         Sort Method: external sort  Disk: 2160kB
         ->  Seq Scan on large  (cost=0.00..1444.18 rows=100118 width=8) (actual time=0.046..10.378 rows=100000 loops=1)
 Planning Time: 0.406 ms
 Execution Time: 130.109 ms
(14 rows)

If you disable merge join with the original code, then the plans coincide, as well as the error value approximately, only in one case cardinality
overestimation occurs in the other vice versa.

                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=1347.00..3915.00 rows=75082 width=16) (actual time=43.625..68.901 rows=50000 loops=1)
   Hash Cond: (large.id = small.id)
   Filter: ((large.a = 1) OR (small.b = 1))
   Rows Removed by Filter: 50000
   ->  Seq Scan on large  (cost=0.00..1443.00 rows=100000 width=8) (actual time=0.008..9.895 rows=100000 loops=1)
   ->  Hash  (cost=722.00..722.00 rows=50000 width=8) (actual time=22.546..22.548 rows=50000 loops=1)
         Buckets: 65536  Batches: 1  Memory Usage: 2466kB
         ->  Seq Scan on small  (cost=0.00..722.00 rows=50000 width=8) (actual time=0.006..7.218 rows=50000 loops=1)
 Planning Time: 0.239 ms
 Execution Time: 70.860 ms
(10 rows)

-- 
Regards,
Alena Rybakina
Postgres Professional
Вложения

Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

От
Alena Rybakina
Дата:
Hi,

I'm still working on it, but, unfortunately, I didn't have much time to 
work with it well enough that there would be something that could be shown.
Now I am trying to sort out the problems that I drew attention to in the 
previous letter.

-- 
Regards,
Alena Rybakina
Postgres Professional