Обсуждение: row estimate for partial index

Поиск
Список
Период
Сортировка

row estimate for partial index

От
Harmen
Дата:
Hi,

I'm trying to figure out why a certain query doesn't use a partial index.

The context:
The table has about 100M rows with "contacts". Every contact belongs to an
"org", and can be marked "deleted". Since we're generally only interested in
non-deleted contacts we have a (partial) index on the table:

    "contact_organization_id" btree (org_id, id) WHERE deleted IS NULL

for queries such as: "select * from contacts where org_id=123 and deleted is null order by id"

Works well enough. However, we now have an org_id which has > 10% of the rows,
but only a handful rows where "deleted is null" matches (so the org has a lot
of "deleted" contacts). The planner doesn't like this and it falls back to a
full table scan for the above query.



I've added a dedicated index just for that org_id, to see if that helps:

    "org123" btree (id) WHERE deleted IS NULL AND org_id = 123

The planner seems to use it now, however the row estimate is way off:

    my_db=> explain SELECT c.id, COALESCE(c.first_name, '') FROM contacts AS c WHERE c.organization_id = 123 AND
c.deletedIS NULL ORDER BY id ASC;
 
                                                 QUERY PLAN                                             
    ────────────────────────────────────────────────────────────────────────────────────────────────────
     Gather Merge  (cost=1761204.72..1883324.98 rows=1046672 width=36)
       Workers Planned: 2
       ->  Sort  (cost=1760204.70..1761513.04 rows=523336 width=36)
             Sort Key: id
             ->  Parallel Bitmap Heap Scan on contact c  (cost=318.13..1696183.14 rows=523336 width=36)
                   Recheck Cond: ((deleted IS NULL) AND (organization_id = 8448))
                   ->  Bitmap Index Scan on org123  (cost=0.00..4.13 rows=1256006 width=0)
    (7 rows)

    Time: 3.337 ms

That "1256006" estimate from the org123 index is wrong. There are 7 contacts in
that org:
               ->  Bitmap Index Scan on org8448  (cost=0.00..4.13 rows=1256006 width=0) (actual time=0.841..0.841
rows=7loops=1)
 


Is this expected? Is there a way to improve this?

Thanks!
Harmen



Re: row estimate for partial index

От
Tom Lane
Дата:
Harmen <harmen@lijzij.de> writes:
> Works well enough. However, we now have an org_id which has > 10% of the rows,
> but only a handful rows where "deleted is null" matches (so the org has a lot
> of "deleted" contacts). The planner doesn't like this and it falls back to a
> full table scan for the above query.

> I've added a dedicated index just for that org_id, to see if that helps:
>     "org123" btree (id) WHERE deleted IS NULL AND org_id = 123
> The planner seems to use it now, however the row estimate is way off:

Yeah, so that indicates that it isn't producing a good selectivity
estimate for the combination of those two conditions: it will assume
the org_id and deleted columns are independent, which per your statements
they are not.

If you are running a reasonably recent PG version you should be able to
fix that by setting up "extended statistics" on that pair of columns:

https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED

(I might be wrong, but I think that will help even when one of
the troublesome conditions is a null-check.  If it doesn't, then
we have something to improve there ...)

            regards, tom lane



Re: row estimate for partial index

От
Harmen
Дата:
On Sat, Jan 14, 2023 at 11:23:07AM -0500, Tom Lane wrote:
> Harmen <harmen@lijzij.de> writes:
> > Works well enough. However, we now have an org_id which has > 10% of the rows,
> > but only a handful rows where "deleted is null" matches (so the org has a lot
> > of "deleted" contacts). The planner doesn't like this and it falls back to a
> > full table scan for the above query.
> 
> > I've added a dedicated index just for that org_id, to see if that helps:
> >     "org123" btree (id) WHERE deleted IS NULL AND org_id = 123
> > The planner seems to use it now, however the row estimate is way off:
> 
> Yeah, so that indicates that it isn't producing a good selectivity
> estimate for the combination of those two conditions: it will assume
> the org_id and deleted columns are independent, which per your statements
> they are not.
> 
> If you are running a reasonably recent PG version you should be able to
> fix that by setting up "extended statistics" on that pair of columns:
> 
> https://www.postgresql.org/docs/current/planner-stats.html#PLANNER-STATS-EXTENDED
> 
> (I might be wrong, but I think that will help even when one of
> the troublesome conditions is a null-check.  If it doesn't, then
> we have something to improve there ...)

Thanks for your explanation, Tom.
I've setup a local test scenario, where I then add a "dependencies" stat, but
that doesn't give a better plan, unfortunately.


This is my test table (I use a boolean field for "deleted" to keep this test
case as simple as possible. In my real case this is a "timestamptz null"
field):

  DROP table if exists contactsbool;
  CREATE table contactsbool (id int not null, org_id int not null, deleted boolean not null, firstname text);
  CREATE index contactsbool_orgs on contactsbool (org_id, id) where not deleted;


Testdata has a very low number of "orgs", and one org has almost only deleted
contacts:

  WITH ids as (select * from generate_series(0, 10000000)) insert into contactsbool select ids.generate_series,
mod(ids.generate_series,7), false, 'hello world' from ids;
 
  UPDATE contactsbool set deleted = true where id > 100 and org_id = 5;
  ANALYZE contactsbool;


Now the new stats:

  CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contactsbool;
  CREATE STATISTICS dist4b (ndistinct) ON org_id, deleted FROM contactsbool;
  ANALYZE contactsbool;


harmen=> explain (analyze) select id, firstname from contactsbool where org_id = 5 and not deleted order by id;
                                                                QUERY PLAN
                 
 

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 Gather Merge  (cost=181983.91..299104.42 rows=1003820 width=16) (actual time=448.244..454.770 rows=14 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Sort  (cost=180983.88..182238.66 rows=501910 width=16) (actual time=413.761..413.762 rows=5 loops=3)
         Sort Key: id
         Sort Method: quicksort  Memory: 25kB
         Worker 0:  Sort Method: quicksort  Memory: 25kB
         Worker 1:  Sort Method: quicksort  Memory: 25kB
         ->  Parallel Seq Scan on contactsbool  (cost=0.00..124881.86 rows=501910 width=16) (actual
time=267.318..413.673rows=5 loops=3)
 
               Filter: ((NOT deleted) AND (org_id = 5))
               Rows Removed by Filter: 3333329
 Planning Time: 0.565 ms
 JIT:
   Functions: 12
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 2.444 ms, Inlining 0.000 ms, Optimization 1.163 ms,
Emission 13.288 ms, Total 16.895 ms
 Execution Time: 456.498 ms
(17 rows)


The "rows=501910" is what I don't expect. I expect/want/hope the plan to use
the contactsbool_orgs index.
(If I really (hard) delete the "deleted" contacts everything works perfectly
for all orgs.)

Any ideas?
Thanks again,
Harmen



Re: row estimate for partial index

От
Tom Lane
Дата:
Harmen <harmen@lijzij.de> writes:
> On Sat, Jan 14, 2023 at 11:23:07AM -0500, Tom Lane wrote:
>> If you are running a reasonably recent PG version you should be able to
>> fix that by setting up "extended statistics" on that pair of columns:

>   CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contactsbool;
>   CREATE STATISTICS dist4b (ndistinct) ON org_id, deleted FROM contactsbool;

1. ndistinct is not the correct stats type for this problem.
(I think dependencies is, but generally speaking, it's not worth
trying to be smarter than the system about which ones you need.
Just create 'em all.)

2. Per the CREATE STATISTICS man page, the order of the columns is
not significant, so you're just doubling the amount of work for
ANALYZE without gaining anything.

I think you will find that

CREATE STATISTICS stats1 ON deleted, org_id FROM contactsbool;

is enough to fix this.  It improved the estimate for me in
v14 and HEAD, anyway.

            regards, tom lane



Re: row estimate for partial index

От
Harmen
Дата:
On Mon, Jan 16, 2023 at 09:59:38AM -0500, Tom Lane wrote:
> Harmen <harmen@lijzij.de> writes:
> > On Sat, Jan 14, 2023 at 11:23:07AM -0500, Tom Lane wrote:
> >> If you are running a reasonably recent PG version you should be able to
> >> fix that by setting up "extended statistics" on that pair of columns:
> 
> >   CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contactsbool;
> >   CREATE STATISTICS dist4b (ndistinct) ON org_id, deleted FROM contactsbool;

Hello Tom,

thanks again for your help, it's much appreciated.

> 1. ndistinct is not the correct stats type for this problem.
> (I think dependencies is, but generally speaking, it's not worth
> trying to be smarter than the system about which ones you need.
> Just create 'em all.)

I don't understand the "dependencies" stat here. The documentation has as
examples zip codes with city names, and month/day columns. Those I follow. But
in this case I generally select a single "org_id", which can have an arbitrary
number of "deleted" rows. I don't understand how having the org_id available
gives a better prediction than taking the plain ratio of all "deleted" rows in
the whole table.


> 2. Per the CREATE STATISTICS man page, the order of the columns is
> not significant, so you're just doubling the amount of work for
> ANALYZE without gaining anything.

Noted. Are these statistics expensive? Or do you expect them to be basically
noise?


> I think you will find that
> 
> CREATE STATISTICS stats1 ON deleted, org_id FROM contactsbool;
> 
> is enough to fix this.  It improved the estimate for me in
> v14 and HEAD, anyway.

It does indeed for me as well! I'm not 100% sure it's because of the test data
being too simple, but I'll test.


Unfortunately, as you already remarked might be the case in your original
reply, I don't get the same good results if I change the "deleted" column from
a "boolean not null" to a "timestamptz null", though.

Then it's back to (pg15.1):

         ->  Parallel Seq Scan on contacts  (cost=0.00..126284.02 rows=513913 width=16) (actual time=254.677..393.448
rows=5loops=3)
 
               Filter: ((deleted IS NULL) AND (org_id = 5))
               Rows Removed by Filter: 3333329


Test table in case that's convenient to have:

  DROP table if exists contacts;
  CREATE table contacts (id int not null, org_id int not null, deleted timestamptz null, firstname text);
  CREATE index contacts_orgs on contacts (org_id, id) where deleted is null;

  WITH ids as (select * from generate_series(0, 10000000)) insert into contacts select ids.generate_series,
mod(ids.generate_series,7), null, 'hello world' from ids;
 
  UPDATE contacts set deleted = now() where id > 100 and org_id = 5;

  CREATE STATISTICS dist4 (ndistinct) ON deleted, org_id FROM contacts;
  CREATE STATISTICS dist5 (dependencies) ON deleted, org_id FROM contacts;
  ANALYZE contacts;

  explain (analyze) select id, firstname from contacts where org_id = 5 and deleted is null order by id;


Again, this is only a problem when a single org_is has such a large percentage
of the table that the planner switches to table scans, which can only ever be a
few org_ids. On production I've added an index with the problematic org_id in
the "where" part, and that avoids the problem for now (milliseconds vs
minutes), but it's not a great fix.


Thanks!
Harmen