Re: WIP: BRIN multi-range indexes

Поиск
Список
Период
Сортировка
От John Naylor
Тема Re: WIP: BRIN multi-range indexes
Дата
Msg-id CAFBsxsHswPBXQvvW87uSR59MkPNhEB5Txe4sFjVBfE9TR4jrGw@mail.gmail.com
обсуждение исходный текст
Ответы Re: WIP: BRIN multi-range indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Re: WIP: BRIN multi-range indexes  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
Список pgsql-hackers
On Thu, Mar 11, 2021 at 12:26 PM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
>
> Hi,
>
> Here is an updated version of the patch series.
>
> It fixes the assert failure (just remove the multiplication from it) and
> adds a simple regression test that exercises this.
>
> Based on the discussion so far, I've decided to keep just the new
> signature of the consistent function. That's a bit simpler than having
> to support both 3 and 4 parameters, and it would not deal with the NULL
> changes anyway (mostly harmless code duplication, but still). I've also
> realized the API documentation in SGML needs updating.
>
> At this point, I think 0001-0006 parts are mostly committable.

I think so. I've marked it RFC for this six.

> As for the remaining two parts, the one dealing with correlation may not
> be strictly necessary, but not having it (or something like it) may
> result in not picking the BRIN index in some cases.
>
> But maybe it's not a major problem. I tried the example from [1] but it
> no longer triggers the issue for me - I'm not entirely sure why, but the
> costing changed for some reason. It used to look like this:

> ...

> The index scan cost is about the same, but the heap scan is about half
> the cost. The row estimates are a bit different, perhaps that's related.
> The seqscan cost (169248) and duration (~500ms) is still about the same,
> but so now we still pick the bitmap heap scan.

With only 0001-0006, I get a parallel bitmap scan in both cases:

                                                            QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=6542.42..52779.35 rows=10 width=4) (actual time=3.283..22.308 rows=10 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on t0  (cost=5542.42..51778.35 rows=4 width=4) (actual time=3.434..17.257 rows=3 loops=3)
         Recheck Cond: (a = 10000)
         Rows Removed by Index Recheck: 83165
         Heap Blocks: lossy=421
         ->  Bitmap Index Scan on t0_a_idx  (cost=0.00..5542.42 rows=381682 width=0) (actual time=2.996..2.996 rows=11040 loops=1)
               Index Cond: (a = 10000)
 Planning Time: 0.088 ms
 Execution Time: 22.341 ms
(11 rows)

> Not sure we can rely on
> this, though. Would be quite sad if we had new improved opclasses but
> the planner often decided not to use them.

Yeah, I'm not sure what to do here. It might be okay to leave it out now and study it further as a PG14 open item or PG15 improvement.

> I had an idea of tweaking choose_bitmap_and to consider both the cost
> and selectivity (similarly to how add_path considers statup/total cost),
> and that did indeed resolve this particular case. This is what the 0008
> part does.
>
> But it may also have negative consequence, as demonstrated by the
> reproducer2.sql script. So maybe the logic would need to be more
> complicated. Or maybe there's no reliable solution, considering how
> tricky/unreliable BRIN estimates are.

Ok, so with 0008 in reproducer2, it chooses the more selective path, even though it has a higher total cost:

0001-0007:

                                                     QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t2  (cost=29.03..24032.28 rows=1 width=8) (actual time=0.498..1.755 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 7167
   Heap Blocks: lossy=128
   ->  Bitmap Index Scan on idx_2  (cost=0.00..29.03 rows=7163 width=0) (actual time=0.278..0.278 rows=1280 loops=1)
         Index Cond: (a = 1000)
 Planning Time: 0.148 ms
 Execution Time: 1.774 ms
(8 rows)

DROP INDEX
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual time=9.695..9.708 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 223
   Heap Blocks: lossy=4
   ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0) (actual time=9.675..9.675 rows=40 loops=1)
         Index Cond: (a = 1000)
 Planning Time: 0.110 ms
 Execution Time: 9.727 ms
(8 rows)

and with 0008:

                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual time=8.540..8.577 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 223
   Heap Blocks: lossy=4
   ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0) (actual time=8.507..8.508 rows=40 loops=1)
         Index Cond: (a = 1000)
 Planning Time: 0.175 ms
 Execution Time: 8.601 ms
(8 rows)

DROP INDEX
                                                    QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t2  (cost=656.00..1531.00 rows=1 width=8) (actual time=9.712..9.725 rows=1 loops=1)
   Recheck Cond: (a = 1000)
   Rows Removed by Index Recheck: 223
   Heap Blocks: lossy=4
   ->  Bitmap Index Scan on idx_1  (cost=0.00..656.00 rows=224 width=0) (actual time=9.691..9.692 rows=40 loops=1)
         Index Cond: (a = 1000)
 Planning Time: 0.104 ms
 Execution Time: 9.746 ms
(8 rows)


> That being said, I don't think this is something we need to solve here,
> and it may not actually be an issue at all. For this to happen there
> need to be a terrible index on the same attribute (like the minmax index
> in the example above). But why keeping such index anyway? Dropping it
> would make the issue go away. If we have two indexes that both perform
> reasonably (say, bloom and minmax-multi), the consequences are not that
> bad. so this is interesting, but probably fine.

Yeah, I suspect this is unlikely to be a problem in practice.

--
I've run a similar test based on an earlier example from some months ago (attached).

0001-0006:

At first regular BRIN is faster, and it will choose it if available:

                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on iot  (cost=404.79..233352.86 rows=1252175 width=57) (actual time=2.115..346.351 rows=1252800 loops=1)
   Recheck Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
   Rows Removed by Index Recheck: 15936
   Heap Blocks: lossy=15104
   ->  Bitmap Index Scan on cd_multi  (cost=0.00..91.74 rows=1256702 width=0) (actual time=0.972..0.972 rows=151040 loops=1)
         Index Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
 Planning Time: 0.208 ms
 Execution Time: 412.549 ms
(8 rows)

                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on iot  (cost=341.99..233335.81 rows=1256871 width=57) (actual time=1.244..170.962 rows=1252800 loops=1)
   Recheck Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
   Rows Removed by Index Recheck: 15936
   Heap Blocks: lossy=15104
   ->  Bitmap Index Scan on cd_single  (cost=0.00..27.78 rows=1267458 width=0) (actual time=0.406..0.406 rows=151040 loops=1)
         Index Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
 Planning Time: 0.197 ms
 Execution Time: 237.146 ms
(8 rows)


After delete, vacuum, and insert, BRIN multi is chosen over seq scan even though the correlation should be somewhat off (I didn't go further and try to find a case where seq scan is wrongly preferred):

                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on iot  (cost=428.53..247273.68 rows=1135074 width=57) (actual time=1.798..252.494 rows=1128038 loops=1)
   Recheck Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
   Rows Removed by Index Recheck: 140698
   Heap Blocks: lossy=15104
   ->  Bitmap Index Scan on cd_multi  (cost=0.00..144.76 rows=1598833 width=0) (actual time=0.940..0.941 rows=151040 loops=1)
         Index Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
 Planning Time: 0.152 ms
 Execution Time: 311.999 ms
(8 rows)


Add regular BRIN index, and it will get chosen, making the scan slower:

                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on iot  (cost=308.20..246966.38 rows=1118304 width=57) (actual time=5.685..1277.854 rows=1128038 loops=1)
   Recheck Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
   Rows Removed by Index Recheck: 7548826
   Heap Blocks: lossy=103296
   ->  Bitmap Index Scan on cd_single  (cost=0.00..28.62 rows=1551397 width=0) (actual time=4.609..4.609 rows=1032960 loops=1)
         Index Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
 Planning Time: 0.211 ms
 Execution Time: 1338.685 ms
(8 rows)

Apply 0007 -- no difference:

                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on iot  (cost=308.20..246966.38 rows=1118304 width=57) (actual time=6.988..1358.113 rows=1128038 loops=1)
   Recheck Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
   Rows Removed by Index Recheck: 7548826
   Heap Blocks: lossy=103296
   ->  Bitmap Index Scan on cd_single  (cost=0.00..28.62 rows=1551397 width=0) (actual time=5.528..5.528 rows=1032960 loops=1)
         Index Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
 Planning Time: 3.534 ms
 Execution Time: 1418.194 ms
(8 rows)

Apply 0008 -- now it chooses minmax-multi:

                                                                          QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on iot  (cost=422.94..245412.66 rows=1118304 width=57) (actual time=2.750..259.850 rows=1128038 loops=1)
   Recheck Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
   Rows Removed by Index Recheck: 140698
   Heap Blocks: lossy=15104
   ->  Bitmap Index Scan on cd_multi  (cost=0.00..143.36 rows=1128092 width=0) (actual time=1.609..1.609 rows=151040 loops=1)
         Index Cond: ((create_dt >= '2020-02-01 00:00:00-04'::timestamp with time zone) AND (create_dt < '2020-03-01 00:00:00-04'::timestamp with time zone))
 Planning Time: 1.421 ms
 Execution Time: 319.891 ms
(8 rows)


So, 0007 doesn't make a difference in this case, but 0008 does.

--
John Naylor
EDB: http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Dean Rasheed
Дата:
Сообщение: Re: PoC/WIP: Extended statistics on expressions
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: PoC/WIP: Extended statistics on expressions