Re: Memory-Bounded Hash Aggregation

Поиск
Список
Период
Сортировка
От Melanie Plageman
Тема Re: Memory-Bounded Hash Aggregation
Дата
Msg-id CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Memory-Bounded Hash Aggregation  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: Memory-Bounded Hash Aggregation  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Список pgsql-hackers

On Wed, Jan 8, 2020 at 2:38 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
This makes the assumption that all Aggrefs or GroupingFuncs are at the
top of the TargetEntry. That's not true, e.g.:

select 0+sum(a) from foo group by b;

I think find_aggregated_cols() and find_unaggregated_cols() should be
merged into one function that scans the targetlist once, and returns two
Bitmapsets. They're always used together, anyway.


So, I've attached a patch that does what Heikki recommended and gets
both aggregated and unaggregated columns in two different bitmapsets.
I think it works for more cases than the other patch.
I'm not sure it is the ideal interface, but, since there aren't many
consumers, I don't know.
Also, it needs some formatting/improved naming/etc.

Per Jeff's comment in [1] I started looking into using the scanCols
patch from the thread on extracting scanCols from PlannerInfo [2] to
get the aggregated and unaggregated columns for this patch.

Since we only make one bitmap for scanCols containing all of the
columns that need to be scanned, there is no context about where the
columns came from in the query.
That is, once the bit is set in the bitmapset, we have no way of
knowing if that column was needed for aggregation or if it is filtered
out immediately.

We could solve this by creating multiple bitmaps at the time that we
create the scanCols field -- one for aggregated columns, one for
unaggregated columns, and, potentially more if useful to other
consumers.

The initial problem with this is that we extract scanCols from the
PlannerInfo->simple_rel_array and PlannerInfo->simple_rte_array.
If we wanted more context about where those columns were from in the
query, we would have to either change how we construct the scanCols or
construct them early and add to the bitmap when adding columns to the
simple_rel_array and simple_rte_array (which, I suppose, is the same
thing as changing how we construct scanCols).

This might decentralize the code for the benefit of one consumer.
Also, looping through the simple_rel_array and simple_rte_array a
couple times per query seems like it would add negligible overhead.
I'm more hesitant to add code that, most likely, would involve a
walker to the codepath everybody uses if only agg will leverage the
two distinct bitmapsets.

Overall, I think it seems like a good idea to leverage scanCols for
determining what columns hashagg needs to spill, but I can't think of
a way of doing it that doesn't seem bad. scanCols are currently just
that -- columns that will need to be scanned.

[1] https://www.postgresql.org/message-id/e5566f7def33a9e9fdff337cca32d07155d7b635.camel%40j-davis.com
[2] https://www.postgresql.org/message-id/flat/CAAKRu_Yj%3DQ_ZxiGX%2BpgstNWMbUJApEJX-imvAEwryCk5SLUebg%40mail.gmail.com

--
Melanie Plageman
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Munro
Дата:
Сообщение: Re: error context for vacuum to include block number
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.