Обсуждение: Hash Joins vs. Bloom Filters / take 2

Поиск

Список

Период

Сортировка

Hash Joins vs. Bloom Filters / take 2

От

Tomas Vondra

Дата:

21 февраля 2018 г., 03:23:54

Hi,

In 2015/2016 I've been exploring if we could improve hash joins by
leveraging bloom filters [1], and I was reminded about this idea in a
thread about amcheck [2]. I also see that bloom filters were briefly
mentioned in the thread about parallel hash [3].

So I've decided to revive the old patch, rebase it to current master,
and see if we can resolve the issues that killed it in 2016.

The issues are essentially about predicting when the bloom filter can
improve anything, which is more a question of the hash join selectivity
than the bloom filter parameters.

Consider a join on tables with FK on the join condition. That is,
something like this:

    CREATE TABLE dim (id serial primary key, ...);
    CREATE TABLE fact (dim_id int references dim(id), ...);

Then if you do a join

    SELECT * FROM fact JOIN dim (id = dim_id);

the bloom filter can't really help, because all the dim_id values are
guaranteed to be in the hash table. So it can only really help with
queries like this:

    SELECT * FROM fact JOIN dim (id = dim_id) WHERE (dim.x = 1000);

where the additional conditions on dim make the hash table incomplete in
some sense (i.e. the bloom will return false, saving quite a bit of
expensive work - lookup in large hash table or spilling tuple to disk).

This is not the same as "false positive rate" which is a feature of the
bloom filter itself, and can be measured by simply counting bits set in
the filter. That is important too, of course, but simple to determine.

The bloom filter "selectivity" is more a feature of the join, i.e. what
fraction of the values looked up in the bloom filter are expected to be
rejected. If many are rejected, the bloom filter helps. If few, the
bloom filter is a waste of time.

After thinking about this a bit more, I think this is pretty much what
we do during join cardinality estimation - we estimate how many of the
rows in "fact" have a matching row in "dim". I do think we can reuse
that to decide if to use a bloom filter or not.

Of course, the decision may need to be more elaborate (and perhaps
formulated as part of costing), but the filter selectivity is the
crucial piece. The attached patch does not do that yet, though.

The attached patch:

1) Only works in non-parallel case for now. Fixing this should not be a
big deal, once the costing/planning gets addressed.


2) Adds details about the bloom filter to EXPLAIN ANALYZE output, with
various important metrics (size of the filter, number of hash functions,
lookups/matches, fill factor).


3) Removes the HLL counter. I came to the conclusion that using HLL to
size the bloom filter makes very little sense, because there are about
three cases:

a) no batching (hash table fits into work_mem)

We can easily walk the hash table and compute "good enough" ndistinct
estimate by counting occupied buckets (or just using number of tuples in
the hash table, assuming the values are distinct).

At this point, the patch does not build the bloom filter in single-batch
cases, unless forced to do that by setting force_hashjoin_bloom=true.
More about this later.


b) batching from the beginning

The HLL is useless, because we need to initialize the bloom filter
before actually seeing any values. Instead, the patch simply uses the
expected number of rows (assuming those are distinct).


c) starting in single-batch mode, switching to batches later

We could use HLL to estimate number of distinct values in the initial
batch (when starting to batch), but it's unclear how is that useful.
This case means our estimates are off, and we don't really know how many
batches will be there. We could use some arbitrary multiple, I guess.

What I decided to do instead in the patch is sizing the bloom filter for
1/8 of work_mem. That seems like a viable compromise, as it's unlikely
to increase the number of batches.


4) Initially, the patch was aimed at hashjoins with batches, on the
premise that it can save some of the serialize/deserialize costs. But as
mentioned in the discussion, bloom filters can be beneficial even in the
single-batch case, when three conditions are met:

a) join is selective (i.e. some rows don't have match in hash table)

b) hash table > CPU cache

c) bloom filter < CPU cache

We don't have a good way to determine size of the CPU cache, but I think
we can use some crude arbitrary limits. E.g. require that hash hash
table is larger than 8MB and bloom filter is smaller than 4MB, or
something like that.

Opinions?


regards


[1] https://www.postgresql.org/message-id/5670946E.8070705%402ndquadrant.com

[2]
https://www.postgresql.org/message-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1%402ndquadrant.com

[3]
https://www.postgresql.org/message-id/9b9fd273-18e7-2b07-7aa1-4b00ab59b8d1%402ndquadrant.com

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

0001-v2-of-hashjoin-bloom-patch.patch

Re: Hash Joins vs. Bloom Filters / take 2

От

Peter Geoghegan

Дата:

21 февраля 2018 г., 05:06:50

On Tue, Feb 20, 2018 at 1:23 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> In 2015/2016 I've been exploring if we could improve hash joins by
> leveraging bloom filters [1], and I was reminded about this idea in a
> thread about amcheck [2]. I also see that bloom filters were briefly
> mentioned in the thread about parallel hash [3].
>
> So I've decided to revive the old patch, rebase it to current master,
> and see if we can resolve the issues that killed it in 2016.

Cool.

> The issues are essentially about predicting when the bloom filter can
> improve anything, which is more a question of the hash join selectivity
> than the bloom filter parameters.

Absolutely. A Bloom filter could be considered an opportunistic thing.
The false positive rate that you estimate is going to be based on the
planner's rowcount estimate for the inner side of the join, which is
liable to be wrong anyway. It's important to be resilient to
misestimations, which Bloom filters are good at.

> This is not the same as "false positive rate" which is a feature of the
> bloom filter itself, and can be measured by simply counting bits set in
> the filter. That is important too, of course, but simple to determine.

Perhaps I'm being pedantic, but it's not exactly true that you can
measure the false positive rate by counting the bits. I do agree with
what I think you meant here, though, which is that the precise false
positive rate is not necessarily that important, while the selectivity
of the join is very important.

In general, hash joins work best when the join qual is highly
selective. This is not the same thing as having a small hash table, of
course.

> After thinking about this a bit more, I think this is pretty much what
> we do during join cardinality estimation - we estimate how many of the
> rows in "fact" have a matching row in "dim". I do think we can reuse
> that to decide if to use a bloom filter or not.
>
> Of course, the decision may need to be more elaborate (and perhaps
> formulated as part of costing), but the filter selectivity is the
> crucial piece. The attached patch does not do that yet, though.

I suspect that it could make sense to use a Bloom filter to summarize
the entire inner side of the join all at once, even when there are
multiple batches. I also suspect that this is particularly beneficial
with parallel hash joins, where IPC/synchronization overhead can be a
big problem.

> 4) Initially, the patch was aimed at hashjoins with batches, on the
> premise that it can save some of the serialize/deserialize costs. But as
> mentioned in the discussion, bloom filters can be beneficial even in the
> single-batch case, when three conditions are met:
>
> a) join is selective (i.e. some rows don't have match in hash table)
>
> b) hash table > CPU cache
>
> c) bloom filter < CPU cache

Seems plausible. CPU cache efficiency is also affected by how many
hash functions you end up using -- use too many, and it slows things
down.

> We don't have a good way to determine size of the CPU cache, but I think
> we can use some crude arbitrary limits. E.g. require that hash hash
> table is larger than 8MB and bloom filter is smaller than 4MB, or
> something like that.

FWIW, the logical way to size the Bloom filter is based on the number
of (distinct) values, not based on the projected size of the hash
table. The lossy nature of Bloom filters makes the average size of the
original, hashed elements irrelevant to a sizing calculation that
targets a certain final false positive rate. Making Bloom filter size
any fixed fraction of hash table size seems wrong to me for that
reason alone.

You should try to exploit the fact that a Bloom filter can summarize a
large set reasonably well with a very compact, simple representation.
A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
for cases where Bloom probes will save a lot of work, it probably
doesn't make all that much difference -- the hash join is still much
faster. If you're willing to accept a 10% false positive rate, then
you can summarize a set of 40 million distinct values with only
22.85MB of memory and 3 hash functions. I think that the smallest
possible amount of memory that any hash table of 40 million elements
would require a much greater amount of memory than 22.85MB --
certainly closer to 20x than to 8x. Even that is pretty conservative.
I bet there are plenty of hash joins were the ratio could be 30x or
more. Not to mention cases with multiple batches.

-- 
Peter Geoghegan

Re: Hash Joins vs. Bloom Filters / take 2

От

Claudio Freire

Дата:

21 февраля 2018 г., 05:17:28

On Tue, Feb 20, 2018 at 8:06 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> You should try to exploit the fact that a Bloom filter can summarize a
> large set reasonably well with a very compact, simple representation.
> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
> for cases where Bloom probes will save a lot of work, it probably
> doesn't make all that much difference -- the hash join is still much
> faster. If you're willing to accept a 10% false positive rate, then
> you can summarize a set of 40 million distinct values with only
> 22.85MB of memory and 3 hash functions. I think that the smallest
> possible amount of memory that any hash table of 40 million elements
> would require a much greater amount of memory than 22.85MB --
> certainly closer to 20x than to 8x. Even that is pretty conservative.
> I bet there are plenty of hash joins were the ratio could be 30x or
> more. Not to mention cases with multiple batches.

I've worked a lot with bloom filters, and for large false positive
rates and large sets (multi-million entries), you get bloom filter
sizes of about 10 bits per distinct item.

That's efficient, but it's not magic. It can still happen that the
whole set can't fit in work_mem with an acceptable false positive
rate.

Re: Hash Joins vs. Bloom Filters / take 2

От

Peter Geoghegan

Дата:

21 февраля 2018 г., 05:23:58

On Tue, Feb 20, 2018 at 3:17 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
> I've worked a lot with bloom filters, and for large false positive
> rates and large sets (multi-million entries), you get bloom filter
> sizes of about 10 bits per distinct item.

It's generally true that you need 9.6 bits per element to get a 1%
false positive rate. 1% could be considered much too low here.

Do we need to eliminate 99% of all hash join probes (that find nothing
to join on) to make this Bloom filter optimization worthwhile?
Personally, I doubt it.

> That's efficient, but it's not magic. It can still happen that the
> whole set can't fit in work_mem with an acceptable false positive
> rate.

A merge join is always going to be the better choice when extremely
memory constrained.

-- 
Peter Geoghegan

Re: Hash Joins vs. Bloom Filters / take 2

От

Claudio Freire

Дата:

21 февраля 2018 г., 05:48:11

On Tue, Feb 20, 2018 at 8:23 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Feb 20, 2018 at 3:17 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> I've worked a lot with bloom filters, and for large false positive
>> rates and large sets (multi-million entries), you get bloom filter
>> sizes of about 10 bits per distinct item.
>
> It's generally true that you need 9.6 bits per element to get a 1%
> false positive rate. 1% could be considered much too low here.
>
> Do we need to eliminate 99% of all hash join probes (that find nothing
> to join on) to make this Bloom filter optimization worthwhile?
> Personally, I doubt it.

Even for 90% it's about 4.6 bits per element.

What I'm saying is that you can't assume you can fit the whole thing
in work_mem. If it could be said that any set worth a hash join will
fit, you can build a work_mem-sized bloom filter and just accept that
if you exceed its capacity its filtering efficiency will drop
benignly. But IMO that can't be taken for granted, so at some point
you should just give up, the overhead of querying a low-quality BF
won't be worth it.

The HLL is good for that. You can keep adding to the BF until the HLL
tells you you're way past the capacity of a work_mem-sized BF, then
you free that memory and go on without it.

You might also want to consider scalable BFs. The smaller you can keep
the BF, the better chance you'll have of fitting it into the L3 cache
(you can fit quite a lot of entries into a decen-sized L3 cache).

Re: Hash Joins vs. Bloom Filters / take 2

От

Tomas Vondra

Дата:

21 февраля 2018 г., 05:54:40

On 02/21/2018 12:06 AM, Peter Geoghegan wrote:
> On Tue, Feb 20, 2018 at 1:23 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> In 2015/2016 I've been exploring if we could improve hash joins by
>> leveraging bloom filters [1], and I was reminded about this idea in a
>> thread about amcheck [2]. I also see that bloom filters were briefly
>> mentioned in the thread about parallel hash [3].
>>
>> So I've decided to revive the old patch, rebase it to current master,
>> and see if we can resolve the issues that killed it in 2016.
> 
> Cool.
> 
>> The issues are essentially about predicting when the bloom filter can
>> improve anything, which is more a question of the hash join selectivity
>> than the bloom filter parameters.
> 
> Absolutely. A Bloom filter could be considered an opportunistic thing.
> The false positive rate that you estimate is going to be based on the
> planner's rowcount estimate for the inner side of the join, which is
> liable to be wrong anyway. It's important to be resilient to
> misestimations, which Bloom filters are good at.
> 

It's a bit more difficult than that, because rowcount is the total
number of rows, but it may not be the same as the number of distinct
values in the join key. But it's an estimate of the upper boundary.

But yeah, that's how the bloom filter is sized in the patch.

>> This is not the same as "false positive rate" which is a feature
>> of the bloom filter itself, and can be measured by simply counting 
>> bits set in the filter. That is important too, of course, but 
>> simple to determine.
> 
> Perhaps I'm being pedantic, but it's not exactly true that you can 
> measure the false positive rate by counting the bits. I do agree with
> what I think you meant here, though, which is that the precise false
> positive rate is not necessarily that important, while the 
> selectivity of the join is very important.
> 

My reasoning is that given a bloom filter with K out of M bits set, a
probability of the bloom filter saying "true" to a random value is

    (K/M)^H

where H is the number of hash functions. I believe that's pretty much
the definition of false positive rate, but I have to admit I haven't
really checked the exact math.

The important takeaway here is that many bits set -> bad.

> In general, hash joins work best when the join qual is highly 
> selective. This is not the same thing as having a small hash table,
> of course.
> 
>> After thinking about this a bit more, I think this is pretty much
>> what we do during join cardinality estimation - we estimate how
>> many of the rows in "fact" have a matching row in "dim". I do think
>> we can reuse that to decide if to use a bloom filter or not.
>>
>> Of course, the decision may need to be more elaborate (and perhaps 
>> formulated as part of costing), but the filter selectivity is the 
>> crucial piece. The attached patch does not do that yet, though.
> 
> I suspect that it could make sense to use a Bloom filter to
> summarize the entire inner side of the join all at once, even when
> there are multiple batches. I also suspect that this is particularly
> beneficial with parallel hash joins, where IPC/synchronization
> overhead can be a big problem.
> 

But that's what the patch does, currently - the filter is built during
the initial pass through the data, and then used for all batches. This
comes from the original idea to throw away rows from the outer relation
(usually the larger one) that have no match in the hash table. Now we
stash them into a batch file, possibly shuffling them between the
batches if we happen to need to add more batches, only to throw them
away much later when we get to that batch.

Actually, now that I think about it - I think the patch should throw
away the filter away after the initial pass over the outer relation,
because at that point we've used all the information in the filter.

I'm not sure it would make sense to then build smaller bloom filters for
individual batches, but maybe it would?

>> 4) Initially, the patch was aimed at hashjoins with batches, on the
>> premise that it can save some of the serialize/deserialize costs. But as
>> mentioned in the discussion, bloom filters can be beneficial even in the
>> single-batch case, when three conditions are met:
>>
>> a) join is selective (i.e. some rows don't have match in hash table)
>>
>> b) hash table > CPU cache
>>
>> c) bloom filter < CPU cache
> 
> Seems plausible. CPU cache efficiency is also affected by how many
> hash functions you end up using -- use too many, and it slows things
> down.
> 

Yeah, I admit those are rather crude rules.

>> We don't have a good way to determine size of the CPU cache, but I
>> think we can use some crude arbitrary limits. E.g. require that
>> hash hash table is larger than 8MB and bloom filter is smaller than
>> 4MB, or something like that.
> 
> FWIW, the logical way to size the Bloom filter is based on the number
> of (distinct) values, not based on the projected size of the hash
> table. The lossy nature of Bloom filters makes the average size of the
> original, hashed elements irrelevant to a sizing calculation that
> targets a certain final false positive rate. Making Bloom filter size
> any fixed fraction of hash table size seems wrong to me for that
> reason alone.
> 

Sure, and the patch does exactly that when we actually have a good
estimate (although, it's using expected number of rows of the inner
relation, not ndistinct).

The trouble is that when we start with a single batch and then find out
the estimate was wrong and we need to start batching, all bets are off.
At that point it seems reasonable to just say "Here is X MBs of RAM, do
what you can".

> You should try to exploit the fact that a Bloom filter can summarize a
> large set reasonably well with a very compact, simple representation.
> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
> for cases where Bloom probes will save a lot of work, it probably
> doesn't make all that much difference -- the hash join is still much
> faster. If you're willing to accept a 10% false positive rate, then
> you can summarize a set of 40 million distinct values with only
> 22.85MB of memory and 3 hash functions. I think that the smallest
> possible amount of memory that any hash table of 40 million elements
> would require a much greater amount of memory than 22.85MB --
> certainly closer to 20x than to 8x. Even that is pretty conservative.
> I bet there are plenty of hash joins were the ratio could be 30x or
> more. Not to mention cases with multiple batches.
> 

But the problem is that I don't know what is the total size of the hash
table, because we're building the bloom filter for all the batches at
once. And we don't know how many batches will be there - if we knew
that, we could estimate the number of distinct values and we could use
that to size the filter instead of doing this. (All of this only applies
to the state where we start with a single batch and then find out we
need to start batching, of course.)

In other words, I'm not claiming 1/8 of a hash table is a reasonable
size for bloom filter, but that 1/8 of work_mem seems like a reasonable
compromise for sizing the bloom filter. Perhaps there should be some
upper boundary for very large work_mem values, though.

Regarding false positive rate - I agree it may be more efficient to use
a smaller bloom filter with worse false positive rate, particularly if
the smaller one fits into a CPU cache and the larger one does not.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hash Joins vs. Bloom Filters / take 2

От

Peter Geoghegan

Дата:

21 февраля 2018 г., 06:19:50

On Tue, Feb 20, 2018 at 3:48 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
>> Do we need to eliminate 99% of all hash join probes (that find nothing
>> to join on) to make this Bloom filter optimization worthwhile?
>> Personally, I doubt it.
>
> Even for 90% it's about 4.6 bits per element.

4.6 bits is vastly less than typical hash join hash table entry sizes,
which are often comprised of several attributes. Even the skinniest
possible hash table elements have HJTUPLE_OVERHEAD() overhead, as well
as MinimalTuple overhead. To say nothing of all the serialization and
synchronization overhead that you could also have.

Whatever precise per-element overhead you come up with for your Bloom
filter, the *ratio* of those two things is what makes using a Bloom
filter potentially very effective. It could end up being something
like 1:100, which is a huge win for a highly selective hash join that
still has a fairly large hash table (that cannot fit in L3
cache/work_mem).

> What I'm saying is that you can't assume you can fit the whole thing
> in work_mem. If it could be said that any set worth a hash join will
> fit, you can build a work_mem-sized bloom filter and just accept that
> if you exceed its capacity its filtering efficiency will drop
> benignly. But IMO that can't be taken for granted, so at some point
> you should just give up, the overhead of querying a low-quality BF
> won't be worth it.

It's probably true that we will need to bail out of using a Bloom
filter, and it is certainly true that the optimization won't always
work out. Still, once your Bloom filter proves useless, chances are
good that the plan isn't a very good one, with or without that extra
optimization. Even if the plan is truly the fastest possible plan, and
even if you avoid wasting extra effort on the Bloom filter, it will
still be very slow.

Join selectivity is what really matters with this optimization. Of
course the optimization might not work out, but you can say that about
almost anything that uses hashing -- that's why safety critical
realtime systems (e.g. avionics systems) often completely avoid using
hash tables. Bloom filters are fairly resilient to somewhat inaccurate
estimates, but nothing will save you from terrible estimates. I think
that it's worth risking a small, predictable regression when the join
turns out to not be selective, in order to get a potentially very
large improvement in performance when the planner estimates join
selectivity reasonably accurately (and it's a selective hash join that
won't simply have a small hash table to begin with).

-- 
Peter Geoghegan

Re: Hash Joins vs. Bloom Filters / take 2

От

Peter Geoghegan

Дата:

21 февраля 2018 г., 07:10:07

On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>> I suspect that it could make sense to use a Bloom filter to
>> summarize the entire inner side of the join all at once, even when
>> there are multiple batches. I also suspect that this is particularly
>> beneficial with parallel hash joins, where IPC/synchronization
>> overhead can be a big problem.
>>
>
> But that's what the patch does, currently - the filter is built during
> the initial pass through the data, and then used for all batches.

I misunderstood. I would probably do something like double or triple
the original rows estimate instead, though. The estimate must be at
least slightly inaccurate when we get to this point, but I don't think
that that's a good enough reason to give up on the estimate
completely.

> Actually, now that I think about it - I think the patch should throw
> away the filter away after the initial pass over the outer relation,
> because at that point we've used all the information in the filter.

Makes sense.

> I'm not sure it would make sense to then build smaller bloom filters for
> individual batches, but maybe it would?

I doubt it.

> Yeah, I admit those are rather crude rules.

You have to start somewhere.

> The trouble is that when we start with a single batch and then find out
> the estimate was wrong and we need to start batching, all bets are off.
> At that point it seems reasonable to just say "Here is X MBs of RAM, do
> what you can".

As I said above, I wouldn't say all bets are off when this happens --
not at all. Estimates are likely to often be somewhat wrong. If
they're completely wrong, we can probably swallow the cost of giving
up on a Bloom filter relatively late.

As I said, X should not be a portion of work_mem, because that has
only a weak relationship to what really matters.

>> You should try to exploit the fact that a Bloom filter can summarize a
>> large set reasonably well with a very compact, simple representation.
>> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
>> for cases where Bloom probes will save a lot of work, it probably
>> doesn't make all that much difference -- the hash join is still much
>> faster.

> But the problem is that I don't know what is the total size of the hash
> table, because we're building the bloom filter for all the batches at
> once. And we don't know how many batches will be there - if we knew
> that, we could estimate the number of distinct values and we could use
> that to size the filter instead of doing this. (All of this only applies
> to the state where we start with a single batch and then find out we
> need to start batching, of course.)

I don't think that the number of batches should matter that much to
the costing/sizing/estimation logic, even if it's an interesting thing
to talk about when considering the upsides of a Bloom filter. My sense
is that we should focus on making sure that using a Bloom filter is
going to win, and not worry so much about whether that's going to be a
huge win or a modest win.

Suppose that you thought you were going to have a 10% false positive
rate with a 22.85MB Bloom filter for 40 million elements (my earlier
example). Further suppose that it turns out to be 80 million elements.
This means that you're going to get a false positive rate of 30%. This
could still be a big win, though, so it's not really a bad situation.
With 120 million elements, it's about 45%, but even then it could
still work out, especially because the hash join will be very slow
anyway. You also have to bear in mind that the 40 million estimate is
much more likely to be too high than too low, because you're assuming
distinct key values for the hash table.

You're taking a risk, in a sense, but a lot of things have to go wrong
for you to lose, and even then you can cut your losses before the
extra cost is too high.

Do you have a test query for this patch, that gives us some idea of the upside?

-- 
Peter Geoghegan

Re: Hash Joins vs. Bloom Filters / take 2

От

Thomas Munro

Дата:

21 февраля 2018 г., 13:17:54

On Wed, Feb 21, 2018 at 10:23 AM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> In 2015/2016 I've been exploring if we could improve hash joins by
> leveraging bloom filters [1], and I was reminded about this idea in a
> thread about amcheck [2]. I also see that bloom filters were briefly
> mentioned in the thread about parallel hash [3].
>
> So I've decided to revive the old patch, rebase it to current master,
> and see if we can resolve the issues that killed it in 2016.

Nice!

> Opinions?

I'm definitely following this and interested in helping in some way if
I can.  I have wondered about this subject and discussed it a bit with
Peter Geoghegan off-list.

Some assorted thoughts:

In the old thread, Peter pointed at a curious undergrad student
project from 2008[1] evaluating Bloom filters for hash joins in
PostgreSQL 8.3, inspired by a couple of older papers[2][3].  While
your patch uses a Bloom filter to short-circuit the regular bucket
probe in ExecHashJoinImpl(), these approach push the Bloom filter down
into the outer relation scan.  I suspect you're right about the fixed
sizing being a problem, but the general idea seems pretty interesting
to me and there seems to be no reason you couldn't make the filter
size dynamic as you have it and then share it via a parameter or
something.  But is there any point?

On the one hand, pushing down Bloom filters requires the hash value to
be computed by the lower scan, and then computed again if the tuple
survives the filter and makes it into the Hash Join node (unless there
is some way to attach it to the tuple...).  On the other hand,
throwing away tuples sooner can avoid more work, particularly in the
case of multi-joins.

Based on a hint from Jim Finnerty at PGCon 2017 and also some light
reading of ancient stuff on join pipelining, I've been wondering how
the potential effectiveness of Bloom filters is affected by the fact
that we prefer left-deep nested hash joins and for us build relation =
right/inner relation.  That policy means that we build hash tables for
all the joins up front, which means we could potentially push all
their Bloom filters down to the ultimate outer scan (or as far down as
possible depending on the qual) as I now find described in [2].  So:

    H1
    /\
   H2 u
   /\
  H3 t
  /\
 r  s

You might be able to push filters B1, B2, B3 constructed from H1, H2,
H3 all the way down to the scan of r and then probe them in order of
selectivity (a bit like order_qual_clauses does with any other kind of
qual).  (Because of the empty-outer optimisation's preliminary attempt
to pull a tuple on the left I guess you might need to push down a
placeholder filter first and then later replace it with the real
filter if/when you eventually build it.)

In contrast to that situation, suppose we introduced whole-query
memory budgets as Peter Geoghegan and several others have proposed.
Then we'd suddenly have a potential reason to prefer right-deep plans:

    H1
    /\
   r  H2
      /\
     s  H3
        /\
       t  u

Many other RDBMSs would consider this (modulo some confusion about
hash join polarity: other systems and literature have build = inner =
left, probe = outer = right so they'd draw that the other way around
and call it left deep, somewhat distractingly...) because it only
requires two hash tables to be loaded into memory at a time, a useful
property if you want to stay inside a whole-query memory budget.
That's because the output of each hash join is the input to the hash
table above it, so in theory we only need the one we're building and
the one we're probing at each moment.  Whatever other pros and cons
might exist with this plan shape compared to the left-deep variant, my
point is that a right-deep join could only push each filter down to
the immediate scan of r, s, t, which wouldn't be much of a potential
improvement over the way your current patch does it (ie entirely
inside the hash join, no push down), because it's not able to move the
Bloom filter very far away from the hash join anyway.  At best it
could perhaps move it before some more expensive/less selective other
quals in the scan.

In other words, I wonder if our left-deep plans stand to gain more
from Bloom filter push down.

[1] http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf
[2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.73.8069&rep=rep1&type=pdf
[3] https://pdfs.semanticscholar.org/ec99/6093805012b9e0ff06bb2050436091671091.pdf

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Hash Joins vs. Bloom Filters / take 2

От

Tomas Vondra

Дата:

22 февраля 2018 г., 08:21:20

On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
> On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>> I suspect that it could make sense to use a Bloom filter to 
>>> summarize the entire inner side of the join all at once, even
>>> when there are multiple batches. I also suspect that this is
>>> particularly beneficial with parallel hash joins, where
>>> IPC/synchronization overhead can be a big problem.
>>>
>>
>> But that's what the patch does, currently - the filter is built
>> during the initial pass through the data, and then used for all
>> batches.
> 
> I misunderstood. I would probably do something like double or triple 
> the original rows estimate instead, though. The estimate must be at 
> least slightly inaccurate when we get to this point, but I don't
> think that that's a good enough reason to give up on the estimate 
> completely.
> 

That's a problem only for the multi-batch case, though.

With a single batch we can walk the hash table and count non-empty
buckets, to get a good ndistinct estimate cheaply. And then size the
filter considering both memory requirements (fits into CPU cache) and
false positive rate. There are other things we may need to consider
(memory usage vs. work_mem) but that's a separate issue.

With multiple batches I think we could use the "size the bloom filter
for a fraction of work_mem" which the current patch uses when switching
to multiple batches halfway-through. That pretty much entirely ignores
the estimate and essentially replaces it with a "fictional" estimate.

I think that's a better approach than using some arbitrary multiple of
the estimate. When we have to start batching halfway through, the
estimate is proven to be rather bogus anyway, but we may treat it as a
lower boundary for the bloom filter size.

>> Actually, now that I think about it - I think the patch should
>> throw away the filter away after the initial pass over the outer
>> relation, because at that point we've used all the information in
>> the filter.
> 
> Makes sense.
> 

Actually, the patch already does that - it stops using the filter if
(curbatch != 0). We don't throw it away, though, because it also
includes some additional instrumentation that are shown by explain analyze.

>> I'm not sure it would make sense to then build smaller bloom
>> filters for individual batches, but maybe it would?
> 
> I doubt it.
> 

I think it might help if the global bloom filter ended up having high
false positive rate. But only if the per-batch filters fit into CPU
cache (i.e. it's the same reasoning as for single-batch case).

But those "per-batch" filters are rather incompatible with pushing the
filter to scan nodes, I think.

>> Yeah, I admit those are rather crude rules.
> 
> You have to start somewhere.
> 
>> The trouble is that when we start with a single batch and then find
>> out the estimate was wrong and we need to start batching, all bets
>> are off. At that point it seems reasonable to just say "Here is X
>> MBs of RAM, do what you can".
> 
> As I said above, I wouldn't say all bets are off when this happens
> -- not at all. Estimates are likely to often be somewhat wrong. If 
> they're completely wrong, we can probably swallow the cost of giving 
> up on a Bloom filter relatively late.
> 
> As I said, X should not be a portion of work_mem, because that has 
> only a weak relationship to what really matters.
> 

I agree a fixed fraction of work_mem may not be the right thing, but the
goal was to make the bloom filter part of the Hash memory budget, i.e.

    bloom filter + hash table <= work_mem

(which I think we agree should be the case), without increasing the
number of batches too much. For example, if you size the filter ignoring
this, and it end up being 90% of work_mem, you may need to do the hash
join in 128 batches instead of just 16. Or something like that.

Maybe that would still be a win, though. Firstly, the higher number of
batches may not have a huge impact - in one case we need to serialie
15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
accurate filter allows us to discard much more data from the outer
relation ...

>>> You should try to exploit the fact that a Bloom filter can summarize a
>>> large set reasonably well with a very compact, simple representation.
>>> A false positive rate of 10% sounds a lot worse than 1% or 0.1%, but
>>> for cases where Bloom probes will save a lot of work, it probably
>>> doesn't make all that much difference -- the hash join is still much
>>> faster.
> 
>> But the problem is that I don't know what is the total size of the
>> hash table, because we're building the bloom filter for all the
>> batches at once. And we don't know how many batches will be there -
>> if we knew that, we could estimate the number of distinct values
>> and we could use that to size the filter instead of doing this.
>> (All of this only applies to the state where we start with a single
>> batch and then find out we need to start batching, of course.)
> 
> I don't think that the number of batches should matter that much to 
> the costing/sizing/estimation logic, even if it's an interesting
> thing to talk about when considering the upsides of a Bloom filter.
> My sense is that we should focus on making sure that using a Bloom
> filter is going to win, and not worry so much about whether that's
> going to be a huge win or a modest win.
> 
> Suppose that you thought you were going to have a 10% false positive 
> rate with a 22.85MB Bloom filter for 40 million elements (my earlier 
> example). Further suppose that it turns out to be 80 million
> elements. This means that you're going to get a false positive rate
> of 30%. This could still be a big win, though, so it's not really a
> bad situation. With 120 million elements, it's about 45%, but even
> then it could still work out, especially because the hash join will
> be very slow anyway. You also have to bear in mind that the 40
> million estimate is much more likely to be too high than too low,
> because you're assuming distinct key values for the hash table.
> 
> You're taking a risk, in a sense, but a lot of things have to go
> wrong for you to lose, and even then you can cut your losses before
> the extra cost is too high.
> 
> Do you have a test query for this patch, that gives us some idea of
> the upside?
> 

I have to admit I've been using only some simplistic ad-hoc queries.
There was a more detailed analysis in the old thread, but I'm not sure
how relevant it still is.

So I did some measurements on a simple join, with different work_mem
values and join selectivity.

    select count(*)
      from fact join dim on (dim.a = fact.a and dim.b < :sel)

Attached are results for "small" data set 20M:1M, the full results and
scripts are available here:

     https://github.com/tvondra/hashjoin-bloom-tests

The first list shows a summary of the results, with timings for

  a) master
  b) patched master (with bloom filters disabled)
  c) patched master (with bloom filters used always)

The last two tables compare b/a and c/a. The first one shows that
there's 0-3% overhead when bloom filters are not used (but it might
easily be just noise or differences in the layout of the binary).

The second one is the more interesting one. It shows a couple of things:

a) For tiny hash tables there's about 11% overhead. 1% selectivity means
the hash table has only 10000 entries, which fits into ~483kB. This is
why I think we need rule that for small hash tables we don't need bloom
filters.

b) For low selectivity (70% or more rows get into the hash table), the
bloom filter is a net loss, costing up to ~11%. This is why we should
consider selectivity of the join, I think.

c) For selectivity between 10% and 70% it's a net win, with speedups
between ~10% (single batch) and ~20% (multi-batch).

Those are relatively modest improvements, I expect more significant
gains on the large data set.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

summary.ods

Re: Hash Joins vs. Bloom Filters / take 2

От

Tomas Vondra

Дата:

22 февраля 2018 г., 08:37:44

On 02/21/2018 08:17 AM, Thomas Munro wrote:
> On Wed, Feb 21, 2018 at 10:23 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> In 2015/2016 I've been exploring if we could improve hash joins by
>> leveraging bloom filters [1], and I was reminded about this idea in a
>> thread about amcheck [2]. I also see that bloom filters were briefly
>> mentioned in the thread about parallel hash [3].
>>
>> So I've decided to revive the old patch, rebase it to current
>> master, and see if we can resolve the issues that killed it in
>> 2016.
> 
> Nice!
> 
>> Opinions?
> 
> I'm definitely following this and interested in helping in some way 
> if I can. I have wondered about this subject and discussed it a bit 
> with Peter Geoghegan off-list.
> 

Good ;-)

I think one important thing we need to figure out is the costing, or
some other way that would allow us to decide when to build the Bloom
filters (and what perhaps whether to prefer larger and more accurate
one, or a smaller one).

But if you want to look into adding support for parallel hash, or
pushing the bloom filter down to the scans, feel free to do so.


> Some assorted thoughts:
> 
> In the old thread, Peter pointed at a curious undergrad student
> project from 2008[1] evaluating Bloom filters for hash joins in
> PostgreSQL 8.3, inspired by a couple of older papers[2][3].  While
> your patch uses a Bloom filter to short-circuit the regular bucket
> probe in ExecHashJoinImpl(), these approach push the Bloom filter down
> into the outer relation scan.  I suspect you're right about the fixed
> sizing being a problem, but the general idea seems pretty interesting
> to me and there seems to be no reason you couldn't make the filter
> size dynamic as you have it and then share it via a parameter or
> something.  But is there any point?
> 
> On the one hand, pushing down Bloom filters requires the hash value 
> to be computed by the lower scan, and then computed again if the 
> tuple survives the filter and makes it into the Hash Join node 
> (unless there is some way to attach it to the tuple...). On the
> other hand, throwing away tuples sooner can avoid more work,
> particularly in the case of multi-joins.
> 

I do agree it's an interesting idea, and being able to push the filter
down would be great, particularly in case of very selective joins (i.e.
when many outer rows have no match in the hash table). I have no idea
how much infrastructure would it require, though, or how widely it could
be used.

Judging by your thoughts on impact of left-deep vs. right-deep joins
etc. you've already given this far more thought that I did ;-)


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hash Joins vs. Bloom Filters / take 2

От

Claudio Freire

Дата:

22 февраля 2018 г., 17:44:42

On Wed, Feb 21, 2018 at 11:21 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
>> On Tue, Feb 20, 2018 at 3:54 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>>> I suspect that it could make sense to use a Bloom filter to
>>>> summarize the entire inner side of the join all at once, even
>>>> when there are multiple batches. I also suspect that this is
>>>> particularly beneficial with parallel hash joins, where
>>>> IPC/synchronization overhead can be a big problem.
>>>>
>>>
>>> But that's what the patch does, currently - the filter is built
>>> during the initial pass through the data, and then used for all
>>> batches.
>>
>> I misunderstood. I would probably do something like double or triple
>> the original rows estimate instead, though. The estimate must be at
>> least slightly inaccurate when we get to this point, but I don't
>> think that that's a good enough reason to give up on the estimate
>> completely.
>>
>
> That's a problem only for the multi-batch case, though.
>
> With a single batch we can walk the hash table and count non-empty
> buckets, to get a good ndistinct estimate cheaply. And then size the
> filter considering both memory requirements (fits into CPU cache) and
> false positive rate. There are other things we may need to consider
> (memory usage vs. work_mem) but that's a separate issue.
>
> With multiple batches I think we could use the "size the bloom filter
> for a fraction of work_mem" which the current patch uses when switching
> to multiple batches halfway-through. That pretty much entirely ignores
> the estimate and essentially replaces it with a "fictional" estimate.
>
> I think that's a better approach than using some arbitrary multiple of
> the estimate. When we have to start batching halfway through, the
> estimate is proven to be rather bogus anyway, but we may treat it as a
> lower boundary for the bloom filter size.

...

>> As I said, X should not be a portion of work_mem, because that has
>> only a weak relationship to what really matters.
>>
>
> I agree a fixed fraction of work_mem may not be the right thing, but the
> goal was to make the bloom filter part of the Hash memory budget, i.e.
>
>     bloom filter + hash table <= work_mem
>
> (which I think we agree should be the case), without increasing the
> number of batches too much. For example, if you size the filter ignoring
> this, and it end up being 90% of work_mem, you may need to do the hash
> join in 128 batches instead of just 16. Or something like that.
>
> Maybe that would still be a win, though. Firstly, the higher number of
> batches may not have a huge impact - in one case we need to serialie
> 15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
> accurate filter allows us to discard much more data from the outer
> relation ...

Let me reiterate, you can avoid both issues with scalable bloom filters[1].

An HLL can be used to estimate set size, the paper makes no mention of
it, probably assuming only distinct items are added to the set.

[1] http://gsd.di.uminho.pt/members/cbm/ps/dbloom.pdf

Re: Hash Joins vs. Bloom Filters / take 2

От

Tomas Vondra

Дата:

22 февраля 2018 г., 21:45:45

On 02/22/2018 12:44 PM, Claudio Freire wrote:
> On Wed, Feb 21, 2018 at 11:21 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 02/21/2018 02:10 AM, Peter Geoghegan wrote:
>>> ...
>>> I misunderstood. I would probably do something like double or triple
>>> the original rows estimate instead, though. The estimate must be at
>>> least slightly inaccurate when we get to this point, but I don't
>>> think that that's a good enough reason to give up on the estimate
>>> completely.
>>>
>>
>> That's a problem only for the multi-batch case, though.
>>
>> With a single batch we can walk the hash table and count non-empty
>> buckets, to get a good ndistinct estimate cheaply. And then size the
>> filter considering both memory requirements (fits into CPU cache) and
>> false positive rate. There are other things we may need to consider
>> (memory usage vs. work_mem) but that's a separate issue.
>>
>> With multiple batches I think we could use the "size the bloom filter
>> for a fraction of work_mem" which the current patch uses when switching
>> to multiple batches halfway-through. That pretty much entirely ignores
>> the estimate and essentially replaces it with a "fictional" estimate.
>>
>> I think that's a better approach than using some arbitrary multiple of
>> the estimate. When we have to start batching halfway through, the
>> estimate is proven to be rather bogus anyway, but we may treat it as a
>> lower boundary for the bloom filter size.
> 
> ...
> 
>>> As I said, X should not be a portion of work_mem, because that has
>>> only a weak relationship to what really matters.
>>>
>>
>> I agree a fixed fraction of work_mem may not be the right thing, but the
>> goal was to make the bloom filter part of the Hash memory budget, i.e.
>>
>>     bloom filter + hash table <= work_mem
>>
>> (which I think we agree should be the case), without increasing the
>> number of batches too much. For example, if you size the filter ignoring
>> this, and it end up being 90% of work_mem, you may need to do the hash
>> join in 128 batches instead of just 16. Or something like that.
>>
>> Maybe that would still be a win, though. Firstly, the higher number of
>> batches may not have a huge impact - in one case we need to serialie
>> 15/16 and in the other one 127/128. That's 93% vs. 99%. And if the more
>> accurate filter allows us to discard much more data from the outer
>> relation ...
> 
> Let me reiterate, you can avoid both issues with scalable bloom filters[1].
> 

I'm afraid it's not as straight-forward as "Use scalable bloom filters!"

This is not merely a question of unreliable estimates of number of
entries. That could have been solved by scalable bloom filters, which
are essentially a sequence of larger and larger bloom filters, added
when the smaller bloom filter "fills up" (1/2 full).

The problem is twofold:

(a) we need to decide what false positive rate to use (i.e. what is a
reasonable trade-off between filter size and how much work it saves)

(b) we also need to consider work_mem (which I assume we all agree we
must respect)

So for example we can't size the first bloom filter to just perfectly
fit into work_mem, only to add larger bloom filters later (each 2x the
size of the previous one). Not only will that increase the memory usage
to 7x the initial estimate, but it will also make the bloom filter less
efficient (having to probe larger and larger filters, likely not fitting
into CPU cache).

> An HLL can be used to estimate set size, the paper makes no mention of
> it, probably assuming only distinct items are added to the set.
> 

The problem with HLL is, it's only an estimate of how many entries you
saw so far. It only tells you that after observing the items, and it
only tells you how many items you saw so far. What we need for sizing a
bloom filter is an estimate of number of distinct values in advance.

In other words, HLL is entirely useless for sizing Bloom Filters.

Furthermore, we could estimate number of observed distinct values from
the number of 1s in the bloom filter - we essentially ask "How many
items we observed if each item sets k random bits, and we have K bits
sets?" HLL does the same thing, but it throws away the ability to answer
which elements are in the set.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hash Joins vs. Bloom Filters / take 2

От

Claudio Freire

Дата:

23 февраля 2018 г., 01:33:53

On Thu, Feb 22, 2018 at 12:45 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
>
> On 02/22/2018 12:44 PM, Claudio Freire wrote:
>> Let me reiterate, you can avoid both issues with scalable bloom filters[1].
>>
>
> I'm afraid it's not as straight-forward as "Use scalable bloom filters!"
>
> This is not merely a question of unreliable estimates of number of
> entries. That could have been solved by scalable bloom filters, which
> are essentially a sequence of larger and larger bloom filters, added
> when the smaller bloom filter "fills up" (1/2 full).
>
> The problem is twofold:
>
> (a) we need to decide what false positive rate to use (i.e. what is a
> reasonable trade-off between filter size and how much work it saves)
>
> (b) we also need to consider work_mem (which I assume we all agree we
> must respect)
> So for example we can't size the first bloom filter to just perfectly
> fit into work_mem, only to add larger bloom filters later (each 2x the
> size of the previous one). Not only will that increase the memory usage
> to 7x the initial estimate

Aside from a, (b) is exactly the problem SBFs solve.

You decide how much of work_mem you're willing to devote for BFs, and
then keep creating bigger and bigger BFs until you've allocated that
many.

Then you either keep updating the biggest filter, or give up entirely.

You can use a rather conservative initial bloom filter size for that,
and let scalability expand until you hit the limit.

> but it will also make the bloom filter less
> efficient (having to probe larger and larger filters, likely not fitting
> into CPU cache).

Again, you factor that into account when choosing the limit.

>> An HLL can be used to estimate set size, the paper makes no mention of
>> it, probably assuming only distinct items are added to the set.
>>
>
> The problem with HLL is, it's only an estimate of how many entries you
> saw so far. It only tells you that after observing the items, and it
> only tells you how many items you saw so far. What we need for sizing a
> bloom filter is an estimate of number of distinct values in advance.
>
> In other words, HLL is entirely useless for sizing Bloom Filters.

Normal BFs, yes. But that's exactly what you need for scalable BFs.
You need an estimate of the amount of distinct entries you've added to
your current filter, not the total set size.

> Furthermore, we could estimate number of observed distinct values from
> the number of 1s in the bloom filter

That's kinda slow to do per-item. I tried to "count" distinct items by
checking the BF before adding (don't add redundantly), but that's less
precise than a HLL in my experience.

Re: Hash Joins vs. Bloom Filters / take 2

От

Tomas Vondra

Дата:

23 февраля 2018 г., 02:11:44

On 02/22/2018 08:33 PM, Claudio Freire wrote:
> On Thu, Feb 22, 2018 at 12:45 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> On 02/22/2018 12:44 PM, Claudio Freire wrote:
>>> ...
>>>
>>> An HLL can be used to estimate set size, the paper makes no
>>> mention of it, probably assuming only distinct items are added to
>>> the set.
>>>
>>
>> The problem with HLL is, it's only an estimate of how many entries
>> you saw so far. It only tells you that after observing the items,
>> and it only tells you how many items you saw so far. What we need
>> for sizing a bloom filter is an estimate of number of distinct
>> values in advance.
>>
>> In other words, HLL is entirely useless for sizing Bloom Filters.
> 
> Normal BFs, yes. But that's exactly what you need for scalable BFs. 
> You need an estimate of the amount of distinct entries you've added
> to your current filter, not the total set size.
> 

No, you don't need that. For the SBF you need to know when the BF gets
full, which can be determined solely based on number of bits set to 1.
Essentially, the BF reaches the false positive rate when it reaches 50%.

Which is exactly what the SBF paper says on page 4: ... when filters get
full due to the limit on fill ratio, new one is added.

>> Furthermore, we could estimate number of observed distinct values from
>> the number of 1s in the bloom filter
> 
> That's kinda slow to do per-item. I tried to "count" distinct items by
> checking the BF before adding (don't add redundantly), but that's less
> precise than a HLL in my experience.

But you don't need to do that for each item you try to add - you know
that with M bits and K hash functions you can't reach 50% before at
least M/(2*K) additions. And you don't really need to track the number
of bits exactly - if it gets 55% full, it's not going to explode.

So just wait until M/(2*K) additions, see how many bits remain until the
threshold - rinse and repeat. And use some reasonable minimum distance
(say, 5% of the capacity), not to do the check too often.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hash Joins vs. Bloom Filters / take 2

От

Claudio Freire

Дата:

23 февраля 2018 г., 02:52:10

On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 02/22/2018 08:33 PM, Claudio Freire wrote:
>> That's kinda slow to do per-item. I tried to "count" distinct items by
>> checking the BF before adding (don't add redundantly), but that's less
>> precise than a HLL in my experience.
>
> But you don't need to do that for each item you try to add - you know
> that with M bits and K hash functions you can't reach 50% before at
> least M/(2*K) additions. And you don't really need to track the number
> of bits exactly - if it gets 55% full, it's not going to explode.
>
> So just wait until M/(2*K) additions, see how many bits remain until the
> threshold - rinse and repeat. And use some reasonable minimum distance
> (say, 5% of the capacity), not to do the check too often.

Ok, that works too.

Point is, SBFs help to keep the BF size as small as possible, keep it
below work_mem, and to avoid caring about the total number of distinct
items.

You may want the planner to try and estimate that to figure out
whether it's worth trying BFs or not, but once you decide to try, SBFs
remove any dependency on the accuracy of that estimate.

Re: Hash Joins vs. Bloom Filters / take 2

От

Tomas Vondra

Дата:

23 февраля 2018 г., 03:14:35


On 02/22/2018 09:52 PM, Claudio Freire wrote:
> On Thu, Feb 22, 2018 at 5:11 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> On 02/22/2018 08:33 PM, Claudio Freire wrote:
>>> That's kinda slow to do per-item. I tried to "count" distinct items by
>>> checking the BF before adding (don't add redundantly), but that's less
>>> precise than a HLL in my experience.
>>
>> But you don't need to do that for each item you try to add - you know
>> that with M bits and K hash functions you can't reach 50% before at
>> least M/(2*K) additions. And you don't really need to track the number
>> of bits exactly - if it gets 55% full, it's not going to explode.
>>
>> So just wait until M/(2*K) additions, see how many bits remain until the
>> threshold - rinse and repeat. And use some reasonable minimum distance
>> (say, 5% of the capacity), not to do the check too often.
> 
> Ok, that works too.
> 
> Point is, SBFs help to keep the BF size as small as possible, keep it
> below work_mem, and to avoid caring about the total number of distinct
> items.
> 
> You may want the planner to try and estimate that to figure out 
> whether it's worth trying BFs or not, but once you decide to try,
> SBFs remove any dependency on the accuracy of that estimate.
> 

OK, thanks for reminding me about SBF and for the discussion.

At this point I'll probably focus on the other parts though -
determining selectivity of the join, etc. Which I think is crucial, and
we need to get that right even for accurate estimates. It's good to know
that we have a solution for that part, though.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hash Joins vs. Bloom Filters / take 2

От

Peter Geoghegan

Дата:

23 февраля 2018 г., 03:29:12

On Thu, Feb 22, 2018 at 1:14 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> OK, thanks for reminding me about SBF and for the discussion.
>
> At this point I'll probably focus on the other parts though -
> determining selectivity of the join, etc. Which I think is crucial, and
> we need to get that right even for accurate estimates. It's good to know
> that we have a solution for that part, though.

+1

There are probably significant opportunities to improve the Bloom
filter. That isn't that interesting right now, though. Figuring out
how scalable Bloom filters might save hash join from being reliant on
the accuracy of the initial estimate of set cardinality seems
premature at best, since we haven't established how sensitive this
use-case is to misestimations. My sense is that it's actually
naturally very insensitive, but there is no need to spend too much
time on it just yet.

It just makes sense, as a matter of procedure, to focus on the hash
join code, and drill down from there. Personally, I'm tired of talking
about the nitty-gritty details of Bloom filters rather than the actual
problem at hand.

-- 
Peter Geoghegan

Re: Hash Joins vs. Bloom Filters / take 2

От

Andres Freund

Дата:

02 марта 2018 г., 04:01:27

Hi,

On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
> So I've decided to revive the old patch, rebase it to current master,
> and see if we can resolve the issues that killed it in 2016.

There seems to be some good discussion in the thread. But the patch
arrived just before the last commitfest and certainly isn't a trivial
cleanup patch. Therefore I think it should be moved to the next CF?

Greetings,

Andres Freund

Re: Hash Joins vs. Bloom Filters / take 2

От

Tomas Vondra

Дата:

02 марта 2018 г., 05:22:44


On 03/01/2018 11:01 PM, Andres Freund wrote:
> Hi,
> 
> On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
>> So I've decided to revive the old patch, rebase it to current master,
>> and see if we can resolve the issues that killed it in 2016.
> 
> There seems to be some good discussion in the thread. But the patch
> arrived just before the last commitfest and certainly isn't a trivial
> cleanup patch. Therefore I think it should be moved to the next CF?
> 

It isn't a massive invasive patch either, though, so I object to moving
it to 2018-09 right away.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hash Joins vs. Bloom Filters / take 2

От

Andres Freund

Дата:

02 марта 2018 г., 05:31:30


On March 1, 2018 3:22:44 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>
>On 03/01/2018 11:01 PM, Andres Freund wrote:
>> Hi,
>>
>> On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
>>> So I've decided to revive the old patch, rebase it to current
>master,
>>> and see if we can resolve the issues that killed it in 2016.
>>
>> There seems to be some good discussion in the thread. But the patch
>> arrived just before the last commitfest and certainly isn't a trivial
>> cleanup patch. Therefore I think it should be moved to the next CF?
>>
>
>It isn't a massive invasive patch either, though, so I object to moving
>it to 2018-09 right away.

Why do we have rules around not submitting large stuff to the last cf, if we just not follow through? We're neck deep
inpatches that are older. And you've already gotten a fair bit of feedback.. 


Andres

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Hash Joins vs. Bloom Filters / take 2

От

Tomas Vondra

Дата:

02 марта 2018 г., 05:52:56

On 03/02/2018 12:31 AM, Andres Freund wrote:
> 
> 
> On March 1, 2018 3:22:44 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>
>>
>> On 03/01/2018 11:01 PM, Andres Freund wrote:
>>> Hi,
>>>
>>> On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
>>>> So I've decided to revive the old patch, rebase it to current
>> master,
>>>> and see if we can resolve the issues that killed it in 2016.
>>>
>>> There seems to be some good discussion in the thread. But the patch
>>> arrived just before the last commitfest and certainly isn't a trivial
>>> cleanup patch. Therefore I think it should be moved to the next CF?
>>>
>>
>> It isn't a massive invasive patch either, though, so I object to moving
>> it to 2018-09 right away.
> 
> Why do we have rules around not submitting large stuff to the last 
> cf, if we just not follow through? We're neck deep in patches that
> are older. And you've already gotten a fair bit of feedback..
> 

It was not my intention to break (or even bend) the CF rules, of course.
I haven't considered the patch to be "large stuff", while you do. I see
Peter Geoghegan agrees with your conclusion on another thread, so go
ahead and move it to 2018-09.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hash Joins vs. Bloom Filters / take 2

От

David Steele

Дата:

02 марта 2018 г., 06:04:41

On 3/1/18 6:52 PM, Tomas Vondra wrote:
> On 03/02/2018 12:31 AM, Andres Freund wrote:
>>
>>
>> On March 1, 2018 3:22:44 PM PST, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>>>
>>>
>>> On 03/01/2018 11:01 PM, Andres Freund wrote:
>>>> Hi,
>>>>
>>>> On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
>>>>> So I've decided to revive the old patch, rebase it to current
>>> master,
>>>>> and see if we can resolve the issues that killed it in 2016.
>>>>
>>>> There seems to be some good discussion in the thread. But the patch
>>>> arrived just before the last commitfest and certainly isn't a trivial
>>>> cleanup patch. Therefore I think it should be moved to the next CF?
>>>>
>>>
>>> It isn't a massive invasive patch either, though, so I object to moving
>>> it to 2018-09 right away.
>>
>> Why do we have rules around not submitting large stuff to the last
>> cf, if we just not follow through? We're neck deep in patches that
>> are older. And you've already gotten a fair bit of feedback..
>>
> 
> It was not my intention to break (or even bend) the CF rules, of course.
> I haven't considered the patch to be "large stuff", while you do. I see
> Peter Geoghegan agrees with your conclusion on another thread, so go
> ahead and move it to 2018-09.

After reviewing the thread I also agree that this should be pushed to 
2018-09, so I have done so.

I'm very excited by this patch, though.  In general I agree with Peter 
that a higher rate of false positives is acceptable to save memory.  I 
also don't see any reason why this can't be tuned with a parameter. 
Start with a conservative default and allow the user to adjust as desired.

-- 
-David
david@pgmasters.net

Re: Hash Joins vs. Bloom Filters / take 2

От

Patrick Krecker

Дата:

07 марта 2018 г., 06:27:16

On Thu, Mar 1, 2018 at 4:04 PM, David Steele <david@pgmasters.net> wrote:
> On 3/1/18 6:52 PM, Tomas Vondra wrote:
>>
>> On 03/02/2018 12:31 AM, Andres Freund wrote:
>>>
>>>
>>>
>>> On March 1, 2018 3:22:44 PM PST, Tomas Vondra
>>> <tomas.vondra@2ndquadrant.com> wrote:
>>>>
>>>>
>>>>
>>>> On 03/01/2018 11:01 PM, Andres Freund wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> On 2018-02-20 22:23:54 +0100, Tomas Vondra wrote:
>>>>>>
>>>>>> So I've decided to revive the old patch, rebase it to current
>>>>
>>>> master,
>>>>>>
>>>>>> and see if we can resolve the issues that killed it in 2016.
>>>>>
>>>>>
>>>>> There seems to be some good discussion in the thread. But the patch
>>>>> arrived just before the last commitfest and certainly isn't a trivial
>>>>> cleanup patch. Therefore I think it should be moved to the next CF?
>>>>>
>>>>
>>>> It isn't a massive invasive patch either, though, so I object to moving
>>>> it to 2018-09 right away.
>>>
>>>
>>> Why do we have rules around not submitting large stuff to the last
>>> cf, if we just not follow through? We're neck deep in patches that
>>> are older. And you've already gotten a fair bit of feedback..
>>>
>>
>> It was not my intention to break (or even bend) the CF rules, of course.
>> I haven't considered the patch to be "large stuff", while you do. I see
>> Peter Geoghegan agrees with your conclusion on another thread, so go
>> ahead and move it to 2018-09.
>
>
> After reviewing the thread I also agree that this should be pushed to
> 2018-09, so I have done so.
>
> I'm very excited by this patch, though.  In general I agree with Peter that
> a higher rate of false positives is acceptable to save memory.  I also don't
> see any reason why this can't be tuned with a parameter. Start with a
> conservative default and allow the user to adjust as desired.
>
> --
> -David
> david@pgmasters.net
>

Hi All --

I'm curious what has to be done to move this patch along. I looked
through the patched and I noticed that the work for deciding whether to
instantiate the bloom filter in single-batch mode is not complete yet
(or if it's in this change, I can't find it), contrary to this
comment:

+        * When starting in a single-batch mode, we do nothing initially.
+        * If the whole hash table fits into a single batch, we can get
+        * sufficiently accurate ndistinct estimate by simply counting
+        * occupied buckets (thanks to shooting for NTUP_PER_BUCKET=1),
+        * or perhaps we could use something more elaborate (e.g. HLL).
+        * But we only build the bloom filter if the hash table is large
+        * enough to exceed on-CPU caches (e.g. 4MB).

I did some basic benchmarking with two tables and a simple where
clause filtering 90% of rows and it does yield about a 1.2x speedup in
my tests. In the pathological case where two tables are joined on a
pk/fk relationship with no filtering, the penalty seems to be about
1-2%.

Patrick

Re: Hash Joins vs. Bloom Filters / take 2

От

Michael Paquier

Дата:

01 октября 2018 г., 13:15:42

On Thu, Mar 01, 2018 at 07:04:41PM -0500, David Steele wrote:
> After reviewing the thread I also agree that this should be pushed to
> 2018-09, so I have done so.
>
> I'm very excited by this patch, though.  In general I agree with Peter that
> a higher rate of false positives is acceptable to save memory.  I also don't
> see any reason why this can't be tuned with a parameter. Start with a
> conservative default and allow the user to adjust as desired.

Not much has happened since last March.  The patch has conflicts in
regression tests.  Thomas, you are registered as a reviewer of this
patch.  Are you planning to look at it?

This is moved to next CF, waiting on author per the rotten bits.
--
Michael

Вложения

signature.asc

Re: Hash Joins vs. Bloom Filters / take 2

От

Tomas Vondra

Дата:

01 октября 2018 г., 16:54:29

On 10/01/2018 09:15 AM, Michael Paquier wrote:
> On Thu, Mar 01, 2018 at 07:04:41PM -0500, David Steele wrote:
>> After reviewing the thread I also agree that this should be pushed to
>> 2018-09, so I have done so.
>>
>> I'm very excited by this patch, though.  In general I agree with Peter that
>> a higher rate of false positives is acceptable to save memory.  I also don't
>> see any reason why this can't be tuned with a parameter. Start with a
>> conservative default and allow the user to adjust as desired.
> 
> Not much has happened since last March.  The patch has conflicts in
> regression tests.  Thomas, you are registered as a reviewer of this
> patch.  Are you planning to look at it?
> 
> This is moved to next CF, waiting on author per the rotten bits.
> --

I don't expect to work on this anytime soon, so maybe mark it as 
returned with feedback instead of moving it to the next CF.

I think the idea to push down the bloom filter to the other side of the 
hash join is what would make it much more interesting than the simple 
approach in this patch - but it's also much more work to make it work.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hash Joins vs. Bloom Filters / take 2

От

Jim Finnerty

Дата:

01 ноября 2018 г., 23:23:06

Hi Tomas,  

    I'm very interested in this patch, and particularly in possible
extensions to push the Bloom filter down on the probe side of the join.  I
made a few small edits to the patch to enable it to compile on PG11, and can
send it to you if you're interested.

    It is currently in the list of patches for the current commitfest, but
based on your previous post I'm not sure if you're planning to get back to
this patch just now.  If you plan to resume work on it, I'll sign up as a
reviewer.

thank you,

    /Jim Finnerty



-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Hash Joins vs. Bloom Filters / take 2

От

Thomas Munro

Дата:

02 ноября 2018 г., 00:06:47

On Fri, Nov 2, 2018 at 9:23 AM Jim Finnerty <jfinnert@amazon.com> wrote:
>     I'm very interested in this patch, and particularly in possible
> extensions to push the Bloom filter down on the probe side of the join.  I
> made a few small edits to the patch to enable it to compile on PG11, and can
> send it to you if you're interested.

Hi Jim,

Would you compute the hash for the outer tuples in the scan, and then
again in the Hash Join when probing, or would you want to (somehow)
attach the hash to emitted tuples for later reuse by the higher node?
Someone pointed out to me off-list that a popular RDBMS emanating from
the bicycle capital of the North-West pushes down Bloom filters to
scans, but only when the key is a non-nullable integer; I wonder if
that is because they hash in both places, but consider that OK only
when it's really cheap to do so.  (Along the same lines, if we could
attach extra data to tuples, I wonder if it would make sense to
transmit sort support information to higher nodes, so that (for
example) GatherMerge could use it to avoid full key comparison when
dealing with subplans that already did a sort and computed integers
for fast inequality checks.)

>     It is currently in the list of patches for the current commitfest, but
> based on your previous post I'm not sure if you're planning to get back to
> this patch just now.  If you plan to resume work on it, I'll sign up as a
> reviewer.

I'm also signed up to review.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Hash Joins vs. Bloom Filters / take 2

От

Tomas Vondra

Дата:

02 ноября 2018 г., 00:16:31

On 11/01/2018 10:06 PM, Thomas Munro wrote:
> On Fri, Nov 2, 2018 at 9:23 AM Jim Finnerty <jfinnert@amazon.com> wrote:
>>     I'm very interested in this patch, and particularly in possible
>> extensions to push the Bloom filter down on the probe side of the join.  I
>> made a few small edits to the patch to enable it to compile on PG11, and can
>> send it to you if you're interested.
> 
> Hi Jim,
> 
> Would you compute the hash for the outer tuples in the scan, and then
> again in the Hash Join when probing, or would you want to (somehow)
> attach the hash to emitted tuples for later reuse by the higher node?
> Someone pointed out to me off-list that a popular RDBMS emanating from
> the bicycle capital of the North-West pushes down Bloom filters to
> scans, but only when the key is a non-nullable integer; I wonder if
> that is because they hash in both places, but consider that OK only
> when it's really cheap to do so.  (Along the same lines, if we could
> attach extra data to tuples, I wonder if it would make sense to
> transmit sort support information to higher nodes, so that (for
> example) GatherMerge could use it to avoid full key comparison when
> dealing with subplans that already did a sort and computed integers
> for fast inequality checks.)
> 
>>     It is currently in the list of patches for the current commitfest, but
>> based on your previous post I'm not sure if you're planning to get back to
>> this patch just now.  If you plan to resume work on it, I'll sign up as a
>> reviewer.
> 
> I'm also signed up to review.
> 

I haven't really planned to work on this anytime soon, unfortunately,
which is why I proposed to mark it as RwF at the end of the last CF. I
already have a couple other patches there, and (more importantly) I
don't have a very clear idea how to implement the filter pushdown.

That being said I still think it'd be an interesting improvement, and if
someone wants to take over I'm available as a reviewer / tester ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Hash Joins vs. Bloom Filters / take 2

От

Robert Haas

Дата:

02 ноября 2018 г., 17:34:04

On Thu, Nov 1, 2018 at 5:07 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Would you compute the hash for the outer tuples in the scan, and then
> again in the Hash Join when probing, or would you want to (somehow)
> attach the hash to emitted tuples for later reuse by the higher node?

I'm interested in what Jim has to say, but to me it seems like we
should try to find a way to add a tlist entry for the hash value to
avoid recomputing it.  That's likely to require some tricky planner
surgery, but it's probably doable.

What really seems finicky to me about this whole project is the
costing.  In the best case it's a a huge win; in the worst case it's a
significant loss; and whether it's a gain or a loss is not easy to
figure out from the information that we have available.  We generally
do not have an accurate count of the number of distinct values we're
likely to see (which is important).

Worse, when you start to consider pushdown, you realize that the cost
of the scan depends on the bloom filter we push down to it.  So
consider something like A IJ B IJ C.  It seems like it could be the
case that once we decide to do the A-B join as a hash join with a
bloom filter, it makes sense to also do the join to C as a hash join
and push down the bloom filter, because we'll be able to combine the
two filters and the extra probes will be basically free.  But if we
weren't already doing the A-B join with a bloom filter, then maybe the
filter wouldn't make sense for C either.

Maybe I'm worrying over nothing here, or the wrong things, but costing
this well enough to avoid regressions really looks hard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Hash Joins vs. Bloom Filters / take 2

От

Jim Finnerty

Дата:

05 ноября 2018 г., 19:04:02

On Thu, Nov 1, 2018 at 5:07 PM Thomas Munro
<[hidden email]> wrote:
> Would you compute the hash for the outer tuples in the scan, and then
> again in the Hash Join when probing, or would you want to (somehow)
> attach the hash to emitted tuples for later reuse by the higher node?

I'm still considering your question (it sounds reasonable, but will add
complexity), but I want to mention a couple of things:  

One benefit of a pushing a filter to the Scan operation is the ability to
apply its selectivity before certain other expensive operations.  Some of
these benefits are storage-engine specific, so for example a column store
can greatly reduce row materialization costs by applying the runtime join
filter before row materialization.  A row store that supports batch mode
operators has different benefits of pushing down the filter than a row store
that does not support batching.  So we should probably think about pushing
runtime filters down to the Scan operator in the broader context of an API
to a pluggable storage engine [1] that may accept one or more runtime
filters.  I have a lot of catching-up to do on the pluggable storage thread
before I can comment on that, but maybe others can offer their opinion.

The runtime join filter might be represented in different ways depending on
the data type and distinct cardinality of the join key(s) of the inner
relation.  As Thomas Munroe hinted at, there are optimizations for the
integer key case, and in particular if the range of the integer key is
sufficiently small, then you can avoid hashing entirely and just do the
membership test in a bit array using the key itself.  In that case the
membership test would be exact, and you don't need the hash to be computed
or passed up to the join operator.

re: "What really seems finicky to me about this whole project is the
costing.  In the best case it's a a huge win; in the worst case it's a
significant loss"

An example of when you'd get pure overhead would be a hash join from the FK
side to the PK side of a RI constraint, with no local predicate on the PK
side.  In that case, every row would pass (assuming RI constraints were
properly enforced), and so the filter wouldn't reject any rows from the FK
side.

For this specific case we could anticipate that creating a runtime join
filter wouldn't be helpful, but a robust way to handle this problem in
general is to collect runtime statistics and have the operator shut itself
off if it's not filtering enough rows to justify its overhead.  For example,
maintain a count of rows processed and rows filtered, and when the number of
rows processed is above some minimum threshold and the selectivity is higher
than some threshold, then disable it either permanently or temporarily. 

Accurate estimation of the benefits of applying the join selectivity during
Scan is going to be storage-engine dependent, but as long as the operation
can turn itself off if it's not filtering enough rows to justify the per-row
overhead on the outer, the risk of the approach can be made very small.

    /Jim

[1] https://www.postgresql-archive.org/Pluggable-storage-tc5916322.html

-----
Jim Finnerty, AWS, Amazon Aurora PostgreSQL
--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Hash Joins vs. Bloom Filters / take 2

От

Dmitry Dolgov

Дата:

30 ноября 2018 г., 18:34:56

On Thu, Nov 1, 2018 at 10:17 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
>
> I haven't really planned to work on this anytime soon, unfortunately,
> which is why I proposed to mark it as RwF at the end of the last CF. I
> already have a couple other patches there, and (more importantly) I
> don't have a very clear idea how to implement the filter pushdown.
>
> That being said I still think it'd be an interesting improvement, and if
> someone wants to take over I'm available as a reviewer / tester ...

Since no one expressed any particular desire to take over the patch, I'm
marking it as "Returned with feedback". Although I really like the discussion
that happened here, maybe it worth to move a part of it to the wiki under the
"possible improvements" flag?

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Hash Joins vs. Bloom Filters / take 2

Вложения

Вложения

Вложения