Re: Do we want a hashset type?

Поиск
Список
Период
Сортировка
От Joel Jacobson
Тема Re: Do we want a hashset type?
Дата
Msg-id 64d3fe63-86ff-45c3-956a-17b2645167e2@app.fastmail.com
обсуждение исходный текст
Ответ на Re: Do we want a hashset type?  (jian he <jian.universality@gmail.com>)
Ответы Re: Do we want a hashset type?  ("Joel Jacobson" <joel@compiler.org>)
Список pgsql-hackers
On Fri, Jun 16, 2023, at 13:57, jian he wrote:
> similar to (int[] || int4) and (int4 || int[])
> should we expect ('{1,2}'::int4hashset || 3) == (3 ||  
> '{1,2}'::int4hashset) == (select hashset_add('{1,2}'::int4hashset,3)); 
> *?*

Good idea, makes sense to support it.
Implemented in attached patch.

> CREATE OPERATOR || (
>     leftarg = int4,
>     rightarg = int4hashset,
>     function = int4_add_int4hashset,
>     commutator = ||
> );
> while creating an operator. I am not sure how to specify 
> NEGATOR,RESTRICT,JOIN clause.

I don't think we need those for this operator, might be wrong though.

>
-----------------------------------------------------------------------------------------------------------------------------
> also. I think the following query should return one row only? but now 
> it doesn't.
> select hashset_cmp('{1,2}','{2,1}')
> union 
> select hashset_cmp('{1,2}','{1,2,1}')
> union 
> select hashset_cmp('{1,2}','{1,2}');

Good point.

I realise int4hashset_hash() is broken,
since two int4hashset's that are considered equal,
can by coincidence get different hashes:

SELECT '{1,2}'::int4hashset = '{2,1}'::int4hashset;
 ?column?
----------
 t
(1 row)

SELECT hashset_hash('{1,2}'::int4hashset);
 hashset_hash
--------------
    990882385
(1 row)

SELECT hashset_hash('{2,1}'::int4hashset);
 hashset_hash
--------------
    996377797
(1 row)

Do we have any ideas on how to fix this without sacrificing performance?

We of course want to avoid having to sort the hashsets,
which is the naive solution.

To understand why this is happening, consider this example:

SELECT '{1,2}'::int4hashset;
 int4hashset
-------------
 {1,2}
(1 row)

SELECT '{2,1}'::int4hashset;
 int4hashset
-------------
 {2,1}
(1 row)

If the hash of `1` and `2` modulus the capacity results in the same value,
they will be attempted to be inserted into the same position,
and since the input text is parsed left-to-right, in the first case `1` will win
the first position, and `2` will get a collision and try the next position.

In the second case, the opposite happens.

Since we do modulus capacity, the position depends on the capacity,
which is why the output string can be different for the same input.

SELECTint4hashset() || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=1) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=2) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=3) || 1 || 2 || 3;
 {3,2,1}

SELECTint4hashset(capacity:=4) || 1 || 2 || 3;
 {3,1,2}

SELECTint4hashset(capacity:=5) || 1 || 2 || 3;
 {1,2,3}

SELECTint4hashset(capacity:=6) || 1 || 2 || 3;
 {1,3,2}


>
----------------------------------------------------------------------------------------------------------------------
> similar to elem_contained_by_range, range_contains_elem. we should 
> already consider the operator *<@* and @*>? *

That could perhaps be nice.
Apart from possible syntax convenience,
are there any other benefits than just using the function hashset_contains(int4hashset, integer) directly?

/Joel
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
Следующее
От: "Tristan Partin"
Дата:
Сообщение: Re: Support to define custom wait events for extensions