Обсуждение: Fast, stable, portable hash function producing 4-byte or 8-byte values?

Поиск
Список
Период
Сортировка

Fast, stable, portable hash function producing 4-byte or 8-byte values?

От
Erwin Brandstetter
Дата:
I am looking for stable hash functions producing 8-byte or 4-byte hashes from long text values in Postgres 10 or later.

There is md5(), the result of which can be cast to uuid. This reliably produces practically unique, stable 16-byte values. I have usecases where an 8-byte or even 4-byte hash would be good enough to make collisions reasonably unlikely. (I can recheck on the full string) - and expression indexes substantially smaller. I could truncate md5 and cast back and forth, but that seems like a lot of wasted computation. Are there suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)

There is an old post from 2012 by Tom Lane suggesting that hashtext() and friends are not for users:


Postgres 11 added hashtextextended() and friends to generate bigint hashes. In a more recent post from 3 months ago, Tom suggests to use it in user-land - if portability is not needed:

https://www.postgresql.org/message-id/9434.1568839177%40sss.pgh.pa.us

Is pghashlib by Marko Kreen my best option?


Or the "version-independent hash functions for PostgreSQL" from Peter Eisentraut:


Neither received updates for a couple of years. Unmaintained? Or obsolete?
And neither is available on most hosted services like RDS or Heroku (which would be required in come cases).

So what are my best options?

Regards
Erwin

Re: Fast, stable, portable hash function producing 4-byte or 8-bytevalues?

От
Laurenz Albe
Дата:
On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote:
> I am looking for stable hash functions producing 8-byte or 4-byte hashes from long text values in Postgres 10 or
later.
> 
> [...]
>
> There is an old post from 2012 by Tom Lane suggesting that hashtext() and friends are not for users:
> 
> https://www.postgresql.org/message-id/24463.1329854466%40sss.pgh.pa.us

Changing a hash function would corrupt hash indexes, wouldn't it?

So I'd expect these functions to be pretty stable:

SELECT amp.amproc
FROM pg_amproc AS amp
   JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid
   JOIN pg_am ON opf.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
  AND amp.amprocnum = 1;

Or at least there would have to be a fat warning in the release notes
to reindex hash indexes.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

От
Miles Elam
Дата:

In terms of "wasted computation", MD5, SHA1, and the others always compute the full length before they are passed to a UUID, int, or whatever. It's a sunk cost. It's also a minor cost considering many hash algorithms are performed in CPU hardware now. All that's left is the truncation and cast, which you can't avoid easily.


Sure, you could reimplement Java's .hashCode() method by iterating through the characters and processing the character codes:

s[0]*31^(n - 1) + s[1]*31^(n - 2) + ... + s[n - 1]


I don't see how that would beat the CPU-based hashes though unless you wrote a C-based extension. Maybe it's better just to embrace the user-defined function first and then decide if performance is insufficient for your use cases.


CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE OR REPLACE FUNCTION hash8 (p_data text, p_algo text = 'md5') RETURNS int8 AS $$

SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 16), 'hex'))::bit(64)::int8

$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;


CREATE OR REPLACE FUNCTION hash4 (p_data text, p_algo text = 'md5') RETURNS int4 AS $$

SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 8), 'hex'))::bit(32)::int4

$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;


SELECT

hash4('something something something'),

hash4('something something something', 'sha1'),

hash8('something something something'),

hash8('something something something', 'sha1');



Cheers,


Miles


On Tue, Dec 10, 2019 at 1:12 PM Erwin Brandstetter <brsaweda@gmail.com> wrote:
I am looking for stable hash functions producing 8-byte or 4-byte hashes from long text values in Postgres 10 or later.

There is md5(), the result of which can be cast to uuid. This reliably produces practically unique, stable 16-byte values. I have usecases where an 8-byte or even 4-byte hash would be good enough to make collisions reasonably unlikely. (I can recheck on the full string) - and expression indexes substantially smaller. I could truncate md5 and cast back and forth, but that seems like a lot of wasted computation. Are there suggestions for text hash functions that are
- fast
- keep collisions to a minimum
- stable across major Postgres versions (so expression indexes don't break)
- croptographic aspect is not needed (acceptable, but no benefit)

There is an old post from 2012 by Tom Lane suggesting that hashtext() and friends are not for users:


Postgres 11 added hashtextextended() and friends to generate bigint hashes. In a more recent post from 3 months ago, Tom suggests to use it in user-land - if portability is not needed:

https://www.postgresql.org/message-id/9434.1568839177%40sss.pgh.pa.us

Is pghashlib by Marko Kreen my best option?


Or the "version-independent hash functions for PostgreSQL" from Peter Eisentraut:


Neither received updates for a couple of years. Unmaintained? Or obsolete?
And neither is available on most hosted services like RDS or Heroku (which would be required in come cases).

So what are my best options?

Regards
Erwin

Re: Fast, stable, portable hash function producing 4-byte or 8-bytevalues?

От
Ron
Дата:
On 12/10/19 3:11 PM, Erwin Brandstetter wrote:
> I am looking for stable hash functions producing 8-byte or 4-byte hashes 
> from long text values in Postgres 10 or later.
>
> There is md5(), the result of which can be cast to uuid. This reliably 
> produces practically unique, stable 16-byte values. I have usecases where 
> an 8-byte or even 4-byte hash would be good enough to make collisions 
> reasonably unlikely. (I can recheck on the full string) - and expression 
> indexes substantially smaller. I could truncate md5 and cast back and 
> forth, but that seems like a lot of wasted computation. Are there 
> suggestions for text hash functions that are
> - fast
> - keep collisions to a minimum
> - stable across major Postgres versions (so expression indexes don't break)
> - croptographic aspect is not needed (acceptable, but no benefit)

What about a CRC32 function?  It's fast, and an SSE4 instruction has been in 
Intel CPUs for about 10 years.

>
> There is an old post from 2012 by Tom Lane suggesting that hashtext() and 
> friends are not for users:
>
> https://www.postgresql.org/message-id/24463.1329854466%40sss.pgh.pa.us
>
> Postgres 11 added hashtextextended() and friends to generate bigint 
> hashes. In a more recent post from 3 months ago, Tom suggests to use it in 
> user-land - if portability is not needed:
>
> https://www.postgresql.org/message-id/9434.1568839177%40sss.pgh.pa.us
>
> Is pghashlib by Marko Kreen my best option?
>
> https://github.com/markokr/pghashlib
>
> Or the "version-independent hash functions for PostgreSQL" from Peter 
> Eisentraut:
>
> https://github.com/petere/pgvihash
>
> Neither received updates for a couple of years. Unmaintained? Or obsolete?
> And neither is available on most hosted services like RDS or Heroku (which 
> would be required in come cases).
>
> So what are my best options?
>
> Regards
> Erwin

-- 
Angular momentum makes the world go 'round.



Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

От
Erwin Brandstetter
Дата:
Thanks for the suggestion. Seems like a good assumption and I have been using hashtext() in the past. But I am uncertain whether it is the best option.

Guess Tom's warning in https://www.postgresql.org/message-id/9434.1568839177@sss.pgh.pa.us about portability only refers to hashtextextended() and friends not being there in Postgres 10 or older.

But why are none of these functions documented? Does the project still not ...

> want people to rely on them continuing to have exactly the current behavior.

I am not complaining, maybe just nobody did the work. But it's also mentioned in this old thread, that hastext() changed in the past. Is all of that outdated and we are welcome to use those functions for indexing?

Filtering with amprocnum = 2 gets functions producing bigint in Postgres 11 or later.  Not sure about the exact meaning of amprocnum, manual says "Support function number".

Remaining problem either way: no hash function returning bigint for Postgres 10.

Regards
Erwin

On Tue, Dec 10, 2019 at 11:13 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote:
> I am looking for stable hash functions producing 8-byte or 4-byte hashes from long text values in Postgres 10 or later.
>
> [...]
>
> There is an old post from 2012 by Tom Lane suggesting that hashtext() and friends are not for users:
>
> https://www.postgresql.org/message-id/24463.1329854466%40sss.pgh.pa.us

Changing a hash function would corrupt hash indexes, wouldn't it?

So I'd expect these functions to be pretty stable:

SELECT amp.amproc
FROM pg_amproc AS amp
   JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid
   JOIN pg_am ON opf.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
  AND amp.amprocnum = 1;

Or at least there would have to be a fat warning in the release notes
to reindex hash indexes.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

От
Erwin Brandstetter
Дата:
On Tue, Dec 10, 2019 at 11:34 PM Miles Elam <miles.elam@productops.com> wrote:

In terms of "wasted computation", MD5, SHA1, and the others always compute the full length before they are passed to a UUID, int, or whatever. It's a sunk cost. It's also a minor cost considering many hash algorithms are performed in CPU hardware now. All that's left is the truncation and cast, which you can't avoid easily.


Sure, you could reimplement Java's .hashCode() method by iterating through the characters and processing the character codes:

s[0]*31^(n - 1) + s[1]*31^(n - 2) + ... + s[n - 1]


I don't see how that would beat the CPU-based hashes though unless you wrote a C-based extension. Maybe it's better just to embrace the user-defined function first and then decide if performance is insufficient for your use cases.


CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE OR REPLACE FUNCTION hash8 (p_data text, p_algo text = 'md5') RETURNS int8 AS $$

SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 16), 'hex'))::bit(64)::int8

$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;


CREATE OR REPLACE FUNCTION hash4 (p_data text, p_algo text = 'md5') RETURNS int4 AS $$

SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 8), 'hex'))::bit(32)::int4

$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;


SELECT

hash4('something something something'),

hash4('something something something', 'sha1'),

hash8('something something something'),

hash8('something something something', 'sha1');



Cheers,


Miles

 


Thanks for the custom functions! May be useful as fallback.

But I am really looking for standard functions in Postgres first. Those should be faster and more reliable than writing my own.


Regards

Erwin

Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

От
Tom Lane
Дата:
Erwin Brandstetter <brsaweda@gmail.com> writes:
> Guess Tom's warning in
> https://www.postgresql.org/message-id/9434.1568839177@sss.pgh.pa.us about
> portability only refers to hashtextextended() and friends not being there
> in Postgres 10 or older.

Well, the other portability issue that is worth considering is that
these functions are only intended to give stable results within a
particular installation; their use for hash indexes does not require
the same results across different platforms.  Notably, most of them
give different answers on little-endian and big-endian machines.

That's not necessarily a reason not to use them, but you have to
be careful what you assume about them.

            regards, tom lane



Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

От
George Neuner
Дата:
On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr@gmail.com>
wrote:

>On 12/10/19 3:11 PM, Erwin Brandstetter wrote:
>> I am looking for stable hash functions producing 8-byte or 4-byte hashes 
>> from long text values in Postgres 10 or later.
>>
>> There is md5(), the result of which can be cast to uuid. This reliably 
>> produces practically unique, stable 16-byte values. I have usecases where 
>> an 8-byte or even 4-byte hash would be good enough to make collisions 
>> reasonably unlikely. (I can recheck on the full string) - and expression 
>> indexes substantially smaller. I could truncate md5 and cast back and 
>> forth, but that seems like a lot of wasted computation. Are there 
>> suggestions for text hash functions that are
>> - fast
>> - keep collisions to a minimum
>> - stable across major Postgres versions (so expression indexes don't break)
>> - croptographic aspect is not needed (acceptable, but no benefit)
>
>What about a CRC32 function?  It's fast, and an SSE4 instruction has been in 
>Intel CPUs for about 10 years.

On long text CRC will not be as discriminating as a real cryptohash,
but it may be considerably faster to compute depending on the length
of the text.  Given that the OP can check the actual string in the
event of a collision, it may work well enough.

One way to make it more discriminating is to compute a pair of CRCs
using different polynomials.  Unfortunately the SSE4.2 CRC32
instruction uses a fixed polynomial.  You can compute it twice using
different initial values, but the result won't be as good as actually
using different polynomials.


I used to create 4-byte hashes by concatenating two 16-bit CRCs - one
using the X-modem polynomial and the other using the CCITT polynomial.
Since these gave very different results, it worked quite well for my
use case where all the strings were guaranteed < 16KB.

YMMV,
George




Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

От
Erik Aronesty
Дата:
You can always tweak fnv for whatever bite-size or bit size you want.   Sometimes I know a little information about my data shape and make a custom fnv that only looks at the first half for the last half of a string, etc.

On Wed, Dec 11, 2019, 1:02 PM Erwin Brandstetter <brsaweda@gmail.com> wrote:
Thanks for the suggestion. Seems like a good assumption and I have been using hashtext() in the past. But I am uncertain whether it is the best option.

Guess Tom's warning in https://www.postgresql.org/message-id/9434.1568839177@sss.pgh.pa.us about portability only refers to hashtextextended() and friends not being there in Postgres 10 or older.

But why are none of these functions documented? Does the project still not ...

> want people to rely on them continuing to have exactly the current behavior.

I am not complaining, maybe just nobody did the work. But it's also mentioned in this old thread, that hastext() changed in the past. Is all of that outdated and we are welcome to use those functions for indexing?

Filtering with amprocnum = 2 gets functions producing bigint in Postgres 11 or later.  Not sure about the exact meaning of amprocnum, manual says "Support function number".

Remaining problem either way: no hash function returning bigint for Postgres 10.

Regards
Erwin

On Tue, Dec 10, 2019 at 11:13 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote:
> I am looking for stable hash functions producing 8-byte or 4-byte hashes from long text values in Postgres 10 or later.
>
> [...]
>
> There is an old post from 2012 by Tom Lane suggesting that hashtext() and friends are not for users:
>
> https://www.postgresql.org/message-id/24463.1329854466%40sss.pgh.pa.us

Changing a hash function would corrupt hash indexes, wouldn't it?

So I'd expect these functions to be pretty stable:

SELECT amp.amproc
FROM pg_amproc AS amp
   JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid
   JOIN pg_am ON opf.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
  AND amp.amprocnum = 1;

Or at least there would have to be a fat warning in the release notes
to reindex hash indexes.

Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com

Re: Fast, stable, portable hash function producing 4-byte or 8-bytevalues?

От
Ron
Дата:
On 12/15/19 3:59 PM, George Neuner wrote:
> On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr@gmail.com>
> wrote:
>
>> On 12/10/19 3:11 PM, Erwin Brandstetter wrote:
>>> I am looking for stable hash functions producing 8-byte or 4-byte hashes
>>> from long text values in Postgres 10 or later.
>>>
>>> There is md5(), the result of which can be cast to uuid. This reliably
>>> produces practically unique, stable 16-byte values. I have usecases where
>>> an 8-byte or even 4-byte hash would be good enough to make collisions
>>> reasonably unlikely. (I can recheck on the full string) - and expression
>>> indexes substantially smaller. I could truncate md5 and cast back and
>>> forth, but that seems like a lot of wasted computation. Are there
>>> suggestions for text hash functions that are
>>> - fast
>>> - keep collisions to a minimum
>>> - stable across major Postgres versions (so expression indexes don't break)
>>> - croptographic aspect is not needed (acceptable, but no benefit)
>> What about a CRC32 function?  It's fast, and an SSE4 instruction has been in
>> Intel CPUs for about 10 years.
> On long text CRC will not be as discriminating as a real cryptohash,

When specifying a 4 byte hash, something must be sacrificed...

-- 
Angular momentum makes the world go 'round.



Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

От
George Neuner
Дата:
On Sun, 15 Dec 2019 20:23:25 -0600, Ron <ronljohnsonjr@gmail.com>
wrote:

>On 12/15/19 3:59 PM, George Neuner wrote:
>
>> On long text CRC will not be as discriminating as a real cryptohash,
>
>When specifying a 4 byte hash, something must be sacrificed...

Obviously.  But the main point is that CRC never was designed to
uniquely fingerprint data - it was designed to detect corruption of
the data, which is a much weaker guarantee than the cryptodigest
hashes.

Despite being the same length as an MD5 hash, a 128-bit CRC still
might not be as discriminating ... it depends greatly on the CRC
polynomial used and on the length of the input.

George