Обсуждение: Fast, stable, portable hash function producing 4-byte or 8-byte values?
https://www.postgresql.org/message-id/9434.1568839177%40sss.pgh.pa.us
On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote: > I am looking for stable hash functions producing 8-byte or 4-byte hashes from long text values in Postgres 10 or later. > > [...] > > There is an old post from 2012 by Tom Lane suggesting that hashtext() and friends are not for users: > > https://www.postgresql.org/message-id/24463.1329854466%40sss.pgh.pa.us Changing a hash function would corrupt hash indexes, wouldn't it? So I'd expect these functions to be pretty stable: SELECT amp.amproc FROM pg_amproc AS amp JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid JOIN pg_am ON opf.opfmethod = pg_am.oid WHERE pg_am.amname = 'hash' AND amp.amprocnum = 1; Or at least there would have to be a fat warning in the release notes to reindex hash indexes. Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
In terms of "wasted computation", MD5, SHA1, and the others always compute the full length before they are passed to a UUID, int, or whatever. It's a sunk cost. It's also a minor cost considering many hash algorithms are performed in CPU hardware now. All that's left is the truncation and cast, which you can't avoid easily.
Sure, you could reimplement Java's .hashCode() method by iterating through the characters and processing the character codes:
s[0]*31^(n - 1) + s[1]*31^(n - 2) + ... + s[n - 1]
I don't see how that would beat the CPU-based hashes though unless you wrote a C-based extension. Maybe it's better just to embrace the user-defined function first and then decide if performance is insufficient for your use cases.
CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE OR REPLACE FUNCTION hash8 (p_data text, p_algo text = 'md5') RETURNS int8 AS $$
SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 16), 'hex'))::bit(64)::int8
$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;
CREATE OR REPLACE FUNCTION hash4 (p_data text, p_algo text = 'md5') RETURNS int4 AS $$
SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 8), 'hex'))::bit(32)::int4
$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;
SELECT
hash4('something something something'),
hash4('something something something', 'sha1'),
hash8('something something something'),
hash8('something something something', 'sha1');
Cheers,
Miles
I am looking for stable hash functions producing 8-byte or 4-byte hashes from long text values in Postgres 10 or later.There is md5(), the result of which can be cast to uuid. This reliably produces practically unique, stable 16-byte values. I have usecases where an 8-byte or even 4-byte hash would be good enough to make collisions reasonably unlikely. (I can recheck on the full string) - and expression indexes substantially smaller. I could truncate md5 and cast back and forth, but that seems like a lot of wasted computation. Are there suggestions for text hash functions that are- fast- keep collisions to a minimum- stable across major Postgres versions (so expression indexes don't break)- croptographic aspect is not needed (acceptable, but no benefit)There is an old post from 2012 by Tom Lane suggesting that hashtext() and friends are not for users:Postgres 11 added hashtextextended() and friends to generate bigint hashes. In a more recent post from 3 months ago, Tom suggests to use it in user-land - if portability is not needed:
https://www.postgresql.org/message-id/9434.1568839177%40sss.pgh.pa.usIs pghashlib by Marko Kreen my best option?Or the "version-independent hash functions for PostgreSQL" from Peter Eisentraut:Neither received updates for a couple of years. Unmaintained? Or obsolete?And neither is available on most hosted services like RDS or Heroku (which would be required in come cases).So what are my best options?RegardsErwin
On 12/10/19 3:11 PM, Erwin Brandstetter wrote: > I am looking for stable hash functions producing 8-byte or 4-byte hashes > from long text values in Postgres 10 or later. > > There is md5(), the result of which can be cast to uuid. This reliably > produces practically unique, stable 16-byte values. I have usecases where > an 8-byte or even 4-byte hash would be good enough to make collisions > reasonably unlikely. (I can recheck on the full string) - and expression > indexes substantially smaller. I could truncate md5 and cast back and > forth, but that seems like a lot of wasted computation. Are there > suggestions for text hash functions that are > - fast > - keep collisions to a minimum > - stable across major Postgres versions (so expression indexes don't break) > - croptographic aspect is not needed (acceptable, but no benefit) What about a CRC32 function? It's fast, and an SSE4 instruction has been in Intel CPUs for about 10 years. > > There is an old post from 2012 by Tom Lane suggesting that hashtext() and > friends are not for users: > > https://www.postgresql.org/message-id/24463.1329854466%40sss.pgh.pa.us > > Postgres 11 added hashtextextended() and friends to generate bigint > hashes. In a more recent post from 3 months ago, Tom suggests to use it in > user-land - if portability is not needed: > > https://www.postgresql.org/message-id/9434.1568839177%40sss.pgh.pa.us > > Is pghashlib by Marko Kreen my best option? > > https://github.com/markokr/pghashlib > > Or the "version-independent hash functions for PostgreSQL" from Peter > Eisentraut: > > https://github.com/petere/pgvihash > > Neither received updates for a couple of years. Unmaintained? Or obsolete? > And neither is available on most hosted services like RDS or Heroku (which > would be required in come cases). > > So what are my best options? > > Regards > Erwin -- Angular momentum makes the world go 'round.
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?
On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote:
> I am looking for stable hash functions producing 8-byte or 4-byte hashes from long text values in Postgres 10 or later.
>
> [...]
>
> There is an old post from 2012 by Tom Lane suggesting that hashtext() and friends are not for users:
>
> https://www.postgresql.org/message-id/24463.1329854466%40sss.pgh.pa.us
Changing a hash function would corrupt hash indexes, wouldn't it?
So I'd expect these functions to be pretty stable:
SELECT amp.amproc
FROM pg_amproc AS amp
JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid
JOIN pg_am ON opf.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
AND amp.amprocnum = 1;
Or at least there would have to be a fat warning in the release notes
to reindex hash indexes.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?
In terms of "wasted computation", MD5, SHA1, and the others always compute the full length before they are passed to a UUID, int, or whatever. It's a sunk cost. It's also a minor cost considering many hash algorithms are performed in CPU hardware now. All that's left is the truncation and cast, which you can't avoid easily.
Sure, you could reimplement Java's .hashCode() method by iterating through the characters and processing the character codes:
s[0]*31^(n - 1) + s[1]*31^(n - 2) + ... + s[n - 1]
I don't see how that would beat the CPU-based hashes though unless you wrote a C-based extension. Maybe it's better just to embrace the user-defined function first and then decide if performance is insufficient for your use cases.
CREATE EXTENSION IF NOT EXISTS pgcrypto;
CREATE OR REPLACE FUNCTION hash8 (p_data text, p_algo text = 'md5') RETURNS int8 AS $$
SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 16), 'hex'))::bit(64)::int8
$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;
CREATE OR REPLACE FUNCTION hash4 (p_data text, p_algo text = 'md5') RETURNS int4 AS $$
SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 8), 'hex'))::bit(32)::int4
$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;
SELECT
hash4('something something something'),
hash4('something something something', 'sha1'),
hash8('something something something'),
hash8('something something something', 'sha1');
Cheers,
Miles
Thanks for the custom functions! May be useful as fallback.
But I am really looking for standard functions in Postgres first. Those should be faster and more reliable than writing my own.
Regards
Erwin
Erwin Brandstetter <brsaweda@gmail.com> writes: > Guess Tom's warning in > https://www.postgresql.org/message-id/9434.1568839177@sss.pgh.pa.us about > portability only refers to hashtextextended() and friends not being there > in Postgres 10 or older. Well, the other portability issue that is worth considering is that these functions are only intended to give stable results within a particular installation; their use for hash indexes does not require the same results across different platforms. Notably, most of them give different answers on little-endian and big-endian machines. That's not necessarily a reason not to use them, but you have to be careful what you assume about them. regards, tom lane
On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr@gmail.com> wrote: >On 12/10/19 3:11 PM, Erwin Brandstetter wrote: >> I am looking for stable hash functions producing 8-byte or 4-byte hashes >> from long text values in Postgres 10 or later. >> >> There is md5(), the result of which can be cast to uuid. This reliably >> produces practically unique, stable 16-byte values. I have usecases where >> an 8-byte or even 4-byte hash would be good enough to make collisions >> reasonably unlikely. (I can recheck on the full string) - and expression >> indexes substantially smaller. I could truncate md5 and cast back and >> forth, but that seems like a lot of wasted computation. Are there >> suggestions for text hash functions that are >> - fast >> - keep collisions to a minimum >> - stable across major Postgres versions (so expression indexes don't break) >> - croptographic aspect is not needed (acceptable, but no benefit) > >What about a CRC32 function? It's fast, and an SSE4 instruction has been in >Intel CPUs for about 10 years. On long text CRC will not be as discriminating as a real cryptohash, but it may be considerably faster to compute depending on the length of the text. Given that the OP can check the actual string in the event of a collision, it may work well enough. One way to make it more discriminating is to compute a pair of CRCs using different polynomials. Unfortunately the SSE4.2 CRC32 instruction uses a fixed polynomial. You can compute it twice using different initial values, but the result won't be as good as actually using different polynomials. I used to create 4-byte hashes by concatenating two 16-bit CRCs - one using the X-modem polynomial and the other using the CCITT polynomial. Since these gave very different results, it worked quite well for my use case where all the strings were guaranteed < 16KB. YMMV, George
Thanks for the suggestion. Seems like a good assumption and I have been using hashtext() in the past. But I am uncertain whether it is the best option.Guess Tom's warning in https://www.postgresql.org/message-id/9434.1568839177@sss.pgh.pa.us about portability only refers to hashtextextended() and friends not being there in Postgres 10 or older.But why are none of these functions documented? Does the project still not ...> want people to rely on them continuing to have exactly the current behavior.I am not complaining, maybe just nobody did the work. But it's also mentioned in this old thread, that hastext() changed in the past. Is all of that outdated and we are welcome to use those functions for indexing?Filtering with amprocnum = 2 gets functions producing bigint in Postgres 11 or later. Not sure about the exact meaning of amprocnum, manual says "Support function number".Remaining problem either way: no hash function returning bigint for Postgres 10.RegardsErwinOn Tue, Dec 10, 2019 at 11:13 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:On Tue, 2019-12-10 at 22:11 +0100, Erwin Brandstetter wrote:
> I am looking for stable hash functions producing 8-byte or 4-byte hashes from long text values in Postgres 10 or later.
>
> [...]
>
> There is an old post from 2012 by Tom Lane suggesting that hashtext() and friends are not for users:
>
> https://www.postgresql.org/message-id/24463.1329854466%40sss.pgh.pa.us
Changing a hash function would corrupt hash indexes, wouldn't it?
So I'd expect these functions to be pretty stable:
SELECT amp.amproc
FROM pg_amproc AS amp
JOIN pg_opfamily AS opf ON amp.amprocfamily = opf.oid
JOIN pg_am ON opf.opfmethod = pg_am.oid
WHERE pg_am.amname = 'hash'
AND amp.amprocnum = 1;
Or at least there would have to be a fat warning in the release notes
to reindex hash indexes.
Yours,
Laurenz Albe
--
Cybertec | https://www.cybertec-postgresql.com
On 12/15/19 3:59 PM, George Neuner wrote: > On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr@gmail.com> > wrote: > >> On 12/10/19 3:11 PM, Erwin Brandstetter wrote: >>> I am looking for stable hash functions producing 8-byte or 4-byte hashes >>> from long text values in Postgres 10 or later. >>> >>> There is md5(), the result of which can be cast to uuid. This reliably >>> produces practically unique, stable 16-byte values. I have usecases where >>> an 8-byte or even 4-byte hash would be good enough to make collisions >>> reasonably unlikely. (I can recheck on the full string) - and expression >>> indexes substantially smaller. I could truncate md5 and cast back and >>> forth, but that seems like a lot of wasted computation. Are there >>> suggestions for text hash functions that are >>> - fast >>> - keep collisions to a minimum >>> - stable across major Postgres versions (so expression indexes don't break) >>> - croptographic aspect is not needed (acceptable, but no benefit) >> What about a CRC32 function? It's fast, and an SSE4 instruction has been in >> Intel CPUs for about 10 years. > On long text CRC will not be as discriminating as a real cryptohash, When specifying a 4 byte hash, something must be sacrificed... -- Angular momentum makes the world go 'round.
On Sun, 15 Dec 2019 20:23:25 -0600, Ron <ronljohnsonjr@gmail.com> wrote: >On 12/15/19 3:59 PM, George Neuner wrote: > >> On long text CRC will not be as discriminating as a real cryptohash, > >When specifying a 4 byte hash, something must be sacrificed... Obviously. But the main point is that CRC never was designed to uniquely fingerprint data - it was designed to detect corruption of the data, which is a much weaker guarantee than the cryptodigest hashes. Despite being the same length as an MD5 hash, a 128-bit CRC still might not be as discriminating ... it depends greatly on the CRC polynomial used and on the length of the input. George