Обсуждение: How to generate unique hash-type id?
Hello, I need to generate unique id which is not guessable unlike serial(integer) type. I need an id in format like md5 hash of random number. On top of that I need this id to be unique across multiple tables. Anyone had to solve this problem before? Can you post any recipes or best practices please? My questions: 1. Avoiding collisions. If I make an UNIQUE constraint and do generation of id triggered on INSERT. What if collision happens? DO I nee d to check if unique hash already exists and if not- regenerate. This looks too primitive. Is there a readily available function or methodology to do that? 2. Generating global unique id across multiple tables. How to do that? My only idea is to have separate table to keep all hashes and compare for collision against that table. Is there a better way? Maybe by creating some special serial type that is not integer but varchar? 3. what function to use to generate 64-bit random hash without much overhead to CPU? Thanks.
Uuid?
Il giorno 29 gen, 2010 9:20 m., "Joe Kramer" <cckramer@gmail.com> ha scritto:
Hello,
I need to generate unique id which is not guessable unlike
serial(integer) type. I need an id in format like md5 hash of random
number.
On top of that I need this id to be unique across multiple tables.
Anyone had to solve this problem before? Can you post any recipes or
best practices please?
My questions:
1. Avoiding collisions.
If I make an UNIQUE constraint and do generation of id triggered on
INSERT. What if collision happens? DO I nee d to check if unique hash
already exists and if not- regenerate.
This looks too primitive. Is there a readily available function or
methodology to do that?
2. Generating global unique id across multiple tables.
How to do that? My only idea is to have separate table to keep all
hashes and compare for collision against that table.
Is there a better way? Maybe by creating some special serial type that
is not integer but varchar?
3. what function to use to generate 64-bit random hash without much
overhead to CPU?
Thanks.
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Hi, On Friday 29 January 2010 09.20:33 Joe Kramer wrote: > I need to generate unique id which is not guessable unlike > serial(integer) type. I need an id in format like md5 hash of random > number. > On top of that I need this id to be unique across multiple tables. Have a look at http://www.postgresql.org/docs/8.3/static/uuid-ossp.html The usual approach is that (given a sensible random generator[1]) uuid are assumed to be unique[2]. So you don't need to check because the probability of collisions is so small that for practical purposes you can just ignore it. (If your engineer's mind balks at this, consider that you're trusting this already when you use digital cryptography / signatures, for example https certificates.) I haven't looked at this module myself, but from the experience with generating gpg keys on an appliance: if you need lots of randomness, the geneation of random numbers might be your bottleneck. OTOH, our platform didn't have disks and usually there was no network traffic while your average db server has both, and on many systems there is a hardware random generator, so this might not be an issue. cheers -- vbi [1] like, for example: http://www.dilbert.com/strips/comic/2001-10-25/ [2] you'll want v4 uuids -- Linus has opinions, I have opinions, everybody else has opinions, and the only consistency here is that most of us are wrong most of the time. -- Andrew Morton, OLS 2004
Вложения
On Fri, Jan 29, 2010 at 10:31, Adrian von Bidder <avbidder@fortytwo.ch> wrote: > > Hi, > > On Friday 29 January 2010 09.20:33 Joe Kramer wrote: >> I need to generate unique id which is not guessable unlike >> serial(integer) type. I need an id in format like md5 hash of random >> number. >> On top of that I need this id to be unique across multiple tables. > > Have a look at http://www.postgresql.org/docs/8.3/static/uuid-ossp.html > > The usual approach is that (given a sensible random generator[1]) uuid are > assumed to be unique[2]. So you don't need to check because the probability > of collisions is so small that for practical purposes you can just ignore > it. > > (If your engineer's mind balks at this, consider that you're trusting this > already when you use digital cryptography / signatures, for example https > certificates.) uuid:s are, AFAIK, not cryptographically strong. They are predictable - a lot less predictable than a sequence, but still. If you want secure random numbers, you need to look at pgcrypto - http://www.postgresql.org/docs/current/static/pgcrypto.html -- Magnus Hagander Me: http://www.hagander.net/ Work: http://www.redpill-linpro.com/
We have bunch of servers running the app and rebuilding postgres with support for ossp_uuid on all servers is time consuming. Is there a way of doing it without third party dependency like ossp_uuid? Should I just run md5(random number), will itbe the same ?. According to description it seems that all uuid_generate_v4() does is simply generate md5 of random number. Thanks. On Fri, Jan 29, 2010 at 8:31 PM, Adrian von Bidder <avbidder@fortytwo.ch> wrote: > > Hi, > > On Friday 29 January 2010 09.20:33 Joe Kramer wrote: >> I need to generate unique id which is not guessable unlike >> serial(integer) type. I need an id in format like md5 hash of random >> number. >> On top of that I need this id to be unique across multiple tables. > > Have a look at http://www.postgresql.org/docs/8.3/static/uuid-ossp.html > > The usual approach is that (given a sensible random generator[1]) uuid are > assumed to be unique[2]. So you don't need to check because the probability > of collisions is so small that for practical purposes you can just ignore > it. > > (If your engineer's mind balks at this, consider that you're trusting this > already when you use digital cryptography / signatures, for example https > certificates.) > > I haven't looked at this module myself, but from the experience with > generating gpg keys on an appliance: if you need lots of randomness, the > geneation of random numbers might be your bottleneck. OTOH, our platform > didn't have disks and usually there was no network traffic while your > average db server has both, and on many systems there is a hardware random > generator, so this might not be an issue. > > cheers > -- vbi > > > [1] like, for example: http://www.dilbert.com/strips/comic/2001-10-25/ > [2] you'll want v4 uuids > -- > Linus has opinions, I have opinions, everybody else has opinions, and > the only consistency here is that most of us are wrong most of the time. > -- Andrew Morton, OLS 2004 >
On Friday 29 January 2010 11.21:00 Joe Kramer wrote: > We have bunch of servers running the app and rebuilding postgres with > support for ossp_uuid on all servers is time consuming. > Is there a way of doing it without third party dependency like > ossp_uuid? Should I just run md5(random number), will itbe the same ? If you're building your own: at least use sha1 instead of md5. (Even md5 *should* be safe in the absence of malicious attacks, but md5 is generally not recommended anymore.) Everything depends on the quality of your random numbers. I don't know how much randomness pg's random() delivers, and as I've said I haven't looked what the uuid module does. (To give you an example: if random() only delivers a random 32 bit float value, the 160 bits of SHA-1 will not be used. You'll only use 4 billion different values and you *will* soon get collisions.) If I were to roll my own, I'd just use 256 bit of /dev/random (or, depending on the application, possibly /dev/urandom and take the risk that my values aren't that random.) Since it's random anyway, there's no need to use a hash. (Not sure: can a SQL function read arbitrary binary files or will a C module be necessary?) Speed: just did a quick test on one machine. reading 1kB from /dev/random takes about 1s. (constant 5MB/s disk activity with lots of seeking going on, no hw random device.) So you'd get ca. 32 id values per second. Don't know if that's a lot or not for your application. Magnus: can you elaborate on uuid not being secure? AFAICT v4 uuid are supposed to be essentially a random number formatted in a certain way. cheers -- vbi -- featured product: GNU Privacy Guard - http://gnupg.org
Вложения
On Fri, Jan 29, 2010 at 07:20:33PM +1100, Joe Kramer wrote: > I need to generate unique id which is not guessable unlike > serial(integer) type. I need an id in format like md5 hash of random > number. check this blogpost: http://www.depesz.com/index.php/2007/06/25/random-text-record-identifiers/ > On top of that I need this id to be unique across multiple tables. just add table id to the generated id. for example: id "xxx" in table users, is globally unique (for your database) when you write it: "users:xxx" if, for some weird reason, you don't want to put table name on its own in your key (why?) then just use some dictionary table, that will link those keys with your table. Best regards, depesz -- Linkedin: http://www.linkedin.com/in/depesz / blog: http://www.depesz.com/ jid/gtalk: depesz@depesz.com / aim:depeszhdl / skype:depesz_hdl / gg:6749007
Thanks for the answer, I am unable to use ossp_uuid due to package install and/or server rebuild requirement. So I am trying to roll my own, and digest(quote_literal(random()+random()), 'sha256'), 'hex') doesn't work: I have created this table and inserted 200000 rows (two million). This is more or less now my application looks now. It uses bigserial. And I need to add some unique hash: CREATE TABLE item ( item_id bigserial NOT NULL, title character varying, CONSTRAINT pk PRIMARY KEY (item_id) ) WITH ( OIDS=FALSE ); Now I add the hash column: ALTER TABLE item ADD COLUMN hash1 character varying NOT NULL DEFAULT encode(digest(quote_literal(random()+random()), 'sha256'), 'hex'); ALTER TABLE item ADD UNIQUE (hash1); When I executed this two statements, ALTER TABLE ADD COUMN, ADD UNIQUE, after 20 seconds I got this message: NOTICE: ALTER TABLE / ADD UNIQUE will create implicit index "item_hash1_key" for table "item" ERROR: could not create unique index "item_hash1_key" DETAIL: Table contains duplicated values. ********* Error ********** ERROR: could not create unique index "item_hash1_key" SQL state: 23505 Detail: Table contains duplicated values. So this means random()+random() is not random even within 2,000,000 iterations! If you suggest accessing /dev/urandom directly- I cannot do that because my application runs on mac,windows and linux. It would be maintenance nightmare. Any suggestions? Thanks. On Fri, Jan 29, 2010 at 10:20 PM, Adrian von Bidder <avbidder@fortytwo.ch> wrote: > On Friday 29 January 2010 11.21:00 Joe Kramer wrote: >> We have bunch of servers running the app and rebuilding postgres with >> support for ossp_uuid on all servers is time consuming. >> Is there a way of doing it without third party dependency like >> ossp_uuid? Should I just run md5(random number), will itbe the same ? > > If you're building your own: at least use sha1 instead of md5. > > (Even md5 *should* be safe in the absence of malicious attacks, but md5 is > generally not recommended anymore.) > > Everything depends on the quality of your random numbers. I don't know how > much randomness pg's random() delivers, and as I've said I haven't looked > what the uuid module does. > > (To give you an example: if random() only delivers a random 32 bit float > value, the 160 bits of SHA-1 will not be used. You'll only use 4 billion > different values and you *will* soon get collisions.) > > If I were to roll my own, I'd just use 256 bit of /dev/random (or, depending > on the application, possibly /dev/urandom and take the risk that my values > aren't that random.) Since it's random anyway, there's no need to use a > hash. (Not sure: can a SQL function read arbitrary binary files or will a C > module be necessary?) > > Speed: just did a quick test on one machine. reading 1kB from /dev/random > takes about 1s. (constant 5MB/s disk activity with lots of seeking going > on, no hw random device.) So you'd get ca. 32 id values per second. Don't > know if that's a lot or not for your application. > > Magnus: can you elaborate on uuid not being secure? AFAICT v4 uuid are > supposed to be essentially a random number formatted in a certain way. > > cheers > -- vbi > > > -- > featured product: GNU Privacy Guard - http://gnupg.org >
On Friday 29 January 2010 12.51:20 Joe Kramer wrote: > So this means random()+random() is not random even within 2,000,000 > iterations! > Exactly the issue I wrote about: random() apparently doesn't deliver enough randomness. Even if it did: quote_literal(random() + random()) is ca. 14 to 16 decimals, so this gets you ca. 40 to 50 bits of randomness (again: assuming random() actually can produce it! You'll really have to look into the implementation of random() for all your supported versions of pg and platforms!) So, of the 160 bit that are possible for a sha1 hash, your're only using 25% (and that's 25% of the bits. That means you're only using 1e12 of the 1.4e48 possible values. which is about 0.000000000000000000000000000000000001% if my math isn't totally bogus. (Read up about the birthday paradoxon to get an idea on how likely a collision is depending on how long your random number is.) cheers -- vbi -- featured link: http://www.pool.ntp.org
Вложения
On 2010-01-29, Joe Kramer wrote: > Thanks for the answer, > > I am unable to use ossp_uuid due to package install and/or server > rebuild requirement. > > So I am trying to roll my own, and > digest(quote_literal(random()+random()), 'sha256'), 'hex') doesn't work: > Your input value is a random number, those aren't supposed to be unique within any predictable number of iterations. Moreover I'd suspect, that random() + random() significantly increases the probability to create duplicated values due to the properties of the add-operation. In the simplest example, the first invocation may return random number 3, the second one random number 5 creating an input of 8 for the digest-function. Another outcome of the random number generator may yield 7 and 1 giving the same input. I'd suggest to use some kind of sequence or something constructed from the primary keys. But you may still see hash collisions although the input is different. -- Robert...
On Fri, 29 Jan 2010 13:13:17 +0100 "Wappler, Robert" <rwappler@ophardt.com> wrote: > I'd suggest to use some kind of sequence or something constructed > from the primary keys. But you may still see hash collisions > although the input is different. Concatenate /* ::text */ random() with something like: http://www.webthatworks.it/d1/node/page/pseudo_random_sequences_postgresql -- Ivan Sergio Borgonovo http://www.webthatworks.it
On 29/01/2010 4:20 PM, Joe Kramer wrote: > Hello, > > I need to generate unique id which is not guessable unlike > serial(integer) type. I need an id in format like md5 hash of random > number. > On top of that I need this id to be unique across multiple tables. > > Anyone had to solve this problem before? Can you post any recipes or > best practices please? > > My questions: > > 1. Avoiding collisions. > If I make an UNIQUE constraint and do generation of id triggered on > INSERT. What if collision happens? DO I nee d to check if unique hash > already exists and if not- regenerate. > This looks too primitive. Is there a readily available function or > methodology to do that? > > 2. Generating global unique id across multiple tables. > How to do that? My only idea is to have separate table to keep all > hashes and compare for collision against that table. > Is there a better way? Maybe by creating some special serial type that > is not integer but varchar? > > 3. what function to use to generate 64-bit random hash without much > overhead to CPU? When I ran into something somewhat akin to this I asked the list about a non-repeating pseudo-random mapping function. Daniel Verite (on this list) enlightened me about Feistel networks/cyphers, and even posted a PL/PgSQL implementation! It's documented here: http://wiki.postgresql.org/wiki/Pseudo_encrypt and has been extremely handy. It should fit you needs - just define a sequence that you pull new input values from, and use that same sequence across all tables that need unique values. It's not trivial to assert, across multiple tables, that a value is unique. You could do it with a trigger that checks each table that uses such values (slow-ish but effective) or maintains a side table of values in use. In either case it'd have to be added to every table that used the pseudo_encrypt values. By the way, if you intend to expose these to users you might also want some kind of data entry error checking so that a typo can't accidentally transform id `n' to id `m' with transposition of a single digit or the like. The luhn algorithm provides a good way to do that with a simple check digit - it's not cryptographically strong in that if you know the values are checked with the luhn algorithm it's trivial to re-generate the check digit, but it helps a log against casual scanning of the number space and against accidental user data entry error. Check out: http://wiki.postgresql.org/wiki/Luhn_algorithm -- Craig Ringer
Thanks for that link Depesz! It worked, I've run ALTER TABLE with your function and didn't have collisions. I guess it's more bulletproof because random() is called not once, but for every character therefore reducing possibility of collision by multitude of number of bytes in hash. CREATE OR REPLACE FUNCTION make_random_string(string_length INT4) RETURNS TEXT LANGUAGE 'plpgsql' AS $BODY$ DECLARE possible_chars TEXT = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'; output TEXT = ''; i INT4; pos INT4; BEGIN FOR i IN 1..string_length LOOP pos := 1 + cast( random() * ( length(possible_chars) - 1) as INT4 ); output := output || substr(possible_chars, pos, 1); END LOOP; RETURN output; END; $BODY$; CREATE TABLE item ( item_id bigserial NOT NULL, title character varying, CONSTRAINT pk PRIMARY KEY (item_id) ) WITH ( OIDS=FALSE ); .... LOOP INSERT INTO item( title) VALUES ('title1'); count = count+1; EXIT WHEN count > 10000000; END LOOP; .... ALTER TABLE item ADD COLUMN hash1 character varying NOT NULL DEFAULT make_random_string(64); ALTER TABLE item ADD UNIQUE (hash1); Query returned successfully with no result in 2120670 ms. It worked! No collisions on 10 million records. Now a question. Is it okay to add calculated column this way by specifying DEFAULT. Or I'm better using INSERT trigger? is DEFAULT basically an internal insert trigger? Thanks. On Fri, Jan 29, 2010 at 10:50 PM, hubert depesz lubaczewski <depesz@depesz.com> wrote: > On Fri, Jan 29, 2010 at 07:20:33PM +1100, Joe Kramer wrote: >> I need to generate unique id which is not guessable unlike >> serial(integer) type. I need an id in format like md5 hash of random >> number. > > check this blogpost: > http://www.depesz.com/index.php/2007/06/25/random-text-record-identifiers/ > >> On top of that I need this id to be unique across multiple tables. > > just add table id to the generated id. > > for example: id "xxx" in table users, is globally unique (for your > database) when you write it: "users:xxx" > > if, for some weird reason, you don't want to put table name on its own > in your key (why?) then just use some dictionary table, that will link > those keys with your table. > > Best regards, > > depesz > > -- > Linkedin: http://www.linkedin.com/in/depesz / blog: http://www.depesz.com/ > jid/gtalk: depesz@depesz.com / aim:depeszhdl / skype:depesz_hdl / gg:6749007 >