Обсуждение: UNIQUE constraint on character sequences
Dear list,
I would be pleased if you could share some thoughts with me on the following: say I wish to maintain a table with all distinct character sequences (variable length) showing series with strong similarities. Example:
"abbbabacccdef"
"abbbabaccdcdf"
"abbbabaccdcgf"
...
"qwtrhdffdd"
...
"qwtrhdffdds"
...
"qwtrhdffddsspp"
"qwtrhdffddsspf"
"qwtrhdffddssph"
"qwtrhdffddsspL"
etc.
Think of them as ordered values (array-like), as a set having many values in common, in the same "elements" (that is: positions; in my application, each position has some particular meaning -- the sequence represents a set of particular settings)
CREATE TABLE textseqs(txtseq TEXT)
What would be an efficient approach in enforcing a UNIQUE constraint?
I was thinking of using hashbuckets in a b-tree:
CREATE UNIQUE INDEX ON textseqs USING BTREE( hashtext(txtseq), txtseq )
This index would "cache" hashes for each row. Upon inserting of a new row, traversing the index involves the comparison of two single integers for each node in the b-tree, until the actual hash value (if it exists) was reached. Then, only within that bucket (the hashes won't be unique), the more expensive string comparing is required; involving a sequencial comparison of (potentially maaaany) characters. Yet, within that bucket, the character series may be expected to show stronger differences than a plain sorted list of all values in the table would have, indexed by a b-tree index. Wouldn't traversing such a plain (non-composite, single column, on: "txtseq") b-tree index involve a sequencial comparison of (potentially many) characters *at EACH NODE* of the tree ? Or am I mistaken that each node is filled with actual values from the txtseq column?
Thank you for your input!
Cheers,
Rob
Could you index the reverse of the string so the unique part appears first? On May 14, 2011, at 11:20, InterRob <rob.marjot@gmail.com> wrote: > Dear list, > > I would be pleased if you could share some thoughts with me on the following: say I wish to maintain a table with all distinctcharacter sequences (variable length) showing series with strong similarities. Example: > "abbbabacccdef" > "abbbabaccdcdf" > "abbbabaccdcgf" > ... > "qwtrhdffdd" > ... > "qwtrhdffdds" > ... > "qwtrhdffddsspp" > "qwtrhdffddsspf" > "qwtrhdffddssph" > "qwtrhdffddsspL" > etc. > > Think of them as ordered values (array-like), as a set having many values in common, in the same "elements" (that is: positions;in my application, each position has some particular meaning -- the sequence represents a set of particular settings) > > CREATE TABLE textseqs(txtseq TEXT) > > What would be an efficient approach in enforcing a UNIQUE constraint? > > I was thinking of using hashbuckets in a b-tree: > CREATE UNIQUE INDEX ON textseqs USING BTREE( hashtext(txtseq), txtseq ) > > This index would "cache" hashes for each row. Upon inserting of a new row, traversing the index involves the comparisonof two single integers for each node in the b-tree, until the actual hash value (if it exists) was reached. Then,only within that bucket (the hashes won't be unique), the more expensive string comparing is required; involving a sequencialcomparison of (potentially maaaany) characters. Yet, within that bucket, the character series may be expected toshow stronger differences than a plain sorted list of all values in the table would have, indexed by a b-tree index. Wouldn'ttraversing such a plain (non-composite, single column, on: "txtseq") b-tree index involve a sequencial comparisonof (potentially many) characters *at EACH NODE* of the tree ? Or am I mistaken that each node is filled with actualvalues from the txtseq column? > > Thank you for your input! > > Cheers, > Rob