Обсуждение: [SQL, regex, words] how to match word boundaries using regex's?

Поиск
Список
Период
Сортировка

[SQL, regex, words] how to match word boundaries using regex's?

От
Tony.Curtis@vcpc.univie.ac.at
Дата:
Hi,

I have been searching all the mailing lists, the on-line
docs, and the source! but cannot find an answer/solution to
this:

I have a varchar() field containing descriptions of things
(doesn't matter what they are).

I want to do a regex match limited to words.

I tried this:

   where ... ~ '\Wword\W';
   where ... ~ '\W*word\W*';
   where ... ~ '\b\(word\)\b';

and other things with LIKE but no joy.

How can I do this?

thanks
tony

Re: [SQL] [SQL, regex, words] how to match word boundaries using regex's?

От
"Gene Selkov Jr."
Дата:
> I want to do a regex match limited to words.
>
> I tried this:
>
>   where ... ~ '\Wword\W';
>   where ... ~ '\W*word\W*';
>   where ... ~ '\b\(word\)\b';
>
> and other things with LIKE but no joy.

Based on the comments in the source, regexp stuff used in postgres is something like this:
http://tiger8.com/us/regexp-manpage.html

I guess there are no backslash macros is POSIX expressions. No joy. By the way, I am wondering what determined the
choiceof the regexp machine for postgres? Is it performance-related? Is it possible to have the same stuff as in perl? 

As to your question, how about a poor man's Altavista like this:

Split the text into words before loading into a special index table. Words are numbered sequentially, so you can search
for"phrases":  

Table    = word
+----------------------------------+----------------------------------+-------+
|              Field               |              Type                | Length|
+----------------------------------+----------------------------------+-------+
| rec                              | char()                           |    12 |
| seq                              | int4                             |     4 |
| word                             | text                             |   var |
+----------------------------------+----------------------------------+-------+

SELECT DISTINCT w1.rec
FROM word w1, word w2
WHERE
      w1.word ~ '^a$'
  AND w2.word ~ '^phrase$'
  AND w1.rec = w2.rec
  AND w2.seq - w1.seq = 1; -- Distance between the words

This way, you can control what represents the concept of a 'word' by an external program (perl script, etc.)

Certainly, this method will show suboptimal performance with extra large tables and more than three or four words in a
seachphrase. But it is possible to optimise by delegating set operations (joins) and position arithmetic to the client.
Itworks very well for my ~500k tables and the most common queries. 

--Gene

Re: [SQL] [SQL, regex, words] how to match word boundaries using regex's?

От
Tony.Curtis@vcpc.univie.ac.at
Дата:
Re: [SQL] [SQL, regex, words] how to match word boundaries
using regex's? , Gene <selkovjr@mcs.anl.gov> said:

Gene> I guess there are no backslash macros is POSIX
Gene> expressions. No joy. By the way, I am wondering what

No joy, Ren :-(

Gene> determined the choice of the regexp machine for
Gene> postgres? Is it performance-related? Is it possible to
Gene> have the same stuff as in perl?

That would be really nice.  I was just considering how easy
it would be to pull out the current regex backend directory
and put perl-type (or GNU) ones there instead.  Or is that
not the way to do it?

Could I write a new backend with overloaded/new operators to
do this?

Gene> Split the text into words before loading into a
Gene> special index table. Words are numbered sequentially,
Gene> so you can search for "phrases":

I had thought about doing this because the description
string will probably require interpretation at some level.

But I was looking for a straight-forward
inefficiency-be-damned-it's-only-a-demo-for-now solution.

tony