Обсуждение: Fuzzy string matching of product names

Поиск
Список
Период
Сортировка

Fuzzy string matching of product names

От
Peter Geoghegan
Дата:
Hello,

At the moment, users of my application, which runs on 8.4.3, may
search for products in a way that is implemented roughly like this:

SELECT * FROM products WHERE description ILIKE '%%usr_string%%';

This works reasonably well. However, I thought it would be a nice
touch to give my users leeway to spell product names incorrectly when
searching, or to not have to remember if a product is entered as "coca
cola", "CocaCola" or "Coca-cola". At the moment, they don't have to
worry about case sensitivity because I use ILIKE - I'd like to
preserve that. I'd also like to not have it weigh against them heavily
when they don't search for a specific product, but just a common
substring. For example, if they search for "coca-cola", there may be a
number of different coca-cola products: "CocaCola 330ml can",
"Coca-Cola 2 litre bottle", but no actual plain "cocacola". That ought
to not matter too much - all cocacola products should be returned.

This isn't important enough for me to be willing to add a big
dependency to my application. I'd really prefer to limit myself to the
contrib modules. pg_trgm and fuzzystrmatch look very promising, but
it's not obvious how I can use either to achieve what I want.
Postgres's built-in regex support may have a role to play too.

I can live with it not being indexable, because typically there are
only tens of thousands of products in a production system.

Could someone suggest an approach that is reasonably simple and
reasonably generic ?

Thanks,
Peter Geoghegan

Re: Fuzzy string matching of product names

От
Bill Moran
Дата:
In response to Peter Geoghegan <peter.geoghegan86@gmail.com>:

> Hello,
>
> At the moment, users of my application, which runs on 8.4.3, may
> search for products in a way that is implemented roughly like this:
>
> SELECT * FROM products WHERE description ILIKE '%%usr_string%%';
>
> This works reasonably well. However, I thought it would be a nice
> touch to give my users leeway to spell product names incorrectly when
> searching, or to not have to remember if a product is entered as "coca
> cola", "CocaCola" or "Coca-cola". At the moment, they don't have to
> worry about case sensitivity because I use ILIKE - I'd like to
> preserve that. I'd also like to not have it weigh against them heavily
> when they don't search for a specific product, but just a common
> substring. For example, if they search for "coca-cola", there may be a
> number of different coca-cola products: "CocaCola 330ml can",
> "Coca-Cola 2 litre bottle", but no actual plain "cocacola". That ought
> to not matter too much - all cocacola products should be returned.
>
> This isn't important enough for me to be willing to add a big
> dependency to my application. I'd really prefer to limit myself to the
> contrib modules. pg_trgm and fuzzystrmatch look very promising, but
> it's not obvious how I can use either to achieve what I want.
> Postgres's built-in regex support may have a role to play too.
>
> I can live with it not being indexable, because typically there are
> only tens of thousands of products in a production system.
>
> Could someone suggest an approach that is reasonably simple and
> reasonably generic ?

http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: Fuzzy string matching of product names

От
Brian Modra
Дата:
On 05/04/2010, Peter Geoghegan <peter.geoghegan86@gmail.com> wrote:
> Hello,
>
> At the moment, users of my application, which runs on 8.4.3, may
> search for products in a way that is implemented roughly like this:
>
> SELECT * FROM products WHERE description ILIKE '%%usr_string%%';
>
> This works reasonably well. However, I thought it would be a nice
> touch to give my users leeway to spell product names incorrectly when
> searching, or to not have to remember if a product is entered as "coca
> cola", "CocaCola" or "Coca-cola". At the moment, they don't have to
> worry about case sensitivity because I use ILIKE - I'd like to
> preserve that. I'd also like to not have it weigh against them heavily
> when they don't search for a specific product, but just a common
> substring. For example, if they search for "coca-cola", there may be a
> number of different coca-cola products: "CocaCola 330ml can",
> "Coca-Cola 2 litre bottle", but no actual plain "cocacola". That ought
> to not matter too much - all cocacola products should be returned.
>
> This isn't important enough for me to be willing to add a big
> dependency to my application. I'd really prefer to limit myself to the
> contrib modules. pg_trgm and fuzzystrmatch look very promising, but
> it's not obvious how I can use either to achieve what I want.
> Postgres's built-in regex support may have a role to play too.
>
> I can live with it not being indexable, because typically there are
> only tens of thousands of products in a production system.
>
> Could someone suggest an approach that is reasonably simple and
> reasonably generic ?

What I do is to create another column that has a simplified version of
the string in it.
(I created a function to simplify strings, and when the source column
is changed or inserted, I also update the "simplified" column.
Then when searching, I use the same function to "simplify" the search
string and use "=" to test against the "simplified" column.

E.g.
if the table has a column called "name" that you want to search, you
create a name_simplified column, and fill it as so:
update your_table set name_simplified=yourSimplifyFunction(name);

Then to search:
select * from your_table where simplified_name =
yourSimplifyFunction('Coca-Cola');

This is really fast, because the match is using the index rather than
a sequential scan.

>
> Thanks,
> Peter Geoghegan
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


--
Brian Modra   Land line: +27 23 5411 462
Mobile: +27 79 69 77 082
5 Jan Louw Str, Prince Albert, 6930
Postal: P.O. Box 2, Prince Albert 6930
South Africa
http://www.zwartberg.com/

Re: Fuzzy string matching of product names

От
George Silva
Дата:
The above is true. For geocoding the same idea is used: the metaphone function is used against street names, and searched to a simples column, filled with the results of the metaphone function. It works quite well.

George

On Mon, Apr 5, 2010 at 4:23 PM, Brian Modra <brian@zwartberg.com> wrote:
On 05/04/2010, Peter Geoghegan <peter.geoghegan86@gmail.com> wrote:
> Hello,
>
> At the moment, users of my application, which runs on 8.4.3, may
> search for products in a way that is implemented roughly like this:
>
> SELECT * FROM products WHERE description ILIKE '%%usr_string%%';
>
> This works reasonably well. However, I thought it would be a nice
> touch to give my users leeway to spell product names incorrectly when
> searching, or to not have to remember if a product is entered as "coca
> cola", "CocaCola" or "Coca-cola". At the moment, they don't have to
> worry about case sensitivity because I use ILIKE - I'd like to
> preserve that. I'd also like to not have it weigh against them heavily
> when they don't search for a specific product, but just a common
> substring. For example, if they search for "coca-cola", there may be a
> number of different coca-cola products: "CocaCola 330ml can",
> "Coca-Cola 2 litre bottle", but no actual plain "cocacola". That ought
> to not matter too much - all cocacola products should be returned.
>
> This isn't important enough for me to be willing to add a big
> dependency to my application. I'd really prefer to limit myself to the
> contrib modules. pg_trgm and fuzzystrmatch look very promising, but
> it's not obvious how I can use either to achieve what I want.
> Postgres's built-in regex support may have a role to play too.
>
> I can live with it not being indexable, because typically there are
> only tens of thousands of products in a production system.
>
> Could someone suggest an approach that is reasonably simple and
> reasonably generic ?

What I do is to create another column that has a simplified version of
the string in it.
(I created a function to simplify strings, and when the source column
is changed or inserted, I also update the "simplified" column.
Then when searching, I use the same function to "simplify" the search
string and use "=" to test against the "simplified" column.

E.g.
if the table has a column called "name" that you want to search, you
create a name_simplified column, and fill it as so:
update your_table set name_simplified=yourSimplifyFunction(name);

Then to search:
select * from your_table where simplified_name =
yourSimplifyFunction('Coca-Cola');

This is really fast, because the match is using the index rather than
a sequential scan.

>
> Thanks,
> Peter Geoghegan
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


--
Brian Modra   Land line: +27 23 5411 462
Mobile: +27 79 69 77 082
5 Jan Louw Str, Prince Albert, 6930
Postal: P.O. Box 2, Prince Albert 6930
South Africa
http://www.zwartberg.com/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
George R. C. Silva

Desenvolvimento em GIS
http://blog.geoprocessamento.net

Re: Fuzzy string matching of product names

От
Peter Geoghegan
Дата:
> http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html
>
> --
> Bill Moran
> http://www.potentialtech.com
> http://people.collaborativefusion.com/~wmoran/

Fuzzystrmatch is generally used to compare two single words for how
similar they sound. How can that actually be applied to get the
functionality that I've described?

Regards,
Peter Geoghegan

Re: Fuzzy string matching of product names

От
Bill Moran
Дата:
In response to Peter Geoghegan <peter.geoghegan86@gmail.com>:

> > http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html
> >
> > --
> > Bill Moran
> > http://www.potentialtech.com
> > http://people.collaborativefusion.com/~wmoran/
>
> Fuzzystrmatch is generally used to compare two single words for how
> similar they sound. How can that actually be applied to get the
> functionality that I've described?

Well, it really depends on your particular situation and what you
want to support.  You could break the name down into individual
words and generate metaphones, then use like to match on metaphone:

'The Candlestick Corporation, Limited' -> 'TE CDSK CPRN LMTD'

Searching for "candlestick" -> WHERE metaphone column like '%CDSK%'

Or you could create an array column that has all the metaphones in
it and use an ANY() or ALL() match to find ones that apply.

Exactly how you implement depends on how far you want to go.  Do you
want to support OR matches, AND matches, or both?  Can the words be
out of order?

You could also use Levenshtein as a percentage function to find matches,
even on long strings with multiple words.  Since Levenshtein gives you
the number of alterations between two strings, using that as a percentage
of the total string length gives you a pretty good gauge of how close
they are overall, and would allow you to set a threshold, or possibly even
list results by relevance.

--
Bill Moran
http://www.potentialtech.com
http://people.collaborativefusion.com/~wmoran/

Re: Fuzzy string matching of product names

От
Leif Biberg Kristensen
Дата:
On Monday 5. April 2010 22.00.41 Peter Geoghegan wrote:
> similar they sound. How can that actually be applied to get the
> functionality that I've described?

I've got a similar problem in my 18th century research, when clerks usually
took pride in being able to spell a name in any number of ways. I've landed on
a solution where I'm sending search strings to SIMILAR TO. I usually get far
too many hits, but it's much easier to browse through 100 hits than the entire
dataset which is approaching 60,000 records.

Optimizing the search strings is based upon a lot of experience.

It would probably be better to add a column with normalized names, but the
amount of work involved with that is staggering. I eventually associate most
of the records to «persons» with normalized names, but the search process can
sometimes be very frustrating, and it would really help with some kind of
fuzzy search.

Just in case anyone should suggest it: Both Soundex and Metaphone are useless
for Norwegian 18th century names.

regards,
--
Leif Biberg Kristensen
http://solumslekt.org/

Re: Fuzzy string matching of product names

От
Peter Geoghegan
Дата:
> I've got a similar problem in my 18th century research, when clerks usually
> took pride in being able to spell a name in any number of ways. I've landed on
> a solution where I'm sending search strings to SIMILAR TO. I usually get far
> too many hits, but it's much easier to browse through 100 hits than the entire
> dataset which is approaching 60,000 records.
>
> Optimizing the search strings is based upon a lot of experience.

That sounds like an interesting problem - mine sounds mundane in comparison.

I now seem to be getting reasonable results with pg_trgm coupled with
ILIKE. I ORDER BY description ILIKE '%%usr_str%%' DESC,
similarity(description, 'usr_str') DESC , prioritising products that
actually contain the user specified string. I have tweaked the "is
greater than similarity" of my queries, to include a bit of rubbish to
be on the safe side, but not too much. I've done this with what I
believe to be a representative dataset. You can create a GiST or GIN
index on text fields for these queries, which is nice.

Have you tried using pg_trgm with your dataset? You might have more
success. It isn't biased towards a particular natural language. Also,
I suggest you avoid SQL regular expressions (which are supported with
SIMILAR TO), and use the ~ operator instead, which gives you the more
powerful POSIX regular expressions, unless portability is a major
concern.

Regards,
Peter Geoghegan

Re: Fuzzy string matching of product names

От
Andy Colson
Дата:
On 4/5/2010 3:00 PM, Peter Geoghegan wrote:
>> http://www.postgresql.org/docs/8.4/static/fuzzystrmatch.html
>>
>> --
>> Bill Moran
>> http://www.potentialtech.com
>> http://people.collaborativefusion.com/~wmoran/
>
> Fuzzystrmatch is generally used to compare two single words for how
> similar they sound. How can that actually be applied to get the
> functionality that I've described?
>
> Regards,
> Peter Geoghegan
>


You could use regexp_split_to_table to split words, then soundx each
one, and find matching...  assuming you have a table like:

andy=# select * from tmpx;
  id |      cname
----+-----------------
   1 | andy corp
   2 | cola corp
   3 | pepsi cola corp
   4 | bob inc
(4 rows)


(I dont have the soundx stuff installed, so I'll uppercase instead)

andy=# select id from  (select id, upper(regexp_split_to_table(cname,
E'\\s+')) as sndx from  tmpx) as foo where sndx = 'CORP';
  id
----
   1
   2
   3
(3 rows)


and actually, you probably want "select distinct id from..."

-Andy

Re: Fuzzy string matching of product names

От
Dimitri Fontaine
Дата:
Leif Biberg Kristensen <leif@solumslekt.org> writes:
> On Monday 5. April 2010 22.00.41 Peter Geoghegan wrote:
>> similar they sound. How can that actually be applied to get the
>> functionality that I've described?
>
> I've got a similar problem in my 18th century research, when clerks usually
> took pride in being able to spell a name in any number of ways. I've landed on
> a solution where I'm sending search strings to SIMILAR TO. I usually get far
> too many hits, but it's much easier to browse through 100 hits than the entire
> dataset which is approaching 60,000 records.

In both your cases I'd play with trigram search. The idea is dead simple
and it performs really well. It's the poor man's Full Text Search, but
for catalog look ups it's exactly what you want I suppose:

  http://www.postgresql.org/docs/8.3/static/pgtrgm.html

Regards,
--
dim

Re: Fuzzy string matching of product names

От
Bruce Momjian
Дата:
George Silva wrote:
> The above is true. For geocoding the same idea is used: the metaphone
> function is used against street names, and searched to a simples column,
> filled with the results of the metaphone function. It works quite well.

I would think an expression index would be better than a separate
column.

---------------------------------------------------------------------------


>
> George
>
> On Mon, Apr 5, 2010 at 4:23 PM, Brian Modra <brian@zwartberg.com> wrote:
>
> > On 05/04/2010, Peter Geoghegan <peter.geoghegan86@gmail.com> wrote:
> > > Hello,
> > >
> > > At the moment, users of my application, which runs on 8.4.3, may
> > > search for products in a way that is implemented roughly like this:
> > >
> > > SELECT * FROM products WHERE description ILIKE '%%usr_string%%';
> > >
> > > This works reasonably well. However, I thought it would be a nice
> > > touch to give my users leeway to spell product names incorrectly when
> > > searching, or to not have to remember if a product is entered as "coca
> > > cola", "CocaCola" or "Coca-cola". At the moment, they don't have to
> > > worry about case sensitivity because I use ILIKE - I'd like to
> > > preserve that. I'd also like to not have it weigh against them heavily
> > > when they don't search for a specific product, but just a common
> > > substring. For example, if they search for "coca-cola", there may be a
> > > number of different coca-cola products: "CocaCola 330ml can",
> > > "Coca-Cola 2 litre bottle", but no actual plain "cocacola". That ought
> > > to not matter too much - all cocacola products should be returned.
> > >
> > > This isn't important enough for me to be willing to add a big
> > > dependency to my application. I'd really prefer to limit myself to the
> > > contrib modules. pg_trgm and fuzzystrmatch look very promising, but
> > > it's not obvious how I can use either to achieve what I want.
> > > Postgres's built-in regex support may have a role to play too.
> > >
> > > I can live with it not being indexable, because typically there are
> > > only tens of thousands of products in a production system.
> > >
> > > Could someone suggest an approach that is reasonably simple and
> > > reasonably generic ?
> >
> > What I do is to create another column that has a simplified version of
> > the string in it.
> > (I created a function to simplify strings, and when the source column
> > is changed or inserted, I also update the "simplified" column.
> > Then when searching, I use the same function to "simplify" the search
> > string and use "=" to test against the "simplified" column.
> >
> > E.g.
> > if the table has a column called "name" that you want to search, you
> > create a name_simplified column, and fill it as so:
> > update your_table set name_simplified=yourSimplifyFunction(name);
> >
> > Then to search:
> > select * from your_table where simplified_name =
> > yourSimplifyFunction('Coca-Cola');
> >
> > This is really fast, because the match is using the index rather than
> > a sequential scan.
> >
> > >
> > > Thanks,
> > > Peter Geoghegan
> > >
> > > --
> > > Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> > > To make changes to your subscription:
> > > http://www.postgresql.org/mailpref/pgsql-general
> > >
> >
> >
> > --
> > Brian Modra   Land line: +27 23 5411 462
> > Mobile: +27 79 69 77 082
> > 5 Jan Louw Str, Prince Albert, 6930
> > Postal: P.O. Box 2, Prince Albert 6930
> > South Africa
> > http://www.zwartberg.com/
> >
> > --
> > Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> > To make changes to your subscription:
> > http://www.postgresql.org/mailpref/pgsql-general
> >
>
>
>
> --
> George R. C. Silva
>
> Desenvolvimento em GIS
> http://blog.geoprocessamento.net

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com