Обсуждение: "Fuzzy" Matches on Nicknames

Поиск
Список
Период
Сортировка

"Fuzzy" Matches on Nicknames

От
Michael Sheaver
Дата:
Greetings,

I have two tables that are populated using large datasets from disparate external systems, and I am trying to match
recordsby customer name between these two tables. I do not have any authoritative key, such as customerID or
nationalID,by which I can match them up, and I have found many cases where the same customer has different first names
inthe two datasets. A sampling of the differences is as follows: 

Michael <=> Mike
Tom <=> Thomas
Liz <=> Elizabeth
Margaret <=> Maggie

How can I build a query in PostgreSQL (v. 9.6) that will find possible matches like these on nicknames? My initial
guessis that I would have to either find or build some sort of intermediary table that contains associated names like
thoseabove. Sometimes though, there will be more than matching pairs, like: 

Jim <=> James <=> Jimmy <=> Jimmie
Bill <=> Will <=> Willie <=> William

and so forth.

Has anyone used or developed PostgreSQL queries that will find matches like these? I am running all my database
queries.on my local laptops (Win7 and macOS), so performance or uptime is no issue here. I am curious to see how others
inthis community have creatively solved this common problem. 

One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might work here, but honestly I am clueless as to how to
setthis up or use it in queries successfully. 

Thanks,
Michael (aka Mike, aka Mikey)

Re: "Fuzzy" Matches on Nicknames

От
rob stone
Дата:
Hello Michael,
On Tue, 2016-11-29 at 19:10 -0500, Michael Sheaver wrote:
> Greetings,
>
> I have two tables that are populated using large datasets from
> disparate external systems, and I am trying to match records by
> customer name between these two tables. I do not have any
> authoritative key, such as customerID or nationalID, by which I can
> match them up, and I have found many cases where the same customer
> has different first names in the two datasets. A sampling of the
> differences is as follows:
>
> Michael <=> Mike
> Tom <=> Thomas
> Liz <=> Elizabeth
> Margaret <=> Maggie
>
> How can I build a query in PostgreSQL (v. 9.6) that will find
> possible matches like these on nicknames? My initial guess is that I
> would have to either find or build some sort of intermediary table
> that contains associated names like those above. Sometimes though,
> there will be more than matching pairs, like:
>
> Jim <=> James <=> Jimmy <=> Jimmie
> Bill <=> Will <=> Willie <=> William
>
> and so forth.
>
> Has anyone used or developed PostgreSQL queries that will find
> matches like these? I am running all my database queries. on my local
> laptops (Win7 and macOS), so performance or uptime is no issue here.
> I am curious to see how others in this community have creatively
> solved this common problem.
>
> One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might
> work here, but honestly I am clueless as to how to set this up or use
> it in queries successfully.
>
> Thanks,
> Michael (aka Mike, aka Mikey)
>

Check out chapter F15 in the doco.
Try the double metaphone.
I worked on something similar many years ago cleaning up input created
by data entry clerks from hand written speeding tickets, so as to match
with "trusted" data held in a database.
As the volume of input was small in comparison with the number of
licensed drivers, we could iterate over and over again trying to match
it up.

HTH.
Rob


Re: "Fuzzy" Matches on Nicknames

От
Merlin Moncure
Дата:
On Tue, Nov 29, 2016 at 6:56 PM, rob stone <floriparob@gmail.com> wrote:
> Hello Michael,
> On Tue, 2016-11-29 at 19:10 -0500, Michael Sheaver wrote:
>> Greetings,
>>
>> I have two tables that are populated using large datasets from
>> disparate external systems, and I am trying to match records by
>> customer name between these two tables. I do not have any
>> authoritative key, such as customerID or nationalID, by which I can
>> match them up, and I have found many cases where the same customer
>> has different first names in the two datasets. A sampling of the
>> differences is as follows:
>>
>> Michael <=> Mike
>> Tom <=> Thomas
>> Liz <=> Elizabeth
>> Margaret <=> Maggie
>>
>> How can I build a query in PostgreSQL (v. 9.6) that will find
>> possible matches like these on nicknames? My initial guess is that I
>> would have to either find or build some sort of intermediary table
>> that contains associated names like those above. Sometimes though,
>> there will be more than matching pairs, like:
>>
>> Jim <=> James <=> Jimmy <=> Jimmie
>> Bill <=> Will <=> Willie <=> William
>>
>> and so forth.
>>
>> Has anyone used or developed PostgreSQL queries that will find
>> matches like these? I am running all my database queries. on my local
>> laptops (Win7 and macOS), so performance or uptime is no issue here.
>> I am curious to see how others in this community have creatively
>> solved this common problem.
>>
>> One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might
>> work here, but honestly I am clueless as to how to set this up or use
>> it in queries successfully.
>>
>> Thanks,
>> Michael (aka Mike, aka Mikey)
>>
>
> Check out chapter F15 in the doco.
> Try the double metaphone.
> I worked on something similar many years ago cleaning up input created
> by data entry clerks from hand written speeding tickets, so as to match
> with "trusted" data held in a database.
> As the volume of input was small in comparison with the number of
> licensed drivers, we could iterate over and over again trying to match
> it up.

Also check out pg_trgm extension.  It's better for addresses than
names, but might be something to look at depending on how things turn
up with the data.

merlin