Обсуждение: Merge rows based on Levenshtein distance

Поиск
Список
Период
Сортировка

Merge rows based on Levenshtein distance

От
mongoose
Дата:
I am new to PostgreSQL and I have the following table:

Name, City
"Alex", "Washington"
"Aleex1", "Washington"
"Bob", "NYC"
"Booob", "NYC"

I want to "merge" similar rows based on levenshtein distance between names
so that I have the following table:

id, Name, City
1,"Alex", "Washington"
1,"Aleex1", "Washington"
2,"Bob", "NYC"
2,"Booob", "NYC"

How could I do that on PostgreSQL? Is there an SQL command for this?
Thnsls




--
View this message in context: http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Merge rows based on Levenshtein distance

От
David G Johnston
Дата:
mongoose wrote
> I am new to PostgreSQL and I have the following table:
>
> Name, City
> "Alex", "Washington"
> "Aleex1", "Washington"
> "Bob", "NYC"
> "Booob", "NYC"
>
> I want to "merge" similar rows based on levenshtein distance between names
> so that I have the following table:
>
> id, Name, City
> 1,"Alex", "Washington"
> 1,"Aleex1", "Washington"
> 2,"Bob", "NYC"
> 2,"Booob", "NYC"
>
> How could I do that on PostgreSQL? Is there an SQL command for this?
> Thnsls

So you have a table of N names and you want to evaluate (N-1)^2 pairs and
then use the output of the levenshtein calculation to group them together.

SELECT
l_names.name_value,
r_names.name_value, leven[...](l_names.name_value, r_names.name_value) AS
pair_group
FROM table_of_names AS l_names
CROSS JOIN table_of_names AS r_names
WHERE l_names.name_value <> r_names.name_value
;

Feel free to add "group by city" or "WHERE substring(l_names.name_value, 0,
1) = substring(r_names.name_value, 0, 1)" since it seems you need more than
just a name-distance to generate the desired groups.  You'd likely want to
add the same "substring" call to the SELECT-list and "GROUP BY" clauses...

David J.




--
View this message in context:
http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5828847.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Merge rows based on Levenshtein distance

От
mongoose
Дата:
David,

Thank you for your prompt reply. I believe your answer helped a lot but it
seems I was not clear enough on my description. Basically I want a counter
(id) to show if two or more names are similar (i.e. levenshtein distance
less than 3) So in the previous example:

From this table:

Name, City
"Booob", "NYC"
"Alex", "Washington"
"Alexj2", "Washington"
"Bob", "NYC"
"Aleex1", "Washington"

to get this table:

id, Name, City
1,"Alex", "Washington"
1,"Aleex1", "Washington"
1,"Alexj2", "Washington"
2,"Bob", "NYC"
2,"Booob", "NYC"

So basically the id is a counter that starts from "1" and increments only
when there is a different name. Please notice that the table has its names
in a completely random order.





--
View this message in context:
http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829030.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Merge rows based on Levenshtein distance

От
David G Johnston
Дата:
On Tuesday, December 2, 2014, mongoose [via PostgreSQL] <[hidden email]> wrote:
David,

Thank you for your prompt reply. I believe your answer helped a lot but it seems I was not clear enough on my description. Basically I want a counter (id) to show if two or more names are similar (i.e. levenshtein distance less than 3) So in the previous example:

From this table:

Name, City
"Booob", "NYC"
"Alex", "Washington"
"Alexj2", "Washington"
"Bob", "NYC"
"Aleex1", "Washington"

to get this table:

id, Name, City
1,"Alex", "Washington"
1,"Aleex1", "Washington"
1,"Alexj2", "Washington"
2,"Bob", "NYC"
2,"Booob", "NYC"

So basically the id is a counter that starts from "1" and increments only when there is a different name. Please notice that the table has its names in a completely random order.


Write and combine a few subqueries that use window functions (namely lag and row_number) to identify groups, label them, and assign rows to each group (using a between condition on a join)

Pondering some (not tested) if you identify the boundary records in a subquery you can assign them a value of 1 while all others take on null.  In the outer query you should be able to assign groups by simply applying the sum function over the entire result such that at each boundary value the presence of the 1 will increment the sum while the null rows will use the sum value from the prior row.

David J.



View this message in context: Re: Merge rows based on Levenshtein distance
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Merge rows based on Levenshtein distance

От
pinker
Дата:
There is nice extension in postgres:  fuzzystrmatch
<http://www.postgresql.org/docs/9.3/static/fuzzystrmatch.html>   I have used
to calculate the distance. From documetation:

SELECT levenshtein_less_equal('extensive', 'exhaustive',2);

You can use it then with your group by query.



--
View this message in context:
http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829111.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Merge rows based on Levenshtein distance

От
mongoose
Дата:
David,

Thanks for the useful feedback. Since I am not an experienced developer it
is too complicated for me to come up with the queries. Besides I wonder if
this is going to be efficient to do this processing on PostgreSQL.



--
View this message in context:
http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829132.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Merge rows based on Levenshtein distance

От
David G Johnston
Дата:
I play with it when I get a chance but you should at least try to code something.

Dave


On Wed, Dec 3, 2014 at 11:08 AM, mongoose [via PostgreSQL] <[hidden email]> wrote:
David,

Thanks for the useful feedback. Since I am not an experienced developer it is too complicated for me to come up with the queries. Besides I wonder if this is going to be efficient to do this processing on PostgreSQL.


If you reply to this email, your message will be added to the discussion below:
http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829132.html
To unsubscribe from Merge rows based on Levenshtein distance, click here.
NAML



View this message in context: Re: Merge rows based on Levenshtein distance
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Merge rows based on Levenshtein distance

От
David G Johnston
Дата:
On Wed, Dec 3, 2014 at 9:14 AM, pinker [via PostgreSQL] <[hidden email]> wrote:
There is nice extension in postgres: fuzzystrmatch I have used to calculate the distance. From documetation:

SELECT levenshtein_less_equal('extensive', 'exhaustive',2);

You can use it then with your group by query.


​Something like this - replace the substring(...) comparison with legenshtein_less_equal(...)​ or whatever comparison you find applicable.

In the case below new groups are started whenever the first letter of the value changes.

The first group would be NULL so I add a COALESCE() call to make it 0 - subsequent groups start with 1 and increment properly.

WITH src (val) AS (
VALUES ('A1'::varchar),('A2'),('B1'),('B2'),('B3'),('C1'),('D1')
)
, grp AS (
SELECT val
, CASE WHEN 
            substring(val,1,1) <> substring(lag(val) OVER (ORDER BY val),1,1) 
       THEN 1 
       ELSE NULL 
       END AS changed 
, ROW_NUMBER() OVER (ORDER BY val) AS val_idx
FROM src
)
SELECT val, COALESCE(sum(changed) OVER (ORDER BY val_idx), 0) AS group_id
FROM grp
​;

David J.​
 


View this message in context: Re: Merge rows based on Levenshtein distance
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

Re: Merge rows based on Levenshtein distance

От
Michael Nolan
Дата:
I don't think you've defined your problem very clearly. 

Suppose you have 1000 names in your database.  Are you planning to compare each name to the other 999 names to see which is closest?  What if two names are equally close to a third name but not to each other, how do you decide which is better? 
--
Mike Nolan

Re: Merge rows based on Levenshtein distance

От
mongoose
Дата:
Thanks for the help. I will give your code a try. Btw I know how to solve
this in a different language but unfortunately I am a very rookie with
databases.



--
View this message in context:
http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829145.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Merge rows based on Levenshtein distance

От
mongoose
Дата:
Hi Mike,

I was planning to do something like David suggested: I would sort the rows
based on name and then I would  use a window (i.e. 100 rows) to compare each
individual name to the previous 100. All I want to do is create groups of
similar rows based on some criteria.



--
View this message in context:
http://postgresql.nabble.com/Merge-rows-based-on-Levenshtein-distance-tp5828841p5829146.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Merge rows based on Levenshtein distance

От
Michael Nolan
Дата:
Have you considered using a soundex function to sort names into similarity groups?  In my experience it works fairly well with Western European names, not quite as well with names from other parts of the world.  It also doesn't deal well with many nicknames (Mike instead of Michael, etc.)

--
Mike Nolan