Re: Merge rows based on Levenshtein distance

Поиск
Список
Период
Сортировка
От David G Johnston
Тема Re: Merge rows based on Levenshtein distance
Дата
Msg-id CAKFQuwbhyimtQR9NY28RtfLvpOhgsSRe0H_mT=MqTZtz3k9bnA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Merge rows based on Levenshtein distance  (mongoose <akarargyris@gmail.com>)
Ответы Re: Merge rows based on Levenshtein distance  (mongoose <akarargyris@gmail.com>)
Список pgsql-general
On Tuesday, December 2, 2014, mongoose [via PostgreSQL] <[hidden email]> wrote:
David,

Thank you for your prompt reply. I believe your answer helped a lot but it seems I was not clear enough on my description. Basically I want a counter (id) to show if two or more names are similar (i.e. levenshtein distance less than 3) So in the previous example:

From this table:

Name, City
"Booob", "NYC"
"Alex", "Washington"
"Alexj2", "Washington"
"Bob", "NYC"
"Aleex1", "Washington"

to get this table:

id, Name, City
1,"Alex", "Washington"
1,"Aleex1", "Washington"
1,"Alexj2", "Washington"
2,"Bob", "NYC"
2,"Booob", "NYC"

So basically the id is a counter that starts from "1" and increments only when there is a different name. Please notice that the table has its names in a completely random order.


Write and combine a few subqueries that use window functions (namely lag and row_number) to identify groups, label them, and assign rows to each group (using a between condition on a join)

Pondering some (not tested) if you identify the boundary records in a subquery you can assign them a value of 1 while all others take on null.  In the outer query you should be able to assign groups by simply applying the sum function over the entire result such that at each boundary value the presence of the 1 will increment the sum while the null rows will use the sum value from the prior row.

David J.



View this message in context: Re: Merge rows based on Levenshtein distance
Sent from the PostgreSQL - general mailing list archive at Nabble.com.

В списке pgsql-general по дате отправления:

Предыдущее
От: David G Johnston
Дата:
Сообщение: Re: Is "WITH () UPDATE" Thread Safe ?
Следующее
От: Tim Schäfer
Дата:
Сообщение: Re: Auto vacuum not running -- Could not bind socket for statistics collector