Обсуждение: how to group by similarity ?

Поиск
Список
Период
Сортировка

how to group by similarity ?

От
Andreas
Дата:
Hi,

I'm trying to get an idea about pg_trgrm.
I created a GIST index on a text column in a table.
Now I can filter the table with similarity().

How would I group the table so that it shows groups that have similarity
() > x ?

Lets say the table looks like this:

id,  txt
1,   aa1
2,   bb1
3,   cc1
4,   bb2
5,   bb3
6,   aa2
...

How would a select look like that shows:

id,  txt,      group_id
1,   aa1,   1,
6,   aa2,   1,
2,   bb1,   2,
4,   bb2,   2,
5,   bb3,   2,
3,   cc1,   3




Re: how to group by similarity ?

От
hari.fuchs@gmail.com
Дата:
Andreas <maps.on@gmx.net> writes:

> How would I group the table so that it shows groups that have
> similarity () > x ?
>
> Lets say the table looks like this:
>
> id,  txt
> 1,   aa1
> 2,   bb1
> 3,   cc1
> 4,   bb2
> 5,   bb3
> 6,   aa2
> ...
>
> How would a select look like that shows:
>
> id,  txt,      group_id
> 1,   aa1,   1,
> 6,   aa2,   1,
> 2,   bb1,   2,
> 4,   bb2,   2,
> 5,   bb3,   2,
> 3,   cc1,   3

I could only come up with this convoluted query:

WITH grp (t1, id, t2) AS (
  SELECT t1.txt, t1.id, t2.txt
  FROM tbl t1
  LEFT JOIN tbl t2 ON t2.txt > t1.txt
  WHERE t2.txt IS NULL OR similarity(t1.txt, t2.txt) > 0
)
SELECT t1, min(id)
FROM (
    SELECT t1, id
    FROM grp
  UNION ALL
    SELECT t2, id
    FROM grp
    WHERE t2 IS NOT NULL
  ) dummy
GROUP BY t1
ORDER BY t1

Re: how to group by similarity ?

От
hari.fuchs@gmail.com
Дата:
Andreas <maps.on@gmx.net> writes:

> How would I group the table so that it shows groups that have
> similarity () > x ?
>
> Lets say the table looks like this:
>
> id,  txt
> 1,   aa1
> 2,   bb1
> 3,   cc1
> 4,   bb2
> 5,   bb3
> 6,   aa2
> ...
>
> How would a select look like that shows:
>
> id,  txt,      group_id
> 1,   aa1,   1,
> 6,   aa2,   1,
> 2,   bb1,   2,
> 4,   bb2,   2,
> 5,   bb3,   2,
> 3,   cc1,   3

The following query will do that, but it's convoluted:

WITH grp (t1, id, t2) AS (
  SELECT t1.txt, t1.id, t2.txt
  FROM tbt t1
  LEFT JOIN tbt t2 ON t2.txt > t1.txt
  WHERE t2.txt IS NULL OR similarity(t1.txt, t2.txt) > 0
)
SELECT t1, min(id)
FROM (
    SELECT t1, id
    FROM grp
  UNION ALL
    SELECT t2, id
    FROM grp
    WHERE t2 IS NOT NULL
  ) dummy
GROUP BY t1
ORDER BY t1

Re: how to group by similarity ?

От
Tim Uckun
Дата:
On Wed, Apr 25, 2012 at 8:34 PM,  <hari.fuchs@gmail.com> wrote:
> Andreas <maps.on@gmx.net> writes:
>
>> How would I group the table so that it shows groups that have
>> similarity () > x ?
>>
>> Lets say the table looks like this:
>>
>> id,  txt
>> 1,   aa1
>> 2,   bb1
>> 3,   cc1
>> 4,   bb2
>> 5,   bb3
>> 6,   aa2
>> ...
>>
>> How would a select look like that shows:
>>
>> id,  txt,      group_id
>> 1,   aa1,   1,
>> 6,   aa2,   1,
>> 2,   bb1,   2,
>> 4,   bb2,   2,
>> 5,   bb3,   2,
>> 3,   cc1,   3
>


Hey guys. I have a similar problem and I tried a couple of ways to
solve this including the window function described in the answer to
the original poster in this thread.

The problem I am having is that even with a trigam index and a table
with only 80,000  records the query takes forever to run. In both
cases I ended the query manually and have no idea how long it would
actually take to run. I have included the two queries below and am
hoping somebody can give me a pointer on how to accomplish with a
query that runs.


Query 1

WITH grp (t1, id, t2) AS (
  SELECT t1.raw_data, t1.id, t2.raw_data
  FROM schema.a t1
  LEFT JOIN schema.a t2 ON t2.raw_data > t1.raw_data
  WHERE t2.raw_data IS NULL OR similarity(t1.raw_data, t2.raw_data) > .75
)
SELECT t1, min(id)
FROM (
    SELECT t1, id
    FROM grp
  UNION ALL
    SELECT t2, id
    FROM grp
    WHERE t2 IS NOT NULL
  ) dummy
GROUP BY t1
ORDER BY t1


query 2

select similarity(a.raw_data,b.raw_data),*
from schema.a a, schema.a b where similarity(a.raw_data,b.raw_data) >
.75 and a.id != b.id