Обсуждение: [GENERAL] Full Text Search combined with Fuzzy
Hello, AFAIK there is no built-in way to combine full text search and fuzzy matching (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html). By example, phrase searching with tipos in it. First I don't know if postgresql concurrents (lucene based...) are able to do so. Second, is such feature is in the road map ? Third, I wonder if it is a good idea to use the postgresql synonyms feature for such prupose.(https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html) I mean, building up a synonyms dictionnary containing tipos. By eg: postgres pgsql postgresql pgsql postgrez pgsql postgre pgsql gogle googl gooogle googl There is multiple way to build such dictionary. But my question is about the implementation of dictionnaries in postgresql: Is postgresql supposed to take advantage of billion entries dictionaries ? Thanks by advance for you answers,
On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris <niparisco@gmail.com> wrote:
Hello,
AFAIK there is no built-in way to combine full text search and fuzzy matching
(https://www.postgresql.org/docs/current/static/ fuzzystrmatch.html).
By example, phrase searching with tipos in it.
First I don't know if postgresql concurrents (lucene based...) are able
to do so.
Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used for this.
Second, is such feature is in the road map ?
Third, I wonder if it is a good idea to use the postgresql synonyms
feature for such prupose.(https://www.postgresql.org/docs/current/ static/textsearch- dictionaries.html)
I mean, building up a synonyms dictionnary containing tipos. By eg:
postgres pgsql
postgresql pgsql
postgrez pgsql
postgre pgsql
gogle googl
gooogle googl
There is multiple way to build such dictionary. But my question is about
the implementation of dictionnaries in postgresql: Is postgresql
supposed to take advantage of billion entries dictionaries ?
dictionary is just a program, so it's up to developer how to write efficient program to deal with billion entries. Specifically to synonym dictionary, it's not intended to work with a lot of entries. btw, have a look on contrib/dict_xsyn dictionary, which is more flexible than synonym.
Thanks by advance for you answers,
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Le 27 févr. 2017 à 10:32, Oleg Bartunov écrivait : > > > On Sun, Feb 26, 2017 at 3:52 PM, Nicolas Paris <niparisco@gmail.com> wrote: > > Hello, > > AFAIK there is no built-in way to combine full text search and fuzzy > matching > (https://www.postgresql.org/docs/current/static/fuzzystrmatch.html). > By example, phrase searching with tipos in it. > > First I don't know if postgresql concurrents (lucene based...) are able > to do so. > > > Usually, https://www.postgresql.org/docs/current/static/pgtrgm.html is used for > this. > Hi Oleg, Thanks. I thought pgtrgm was not able to index my long texts because of limitation of 8191 bytes per index row for btree. Then I found out it is possible to use pgtrgm over a GIN/GIST index. My final use case is phrase mining in texts. I want my application returns texts that contains approximatly the user entry: Eg: user search "Hello Word" a text containing "blah blah blah hello world blah blah blah" would be returned. Test: postgres=# CREATE table test_trgm (texts text); CREATE TABLE postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops); CREATE INDEX postgres=# SET enable_seqscan = OFF; SET postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah'); INSERT 0 1 postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah'); INSERT 0 1 postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; texts | similarity -------------------------------------------+------------ blah blah blah hello world blah blah blah | 0.473684 blah blah blah hello word blah blah blah | 0.6875 (2 rows) postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; QUERY PLAN ----------------------------------------------------------------------------------- Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32) Recheck Cond: (texts % 'hello word'::text) -> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0) Index Cond: (texts % 'hello word'::text) (4 rows) Conclusion: If I d'say 0.4 is my threshold, would this methodology meet my requirements ? Thanks for the help ! > > Second, is such feature is in the road map ? > > Third, I wonder if it is a good idea to use the postgresql synonyms > feature for such prupose.(https://www.postgresql.org/docs/current/static/ > textsearch-dictionaries.html) > I mean, building up a synonyms dictionnary containing tipos. By eg: > > postgres pgsql > postgresql pgsql > postgrez pgsql > postgre pgsql > gogle googl > gooogle googl > > There is multiple way to build such dictionary. But my question is about > the implementation of dictionnaries in postgresql: Is postgresql > supposed to take advantage of billion entries dictionaries ? > > > dictionary is just a program, so it's up to developer how to write efficient > program to deal with billion entries. Specifically to synonym dictionary, it's > not intended to work with a lot of entries. btw, have a look on contrib/ > dict_xsyn dictionary, which is more flexible than synonym. > > > Thanks by advance for you answers, > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > >
On 03.03.2017 15:49, Nicolas Paris wrote: > > Hi Oleg, > > Thanks. I thought pgtrgm was not able to index my long texts because of > limitation of 8191 bytes per index row for btree. > > Then I found out it is possible to use pgtrgm over a GIN/GIST index. > My final use case is phrase mining in texts. > > I want my application returns texts that contains approximatly the user > entry: > > Eg: user search "Hello Word" > a text containing "blah blah blah hello world blah blah blah" would be > returned. > > Test: > postgres=# CREATE table test_trgm (texts text); > CREATE TABLE > postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops); > CREATE INDEX > postgres=# SET enable_seqscan = OFF; > SET > postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah'); > INSERT 0 1 > postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah'); > INSERT 0 1 > postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; > texts | similarity > -------------------------------------------+------------ > blah blah blah hello world blah blah blah | 0.473684 > blah blah blah hello word blah blah blah | 0.6875 > (2 rows) > > postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; > QUERY PLAN > ----------------------------------------------------------------------------------- > Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32) > Recheck Cond: (texts % 'hello word'::text) > -> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0) > Index Cond: (texts % 'hello word'::text) > (4 rows) > > Conclusion: If I d'say 0.4 is my threshold, would this methodology meet > my requirements ? > > Thanks for the help ! > Hello, If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For example: postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm WHERE 'hello word' <% texts; texts | word_similarity -------------------------------------------+----------------- blah blah blah hello world blah blah blah | 0.818182 blah blah blah hello word blah blah blah | 1 (2 rows) 1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Le 03 mars 2017 à 14:08, Artur Zakirov écrivait : > On 03.03.2017 15:49, Nicolas Paris wrote: > > > >Hi Oleg, > > > >Thanks. I thought pgtrgm was not able to index my long texts because of > >limitation of 8191 bytes per index row for btree. > > > >Then I found out it is possible to use pgtrgm over a GIN/GIST index. > >My final use case is phrase mining in texts. > > > >I want my application returns texts that contains approximatly the user > >entry: > > > >Eg: user search "Hello Word" > >a text containing "blah blah blah hello world blah blah blah" would be > >returned. > > > >Test: > >postgres=# CREATE table test_trgm (texts text); > >CREATE TABLE > >postgres=# CREATE INDEX ON test_trgm USING GIN(texts gin_trgm_ops); > >CREATE INDEX > >postgres=# SET enable_seqscan = OFF; > >SET > >postgres=# insert into test_trgm VALUES ('blah blah blah hello world blah blah blah'); > >INSERT 0 1 > >postgres=# insert into test_trgm VALUES ('blah blah blah hello word blah blah blah'); > >INSERT 0 1 > >postgres=# SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; > > texts | similarity > >-------------------------------------------+------------ > > blah blah blah hello world blah blah blah | 0.473684 > > blah blah blah hello word blah blah blah | 0.6875 > >(2 rows) > > > >postgres=# EXPLAIN SELECT texts, similarity(texts, 'hello word') FROM test_trgm WHERE texts % 'hello word'; > > QUERY PLAN > >----------------------------------------------------------------------------------- > > Bitmap Heap Scan on test_trgm (cost=52.01..56.03 rows=1 width=32) > > Recheck Cond: (texts % 'hello word'::text) > > -> Bitmap Index Scan on test_trgm_texts_idx (cost=0.00..52.01 rows=1 width=0) > > Index Cond: (texts % 'hello word'::text) > >(4 rows) > > > >Conclusion: If I d'say 0.4 is my threshold, would this methodology meet > >my requirements ? > > > >Thanks for the help ! > > > > Hello, > > If you use PostgreSQL 9.6, then word_similarity() can help you [1]. For > example: > > postgres=# SELECT texts, word_similarity('hello word', texts) FROM test_trgm > WHERE 'hello word' <% texts; > texts | word_similarity > -------------------------------------------+----------------- > blah blah blah hello world blah blah blah | 0.818182 > blah blah blah hello word blah blah blah | 1 > (2 rows) > > 1. https://www.postgresql.org/docs/9.6/static/pgtrgm.html > Nice ! I do have 9.6 version. Would this kind of index could handle more than 20M large texts ? The recheck condition looks ressource consuming. The full text index + phrase search + synonym dictionnary is the only other alternativ to deal with typo-phrase mining ? Is there any possibility in the future to add typo in the full text road-map ? Thanks, > -- > Artur Zakirov > Postgres Professional: http://www.postgrespro.com > Russian Postgres Company
On 03.03.2017 16:17, Nicolas Paris wrote: > Nice ! I do have 9.6 version. > > Would this kind of index could handle more than 20M large texts ? The > recheck condition looks ressource consuming. You are right. I think pg_trgm will be not good for such large texts, unfortunately. > > The full text index + phrase search + synonym dictionnary is the only > other alternativ to deal with typo-phrase mining ? I suppose there are no other options now. Though, prefix search maybe will help you [1]. > > Is there any possibility in the future to add typo in the full text > road-map ? As far as I know, there is no plans in the near future to add similarity full text search. 1. https://www.postgresql.org/docs/current/static/textsearch-dictionaries.html -- Artur Zakirov Postgres Professional: http://www.postgrespro.com Russian Postgres Company