Обсуждение: BUG #13690: Full Text Search with spanish dictionary cannot find some words

Поиск
Список
Период
Сортировка

BUG #13690: Full Text Search with spanish dictionary cannot find some words

От
vtamara@pasosdeJesus.org
Дата:
The following bug has been logged on the website:

Bug reference:      13690
Logged by:          Vladimir Támara Patiño
Email address:      vtamara@pasosdeJesus.org
PostgreSQL version: 9.4.5
Operating system:   adJ/OpenBSD
Description:

Consider the following table with one register:

CREATE TABLE cat (
    nombre character varying(500)
);

INSERT INTO cat (nombre) VALUES ('Politica Social');

The following search in english succeeds  (returns 1):

SELECT  COUNT(*) FROM cat
        WHERE to_tsvector('english', nombre) @@ to_tsquery('english',
'politi:*'
);


But fails using the spanish dictionary (returns 0):

SELECT  COUNT(*) FROM cat
        WHERE to_tsvector('spanish', nombre) @@ to_tsquery('spanish',
'politi:*'
);

In all the cases I tried, the search with the english dictionary succeeded.

However using the spanish dictionary the search will succed for example with
'poli', 'polit', 'politica' but will fail with 'politi' and 'politic'

I'm using the dictionary of stop words included with PostgreSQL 9.4.5.

I also tested in a PostgreSQL 9.1.12 running on a Linux Ubuntu 15.04, and I
observed exactly the same error using spanish dictionary (can find 'polit'
and 'politica' but cannot find 'politi' neither 'politic').


Here is the SQL I'm using for testing:

CREATE TABLE cat (
    nombre character varying(500)
);

INSERT INTO cat (nombre) VALUES ('Politica Social');

SELECT  COUNT(*) FROM cat
        WHERE to_tsvector('spanish', nombre) @@ to_tsquery('spanish',
'polit:*');
SELECT  COUNT(*) FROM cat
        WHERE to_tsvector('english', nombre) @@ to_tsquery('english',
'polit:*');
SELECT  COUNT(*) FROM cat
        WHERE to_tsvector('spanish', nombre) @@ to_tsquery('spanish',
'politi:*'); -- Error gives 0 should be 1
SELECT  COUNT(*) FROM cat
        WHERE to_tsvector('english', nombre) @@ to_tsquery('english',
'politi:*');
SELECT  COUNT(*) FROM cat
        WHERE to_tsvector('spanish', nombre) @@ to_tsquery('spanish',
'politic:*'); -- Error gives 0 should be 1
SELECT  COUNT(*) FROM cat
        WHERE to_tsvector('english', nombre) @@ to_tsquery('english',
'politic:*');
SELECT  COUNT(*) FROM cat
        WHERE to_tsvector('spanish', nombre) @@ to_tsquery('spanish',
'politica:*');
SELECT  COUNT(*) FROM cat
        WHERE to_tsvector('english', nombre) @@ to_tsquery('english',
'politica:*');

Re: BUG #13690: Full Text Search with spanish dictionary cannot find some words

От
Tom Lane
Дата:
vtamara@pasosdeJesus.org writes:
> The following search in english succeeds  (returns 1):

> SELECT  COUNT(*) FROM cat
>         WHERE to_tsvector('english', nombre) @@ to_tsquery('english',
> 'politi:*'
> );

> But fails using the spanish dictionary (returns 0):

> SELECT  COUNT(*) FROM cat
>         WHERE to_tsvector('spanish', nombre) @@ to_tsquery('spanish',
> 'politi:*'
> );

This is because you didn't adjust the wildcard search pattern for the
different stemming rules used in Spanish.  Look at the to_tsvector and
to_tsquery results:

regression=# SELECT to_tsvector('english', nombre) , to_tsquery('english','politi:*') from cat;
       to_tsvector       | to_tsquery
-------------------------+------------
 'politica':1 'social':2 | 'politi':*
(1 row)

regression=# SELECT to_tsvector('spanish', nombre) , to_tsquery('spanish','politi:*') from cat;
     to_tsvector      | to_tsquery
----------------------+------------
 'polit':1 'social':2 | 'politi':*
(1 row)

I don't know enough Spanish to follow the reasoning for stemming
"politica" as "polit" rather than something else; but I do see that
"politi" is not reduced to "polit", which is fairly reasonable since
that's not a word.  "politi:*" will match anything whose stemmed
version starts with "politi", but that's too long ...

            regards, tom lane

Re: BUG #13690: Full Text Search with spanish dictionary cannot find some words

От
Artur Zakirov
Дата:
20.10.2015 19:21, Tom Lane ïèøåò:
> This is because you didn't adjust the wildcard search pattern for the
> different stemming rules used in Spanish.  Look at the to_tsvector and
> to_tsquery results:
>
> regression=# SELECT to_tsvector('english', nombre) , to_tsquery('english','politi:*') from cat;
>         to_tsvector       | to_tsquery
> -------------------------+------------
>   'politica':1 'social':2 | 'politi':*
> (1 row)
>
> regression=# SELECT to_tsvector('spanish', nombre) , to_tsquery('spanish','politi:*') from cat;
>       to_tsvector      | to_tsquery
> ----------------------+------------
>   'polit':1 'social':2 | 'politi':*
> (1 row)
>
> I don't know enough Spanish to follow the reasoning for stemming
> "politica" as "polit" rather than something else; but I do see that
> "politi" is not reduced to "polit", which is fairly reasonable since
> that's not a word.  "politi:*" will match anything whose stemmed
> version starts with "politi", but that's too long ...

Tom is right. You cannot change stemming rules used in Spanish. But on
the other hand you can create the synonym dictionary.

First, you need create file $SHAREDIR/tsearch_data/spanish.syn with the
following entry:

politi polit

Then execute the following script:

CREATE TEXT SEARCH DICTIONARY spanish_synonym (
     TEMPLATE = synonym,
     SYNONYMS = spanish
);
ALTER TEXT SEARCH CONFIGURATION spanish
     ALTER MAPPING FOR asciiword
     WITH spanish_synonym, spanish_stem;

After that the following query:

SELECT COUNT(*) FROM cat
        WHERE to_tsvector('spanish', nombre) @@ to_tsquery('spanish',
'politi:*');

will return 1.

You can read about synonym dictionary in the documentation:
http://www.postgresql.org/docs/devel/static/textsearch-dictionaries.html

--
Artur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company