Обсуждение: Hunspell as filtering dictionary

Поиск
Список
Период
Сортировка

Hunspell as filtering dictionary

От
Bibi Mansione
Дата:
Hi,
I am trying to create a ts_vector from a French text. Here are the operations that seem logical to perform in that order:

1. remove stopwords
2. use hunspell to find words roots
3. unaccent

I first tried:

CREATE TEXT SEARCH CONFIGURATION fr_conf (copy='simple');

ALTER TEXT SEARCH CONFIGURATION fr_conf

ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,

 word, hword, hword_part

WITH unaccent, french_hunspell;


select * from to_tsvector('fr_conf', E'Pour découvrir et rencontrer l\'aventure.');

-- 'aventure':5 'aventurer':5 'rencontrer':3


But the verb "découvrir" is missing :(


If I try with french_hunspell only, I get it, but with the accent:


select * from to_tsvector('french_hunspell', E'Pour découvrir et rencontrer l\'aventure.');

-- 'aventure':6 'aventurer':6 'découvrir':2 'rencontrer':4


I also tried:

CREATE TEXT SEARCH CONFIGURATION fr_conf2 (copy='simple');

ALTER TEXT SEARCH CONFIGURATION fr_conf2

ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,

 word, hword, hword_part

WITH french_hunspell, unaccent;


select * from to_tsvector('fr_conf2', E'Pour découvrir et rencontrer l\'aventure.');

-- 'aventure':5 'aventurer':5 'rencontrer':3


But I guess unaccent is never called.

I believe this is because french_hunspell is not a filtering dictionary, but I might be wrong. So is there a way to get this result from any FTS configuration (existing or :

-- 'aventure':6 'aventurer':6 'decouvrir':2 'rencontrer':4


Thanks,

Bertrand

Re: Hunspell as filtering dictionary

От
Hugh Ranalli
Дата:
On Tue, 5 Nov 2019 at 09:42, Bibi Mansione <golgote@gmail.com> wrote:
Hi,
I am trying to create a ts_vector from a French text. Here are the operations that seem logical to perform in that order:

1. remove stopwords
2. use hunspell to find words roots
3. unaccent

I can't speak to French, but we use a similar configuration in English, with unaccent first, then hunspell. We found that there were words that hunspell didn't recognise, but instead pulled apart (for example, "contract" became "con" and "tract"), so I wonder if something similar is happening with "découvrir." To solve this, we put a custom dictionary with these terms in front of hunspell. Unaccent definitely has to be called first. We also modified hunspell with a custom stopwords file, to eliminate select other terms, such as profanities:

    -- We use a custom stopwords file, to filter out other terms, such as profanities
    ALTER TEXT SEARCH DICTIONARY
        hunspell_en_ca (
            Stopwords = our_custom_stopwords
            );

    -- Adding english_stem allows us to recognize words which hunspell
    -- doesn't, particularly acronyms such as CGA 
    ALTER TEXT SEARCH CONFIGURATION    
        our_configuration   
    ALTER MAPPING FOR
        asciiword, asciihword, hword_asciipart,
        word, hword, hword_part
    WITH
        unaccent, our_custom_dictionary, hunspell_en_ca, english_stem
        ;

There was definitely a fair bit of trial and error to determine the correct order and configuration.

Re: Hunspell as filtering dictionary

От
Bibi Mansione
Дата:
Thanks. The problem is that the hunspell dictionary doesn't work with unaccent so it is actually totally useless for languages with accents. If one has to rely on stemming for words with accents, it is just a partial solution and it is not the right solution.

Besides, the results returned by the hunspell implementation in postgresql are incorrect. As you mentioned, it shouldn't return "con" and "tract" for "contract". I also noticed many other weird results with other words in French. They might have a bug in their code. 

I ended up using ts_debug() with a simple stopword file in my own tokenizer written with pllua that calls libhunspell directly using luajit and ffi. I also wrote my own unaccent in Lua using the unaccent extension rules. It is now two times faster to index French text and it gives much better results. It produces a tsvector. Words returned by libhunspell stem() function get a lower weight D and keep the same position as the original word. 

My conclusion is that hunspell in postgres is useless for me at least because it should be a filtering dictionary and it produces strange results that pollute the original text. 

I also think that the current implementation of TEXT SEARCH configuration is not usable for serious purposes. It is too limited. Solr configuration, while more complex, does a much better job. 




Le mer. 6 nov. 2019 à 16:50, Hugh Ranalli <hugh@whtc.ca> a écrit :
On Tue, 5 Nov 2019 at 09:42, Bibi Mansione <golgote@gmail.com> wrote:
Hi,
I am trying to create a ts_vector from a French text. Here are the operations that seem logical to perform in that order:

1. remove stopwords
2. use hunspell to find words roots
3. unaccent

I can't speak to French, but we use a similar configuration in English, with unaccent first, then hunspell. We found that there were words that hunspell didn't recognise, but instead pulled apart (for example, "contract" became "con" and "tract"), so I wonder if something similar is happening with "découvrir." To solve this, we put a custom dictionary with these terms in front of hunspell. Unaccent definitely has to be called first. We also modified hunspell with a custom stopwords file, to eliminate select other terms, such as profanities:

    -- We use a custom stopwords file, to filter out other terms, such as profanities
    ALTER TEXT SEARCH DICTIONARY
        hunspell_en_ca (
            Stopwords = our_custom_stopwords
            );

    -- Adding english_stem allows us to recognize words which hunspell
    -- doesn't, particularly acronyms such as CGA 
    ALTER TEXT SEARCH CONFIGURATION    
        our_configuration   
    ALTER MAPPING FOR
        asciiword, asciihword, hword_asciipart,
        word, hword, hword_part
    WITH
        unaccent, our_custom_dictionary, hunspell_en_ca, english_stem
        ;

There was definitely a fair bit of trial and error to determine the correct order and configuration.