Обсуждение: Hunspell as filtering dictionary
CREATE TEXT SEARCH CONFIGURATION fr_conf (copy='simple');
ALTER TEXT SEARCH CONFIGURATION fr_conf
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH unaccent, french_hunspell;
select * from to_tsvector('fr_conf', E'Pour découvrir et rencontrer l\'aventure.');
-- 'aventure':5 'aventurer':5 'rencontrer':3
But the verb "découvrir" is missing :(
If I try with french_hunspell only, I get it, but with the accent:
select * from to_tsvector('french_hunspell', E'Pour découvrir et rencontrer l\'aventure.');
-- 'aventure':6 'aventurer':6 'découvrir':2 'rencontrer':4
CREATE TEXT SEARCH CONFIGURATION fr_conf2 (copy='simple');
ALTER TEXT SEARCH CONFIGURATION fr_conf2
ALTER MAPPING FOR asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH french_hunspell, unaccent;
select * from to_tsvector('fr_conf2', E'Pour découvrir et rencontrer l\'aventure.');
-- 'aventure':5 'aventurer':5 'rencontrer':3
-- 'aventure':6 'aventurer':6 'decouvrir':2 'rencontrer':4
Hi,I am trying to create a ts_vector from a French text. Here are the operations that seem logical to perform in that order:1. remove stopwords2. use hunspell to find words roots3. unaccent
hunspell_en_ca (
Stopwords = our_custom_stopwords
);
-- Adding english_stem allows us to recognize words which hunspell
-- doesn't, particularly acronyms such as CGA
ALTER TEXT SEARCH CONFIGURATION
our_configuration
ALTER MAPPING FOR
asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH
unaccent, our_custom_dictionary, hunspell_en_ca, english_stem
;
On Tue, 5 Nov 2019 at 09:42, Bibi Mansione <golgote@gmail.com> wrote:Hi,I am trying to create a ts_vector from a French text. Here are the operations that seem logical to perform in that order:1. remove stopwords2. use hunspell to find words roots3. unaccentI can't speak to French, but we use a similar configuration in English, with unaccent first, then hunspell. We found that there were words that hunspell didn't recognise, but instead pulled apart (for example, "contract" became "con" and "tract"), so I wonder if something similar is happening with "découvrir." To solve this, we put a custom dictionary with these terms in front of hunspell. Unaccent definitely has to be called first. We also modified hunspell with a custom stopwords file, to eliminate select other terms, such as profanities:-- We use a custom stopwords file, to filter out other terms, such as profanitiesALTER TEXT SEARCH DICTIONARY
hunspell_en_ca (
Stopwords = our_custom_stopwords
);
-- Adding english_stem allows us to recognize words which hunspell
-- doesn't, particularly acronyms such as CGA
ALTER TEXT SEARCH CONFIGURATION
our_configuration
ALTER MAPPING FOR
asciiword, asciihword, hword_asciipart,
word, hword, hword_part
WITH
unaccent, our_custom_dictionary, hunspell_en_ca, english_stem
;There was definitely a fair bit of trial and error to determine the correct order and configuration.