Hi,
I'm trying to get a German ispell dictionary to work with
postgresql 8.3.7 which supports compound words. I tried
the following three dictionaries:
- http://ftp.services.openoffice.org/pub/OpenOffice.org/contrib/dictionaries/de_DE_frami.zip
(for OpenOffice 2),
- http://extensions.services.openoffice.org/project/dict-de_DE_frami
(for OpenOffice 3) and
- http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2/dicts/ispell/ispell-german-compound.tar.gz.
Each file was converted to UTF-8 via iconv. I created the
dictionary with the following command:
CREATE TEXT SEARCH DICTIONARY german_ispell (
Template = ispell,
DictFile = de_de_frami,
AffFile = de_de_frami,
StopWords = german
);
Then I test it via:
SELECT ts_lexize('german_ispell', 'haustür');
which should result in 'haus' and 'tür'. The first two
dictionaries return nothing at all. Compound words don't seem
to work with those two.
The third one works if I remove all lines containing any umlauts
from de_de_frami.affix and returns 'haus' and 'tür'. If I do not
remove all lines containing umlauts from the affix file I get a
syntax error during parsing:
ERROR: syntax error
CONTEXT: line 224 of configuration file
"/usr/local/share/postgresql/tsearch_data/de_de_frami.affix": " ABE
> -ABE,äBIN
"
Problem seems to be that postgresql runs on OpenBSD, which
does not support any locale but C. The affix file contains umlauts
and is encoded in UTF-8 as required by postgresql. But the
parsing fails probably due to the method parse_affentry in spell.c
and the method t_isalpha used within that function.
In t_isalpha there is:
if (clen == 1 || lc_ctype_is_c())
return isalpha(TOUCHAR(ptr))
which fails for the umlauts in the affix file. is there any reason to
check for a lc_ctype of C here. The affix file is in UTF-8 and each line
is converted to the encoding used by the database. Why is there
a check for the C locale?
Or am I completly wrong and this is not the reason, the parsing of
the affix file fails.
Thanks for your help.
Christof