Обсуждение: tsearch2: enable non ascii stop words with C locale
Hi, Currently tsearch2 does not accept non ascii stop words if locale is C. Included patches should fix the problem. Patches against PostgreSQL 8.2.3. -- Tatsuo Ishii SRA OSS, Inc. Japan *** wordparser/parser.c~ 2007-01-16 00:16:11.000000000 +0900 --- wordparser/parser.c 2007-02-10 18:04:59.000000000 +0900 *************** *** 246,251 **** --- 246,266 ---- static int p_islatin(TParser * prs) { + if (prs->usewide) + { + if (lc_ctype_is_c()) + { + unsigned int c = *(unsigned int*)(prs->wstr + prs->state->poschar); + + /* + * any non-ascii symbol with multibyte encoding + * with C-locale is a latin character + */ + if ( c > 0x7f ) + return 1; + } + } + return (p_isalpha(prs) && p_isascii(prs)) ? 1 : 0; }
> Currently tsearch2 does not accept non ascii stop words if locale is > C. Included patches should fix the problem. Patches against PostgreSQL > 8.2.3. I'm not sure about correctness of patch's description. First, p_islatin() function is used only in words/lexemes parser, not stop-word code. Second, p_islatin() function is used for catching lexemes like URL or HTML entities, so, it's important to define real latin characters. And it works right: it calls p_isalpha (already patched for your case), then it calls p_isascii which should be correct for any encodings with C-locale. Third (and last): contrib_regression=# show server_encoding; server_encoding ----------------- UTF8 contrib_regression=# show lc_ctype; lc_ctype ---------- C contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD); lexize -------- {} Russian characters with UTF8 take two bytes. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> > Currently tsearch2 does not accept non ascii stop words if locale is > > C. Included patches should fix the problem. Patches against PostgreSQL > > 8.2.3. > > I'm not sure about correctness of patch's description. > > First, p_islatin() function is used only in words/lexemes parser, not stop-word > code. I know. My guess is the parser does not read the stop word file at least with default configuration. > Second, p_islatin() function is used for catching lexemes like URL or HTML > entities, so, it's important to define real latin characters. And it works > right: it calls p_isalpha (already patched for your case), then it calls > p_isascii which should be correct for any encodings with C-locale. original p_islatin is defined as follows: static int p_islatin(TParser * prs) {return (p_isalpha(prs) && p_isascii(prs)) ? 1 : 0; } So if a character is not ASCII, it returns 0 even if p_isalpha returns 1. Is this what you expect? > Third (and last): > contrib_regression=# show server_encoding; > server_encoding > ----------------- > UTF8 > contrib_regression=# show lc_ctype; > lc_ctype > ---------- > C > contrib_regression=# select lexize('ru_stem_utf8', RUSSIAN_STOP_WORD); > lexize > -------- > {} > > Russian characters with UTF8 take two bytes. In our case, we added JAPANESE_STOP_WORD into english.stop then: select to_tsvector(JAPANESE_STOP_WORD) which returns words even they are in JAPANESE_STOP_WORD. And with the patches the problem was solved. -- Tatsuo Ishii SRA OSS, Inc. Japan
> I know. My guess is the parser does not read the stop word file at > least with default configuration. Parser should not read stopword file: its deal for dictionaries. > > So if a character is not ASCII, it returns 0 even if p_isalpha returns > 1. Is this what you expect? No, p_islatin should return true only for latin characters, not for national ones. > > In our case, we added JAPANESE_STOP_WORD into english.stop then: > select to_tsvector(JAPANESE_STOP_WORD) > which returns words even they are in JAPANESE_STOP_WORD. > And with the patches the problem was solved. Pls, show your configuration for lexemes/dictionaries. I suspect that you have en_stem dictionary on for lword lexemes type. Better way is to use 'simple' distionary (it's support stopword the same way as en_stem does) and set it for nlword, word, part_hword, nlpart_hword, hword, nlhword lexeme's types. Note, leave unchanged en_stem for any latin word. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/
> > I know. My guess is the parser does not read the stop word file at > > least with default configuration. > > Parser should not read stopword file: its deal for dictionaries. I'll come up with more detailed info, explaining why stopword file is not read. > > So if a character is not ASCII, it returns 0 even if p_isalpha returns > > 1. Is this what you expect? > No, p_islatin should return true only for latin characters, not for national ones. Precise definition for "latin" in C locale please. Are you saying that single byte encoding with range 0-7f? is "latin"? If so, it seems they are exacty same as ASCII. -- Tatsuo Ishii SRA OSS, Inc. Japan > > In our case, we added JAPANESE_STOP_WORD into english.stop then: > > select to_tsvector(JAPANESE_STOP_WORD) > > which returns words even they are in JAPANESE_STOP_WORD. > > And with the patches the problem was solved. > > Pls, show your configuration for lexemes/dictionaries. I suspect that you have > en_stem dictionary on for lword lexemes type. Better way is to use 'simple' > distionary (it's support stopword the same way as en_stem does) and set it for > nlword, word, part_hword, nlpart_hword, hword, nlhword lexeme's types. Note, > leave unchanged en_stem for any latin word. > > -- > Teodor Sigaev E-mail: teodor@sigaev.ru
> Precise definition for "latin" in C locale please. Are you saying that > single byte encoding with range 0-7f? is "latin"? If so, it seems they > are exacty same as ASCII. p_islatin returns true for ASCII alpha characters. -- Teodor Sigaev E-mail: teodor@sigaev.ru WWW: http://www.sigaev.ru/