Re: integrated tsearch doesn't work with non utf8 database

Поиск

Список

Период

Сортировка

От	Heikki Linnakangas
Тема	Re: integrated tsearch doesn't work with non utf8 database
Дата	7 сентября 2007 г. 21:24:27
Msg-id	46E1C15E.5060805@enterprisedb.com обсуждение исходный текст
Ответ на	integrated tsearch doesn't work with non utf8 database ("Pavel Stehule" <pavel.stehule@gmail.com>)
Ответы	Re: integrated tsearch doesn't work with non utf8 database (Oleg Bartunov <oleg@sai.msu.su>) Re: integrated tsearch doesn't work with non utf8 database (Teodor Sigaev <teodor@sigaev.ru>)
Список	pgsql-hackers

Дерево обсуждения

Pavel Stehule wrote:
> postgres=# select ts_debug('cs','Příliš žluťoučký kůň se napil žluté vody');
> ERROR:  character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
> CONTEXT:  SQL function "ts_debug" statement 1

I can reproduce that. In fact, you don't need the custom config or
dictionary at all:

postgres=# CREATE DATABASE latin2 encoding='latin2';
CREATE DATABASE
postgres=# \c latin2
You are now connected to database "latin2".
latin2=#  select ts_debug('simple','foo');
ERROR:  character 0xc3a5 of encoding "UTF8" has no equivalent in "LATIN2"
CONTEXT:  SQL function "ts_debug" statement 1

It fails trying to lexize the string using the danish snowball stemmer,
because the danish stopword file contains character 'å' which doesn't
have an equivalent in LATIN2.

Now what the heck is it doing with the danish stemmer, you might ask.
ts_debug is implemented as a SQL function; EXPLAINing the complex SELECT
behind it, I get this plan:

latin2=# \i foo.sql                                                        QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------Hash
Join (cost=2.80..1134.45 rows=80 width=100)  Hash Cond: (parse.tokid = tt.tokid)  InitPlan    ->  Seq Scan on
pg_ts_config (cost=0.00..1.20 rows=1 width=4)          Filter: (oid = 3748::oid)    ->  Seq Scan on pg_ts_config
(cost=0.00..1.20rows=1 width=4)          Filter: (oid = 3748::oid)  ->  Function Scan on ts_parse parse
(cost=0.00..12.50rows=1000

width=36)  ->  Hash  (cost=0.20..0.20 rows=16 width=68)        ->  Function Scan on ts_token_type tt  (cost=0.00..0.20
rows=16
width=68)  SubPlan    ->  Limit  (cost=7.33..7.36 rows=1 width=36)          ->  Subquery Scan dl  (cost=7.33..7.36
rows=1width=36)                ->  Sort  (cost=7.33..7.34 rows=1 width=8)                      Sort Key: m.mapseqno
                ->  Seq Scan on pg_ts_config_map m

(cost=0.00..7.32 rows=1 width=8)                            Filter: ((ts_lexize(mapdict, $1) IS NOT
NULL) AND (mapcfg = 3765::oid) AND (maptokentype = $0))    ->  Sort  (cost=6.57..6.57 rows=1 width=8)          Sort
Key:m.mapseqno          ->  Seq Scan on pg_ts_config_map m  (cost=0.00..6.56 rows=1

width=8)                Filter: ((mapcfg = 3765::oid) AND (maptokentype = $0))
(21 rows)

Note the Seq Scan on pg_ts_config_map, with filter on ts_lexize(mapdict,
$1). That means that it will call ts_lexize on every dictionary, which
will try to load every dictionary. And loading danish_stem dictionary
fails in latin2 encoding, because of the problem with the stopword file.

We could rewrite ts_debug as a C-function, so that it doesn't try to
access any unnecessary dictionaries. It seems wrong to install
dictionaries in databases where they won't work in the first place, but
I don't see an easy fix for that. Any comments or better ideas?

--  Heikki Linnakangas EnterpriseDB   http://www.enterprisedb.com

В списке pgsql-hackers по дате отправления:

Предыдущее

От: Josh Berkus
Дата: 07 сентября 2007 г., 21:12:37
Сообщение: Re: A Silly Idea for Vertically-Oriented Databases

Следующее

От: Robert Treat
Дата: 07 сентября 2007 г., 21:57:22
Сообщение: equivilant to contrib "tsearch" trigger function in 8.3

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: integrated tsearch doesn't work with non utf8 database

Предыдущее

Следующее