Обсуждение: Tsearch2 and Unicode?

Поиск
Список
Период
Сортировка

Tsearch2 and Unicode?

От
Dawid Kuroczko
Дата:
I'm trying to use tsearch2 with database which is in 'UNICODE' encoding.
It works fine for English text, but as I intend to search Polish texts I did:

insert into pg_ts_cfg('default_polish', 'default', 'pl_PL.UTF-8');
(and I updated other pg_ts_* tables as written in manual).

However, Polish-specific chars are being eaten alive, it seems.
I.e. doing select to_tsvector('default_polish', body) from messages;
results in list of words but with national chars stripped...

I wonder, am I doing something wrong, or just tsearch2 doesn't grok
Unicode, despite the locales setting?  This also is a good question
regarding ispell_dict and its feelings regarding Unicode, but that's
another story.

Assuming Unicode unsupported means I should perhaps... oh, convert
the data to iso8859 prior feeding it to_tsvector()...  interesting idea,
but so far I have failed to actually do it.  Maybe store the data as
'bytea' and add a column with encoding information (assuming I don't
want to recreate whole database with new encoding, and that I want
to use unicode for some columns (so I don't have to keep encoding
with every text everywhere...).

And while we are at it, how do you feel -- an extra column with tsvector
and its index -- would it be OK to keep it away from my data (so I can
safely get rid of them if need be)?
[ I intend to keep index of around 2 000 000 records, few KBs of
text each ]...

  Regards,
      Dawid Kuroczko

Re: Tsearch2 and Unicode?

От
Oleg Bartunov
Дата:
Dawid,

unfortunately, tsearch2 doesn't support unicode yet.
If you keep tsvector separately from data than you'll need one more join.

     Oleg
On Wed, 17 Nov 2004, Dawid Kuroczko wrote:

> I'm trying to use tsearch2 with database which is in 'UNICODE' encoding.
> It works fine for English text, but as I intend to search Polish texts I did:
>
> insert into pg_ts_cfg('default_polish', 'default', 'pl_PL.UTF-8');
> (and I updated other pg_ts_* tables as written in manual).
>
> However, Polish-specific chars are being eaten alive, it seems.
> I.e. doing select to_tsvector('default_polish', body) from messages;
> results in list of words but with national chars stripped...
>
> I wonder, am I doing something wrong, or just tsearch2 doesn't grok
> Unicode, despite the locales setting?  This also is a good question
> regarding ispell_dict and its feelings regarding Unicode, but that's
> another story.
>
> Assuming Unicode unsupported means I should perhaps... oh, convert
> the data to iso8859 prior feeding it to_tsvector()...  interesting idea,
> but so far I have failed to actually do it.  Maybe store the data as
> 'bytea' and add a column with encoding information (assuming I don't
> want to recreate whole database with new encoding, and that I want
> to use unicode for some columns (so I don't have to keep encoding
> with every text everywhere...).
>
> And while we are at it, how do you feel -- an extra column with tsvector
> and its index -- would it be OK to keep it away from my data (so I can
> safely get rid of them if need be)?
> [ I intend to keep index of around 2 000 000 records, few KBs of
> text each ]...
>
>  Regards,
>      Dawid Kuroczko
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faqs/FAQ.html
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83

Re: Tsearch2 and Unicode?

От
"Markus Wollny"
Дата:
Hi!

Hi!

Oleg, what exactly do you mean by "tsearch2 doesn't support unicode yet"?

It does seem to work fine in my database, it seems:

./pg_controldata [mycluster] gives me
pg_control version number:            72
[...]
LC_COLLATE:                           de_DE.UTF-8
LC_CTYPE:                             de_DE.UTF-8

community_unicode=# SELECT pg_encoding_to_char(encoding) AS encoding FROM pg_database WHERE
datname='community_unicode';
 encoding
----------
 UNICODE
(1 row)

community_unicode=# select to_tsvector('default_german', 'Ich fände, daß das Fehlen von Umlauten ein Ärgernis wäre.');
                           to_tsvector
------------------------------------------------------------------
 'daß':3 'wäre':10 'fehlen':5 'fände':2 'umlauten':7 'Ärgernis':9
(1 row)

community_unicode=# SELECT  message_id
community_unicode-#  , rank(idxfti, to_tsquery('default_german', 'Könige|Söldner'),0) as rank
community_unicode-#  FROM ct_com_board_message
community_unicode-#  WHERE idxfti @@ to_tsquery('default_german', 'Könige|Söldner')
community_unicode-#  order by rank desc
community_unicode-#  limit 10;
 message_id |   rank
------------+----------
    3191632 | 0.686189
    2803233 | 0.686189
    2935325 | 0.686189
    2882337 | 0.686189
    2842006 | 0.686189
    2854329 | 0.686189
    2841962 | 0.686189
    2999851 | 0.651322
    2869839 | 0.651322
    2999799 |  0.61258
(10 rows)

These results look alright to me, so I cannot reproduce this phenomenon of disappearing special characters in a
unicode-database.Dawid, are you sure, you INITDB'd your cluster to the correct locale-settings? 

Kind regards

   Markus

> -----Ursprüngliche Nachricht-----
> Von: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von
> Oleg Bartunov
> Gesendet: Mittwoch, 17. November 2004 17:32
> An: Dawid Kuroczko
> Cc: Pgsql General
> Betreff: Re: [GENERAL] Tsearch2 and Unicode?
>
> Dawid,
>
> unfortunately, tsearch2 doesn't support unicode yet.
> If you keep tsvector separately from data than you'll need
> one more join.
>
>      Oleg
>

Re: Tsearch2 and Unicode?

От
Dawid Kuroczko
Дата:
On Thu, 18 Nov 2004 11:08:38 +0100, Markus Wollny
<markus.wollny@computec.de> wrote:
> Oleg, what exactly do you mean by "tsearch2 doesn't support unicode yet"?
> It does seem to work fine in my database, it seems:
> ./pg_controldata [mycluster] gives me
> pg_control version number:            72
> [...]
> LC_COLLATE:                           de_DE.UTF-8
> LC_CTYPE:                             de_DE.UTF-8

Correct me if I am wrong, but I think that UTF-8 is almost identical
to ISO-8859-1 in binary form to ISO-8859-1.  I mean, UTF-8 is
ISO-8859-1 plus multibyte characters from other charsets.

If I am correct, there is no difference for Tsearch2 between UTF-8
and ISO-8859-1, so German locales work fine.  As for ISO-8859-2
and similar this will not be the case since there are multibyte
characters.

Oh, and a side-question -- is there any facility to "strip" charsets
like ISO-8859-2 down to ASCII (not SQL_ASCII, I mean ASCII, only
latin characters without accents, etc.)?  I think it would be useful
for me, since I intend to search through mail-like messages and
many people are lazy (or accustomed to...) and don't use special
chars but only their ASCII non-accented counterparts...

   Regards,
      Dawid

Re: Tsearch2 and Unicode?

От
dom@happygiraffe.net (Dominic Mitchell)
Дата:
On Thu, Nov 18, 2004 at 12:17:01PM +0100, Dawid Kuroczko wrote:
> Correct me if I am wrong, but I think that UTF-8 is almost identical
> to ISO-8859-1 in binary form to ISO-8859-1.  I mean, UTF-8 is
> ISO-8859-1 plus multibyte characters from other charsets.

No, UTF-8 and ISO-8859-1 are different from characters 128 and up.  Even
though code point 160 represents "non breaking space" in both, they are
represented differently in terms of bytes.  One is 0xA0 and the other is
0xC2 0xA0.

-Dom

Re: Tsearch2 and Unicode?

От
"Markus Wollny"
Дата:
Hi!

I dug through my list-archives - I actually used to have the very same problem that you described: special chars being
swallowedby tsearch2-functions. The source of the problem was that I had INITDB'ed my cluster with DE@euro as locale,
whereasmy databases used Unicode encoding. This does not work correctly. I had to dump, initdb to the correct
UTF-8-locale(de_DE.UTF-8 in my case) and reload to get tsearch2 to work correctly. You may find the original discussion
here:http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php 
If you wish to find out which locale was used during INITDB for your cluster, you may use the pg_controldata program
that'ssupplied with PostgreSQL. 

Kind regards

   Markus



> -----Ursprüngliche Nachricht-----
> Von: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von
> Dawid Kuroczko
> Gesendet: Mittwoch, 17. November 2004 17:17
> An: Pgsql General
> Betreff: [GENERAL] Tsearch2 and Unicode?
>
> I'm trying to use tsearch2 with database which is in
> 'UNICODE' encoding.
> It works fine for English text, but as I intend to search
> Polish texts I did:
>
> insert into pg_ts_cfg('default_polish', 'default',
> 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as
> written in manual).
>
> However, Polish-specific chars are being eaten alive, it seems.
> I.e. doing select to_tsvector('default_polish', body) from
> messages; results in list of words but with national chars stripped...
>
> I wonder, am I doing something wrong, or just tsearch2
> doesn't grok Unicode, despite the locales setting?  This also
> is a good question regarding ispell_dict and its feelings
> regarding Unicode, but that's another story.
>
> Assuming Unicode unsupported means I should perhaps... oh,
> convert the data to iso8859 prior feeding it to_tsvector()...
>  interesting idea, but so far I have failed to actually do
> it.  Maybe store the data as 'bytea' and add a column with
> encoding information (assuming I don't want to recreate whole
> database with new encoding, and that I want to use unicode
> for some columns (so I don't have to keep encoding with every
> text everywhere...).
>
> And while we are at it, how do you feel -- an extra column
> with tsvector and its index -- would it be OK to keep it away
> from my data (so I can safely get rid of them if need be)?
> [ I intend to keep index of around 2 000 000 records, few KBs
> of text each ]...
>
>   Regards,
>       Dawid Kuroczko
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html
>

Re: Tsearch2 and Unicode?

От
Oleg Bartunov
Дата:
This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

---559023410-491009931-1101131295=:24069
Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed
Content-Transfer-Encoding: 8BIT

Markus,

it'd be nice if you (or somebody) wrtite a note about unicode, so it
could be added to tsearch2 documentation. It will help people and save
time and hair :)


     Oleg
On Mon, 22 Nov 2004, Markus Wollny wrote:

> Hi!
>
> I dug through my list-archives - I actually used to have the very same problem that you described: special chars
beingswallowed by tsearch2-functions. The source of the problem was that I had INITDB'ed my cluster with DE@euro as
locale,whereas my databases used Unicode encoding. This does not work correctly. I had to dump, initdb to the correct
UTF-8-locale(de_DE.UTF-8 in my case) and reload to get tsearch2 to work correctly. You may find the original discussion
here:http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php 
> If you wish to find out which locale was used during INITDB for your cluster, you may use the pg_controldata program
that'ssupplied with PostgreSQL. 
>
> Kind regards
>
>   Markus
>
>
>
>> -----Urspr�ngliche Nachricht-----
>> Von: pgsql-general-owner@postgresql.org
>> [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von
>> Dawid Kuroczko
>> Gesendet: Mittwoch, 17. November 2004 17:17
>> An: Pgsql General
>> Betreff: [GENERAL] Tsearch2 and Unicode?
>>
>> I'm trying to use tsearch2 with database which is in
>> 'UNICODE' encoding.
>> It works fine for English text, but as I intend to search
>> Polish texts I did:
>>
>> insert into pg_ts_cfg('default_polish', 'default',
>> 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as
>> written in manual).
>>
>> However, Polish-specific chars are being eaten alive, it seems.
>> I.e. doing select to_tsvector('default_polish', body) from
>> messages; results in list of words but with national chars stripped...
>>
>> I wonder, am I doing something wrong, or just tsearch2
>> doesn't grok Unicode, despite the locales setting?  This also
>> is a good question regarding ispell_dict and its feelings
>> regarding Unicode, but that's another story.
>>
>> Assuming Unicode unsupported means I should perhaps... oh,
>> convert the data to iso8859 prior feeding it to_tsvector()...
>>  interesting idea, but so far I have failed to actually do
>> it.  Maybe store the data as 'bytea' and add a column with
>> encoding information (assuming I don't want to recreate whole
>> database with new encoding, and that I want to use unicode
>> for some columns (so I don't have to keep encoding with every
>> text everywhere...).
>>
>> And while we are at it, how do you feel -- an extra column
>> with tsvector and its index -- would it be OK to keep it away
>> from my data (so I can safely get rid of them if need be)?
>> [ I intend to keep index of around 2 000 000 records, few KBs
>> of text each ]...
>>
>>   Regards,
>>       Dawid Kuroczko
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 5: Have you checked our extensive FAQ?
>>
>>                http://www.postgresql.org/docs/faqs/FAQ.html
>>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>

     Regards,
         Oleg
_____________________________________________________________
Oleg Bartunov, sci.researcher, hostmaster of AstroNet,
Sternberg Astronomical Institute, Moscow University (Russia)
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(095)939-16-83, +007(095)939-23-83
---559023410-491009931-1101131295=:24069--