Обсуждение: Tsearch2 and Unicode?
I'm trying to use tsearch2 with database which is in 'UNICODE' encoding. It works fine for English text, but as I intend to search Polish texts I did: insert into pg_ts_cfg('default_polish', 'default', 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as written in manual). However, Polish-specific chars are being eaten alive, it seems. I.e. doing select to_tsvector('default_polish', body) from messages; results in list of words but with national chars stripped... I wonder, am I doing something wrong, or just tsearch2 doesn't grok Unicode, despite the locales setting? This also is a good question regarding ispell_dict and its feelings regarding Unicode, but that's another story. Assuming Unicode unsupported means I should perhaps... oh, convert the data to iso8859 prior feeding it to_tsvector()... interesting idea, but so far I have failed to actually do it. Maybe store the data as 'bytea' and add a column with encoding information (assuming I don't want to recreate whole database with new encoding, and that I want to use unicode for some columns (so I don't have to keep encoding with every text everywhere...). And while we are at it, how do you feel -- an extra column with tsvector and its index -- would it be OK to keep it away from my data (so I can safely get rid of them if need be)? [ I intend to keep index of around 2 000 000 records, few KBs of text each ]... Regards, Dawid Kuroczko
Dawid, unfortunately, tsearch2 doesn't support unicode yet. If you keep tsvector separately from data than you'll need one more join. Oleg On Wed, 17 Nov 2004, Dawid Kuroczko wrote: > I'm trying to use tsearch2 with database which is in 'UNICODE' encoding. > It works fine for English text, but as I intend to search Polish texts I did: > > insert into pg_ts_cfg('default_polish', 'default', 'pl_PL.UTF-8'); > (and I updated other pg_ts_* tables as written in manual). > > However, Polish-specific chars are being eaten alive, it seems. > I.e. doing select to_tsvector('default_polish', body) from messages; > results in list of words but with national chars stripped... > > I wonder, am I doing something wrong, or just tsearch2 doesn't grok > Unicode, despite the locales setting? This also is a good question > regarding ispell_dict and its feelings regarding Unicode, but that's > another story. > > Assuming Unicode unsupported means I should perhaps... oh, convert > the data to iso8859 prior feeding it to_tsvector()... interesting idea, > but so far I have failed to actually do it. Maybe store the data as > 'bytea' and add a column with encoding information (assuming I don't > want to recreate whole database with new encoding, and that I want > to use unicode for some columns (so I don't have to keep encoding > with every text everywhere...). > > And while we are at it, how do you feel -- an extra column with tsvector > and its index -- would it be OK to keep it away from my data (so I can > safely get rid of them if need be)? > [ I intend to keep index of around 2 000 000 records, few KBs of > text each ]... > > Regards, > Dawid Kuroczko > > ---------------------------(end of broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83
Hi! Hi! Oleg, what exactly do you mean by "tsearch2 doesn't support unicode yet"? It does seem to work fine in my database, it seems: ./pg_controldata [mycluster] gives me pg_control version number: 72 [...] LC_COLLATE: de_DE.UTF-8 LC_CTYPE: de_DE.UTF-8 community_unicode=# SELECT pg_encoding_to_char(encoding) AS encoding FROM pg_database WHERE datname='community_unicode'; encoding ---------- UNICODE (1 row) community_unicode=# select to_tsvector('default_german', 'Ich fände, daß das Fehlen von Umlauten ein Ärgernis wäre.'); to_tsvector ------------------------------------------------------------------ 'daß':3 'wäre':10 'fehlen':5 'fände':2 'umlauten':7 'Ärgernis':9 (1 row) community_unicode=# SELECT message_id community_unicode-# , rank(idxfti, to_tsquery('default_german', 'Könige|Söldner'),0) as rank community_unicode-# FROM ct_com_board_message community_unicode-# WHERE idxfti @@ to_tsquery('default_german', 'Könige|Söldner') community_unicode-# order by rank desc community_unicode-# limit 10; message_id | rank ------------+---------- 3191632 | 0.686189 2803233 | 0.686189 2935325 | 0.686189 2882337 | 0.686189 2842006 | 0.686189 2854329 | 0.686189 2841962 | 0.686189 2999851 | 0.651322 2869839 | 0.651322 2999799 | 0.61258 (10 rows) These results look alright to me, so I cannot reproduce this phenomenon of disappearing special characters in a unicode-database.Dawid, are you sure, you INITDB'd your cluster to the correct locale-settings? Kind regards Markus > -----Ursprüngliche Nachricht----- > Von: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von > Oleg Bartunov > Gesendet: Mittwoch, 17. November 2004 17:32 > An: Dawid Kuroczko > Cc: Pgsql General > Betreff: Re: [GENERAL] Tsearch2 and Unicode? > > Dawid, > > unfortunately, tsearch2 doesn't support unicode yet. > If you keep tsvector separately from data than you'll need > one more join. > > Oleg >
On Thu, 18 Nov 2004 11:08:38 +0100, Markus Wollny <markus.wollny@computec.de> wrote: > Oleg, what exactly do you mean by "tsearch2 doesn't support unicode yet"? > It does seem to work fine in my database, it seems: > ./pg_controldata [mycluster] gives me > pg_control version number: 72 > [...] > LC_COLLATE: de_DE.UTF-8 > LC_CTYPE: de_DE.UTF-8 Correct me if I am wrong, but I think that UTF-8 is almost identical to ISO-8859-1 in binary form to ISO-8859-1. I mean, UTF-8 is ISO-8859-1 plus multibyte characters from other charsets. If I am correct, there is no difference for Tsearch2 between UTF-8 and ISO-8859-1, so German locales work fine. As for ISO-8859-2 and similar this will not be the case since there are multibyte characters. Oh, and a side-question -- is there any facility to "strip" charsets like ISO-8859-2 down to ASCII (not SQL_ASCII, I mean ASCII, only latin characters without accents, etc.)? I think it would be useful for me, since I intend to search through mail-like messages and many people are lazy (or accustomed to...) and don't use special chars but only their ASCII non-accented counterparts... Regards, Dawid
On Thu, Nov 18, 2004 at 12:17:01PM +0100, Dawid Kuroczko wrote: > Correct me if I am wrong, but I think that UTF-8 is almost identical > to ISO-8859-1 in binary form to ISO-8859-1. I mean, UTF-8 is > ISO-8859-1 plus multibyte characters from other charsets. No, UTF-8 and ISO-8859-1 are different from characters 128 and up. Even though code point 160 represents "non breaking space" in both, they are represented differently in terms of bytes. One is 0xA0 and the other is 0xC2 0xA0. -Dom
Hi! I dug through my list-archives - I actually used to have the very same problem that you described: special chars being swallowedby tsearch2-functions. The source of the problem was that I had INITDB'ed my cluster with DE@euro as locale, whereasmy databases used Unicode encoding. This does not work correctly. I had to dump, initdb to the correct UTF-8-locale(de_DE.UTF-8 in my case) and reload to get tsearch2 to work correctly. You may find the original discussion here:http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php If you wish to find out which locale was used during INITDB for your cluster, you may use the pg_controldata program that'ssupplied with PostgreSQL. Kind regards Markus > -----Ursprüngliche Nachricht----- > Von: pgsql-general-owner@postgresql.org > [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von > Dawid Kuroczko > Gesendet: Mittwoch, 17. November 2004 17:17 > An: Pgsql General > Betreff: [GENERAL] Tsearch2 and Unicode? > > I'm trying to use tsearch2 with database which is in > 'UNICODE' encoding. > It works fine for English text, but as I intend to search > Polish texts I did: > > insert into pg_ts_cfg('default_polish', 'default', > 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as > written in manual). > > However, Polish-specific chars are being eaten alive, it seems. > I.e. doing select to_tsvector('default_polish', body) from > messages; results in list of words but with national chars stripped... > > I wonder, am I doing something wrong, or just tsearch2 > doesn't grok Unicode, despite the locales setting? This also > is a good question regarding ispell_dict and its feelings > regarding Unicode, but that's another story. > > Assuming Unicode unsupported means I should perhaps... oh, > convert the data to iso8859 prior feeding it to_tsvector()... > interesting idea, but so far I have failed to actually do > it. Maybe store the data as 'bytea' and add a column with > encoding information (assuming I don't want to recreate whole > database with new encoding, and that I want to use unicode > for some columns (so I don't have to keep encoding with every > text everywhere...). > > And while we are at it, how do you feel -- an extra column > with tsvector and its index -- would it be OK to keep it away > from my data (so I can safely get rid of them if need be)? > [ I intend to keep index of around 2 000 000 records, few KBs > of text each ]... > > Regards, > Dawid Kuroczko > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html >
This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---559023410-491009931-1101131295=:24069 Content-Type: TEXT/PLAIN; charset=X-UNKNOWN; format=flowed Content-Transfer-Encoding: 8BIT Markus, it'd be nice if you (or somebody) wrtite a note about unicode, so it could be added to tsearch2 documentation. It will help people and save time and hair :) Oleg On Mon, 22 Nov 2004, Markus Wollny wrote: > Hi! > > I dug through my list-archives - I actually used to have the very same problem that you described: special chars beingswallowed by tsearch2-functions. The source of the problem was that I had INITDB'ed my cluster with DE@euro as locale,whereas my databases used Unicode encoding. This does not work correctly. I had to dump, initdb to the correct UTF-8-locale(de_DE.UTF-8 in my case) and reload to get tsearch2 to work correctly. You may find the original discussion here:http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php > If you wish to find out which locale was used during INITDB for your cluster, you may use the pg_controldata program that'ssupplied with PostgreSQL. > > Kind regards > > Markus > > > >> -----Urspr�ngliche Nachricht----- >> Von: pgsql-general-owner@postgresql.org >> [mailto:pgsql-general-owner@postgresql.org] Im Auftrag von >> Dawid Kuroczko >> Gesendet: Mittwoch, 17. November 2004 17:17 >> An: Pgsql General >> Betreff: [GENERAL] Tsearch2 and Unicode? >> >> I'm trying to use tsearch2 with database which is in >> 'UNICODE' encoding. >> It works fine for English text, but as I intend to search >> Polish texts I did: >> >> insert into pg_ts_cfg('default_polish', 'default', >> 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as >> written in manual). >> >> However, Polish-specific chars are being eaten alive, it seems. >> I.e. doing select to_tsvector('default_polish', body) from >> messages; results in list of words but with national chars stripped... >> >> I wonder, am I doing something wrong, or just tsearch2 >> doesn't grok Unicode, despite the locales setting? This also >> is a good question regarding ispell_dict and its feelings >> regarding Unicode, but that's another story. >> >> Assuming Unicode unsupported means I should perhaps... oh, >> convert the data to iso8859 prior feeding it to_tsvector()... >> interesting idea, but so far I have failed to actually do >> it. Maybe store the data as 'bytea' and add a column with >> encoding information (assuming I don't want to recreate whole >> database with new encoding, and that I want to use unicode >> for some columns (so I don't have to keep encoding with every >> text everywhere...). >> >> And while we are at it, how do you feel -- an extra column >> with tsvector and its index -- would it be OK to keep it away >> from my data (so I can safely get rid of them if need be)? >> [ I intend to keep index of around 2 000 000 records, few KBs >> of text each ]... >> >> Regards, >> Dawid Kuroczko >> >> ---------------------------(end of >> broadcast)--------------------------- >> TIP 5: Have you checked our extensive FAQ? >> >> http://www.postgresql.org/docs/faqs/FAQ.html >> > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > Regards, Oleg _____________________________________________________________ Oleg Bartunov, sci.researcher, hostmaster of AstroNet, Sternberg Astronomical Institute, Moscow University (Russia) Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(095)939-16-83, +007(095)939-23-83 ---559023410-491009931-1101131295=:24069--