Обсуждение: Extending range of to_tsvector et al
When using to_tsvector a number of newer unicode characters and pua characters are not included. How do I add the characters which I desire to be found? Regards John -- View this message in context: http://postgresql.1045698.n5.nabble.com/Extending-range-of-to-tsvector-et-al-tp5726041.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 <john.knightley@gmail.com> wrote: > When using to_tsvector a number of newer unicode characters and pua > characters are not included. How do I add the characters which I desire to > be found? I've just started digging into this code a bit, but from what I've found src/backend/tsearch/wparser_def.c defines much of the parser functionality, and in the area of Unicode includes a number of comments like: * with multibyte encoding and C-locale isw* function may fail or give wrong result. * multibyte encoding and C-locale often are used for Asian languages. * any non-ascii symbol with multibyte encoding with C-locale is an alpha character ... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if WCSTOMBS and TOWLOWER are available) to complicate testing scenarios :) Also note that src/test/regress/sql/tsearch.sql and regress/sql/tsdicts.sql currently focus on English, ASCII-only data. Perhaps this is a good opportunity for you to describe what your environment looks like (OS, PostgreSQL version, encoding and locale settings for the database) and show some sample to_tsquery() @@ to_tsvector() queries that don't behave the way you think they should behave - and we could start building some test cases as a first step? -- Dan Scott Laurentian University
Dear Dan, thank you for your reply. The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on a utf8 local A short 5 line dictionary file is sufficient to test:- raeuz 我们 𦘭𥎵 𪽖𫖂 line 1 "raeuz" Zhuang word written using English letters and show up under ts_vector ok line 2 "我们" uses everyday Chinese word and show up under ts_vector ok line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters found in Unicode 3.1 which came in about the year 2000 and show up under ts_vector ok line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters found in Unicode 5.2 which came in about the year 2009 but do not show up under ts_vector ok line 5 "" Zhuang word written using rather old Chinese charcters found in PUA area of the font Sawndip.ttf but do not show up under ts_vector ok (Font can be downloaded from http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) The last two words even though included in a dictionary do not get accepted by ts_vector. Regards John On Mon, Oct 1, 2012 at 11:04 AM, Dan Scott <denials@gmail.com> wrote: > On Sun, Sep 30, 2012 at 1:56 PM, johnkn63 <john.knightley@gmail.com> wrote: >> When using to_tsvector a number of newer unicode characters and pua >> characters are not included. How do I add the characters which I desire to >> be found? > > I've just started digging into this code a bit, but from what I've > found src/backend/tsearch/wparser_def.c defines much of the parser > functionality, and in the area of Unicode includes a number of > comments like: > > * with multibyte encoding and C-locale isw* function may fail or give > wrong result. > * multibyte encoding and C-locale often are used for Asian languages. > * any non-ascii symbol with multibyte encoding with C-locale is an > alpha character > > ... in concert with ifdefs around WIDE_UPPER_LOWER (in effect if > WCSTOMBS and TOWLOWER are available) to complicate testing scenarios > :) > > Also note that src/test/regress/sql/tsearch.sql and > regress/sql/tsdicts.sql currently focus on English, ASCII-only data. > > Perhaps this is a good opportunity for you to describe what your > environment looks like (OS, PostgreSQL version, encoding and locale > settings for the database) and show some sample to_tsquery() @@ > to_tsvector() queries that don't behave the way you think they should > behave - and we could start building some test cases as a first step? > > -- > Dan Scott > Laurentian University
Hi John: On Sun, Sep 30, 2012 at 11:45 PM, john knightley <john.knightley@gmail.com> wrote: > Dear Dan, > > thank you for your reply. > > The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on > a utf8 local > > A short 5 line dictionary file is sufficient to test:- > > raeuz > 我们 > 𦘭𥎵 > 𪽖𫖂 > > > line 1 "raeuz" Zhuang word written using English letters and show up > under ts_vector ok > line 2 "我们" uses everyday Chinese word and show up under ts_vector ok > line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters > found in Unicode 3.1 which came in about the year 2000 and show up > under ts_vector ok > line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters > found in Unicode 5.2 which came in about the year 2009 but do not show > up under ts_vector ok > line 5 "" Zhuang word written using rather old Chinese charcters > found in PUA area of the font Sawndip.ttf but do not show up under > ts_vector ok (Font can be downloaded from > http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) > > The last two words even though included in a dictionary do not get > accepted by ts_vector. Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to work using the default text search configuration (albeit with one crucial note: I created the database with the "lc_ctype=C lc_collate=C" options): WORKING: createdb --template=template0 --lc-ctype=C --lc-collate=C foobar foobar=# select ts_debug(''); ts_debug ----------------------------------------------------------------(word,"Word, all letters",,{english_stem},english_stem,{}) (1 row) NOT WORKING AS EXPECTED: foobaz=# SHOW LC_CTYPE; lc_ctype -------------en_US.UTF-8 (1 row) foobaz=# select ts_debug(''); ts_debug ---------------------------------(blank,"Space symbols",,{},,) (1 row) So... perhaps LC_CTYPE=C is a possible workaround for you?
john knightley <john.knightley@gmail.com> writes: > The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on > a utf8 local > A short 5 line dictionary file is sufficient to test:- > raeuz > 我们 > 𦘭𥎵 > 𪽖𫖂 > > line 1 "raeuz" Zhuang word written using English letters and show up > under ts_vector ok > line 2 "我们" uses everyday Chinese word and show up under ts_vector ok > line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters > found in Unicode 3.1 which came in about the year 2000 and show up > under ts_vector ok > line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters > found in Unicode 5.2 which came in about the year 2009 but do not show > up under ts_vector ok > line 5 "" Zhuang word written using rather old Chinese charcters > found in PUA area of the font Sawndip.ttf but do not show up under > ts_vector ok (Font can be downloaded from > http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) AFAIK there is nothing in Postgres itself that would distinguish, say, 𦘭 from 𪽖. I think this must be down to your platform's locale definition: it probably thinks that the former is a letter and the latter is not. You'd have to gripe to the locale maintainers to get that fixed. regards, tom lane
On Mon, Oct 1, 2012 at 12:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > john knightley <john.knightley@gmail.com> writes: >> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on >> a utf8 local > >> A short 5 line dictionary file is sufficient to test:- > >> raeuz >> 我们 >> 𦘭𥎵 >> 𪽖𫖂 >> > >> line 1 "raeuz" Zhuang word written using English letters and show up >> under ts_vector ok >> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok >> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters >> found in Unicode 3.1 which came in about the year 2000 and show up >> under ts_vector ok >> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters >> found in Unicode 5.2 which came in about the year 2009 but do not show >> up under ts_vector ok >> line 5 "" Zhuang word written using rather old Chinese charcters >> found in PUA area of the font Sawndip.ttf but do not show up under >> ts_vector ok (Font can be downloaded from >> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) > > AFAIK there is nothing in Postgres itself that would distinguish, say, > 𦘭 from 𪽖. I think this must be down to > your platform's locale definition: it probably thinks that the former is > a letter and the latter is not. You'd have to gripe to the locale > maintainers to get that fixed. > > regards, tom lane PostgreSQL in general does not usually distinguish but full text search does:- select ts_debug('𦘭 from 𪽖'); gives the result:- ts_debug -------------------------------------------------------------------(word,"Word, all letters",𦘭,{english_stem},english_stem,{𦘭})(blank,"Spacesymbols"," ",{},,)(asciiword,"Word, all ASCII",from,{english_stem},english_stem,{})(blank,"Spacesymbols"," 𪽖",{},,) (4 rows) Somewhere there is dictionary, or library that is based on @ Unicode 4.0 which includes "𦘭","U+2662d" but not "𫖂","U+2b582" which is Unicode 5.1. Also PUA characters are dropped in the same way by the full text search, which is what google does but which I do not wish to do. Regards John
On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott <denials@gmail.com> wrote: > Hi John: > > On Sun, Sep 30, 2012 at 11:45 PM, john knightley > <john.knightley@gmail.com> wrote: >> Dear Dan, >> >> thank you for your reply. >> >> The OS I am using is Ubuntu 12.04, with PostgreSQL 9.1.5 installed on >> a utf8 local >> >> A short 5 line dictionary file is sufficient to test:- >> >> raeuz >> 我们 >> 𦘭𥎵 >> 𪽖𫖂 >> >> >> line 1 "raeuz" Zhuang word written using English letters and show up >> under ts_vector ok >> line 2 "我们" uses everyday Chinese word and show up under ts_vector ok >> line 3 "𦘭𥎵" Zhuang word written using rather old Chinese charcters >> found in Unicode 3.1 which came in about the year 2000 and show up >> under ts_vector ok >> line 4 "𪽖𫖂" Zhuang word written using rather old Chinese charcters >> found in Unicode 5.2 which came in about the year 2009 but do not show >> up under ts_vector ok >> line 5 "" Zhuang word written using rather old Chinese charcters >> found in PUA area of the font Sawndip.ttf but do not show up under >> ts_vector ok (Font can be downloaded from >> http://gdzhdb.l10n-support.com/sawndip-fonts/Sawndip.ttf) >> >> The last two words even though included in a dictionary do not get >> accepted by ts_vector. > > Hmm. Fedora 17 x86-64 w/ PostgreSQL 9.1.5 here, the latter seems to > work using the default text search configuration (albeit with one > crucial note: I created the database with the "lc_ctype=C > lc_collate=C" options): > > WORKING: > > createdb --template=template0 --lc-ctype=C --lc-collate=C foobar > foobar=# select ts_debug(''); > ts_debug > ---------------------------------------------------------------- > (word,"Word, all letters",,{english_stem},english_stem,{}) > (1 row) > > NOT WORKING AS EXPECTED: > > > foobaz=# SHOW LC_CTYPE; > lc_ctype > ------------- > en_US.UTF-8 > (1 row) > > foobaz=# select ts_debug(''); > ts_debug > --------------------------------- > (blank,"Space symbols",,{},,) > (1 row) > > So... perhaps LC_CTYPE=C is a possible workaround for you? LC_CTYPE would not be a work around - this database needs to be in utf8 , the full text search is to be used for a mediawiki. Is this a bug that is being worked on? Regards John
john knightley <john.knightley@gmail.com> writes: > On Mon, Oct 1, 2012 at 11:58 AM, Dan Scott <denials@gmail.com> wrote: >> So... perhaps LC_CTYPE=C is a possible workaround for you? > LC_CTYPE would not be a work around - this database needs to be in > utf8 , the full text search is to be used for a mediawiki. You're confusing locale and encoding. They are different things. > Is this a bug that is being worked on? No. As I already tried to explain to you, this behavior is not determined by Postgres, it's determined by the platform's locale support. You need to complain to your OS vendor. regards, tom lane