Обсуждение: LC_CTYPE and matching accented chars

Поиск
Список
Период
Сортировка

LC_CTYPE and matching accented chars

От
Martin Langhoff
Дата:
Trying to understand how I can get Pg 8.2 to match 'martín' when I
search for 'martin', and going through the documentation, specially

  http://www.postgresql.org/docs/8.2/static/locale.html
  http://www.postgresql.org/docs/8.1/static/functions-matching.html

Here is a transcript of my commandline - I suspect I am quite close, but
something is missing. Hints _really_ appreciated!

      # do I have a Spanish locale on this box?
      locale -a
      # no I don't - let's make one
      sudo locale-gen es_AR.utf8
      # check that I have it
      locale -a
      ...
      sudo pg_createcluster --locale=es_AR.utf8 8.2 arg
      sudo pg_ctlcluster 8.2 arg start
      sudo -u postgres createdb --cluster 8.2/arg testing
      sudo -u postgres psql --cluster 8.2/arg testing
      # in psql...
      =# create table test (value text);
      CREATE TABLE
      =#  insert into test values ('martín');
      INSERT 0 1

      # this is apparently the right way to
      # select base character based on the "equivalence class"
      # as defined in the LC_CTYPE
      =# select * from test where value ~ 'mart[=i=]n';
       value
      -------
      (0 rows)
      # grumble - double check
      =# SHOW LC_CTYPE;
       lc_ctype
      ------------
      es_AR.utf8

Any hints in general?

 - How can I read the LC_CTYPE definition? IOWs locale-gen has created a
binary blob in  /usr/lib/locale/es_AR.utf8/LC_CTYPE but I sure there is
a human-readable definition somewhere...

 - Is that the correct syntax?

 - Pg bug? I doubt many people are counting on this behaviour... as it
seems to be widely announced to be brittle and slow.

cheers,


m
--
-----------------------------------------------------------------------
Martin @ Catalyst .Net .NZ  Ltd, PO Box 11-053, Manners St,  Wellington
WEB: http://catalyst.net.nz/           PHYS: Level 2, 150-154 Willis St
OFFICE: +64(4)916-7224  UK: 0845 868 5733 ext 7224  MOB: +64(21)364-017
      Make things as simple as possible, but no simpler - Einstein
-----------------------------------------------------------------------

Re: LC_CTYPE and matching accented chars

От
Alvaro Herrera
Дата:
Martin Langhoff wrote:
> Trying to understand how I can get Pg 8.2 to match 'martín' when I
> search for 'martin', and going through the documentation, specially
>
>   http://www.postgresql.org/docs/8.2/static/locale.html
>   http://www.postgresql.org/docs/8.1/static/functions-matching.html
>
> Here is a transcript of my commandline - I suspect I am quite close, but
> something is missing. Hints _really_ appreciated!

>       =# create table test (value text);
>       CREATE TABLE
>       =#  insert into test values ('martín');
>       INSERT 0 1
>
>       # this is apparently the right way to
>       # select base character based on the "equivalence class"
>       # as defined in the LC_CTYPE
>       =# select * from test where value ~ 'mart[=i=]n';

I think it would be much easier if you did something like

select * from test where lower(to_ascii(value)) = lower(to_ascii('martín'));

When to_ascii doesn't work (for example because it doesn't work in UTF8)
you may want to use convert() to recode the text to latin1 or latin9.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: LC_CTYPE and matching accented chars

От
Martin Langhoff
Дата:
Alvaro Herrera wrote:
> Martin Langhoff wrote:
>>       # this is apparently the right way to
>>       # select base character based on the "equivalence class"
>>       # as defined in the LC_CTYPE
>>       =# select * from test where value ~ 'mart[=i=]n';
>
> I think it would be much easier if you did something like
>
> select * from test where lower(to_ascii(value)) = lower(to_ascii('martín'));
>
> When to_ascii doesn't work (for example because it doesn't work in UTF8)
> you may want to use convert() to recode the text to latin1 or latin9.

Well, with the example above to_ascii doesn't work.

  select to_ascii(value) from test ;
  ERROR:  encoding conversion from UTF8 to ASCII not supported

And neither does convert

  select convert(value using utf8_to_ascii) from test ;
  ERROR:  character 0xc3 of encoding "MULE_INTERNAL" has no equivalent
   in "SQL_ASCII"

  select convert('martín' using utf8_to_ascii) from test ;
  ERROR:  character 0xc3 of encoding "MULE_INTERNAL" has no equivalent
   in "SQL_ASCII"

Any other alternatives?

BTW, this is on Pg8.2 as seen on Ubuntu Feisty PPC.

  $ dpkg -l postgresql-8.2
  ii  postgresql-8.2                8.2.4-0ubuntu0.7.04

cheers,


martín
--
-----------------------------------------------------------------------
Martin @ Catalyst .Net .NZ  Ltd, PO Box 11-053, Manners St,  Wellington
WEB: http://catalyst.net.nz/           PHYS: Level 2, 150-154 Willis St
OFFICE: +64(4)916-7224  UK: 0845 868 5733 ext 7224  MOB: +64(21)364-017
      Make things as simple as possible, but no simpler - Einstein
-----------------------------------------------------------------------


Re: LC_CTYPE and matching accented chars

От
Michael Fuhr
Дата:
On Wed, Jun 27, 2007 at 09:28:24AM +1200, Martin Langhoff wrote:
> Alvaro Herrera wrote:
> > I think it would be much easier if you did something like
> >
> > select * from test where lower(to_ascii(value)) = lower(to_ascii('martín'));
> >
> > When to_ascii doesn't work (for example because it doesn't work in UTF8)
> > you may want to use convert() to recode the text to latin1 or latin9.
>
> Well, with the example above to_ascii doesn't work.
>
>   select to_ascii(value) from test ;
>   ERROR:  encoding conversion from UTF8 to ASCII not supported
>
> And neither does convert
>
>   select convert(value using utf8_to_ascii) from test ;
>   ERROR:  character 0xc3 of encoding "MULE_INTERNAL" has no equivalent
>    in "SQL_ASCII"

As Alvaro suggested, try converting to latin1 or latin9 and then
calling to_ascii:

select 'martin' = to_ascii(convert('martín', 'latin1'), 'latin1');
 ?column?
----------
 t
(1 row)

For other possibilities search the list archives for examples of
"unaccent" functions that normalize text to NFD (Unicode Normalization
Form D) and remove nonspacing marks.  Here's a message with a couple
of PL/Perl functions:

http://archives.postgresql.org/pgsql-general/2007-01/msg00702.php

--
Michael Fuhr

Re: LC_CTYPE and matching accented chars

От
Martin Langhoff
Дата:
Alvaro Herrera wrote:
> Martin Langhoff wrote:
>>       # this is apparently the right way to
>>       # select base character based on the "equivalence class"
>>       # as defined in the LC_CTYPE
>>       =# select * from test where value ~ 'mart[=i=]n';
>
> I think it would be much easier if you did something like
>
> select * from test where lower(to_ascii(value)) = lower(to_ascii('martín'));
>
> When to_ascii doesn't work (for example because it doesn't work in UTF8)
> you may want to use convert() to recode the text to latin1 or latin9.

Well, with the example above to_ascii doesn't work.

  select to_ascii(value) from test ;
  ERROR:  encoding conversion from UTF8 to ASCII not supported

And neither does convert

  select convert(value using utf8_to_ascii) from test ;
  ERROR:  character 0xc3 of encoding "MULE_INTERNAL" has no equivalent
   in "SQL_ASCII"

  select convert('martín' using utf8_to_ascii) from test ;
  ERROR:  character 0xc3 of encoding "MULE_INTERNAL" has no equivalent
   in "SQL_ASCII"

Any other alternatives?

BTW, this is on Pg8.2 as seen on Ubuntu Feisty PPC.

  $ dpkg -l postgresql-8.2
  ii  postgresql-8.2                8.2.4-0ubuntu0.7.04

cheers,


martín
--
-----------------------------------------------------------------------
Martin @ Catalyst .Net .NZ  Ltd, PO Box 11-053, Manners St,  Wellington
WEB: http://catalyst.net.nz/           PHYS: Level 2, 150-154 Willis St
OFFICE: +64(4)916-7224  UK: 0845 868 5733 ext 7224  MOB: +64(21)364-017
      Make things as simple as possible, but no simpler - Einstein
-----------------------------------------------------------------------