Обсуждение: Bug #659: lower()/upper() bug on ->multibyte<- DB

Поиск
Список
Период
Сортировка

Bug #659: lower()/upper() bug on ->multibyte<- DB

От
pgsql-bugs@postgresql.org
Дата:
Michael Enke (michael.enke@wincor-nixdorf.com) reports a bug with a severity of 2
The lower the number the more severe it is.

Short Description
lower()/upper() bug on ->multibyte<- DB

Long Description
OS: Linux Kernel 2.4.4, PostgreSQL version 7.2.1
lower() and upper() doesn't work like expected for multibyte
databases. It is working fine for one-byte encoding.
The behaviour can be reproduced as follows:
at initdb: LC_CTYPE was set to de_DE
createdb -E UTF-8 name
export PGCLIENTENCODING=LATIN1
psql -U name
--------------------------------------------------
=> select lower('Ä');  -- german umlaut A, capital
ERROR: Could not convert UTF-8 to ISO8859-1
-- I expected to see: ä german umlaut a, lower case
--------------------------------------------------
=> select lower('ä');  -- german umlaut a, lower case
ERROR: Could not convert UTF-8 to ISO8859-1
-- I expected to see: ä german umlaut a, lower case
--------------------------------------------------
=> select upper('ä');  -- it doesn't translate
ä
-- I expected to see: Ä
--------------------------------------------------
=> select upper('Ä');  -- this works fine
Ä
--------------------------------------------------

The same happens to Ö and Ü (O umlaut, U umlaut)

If you want to reproduce this and don't have ä/Ä on your keyboard,
you can create a table with one column, type varchar(1) (on a MB DB).
create a file with following input:
ae is \u00e4
AE is \u00c4
from java use the command:
native2ascii -reverse -utf8 <this-file> <new-file>
In <new-file> you will see:
in the first line 2 bytes: A(with tilde on top) and Euro Symbol,
in the second line 2 byte: A(with tilde on top) and a dotted box
unset PGCLIENTENCODING, call psql:
insert into table values('<copy and paste first two bytes>');
insert into table values('<copy and paste second two bytes>');
export PGCLIENTENCODING=LATIN1
psql: select * from table; will show you the a-umlaut and A-umlaut.

Sample Code


No file was uploaded with this report

Re: Bug #659: lower()/upper() bug on ->multibyte<- DB

От
Tatsuo Ishii
Дата:
> Short Description
> lower()/upper() bug on ->multibyte<- DB
>
> Long Description
> OS: Linux Kernel 2.4.4, PostgreSQL version 7.2.1
> lower() and upper() doesn't work like expected for multibyte
> databases. It is working fine for one-byte encoding.
> The behaviour can be reproduced as follows:
> at initdb: LC_CTYPE was set to de_DE
> createdb -E UTF-8 name
> export PGCLIENTENCODING=LATIN1
> psql -U name
> --------------------------------------------------
> => select lower('D');  -- german umlaut A, capital
> ERROR: Could not convert UTF-8 to ISO8859-1
> -- I expected to see: d german umlaut a, lower case

This is not a bug but an expected behavior. Locale support expects an
input string is encoded in ISO-8859-1 (because you set locale to
de_DE) while you supply UTF-8. Try an explicit encoding converion
function:

select lower(convert('D'), 'LATIN1');

Note that '\304' must be an actual german umlaut A, capital character,
not an octal espcaped notion.
--
Tatsuo Ishii

Re: Bug #659: lower()/upper() bug on ->multibyte<- DB

От
Tatsuo Ishii
Дата:
> > This is not a bug but an expected behavior. Locale support expects an
> > input string is encoded in ISO-8859-1 (because you set locale to
> > de_DE) while you supply UTF-8.
>
> What is the difference between an insert of string and a call to a function with a string argument?

You input "select lower('X')" as ISO-8859-1 encoded, then it is sent
to the backend. The backend convert it to UTF-8. Then lower() is
called with an UTF-8 string input. lower() calls tolower() which
expects the input being ISO-8859-1 since you set locale to de_DE.
This is the source of the problem.

> > select lower(convert('D'), 'LATIN1');
>
> I tried: select lower(convert('X'), 'LATIN1'); -- X is german umlaut A, capital
> but the result was the same:
> ERROR: Could not convert UTF-8 to ISO8859-1

Oops. That should be:

select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');

It looks ugly, but works.
--
Tatsuo Ishii

Re: Bug #659: lower()/upper() bug on ->multibyte<- DB

От
"Enke, Michael"
Дата:
Hello,

> This is not a bug but an expected behavior. Locale support expects an
> input string is encoded in ISO-8859-1 (because you set locale to
> de_DE) while you supply UTF-8.

What is the difference between an insert of string and a call to a function with a string argument?
Insert works well, output also, only the functions lower(), upper() and initcap() make problems.
This is also ok: select a from a where a = 'X'; -- X is german umlaut a, lowercase / german umlaut A, capital

> Try an explicit encoding converion function:
>
> select lower(convert('D'), 'LATIN1');

I tried: select lower(convert('X'), 'LATIN1'); -- X is german umlaut A, capital
but the result was the same:
ERROR: Could not convert UTF-8 to ISO8859-1

I than compiled postgres without locale support. I created a DB with -E UTF-8
I created a table and inserted UTF-8 char "0x00C4" (german umlaut A, capital)
I called "select lower(a) from a;"
Now, without locale support, I didn't get the error but I also didn't get
the right result. The right result would be UTF-8 char "0x00E4" (german umlaut a, lower case)
!independent of the locale!

Regards,
Michael Enke

Tatsuo Ishii wrote:
>
> > Short Description
> > lower()/upper() bug on ->multibyte<- DB
> >
> > Long Description
> > OS: Linux Kernel 2.4.4, PostgreSQL version 7.2.1
> > lower() and upper() doesn't work like expected for multibyte
> > databases. It is working fine for one-byte encoding.
> > The behaviour can be reproduced as follows:
> > at initdb: LC_CTYPE was set to de_DE
> > createdb -E UTF-8 name
> > export PGCLIENTENCODING=LATIN1
> > psql -U name
> > --------------------------------------------------
> > => select lower('D');  -- german umlaut A, capital
> > ERROR: Could not convert UTF-8 to ISO8859-1
> > -- I expected to see: d german umlaut a, lower case
>
> This is not a bug but an expected behavior. Locale support expects an
> input string is encoded in ISO-8859-1 (because you set locale to
> de_DE) while you supply UTF-8. Try an explicit encoding converion
> function:
>
> select lower(convert('D'), 'LATIN1');
>
> Note that '\304' must be an actual german umlaut A, capital character,
> not an octal espcaped notion.
> --
> Tatsuo Ishii

Re: Bug #659: lower()/upper() bug on ->multibyte<- DB

От
"Enke, Michael"
Дата:
Tatsuo Ishii wrote:
> > What is the difference between an insert of string and a call to a function with a string argument?
>
> You input "select lower('X')" as ISO-8859-1 encoded, then it is sent
> to the backend. The backend convert it to UTF-8. Then lower() is
> called with an UTF-8 string input. lower() calls tolower() which
> expects the input being ISO-8859-1 since you set locale to de_DE.
> This is the source of the problem.

Excuse me, this seems not the be the source of the problem.
If I call select lower(table_col) from table;
then I also don't get back the lower case character but the original case if it is a multibyte char.
There I have no input from the client to the backend.
I did now also remove all below data directory, exported LC_CTYPE to de_DE.utf8, made an initdb.
With pg_controldata I see LC_CTYPE is de_DE.utf8
Now I no longer get the ERROR: cannot convert UTF-8 to ISO8859-1, but the translation doesn't work:
MB chars are not translated, I get back the original case.
BTW: mbsrtowcs(), wctrans(), towctrans(), wcsrtombs() makes the job with de_DE.utf8.

> Oops. That should be:
>
> select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');
> It looks ugly, but works.

Sorry, it doesn't work. The same here, I get back the case I put in at X, not the lower case.

Regards,
Michael

Re: Bug #659: lower()/upper() bug on ->multibyte<- DB

От
Tatsuo Ishii
Дата:
> > You input "select lower('X')" as ISO-8859-1 encoded, then it is sent
> > to the backend. The backend convert it to UTF-8. Then lower() is
> > called with an UTF-8 string input. lower() calls tolower() which
> > expects the input being ISO-8859-1 since you set locale to de_DE.
> > This is the source of the problem.
>
> Excuse me, this seems not the be the source of the problem.
> If I call select lower(table_col) from table;
> then I also don't get back the lower case character but the original case if it is a multibyte char.

This doesn't work by the same reason above. The backend extracts
table_col from the table which is encoded in UTF-8, while lower()
expects ISO-8859-1. Try:

select convert(lower(convert(table_col, 'LATIN1')),'LATIN1','UNICODE')
from your_table;

> I did now also remove all below data directory, exported LC_CTYPE to de_DE.utf8, made an initdb.
> With pg_controldata I see LC_CTYPE is de_DE.utf8
> Now I no longer get the ERROR: cannot convert UTF-8 to ISO8859-1, but the translation doesn't work:
> MB chars are not translated, I get back the original case.

I don't think using de_DE.utf8 helps. The locale support just calls
tolower(), which is not be able to handle multibyte chars.

> > Oops. That should be:
> >
> > select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');
> > It looks ugly, but works.
>
> Sorry, it doesn't work. The same here, I get back the case I put in at X, not the lower case.

Are you sure to use de_DE locale (not de_DE.utf8)?
Included are sample scripts being work with me using de_DE locale.
Here is also my pg_controldata output.

$ pg_controldata
pg_control version number:            71
Catalog version number:               200201121
Database state:                       IN_PRODUCTION
pg_control last modified:             Thu May  9 08:37:20 2002
Current log file id:                  0
Next log file segment:                1
Latest checkpoint location:           0/18C860
Prior checkpoint location:            0/1503A0
Latest checkpoint's REDO location:    0/172054
Latest checkpoint's UNDO location:    0/0
Latest checkpoint's StartUpID:        8
Latest checkpoint's NextXID:          217
Latest checkpoint's NextOID:          24748
Time of latest checkpoint:            Thu May  9 08:37:17 2002
Database block size:                  8192
Blocks per segment of large relation: 131072
LC_COLLATE:                           de_DE
LC_CTYPE:                             de_DE
--
Tatsuo Ishii

Re: Bug #659: lower()/upper() bug on ->multibyte<- DB

От
Tatsuo Ishii
Дата:
> Are you sure to use de_DE locale (not de_DE.utf8)?
> Included are sample scripts being work with me using de_DE locale.
> Here is also my pg_controldata output.

Sorry, fogot to include the execution results:
--
Tatsuo Ishii

Re: Bug #659: lower()/upper() bug on ->multibyte<- DB

От
"Enke, Michael"
Дата:
Tatsuo Ishii wrote:
> I don't think using de_DE.utf8 helps. The locale support just calls
> tolower(), which is not be able to handle multibyte chars.
>
> > > Oops. That should be:
> > >
> > > select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');
> > > It looks ugly, but works.
> >
> > Sorry, it doesn't work. The same here, I get back the case I put in at X, not the lower case.
>
> Are you sure to use de_DE locale (not de_DE.utf8)?
> Included are sample scripts being work with me using de_DE locale.

Ok, this is working now (I cann't reproduce why not at the first time).
Is it planned to implement it so that I can write lower()/ upper() for multibyte
according to SQL standard (without convert)?
I could do it if you tell me where the final tolower()/toupper() happens.
(but not before middle of June).

Regards,
Michael

Re: Bug #659: lower()/upper() bug on ->multibyte<- DB

От
Tatsuo Ishii
Дата:
[Cc:ed to hackers]

(trying select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');)

> Ok, this is working now (I cann't reproduce why not at the first time).

Good.

> Is it planned to implement it so that I can write lower()/ upper() for multibyte
> according to SQL standard (without convert)?

SQL standard? The SQL standard says nothing about locale. So making
lower() (and others) "locale aware" is far different from the SQL
standard of point of view. Of course this does not mean "locale
support" is should not be a part of PostgreSQL's implementation of
SQL. However, we should be aware the limitation of "locale support"
(as well as multibyte support). They are just the stopgap util CREATE
CHARACTER SET etc. is implemnted IMO.

> I could do it if you tell me where the final tolower()/toupper() happens.
> (but not before middle of June).

For the short term solution making convert() hiding from users might
be a good idea (what I mean here is kind of auto execution of
convert()). The hardest part is there's no idea how we could find a
relationship bewteen particular locale and the encoding. For example,
you know that for de_DE locale using LATIN1 encoding is appropreate,
but PostgreSQL does not.
--
Tatsuo Ishii


Re: Bug #659: lower()/upper() bug on ->multibyte<- DB

От
"Enke, Michael"
Дата:
Tatsuo Ishii wrote:
>
> [Cc:ed to hackers]
>
> (trying select convert(lower(convert('X', 'LATIN1')),'LATIN1','UNICODE');)
>
> > Ok, this is working now (I cann't reproduce why not at the first time).
>
> Good.
>
> > Is it planned to implement it so that I can write lower()/ upper() for multibyte
> > according to SQL standard (without convert)?
>
> SQL standard? The SQL standard says nothing about locale. So making
> lower() (and others) "locale aware" is far different from the SQL
> standard of point of view. Of course this does not mean "locale
> support" is should not be a part of PostgreSQL's implementation of
> SQL. However, we should be aware the limitation of "locale support"
> (as well as multibyte support). They are just the stopgap util CREATE
> CHARACTER SET etc. is implemnted IMO.
>
> > I could do it if you tell me where the final tolower()/toupper() happens.
> > (but not before middle of June).
>
> For the short term solution making convert() hiding from users might
> be a good idea (what I mean here is kind of auto execution of
> convert()). The hardest part is there's no idea how we could find a
> relationship bewteen particular locale and the encoding. For example,
> you know that for de_DE locale using LATIN1 encoding is appropreate,
> but PostgreSQL does not.

I think it is really not hard to do this for UTF-8. I don't have to know the
relation between the locale and the encoding. Look at this:
We can use the LC_CTYPE from pg_controldata or alternatively the LC_CTYPE
at server startup. For nearly every locale (de_DE, ja_JP, ...) there exists
also a locale *.utf8 (de_DE.utf8, ja_JP.utf8, ...) at least for the actual Linux glibc.
We don't need to know more than this. If we call
setlocale(LC_CTYPE, <value of LC_CTYPE extended with .utf8 if not already given>)
then glibc is aware of doing all the conversions. I attach a small demo program
which set the locale ja_JP.utf8 and is able to translate german umlaut A (upper) to
german umlaut a (lower).
What I don't know (have to ask a glibc delveloper) is:
Why there exists dozens of locales *.utf8 and what is the difference
between all /usr/lib/locale/*.utf8/LC_CTYPE?
But for all existing locales *.utf8, the conversion of german umlauts is working properly.

Regards,
Michael

PS: I'm not in my office for the next 3 weeks and therefore not able to read my mails.

#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <locale.h>
#define LEN 5

int main() {
  char readInByte[LEN], writeOutByte[LEN];     // holds the character bytes
  const char *readInByteP[] = {readInByte};    // help pointer
  wchar_t readInWC[LEN], writeOutWC[LEN];      // holds the wide characters
  const wchar_t *writeOutWCP[] = {writeOutWC}; // help pointer
  wctrans_t wctransDesc;                       // holds the descriptor for conversion
  int i, ret;
  const char myLocale[] = "ja_JP.utf8";
  char *localeSet;

  readInByte[0] = 0xc3; readInByte[1] = 0x84;  // german umlaut A (upper) in UTF-8
  readInByte[2] = 0xc3; readInByte[3] = 0xa4;  // german umlaut a (lower) in UTF-8
  readInByte[4] = 0;

  // print out the input
  printf("german umlaut A (upper) UTF-8: %hhx %hhx\n", readInByte[0], readInByte[1]);
  printf("german umlaut a (lower) UTF-8: %hhx %hhx\n", readInByte[2], readInByte[3]);

  if((localeSet = setlocale(LC_CTYPE, myLocale)) == NULL) { perror("setlocale"); exit(1); }
  else printf("locale set: %s\n", localeSet);
  ret = mbsrtowcs(readInWC, readInByteP, LEN, NULL); // convert bytes to wide chars
  printf("number of wide chars: %i\n", ret);
  wctransDesc = wctrans("tolower");            // get descriptor for wc operation
  if(wctransDesc == 0) { perror("wctransDesc"); exit(1); }

  // make the transformation according to descriptor
  i=0; while((writeOutWC[i] = towctrans(readInWC[i], wctransDesc)) != L'\0') i++;

  ret = wcsrtombs(writeOutByte, writeOutWCP, LEN, NULL); // convert wide chars to bytes
  printf("number of bytes: %i\n", ret);

  // print out the result
  printf("german umlaut A tolower(): %hhx %hhx\n", writeOutByte[0], writeOutByte[1]);
  printf("german umlaut a tolower(): %hhx %hhx\n", writeOutByte[2], writeOutByte[3]);

  return 0;
}

Re: [HACKERS] Bug #659: lower()/upper() bug on

От
Tatsuo Ishii
Дата:
> I think it is really not hard to do this for UTF-8. I don't have to know the
> relation between the locale and the encoding. Look at this:
> We can use the LC_CTYPE from pg_controldata or alternatively the LC_CTYPE
> at server startup. For nearly every locale (de_DE, ja_JP, ...) there exists
> also a locale *.utf8 (de_DE.utf8, ja_JP.utf8, ...) at least for the actual Linux glibc.

My Linux box does not have *.utf8 locales at all. Probably not so many
platforms have them up to now, I guess.

> We don't need to know more than this. If we call
> setlocale(LC_CTYPE, <value of LC_CTYPE extended with .utf8 if not already given>)
> then glibc is aware of doing all the conversions. I attach a small demo program
> which set the locale ja_JP.utf8 and is able to translate german umlaut A (upper) to
> german umlaut a (lower).

Interesting idea, but the problem is we have to decide to use exactly
one locale before initdb. In my understanding, users willing to use
Unicode (UTF-8) tend to use multiple languages. This is natural since
Unicode claims it can handle several languages. For example, user
might want to have a table like this in a UTF-8 database:

create table t1(      english text,    -- English message      germany text,    -- Germany message      japanese text
-- Japanese message
 
);

If you have set the local to, say de_DE, then:

select lower(japanese) from t1;

would be executed in de_DE.utf8 locale, and I doubt it produces any
meaningfull results for Japanese.
--
Tatsuo Ishii


Re: [HACKERS] Bug #659: lower()/upper() bug on

От
Tatsuo Ishii
Дата:
> > My Linux box does not have *.utf8 locales at all. Probably not so many
> > platforms have them up to now, I guess.
> 
> What linux do you use ?

Kind of variant of RH6.2.

> At least newer Redhat Linuxen have them and I suspect that all newer
> glibc's are capable of using them.

I guess many RH6.2 or RH6.2 based are still surviving...

> > If you have set the local to, say de_DE, then:
> > 
> > select lower(japanese) from t1;
> >
> > would be executed in de_DE.utf8 locale, and I doubt it produces any
> > meaningfull results for Japanese.
> 
> IIRC it may, as I think that it will include full UTF8 upper/lower
> tables, at least on Linux.
> 
> For example en_US will produce right upper/lower results for Estonian,
> though collation is off and some chars are missing if using iso-8859-1.

Are you sure that say, de_DE.utf8 locale produce meaningful results
for any other languages? If so, why are there so many *.utf8 locales?

> btw, does Japanese language have distinct upper and lower case letters ?

There are "full width alphabets" in Japanese. Thoes include not only
ASCII letters but also some European characters.
--
Tatsuo Ishii


Re: [HACKERS] Bug #659: lower()/upper() bug on

От
Jean-Michel POURE
Дата:
Le Mardi 14 Mai 2002 03:29, Tatsuo Ishii a écrit :
> For example, user
> might want to have a table like this in a UTF-8 database:
>
> create table t1(
>        english text,    -- English message
>        germany text,    -- Germany message
>        japanese text    -- Japanese message
> );

Or just
CREATE table t1(       text_locale varchar,         text_content text
);
which is my case.
Just my 2 cents.
/Jean-Michel POURE


Re: [HACKERS] Bug #659: lower()/upper() bug on

От
Hannu Krosing
Дата:
On Tue, 2002-05-14 at 03:29, Tatsuo Ishii wrote:
> > I think it is really not hard to do this for UTF-8. I don't have to know the
> > relation between the locale and the encoding. Look at this:
> > We can use the LC_CTYPE from pg_controldata or alternatively the LC_CTYPE
> > at server startup. For nearly every locale (de_DE, ja_JP, ...) there exists
> > also a locale *.utf8 (de_DE.utf8, ja_JP.utf8, ...) at least for the actual Linux glibc.
> 
> My Linux box does not have *.utf8 locales at all. Probably not so many
> platforms have them up to now, I guess.

What linux do you use ?

At least newer Redhat Linuxen have them and I suspect that all newer
glibc's are capable of using them.

> 
> > We don't need to know more than this. If we call
> > setlocale(LC_CTYPE, <value of LC_CTYPE extended with .utf8 if not already given>)
> > then glibc is aware of doing all the conversions. I attach a small demo program
> > which set the locale ja_JP.utf8 and is able to translate german umlaut A (upper) to
> > german umlaut a (lower).
> 
> Interesting idea, but the problem is we have to decide to use exactly
> one locale before initdb. In my understanding, users willing to use
> Unicode (UTF-8) tend to use multiple languages. This is natural since
> Unicode claims it can handle several languages. For example, user
> might want to have a table like this in a UTF-8 database:
> 
> create table t1(
>        english text,    -- English message
>        germany text,    -- Germany message
>        japanese text    -- Japanese message
> );
> 
> If you have set the local to, say de_DE, then:
> 
> select lower(japanese) from t1;
>
> would be executed in de_DE.utf8 locale, and I doubt it produces any
> meaningfull results for Japanese.

IIRC it may, as I think that it will include full UTF8 upper/lower
tables, at least on Linux.

For example en_US will produce right upper/lower results for Estonian,
though collation is off and some chars are missing if using iso-8859-1.

btw, does Japanese language have distinct upper and lower case letters ?

--------------
Hannu




Re: [HACKERS] Bug #659: lower()/upper() bug on

От
Tatsuo Ishii
Дата:
> > Are you sure that say, de_DE.utf8 locale produce meaningful results
> > for any other languages?
> 
> there are often subtle differences, but upper() and lower() are much
> more likely to produce right results than collation order or date/money
> formats.
> 
> in fact seem to be only 10 distinct LC_CTYPE files for ~110 locales with
> most european-originated languages having the same and only 
> tr_TR, zh_??, fr_??,da_DK, de_??, ro_RO, sr_YU, ja_JP and ko_KR having
> their own.

I see. So the remaining problem would be how to detect the existence
of *.utf8 collation at the configure time.

> > If so, why are there so many *.utf8 locales?
> 
> As I understand it, a locale should cover all locale-specific issues
>  
> > > btw, does Japanese language have distinct upper and lower case letters ?
> > 
> > There are "full width alphabets" in Japanese. Thoes include not only
> > ASCII letters but also some European characters.
> 
> Are these ASCII and European characters uppercased in some
> Japanese-specific way ?

Probably not, but I'm not sure since my Linux box does not have *.utf8
locales.
--
Tatsuo Ishii


Re: [HACKERS] Bug #659: lower()/upper() bug on

От
Hannu Krosing
Дата:
On Tue, 2002-05-14 at 09:52, Tatsuo Ishii wrote:
> 
> Are you sure that say, de_DE.utf8 locale produce meaningful results
> for any other languages?

there are often subtle differences, but upper() and lower() are much
more likely to produce right results than collation order or date/money
formats.

in fact seem to be only 10 distinct LC_CTYPE files for ~110 locales with
most european-originated languages having the same and only 
tr_TR, zh_??, fr_??,da_DK, de_??, ro_RO, sr_YU, ja_JP and ko_KR having
their own.

> If so, why are there so many *.utf8 locales?

As I understand it, a locale should cover all locale-specific issues
> > btw, does Japanese language have distinct upper and lower case letters ?
> 
> There are "full width alphabets" in Japanese. Thoes include not only
> ASCII letters but also some European characters.

Are these ASCII and European characters uppercased in some
Japanese-specific way ?

--------------
Hannu



Re: [HACKERS] Bug #659: lower()/upper() bug on

От
"Enke, Michael"
Дата:
Tatsuo Ishii wrote:
> 
> > > Are you sure that say, de_DE.utf8 locale produce meaningful results
> > > for any other languages?
> >
> > there are often subtle differences, but upper() and lower() are much
> > more likely to produce right results than collation order or date/money
> > formats.
> >
> > in fact seem to be only 10 distinct LC_CTYPE files for ~110 locales with
> > most european-originated languages having the same and only
> > tr_TR, zh_??, fr_??,da_DK, de_??, ro_RO, sr_YU, ja_JP and ko_KR having
> > their own.
> 
> I see. So the remaining problem would be how to detect the existence
> of *.utf8 collation at the configure time.
> 
> > > If so, why are there so many *.utf8 locales?
> >
> > As I understand it, a locale should cover all locale-specific issues
> >
> > > > btw, does Japanese language have distinct upper and lower case letters ?
> > >
> > > There are "full width alphabets" in Japanese. Thoes include not only
> > > ASCII letters but also some European characters.
> >
> > Are these ASCII and European characters uppercased in some
> > Japanese-specific way ?
> 
> Probably not, but I'm not sure since my Linux box does not have *.utf8
> locales.

Could you give me the UTF-8 bytecode for one japanese upper case char and
for the same char the lower case?
I will check in de_DE locale if this translations works.

Michael


Re: [HACKERS] Bug #659: lower()/upper() bug on

От
Tatsuo Ishii
Дата:
> > > > There are "full width alphabets" in Japanese. Thoes include not only
> > > > ASCII letters but also some European characters.
> > >
> > > Are these ASCII and European characters uppercased in some
> > > Japanese-specific way ?
> > 
> > Probably not, but I'm not sure since my Linux box does not have *.utf8
> > locales.
> 
> Could you give me the UTF-8 bytecode for one japanese upper case char and
> for the same char the lower case?
> I will check in de_DE locale if this translations works.

Ok, here is the data you requested. The first three bytes (0xefbca1)
represents full-width capital "A", the rest three bytes (0xefbd81)
represents full-width lower case "a".

Re: [HACKERS] Bug #659: lower()/upper() bug on

От
"Enke, Michael"
Дата:
Tatsuo Ishii wrote:
>
> > > > > There are "full width alphabets" in Japanese. Thoes include not only
> > > > > ASCII letters but also some European characters.
> > > >
> > > > Are these ASCII and European characters uppercased in some
> > > > Japanese-specific way ?
> > >
> > > Probably not, but I'm not sure since my Linux box does not have *.utf8
> > > locales.
> >
> > Could you give me the UTF-8 bytecode for one japanese upper case char and
> > for the same char the lower case?
> > I will check in de_DE locale if this translations works.
>
> Ok, here is the data you requested. The first three bytes (0xefbca1)
> represents full-width capital "A", the rest three bytes (0xefbd81)
> represents full-width lower case "a".

Thank you for the data, it is working in ja_JP.utf8 and in de_DE.utf8
I send you my test program as attachment.

Regards,
Michael

#include <stdio.h>
#include <wchar.h>
#include <wctype.h>
#include <locale.h>
#define LEN 7

int main() {
  char readInByte[LEN], writeOutByte[LEN];     // holds the character bytes
  const char *readInByteP[] = {readInByte};    // help pointer
  wchar_t readInWC[LEN], writeOutWC[LEN];      // holds the wide characters
  const wchar_t *writeOutWCP[] = {writeOutWC}; // help pointer
  wctrans_t wctransDesc;                       // holds the descriptor for conversion
  int i, ret;
  //const char myLocale[] = "ja_JP.utf8";
  const char myLocale[] = "de_DE.utf8";
  char *localeSet;

  readInByte[0] = 0xef; readInByte[1] = 0xbc; readInByte[2] = 0xa1; // full-width A (upper) in UTF-8
  readInByte[3] = 0xef; readInByte[4] = 0xbd; readInByte[5] = 0x81; // full-width a (lower) in UTF-8
  readInByte[6] = 0;

  // print out the input
  printf("full-width A (upper) UTF-8: %hhx %hhx %hhx\n", readInByte[0], readInByte[1], readInByte[2]);
  printf("full-width a (lower) UTF-8: %hhx %hhx %hhx\n", readInByte[3], readInByte[4], readInByte[5]);

  if((localeSet = setlocale(LC_CTYPE, myLocale)) == NULL) { perror("setlocale"); exit(1); }
  else printf("locale set: %s\n", localeSet);
  ret = mbsrtowcs(readInWC, readInByteP, LEN, NULL); // convert bytes to wide chars
  printf("number of wide chars: %i\n", ret);
  wctransDesc = wctrans("tolower");            // get descriptor for wc operation
  if(wctransDesc == 0) { perror("wctransDesc"); exit(1); }

  // make the transformation according to descriptor
  i=0; while((writeOutWC[i] = towctrans(readInWC[i], wctransDesc)) != L'\0') i++;

  ret = wcsrtombs(writeOutByte, writeOutWCP, LEN, NULL); // convert wide chars to bytes
  printf("number of bytes: %i\n", ret);

  // print out the result
  printf("full-width A tolower():     %hhx %hhx %hhx\n", writeOutByte[0], writeOutByte[1], writeOutByte[2]);
  printf("full-width a tolower():     %hhx %hhx %hhx\n", writeOutByte[3], writeOutByte[4], writeOutByte[5]);

  return 0;
}