Обсуждение: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

Поиск
Список
Период
Сортировка

SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

От
Roland Glenn McIntosh
Дата:
Okay.  I have NO IDEA why this works.  If someone could enlighten me as to the math involved I'd appreciate it.  First,
alittle background: 

The Euro symbol is unicode value 0x20AC.  UTF-8 encoding is a way of representing most unicode characters in two bytes,
andmost latin characters in one byte. 

The only way I have found to insert a euro symbol into the database from the command line psql client is this:
    INSERT INTO mytable VALUES('\342\202\254');

I don't know why this works.  In hex, those octal values are:
    E2 82 AC

I don't know why my "20" byte turned into two bytes of E2 and 82.  Furthermore, I was under the impression that a UTF-8
encodingof the Euro sign only took two bytes.  Corroborating this assumption, upon dumping that table with pg_dump and
examiningthe resultant file in a hex editor, I see this in that character position: AC 20 

Furthermore, according to the psql online documentation and man page:
"Anything contained in single quotes is furthermore subject to C-like substitutions for \n (new line), \t (tab),
\digits,\0digits, and \0xdigits (the character with the given decimal, octal, or hexadecimal code)." 

Those digits *should* be interpreted as decimal digits, but they aren't.  The man page for psql is either incorrect, or
theimplementation is buggy. 

It's worth noting that the field I'm inserting into is an SQL_ASCII field, and I'm reading my UTF-8 string out of it
likethis, via JDBC: 
    String value = new String( resultset.getBytes(1), "UTF-8");

Can anyone help me make sense of this mumbo jumbo?
-Roland


Re: SOLUTION: Insert a Euro symbol as UTF-8 from a latin1 charset.

От
Joel Rees
Дата:
> Okay.  I have NO IDEA why this works.  If someone could enlighten me as
> to the math involved I'd appreciate it.  First, a little background:
>
> The Euro symbol is unicode value 0x20AC.

That's most significant byte first, in UTF-16.

> UTF-8 encoding is a way of
> representing most unicode characters in two bytes,

err, no.

> and most latin
> characters in one byte.

UTF-8 is a way to transform the standard code points of a large, fixed
width encoding to a variable width encoding. For Unicode, the number of
octets (Octet is explicitly 8 bits, since byte is not necessarily 8 bits)
can vary between one and four. See RFC 2279:

    http://www.ietf.org/rfc/rfc2279.txt
    http://www.unicode.org/book/preview/ch02.pdf

Those two pages should clear your question up.

> The only way I have found to insert a euro symbol into the database
> from the command line psql client is this:
>
>     INSERT INTO mytable VALUES('\342\202\254');
>
> I don't know why this works.  In hex, those octal values are:
>     E2 82 AC

That's UTF-8.

> I don't know why my "20" byte turned into two bytes of E2 and 82.

In the conversion from UTF-16 to UTF-8. (You converted it, right?)

> Furthermore, I was under the impression that a UTF-8 encoding of the
> Euro sign only took two bytes.

Let's see. From a table on the RFC page, I note that 0x00 to 0x7f fit in
one octet, 0x0080 to 0x07ff fits in two, and 0x0800 to 0xffff fits in
three. 0x20ac is definitely greater than 0x07ff.

> Corroborating this assumption, upon
> dumping that table with pg_dump and examining the resultant file in
> a hex editor, I see this in that character position: AC 20

That's called looking at a UTF-16 character as if it were an integer on
a byte-backwards, I mean, least-significant byte first CPU. (See "Byte
Order Mark" in the unicode.org link I pasted in above.

> Furthermore, according to the psql online documentation and man page:
> "Anything contained in single quotes is furthermore subject to C-like
> substitutions for \n (new line), \t (tab), \digits, \0digits, and
> \0xdigits (the character with the given decimal, octal, or hexadecimal
> code)."

I'm having trouble finding that page. Do you have a URL for it? (But,
then again, I might not be the person to ask this question. It seems
like I saw something about it somewhere, maybe it's in the list archives
or something.)

> Those digits *should* be interpreted as decimal digits, but they aren't.
> The man page for psql is either incorrect, or the implementation is
> buggy.
>
> It's worth noting that the field I'm inserting into is an SQL_ASCII
> field, and I'm reading my UTF-8 string out of it like this, via JDBC:
>
>     String value = new String( resultset.getBytes(1), "UTF-8");
>
> Can anyone help me make sense of this mumbo jumbo?

HTH

--
Joel Rees <joel@alpsgiken.gr.jp>