Обсуждение: inserts bypass encoding conversion

Поиск
Список
Период
Сортировка

inserts bypass encoding conversion

От
"James Pang (chaolpan)"
Дата:

Hi,

   With client_encoding=UTF8 and server_encoding=LATIN1, looks like insert into value with chr(codepoint) bypass encoding conversion , is it expected ?  test as below ,

 

jamet=# delete from testutf8;

DELETE 1

jamet=# show client_encoding;

client_encoding

-----------------

UTF8

(1 row)

 

jamet=# show server_encoding;

server_encoding

-----------------

LATIN1

(1 row)

 

jamet=# \d testutf8

                     Table "public.testutf8"

Column |          Type          | Collation | Nullable | Default

--------+------------------------+-----------+----------+---------

test   | character varying(128) |           |          |

 

jamet=# insert into testutf8 values('…');

ERROR:  character with byte sequence 0xe2 0x80 0xa6 in encoding "UTF8" has no equivalent in encoding "LATIN1"           <<< here it’s expected to see encoding conversion error

jamet=# insert into testutf8 values(chr(226)||chr(128)||chr(166));                                                                                                 <<< here, looks like using chr(codepoint) works, it bypass encoding_conversion ?

INSERT 0 1

jamet=# set client_encoding='LATIN1';

SET

jamet=# show client_encoding;

client_encoding

-----------------

LATIN1

(1 row)

 

jamet=# show server_encoding;

server_encoding

-----------------

LATIN1

(1 row)

 

jamet=# select * from testutf8;

test

------

(1 row)

 

jamet=# insert into testutf8 values('…');                                                                               <<< here, with client and server same LATIN1, no any encoding conversion , and the data got inserted.

INSERT 0 1

jamet=# select * from testutf8;

test

------

(2 rows)

 

jamet=# select encode(test::bytea,'hex') from testutf8;                                                    <<< both show same value

 encode

--------

e280a6

e280a6

(2 rows)

Re: inserts bypass encoding conversion

От
Tom Lane
Дата:
"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
>    With client_encoding=UTF8 and server_encoding=LATIN1, looks like insert into value with chr(codepoint) bypass
encodingconversion , is it expected ?  test as below , 

The chr() function is a server-side operation that has nothing to do
with the client encoding.

            regards, tom lane



RE: inserts bypass encoding conversion

От
"James Pang (chaolpan)"
Дата:
In this case, the real value stored in database is UTF8 byte sequence instead of LATIN1 encoding text, right?   When
Irun "select * from table" with client_encoding=LATIN1, automatically encoding conversion from UTF8 byte sequence to
text? 

Thanks,

James

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Wednesday, August 16, 2023 10:03 PM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: pgsql-admin@lists.postgresql.org
Subject: Re: inserts bypass encoding conversion

"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
>    With client_encoding=UTF8 and server_encoding=LATIN1, looks like
> insert into value with chr(codepoint) bypass encoding conversion , is
> it expected ?  test as below ,

The chr() function is a server-side operation that has nothing to do with the client encoding.

            regards, tom lane



Re: inserts bypass encoding conversion

От
Tom Lane
Дата:
"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
>   In this case, the real value stored in database is UTF8 byte sequence
>   instead of LATIN1 encoding text, right?

Not if you have server_encoding = LATIN1, as you stated earlier.
In that case, the data in the database is in LATIN1, and chr()
interprets its argument as a LATIN1 code value --- which happens
to look enough like a Unicode code point to be possibly confusing,
until you try to use code points that aren't within LATIN1.

            regards, tom lane



RE: inserts bypass encoding conversion

От
"James Pang (chaolpan)"
Дата:
So,  insert into values(chr(226)||chr(128)||chr(166)) actually got stored in database with LATIN1 with single byte
sequence,but when query select * from testutf8, it got converted to UTF8 three byte sequence first ?  

jamet=# select chr(226)||chr(128)||chr(166);
 ?column?
----------
 ...
(1 row)

jamet=# select * from testutf8;
                                      test
--------------------------------------------------------------------------------
 ...

jamet=# select encode(test::bytea,'hex') from testutf8;
                                                                            encode


-------------------------------------------------------------------------------------------------------------------------------------------------------------
-
 e280a6

Thanks,

James

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Thursday, August 17, 2023 9:33 AM
To: James Pang (chaolpan) <chaolpan@cisco.com>
Cc: pgsql-admin@lists.postgresql.org
Subject: Re: inserts bypass encoding conversion

"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
>   In this case, the real value stored in database is UTF8 byte sequence
>   instead of LATIN1 encoding text, right?

Not if you have server_encoding = LATIN1, as you stated earlier.
In that case, the data in the database is in LATIN1, and chr() interprets its argument as a LATIN1 code value --- which
happensto look enough like a Unicode code point to be possibly confusing, until you try to use code points that aren't
withinLATIN1. 

            regards, tom lane



Re: inserts bypass encoding conversion

От
Tom Lane
Дата:
"James Pang (chaolpan)" <chaolpan@cisco.com> writes:
> So,  insert into values(chr(226)||chr(128)||chr(166)) actually got stored in database with LATIN1 with single byte
sequence,but when query select * from testutf8, it got converted to UTF8 three byte sequence first ?  

There are no LATIN1 characters that have longer than 2-byte UTF8
representations, so no.

I think your fundamental misunderstanding is supposing that this:

    chr(226)||chr(128)||chr(166)

produces something equivalent to the UTF8 sequence 0xe2 0x80 0xa6.
It will not, no matter which server encoding you are dealing with.
It will produce something that is three separate characters
according to the server encoding.  In LATIN1, that could well be
the byte sequence 0xe2 0x80 0xa6, but *that byte sequence does not
mean the same thing that it would mean in UTF8 encoding*.

You also seem not to grasp the fact that an encoding conversion
will happen between your client and the server if client_encoding
is different from server_encoding.  Because of that, the output of
a SELECT command doesn't prove much of anything here.

            regards, tom lane