Обсуждение: Encoding conversion: Use replacement character instead of failingquery with "ERROR: character with byte sequence 0xd0 0xad in encoding "UTF8"has no equivalent in encoding LATIN1" ?

Поиск
Список
Период
Сортировка
Hello list

I'm slowly converting all of our databases from Postgres 9 and LATIN1 to
Postgres 12 and UTF8, one by one.

I was wondering if there is a solution for this issue: if a database
that is still latin1 has a postgres_fdw foreign table onto a converted
utf8 database, and somehow a character that is not in latin1 has already
gotten in there, a query might fail with e.g.

PG9=# select * from fdw_table_test where hostname ~* 'moscow-ix-02';
ERROR:  character with byte sequence 0xd0 0xad in encoding "UTF8" has no
equivalent in encoding "LATIN1"
-- (0xd0 0xad being CYRILLIC CAPITAL LETTER E: Э)

Can I somehow influence the client:UTF8->server:LATIN1 character set
conversion so that instead of failing, it inserts an invalid codepoint
character, the utf8 hex bytes as string, drops the character or
something like that?

I do agree that the default behavior is correct and in general failing
is a lot better than mutilating or losing data in the conversion.
However in this specific case we don't care all that much about the bits
in possibly foreign scripts, and just having some ? in the string would
be easier to handle than one Cyrillic character in a single row failing
a large import job completely.

Is there any way I can do this on the conversion level* ?

Cheers
Christian


* I get that I could somehow write views too look at the content first
and filter out characters that won't work in LATIN1 before going through
the fdw, but I don't quite know all the tables and columns where this
can become an issue. But if you have a copy/paste ready solution for
that I'll take it as well of course :)



Christian Ramseyer <rc@networkz.ch> writes:
> Can I somehow influence the client:UTF8->server:LATIN1 character set
> conversion so that instead of failing, it inserts an invalid codepoint
> character, the utf8 hex bytes as string, drops the character or
> something like that?

There's nothing built-in for that, but it seems like it wouldn't be
hard to modify the code if you wanted a quick hack to do this.

In general, the system nominally supports multiple conversion functions
per encoding pair, so you could imagine having an alternate conversion
that doesn't throw errors.  Problem is that it's quite difficult to get
the system to actually *use* a non-default conversion for anything really
significant, like say client I/O.  I don't know that anyone's thought
hard about how to improve that.

            regards, tom lane




On 06.05.20 02:00, Tom Lane wrote:
> Christian Ramseyer <rc@networkz.ch> writes:
>> Can I somehow influence the client:UTF8->server:LATIN1 character set
>> conversion so that instead of failing, it inserts an invalid codepoint
>> character, the utf8 hex bytes as string, drops the character or
>> something like that?
> 
> There's nothing built-in for that, but it seems like it wouldn't be
> hard to modify the code if you wanted a quick hack to do this.
> 
> In general, the system nominally supports multiple conversion functions
> per encoding pair, so you could imagine having an alternate conversion
> that doesn't throw errors.  Problem is that it's quite difficult to get
> the system to actually *use* a non-default conversion for anything really
> significant, like say client I/O.  I don't know that anyone's thought
> hard about how to improve that.
> 

Thanks Tom, that's basically like I suspected how it was, but I wanted
to make sure I'm not overlooking an easy workaround with a simple
"create conversion" or similar.

I really appreciate the quick answers from highly qualified people I'm
getting on here, without exceptions. If only "enterprise" product
support worked like that :)


Cheers
Christian


-- 
Christian Ramseyer, netnea ag
Network Management. Security. OpenSource.
https://www.netnea.com