Обсуждение: Encoding conversion: Use replacement character instead of failingquery with "ERROR: character with byte sequence 0xd0 0xad in encoding "UTF8"has no equivalent in encoding LATIN1" ?
Hello list I'm slowly converting all of our databases from Postgres 9 and LATIN1 to Postgres 12 and UTF8, one by one. I was wondering if there is a solution for this issue: if a database that is still latin1 has a postgres_fdw foreign table onto a converted utf8 database, and somehow a character that is not in latin1 has already gotten in there, a query might fail with e.g. PG9=# select * from fdw_table_test where hostname ~* 'moscow-ix-02'; ERROR: character with byte sequence 0xd0 0xad in encoding "UTF8" has no equivalent in encoding "LATIN1" -- (0xd0 0xad being CYRILLIC CAPITAL LETTER E: Э) Can I somehow influence the client:UTF8->server:LATIN1 character set conversion so that instead of failing, it inserts an invalid codepoint character, the utf8 hex bytes as string, drops the character or something like that? I do agree that the default behavior is correct and in general failing is a lot better than mutilating or losing data in the conversion. However in this specific case we don't care all that much about the bits in possibly foreign scripts, and just having some ? in the string would be easier to handle than one Cyrillic character in a single row failing a large import job completely. Is there any way I can do this on the conversion level* ? Cheers Christian * I get that I could somehow write views too look at the content first and filter out characters that won't work in LATIN1 before going through the fdw, but I don't quite know all the tables and columns where this can become an issue. But if you have a copy/paste ready solution for that I'll take it as well of course :)
Christian Ramseyer <rc@networkz.ch> writes: > Can I somehow influence the client:UTF8->server:LATIN1 character set > conversion so that instead of failing, it inserts an invalid codepoint > character, the utf8 hex bytes as string, drops the character or > something like that? There's nothing built-in for that, but it seems like it wouldn't be hard to modify the code if you wanted a quick hack to do this. In general, the system nominally supports multiple conversion functions per encoding pair, so you could imagine having an alternate conversion that doesn't throw errors. Problem is that it's quite difficult to get the system to actually *use* a non-default conversion for anything really significant, like say client I/O. I don't know that anyone's thought hard about how to improve that. regards, tom lane
On 06.05.20 02:00, Tom Lane wrote: > Christian Ramseyer <rc@networkz.ch> writes: >> Can I somehow influence the client:UTF8->server:LATIN1 character set >> conversion so that instead of failing, it inserts an invalid codepoint >> character, the utf8 hex bytes as string, drops the character or >> something like that? > > There's nothing built-in for that, but it seems like it wouldn't be > hard to modify the code if you wanted a quick hack to do this. > > In general, the system nominally supports multiple conversion functions > per encoding pair, so you could imagine having an alternate conversion > that doesn't throw errors. Problem is that it's quite difficult to get > the system to actually *use* a non-default conversion for anything really > significant, like say client I/O. I don't know that anyone's thought > hard about how to improve that. > Thanks Tom, that's basically like I suspected how it was, but I wanted to make sure I'm not overlooking an easy workaround with a simple "create conversion" or similar. I really appreciate the quick answers from highly qualified people I'm getting on here, without exceptions. If only "enterprise" product support worked like that :) Cheers Christian -- Christian Ramseyer, netnea ag Network Management. Security. OpenSource. https://www.netnea.com