Обсуждение: Reinterpreting BYTEA as TEXT, converting BYTEA to TEXT
Hello, I need to perform "conversions" (transcoding) between BYTEA and TEXT columns in a UTF-8 database. I searched for existing solutions and was unable to find one for 8.x or 9.x, so I cam up with something I'd like to validate with the more enlightened members of this list... Case 1: reinterpreting: (Working on a 8.4.3 backend.) In a UTF8 database I have a BYTEA value which contains a perfectly valid UTF8 string. I want to simply *reinterpret* it as TEXT (in the same way pg_convert_from internally reinterprets the BYTEA return value from pg_convert as TEXT), backend-side, no transcoding/encoding/decoding should take place. The solution I came up with goes something like this: CREATE DOMAIN my_varlena AS bytea; CREATE CAST (my_varlena AS text) WITHOUT FUNCTION; ... SELECT bytea_col::my_varlena::text FROM tbl; -- bypass the bytea to varchar/text conversion which actually calls encode() ... DROP DOMAIN my_varlena CASCADE; Is there anything blatantly wrong with this approach that I have missed, or is there a more straightforward way, or anything to be improved? (Again, I need a backend-side solution, not a client-side one -- e.g. copying huge amounts of data from a BYTEA column to a TEXT column in some other table.) Case 2: converting: (Working on a 8.4.3 backend.) In a UTF8 database I have a BYTEA value which contains a perfectly valid e.g. LATIN1 string. Building on top of the above (again, 100% backend-side), is there anything blatantly wrong with: CREATE DOMAIN my_varlena AS bytea; CREATE CAST (my_varlena AS text) WITHOUT FUNCTION; ... SELECT convert(bytea_col::bytea, 'LATIN1', 'UTF8')::my_varlena::text FROM tbl; ... DROP DOMAIN my_varlena CASCADE; Thank you! PS: Incidentally, if bytea_col contains any of the invalid LATIN1 sequences 0x7f, 0x80, 0x81, 0x82 etc., pg_convert (v8.4.3) will (leniently) convert them to the invalid UTF-8 BYTEA sequences 0x7f, 0xc2 0x80, 0xc2 0x81, 0xc2 0x82 etc. :)
On Wed, Mar 9, 2011 at 4:09 PM, Vlad Romascanu <vromascanu@accurev.com> wrote: > Hello, > > I need to perform "conversions" (transcoding) between BYTEA and TEXT > columns in a UTF-8 database. I searched for existing solutions and > was unable to find one for 8.x or 9.x, so I cam up with something I'd > like to validate with the more enlightened members of this list... > > Case 1: reinterpreting: > > (Working on a 8.4.3 backend.) > > In a UTF8 database I have a BYTEA value which contains a perfectly > valid UTF8 string. I want to simply *reinterpret* it as TEXT (in the > same way pg_convert_from internally reinterprets the BYTEA return > value from pg_convert as TEXT), backend-side, no > transcoding/encoding/decoding should take place. > > The solution I came up with goes something like this: > > CREATE DOMAIN my_varlena AS bytea; > CREATE CAST (my_varlena AS text) WITHOUT FUNCTION; > ... > SELECT bytea_col::my_varlena::text FROM tbl; -- bypass the bytea to > varchar/text conversion which actually calls encode() > ... > DROP DOMAIN my_varlena CASCADE; > > Is there anything blatantly wrong with this approach that I have > missed, or is there a more straightforward way, or anything to be > improved? (Again, I need a backend-side solution, not a client-side > one -- e.g. copying huge amounts of data from a BYTEA column to a TEXT > column in some other table.) I think convert_from is a little more direct: convert_from(string bytea, src_encoding name) http://www.postgresql.org/docs/8.4/interactive/functions-string.html merlin
Aha! :) Why haven't I used convert_from() -- my ultimate abominable goal is to do an in-place migration of an SQL_ASCII database (LC_TYPE/COLLATION="C") to UTF8 (LC_TYPE/COLLATION="C"), where the string data in the SQL_ASCII database is in some encoding, say LATIN1, and where the bulk of the data is ASCII-only (i.e. don't want to "pg_dump -E sqlasciidb | pg_restore -d utf8db" when I can potentially get away with a fraction of the I/O and disk space requirements.) So the plan was to: UPDATE tbl SET str=convert(str::bytea, 'LATIN1', 'UTF8')::my_varlena::text WHERE str::bytea<>convert(str::bytea, 'LATIN1', 'UTF8'); UPDATE pg_database SET encoding=pg_char_to_encoding('UTF8') WHERE datname='sqlasciidb'; -- close all connections/backends connected to sqlasciidb -- reopen connection -- reindex Of course, I could theoretically UPDATE pg_database first, restart the backends, and then use convert_from(), but even though LC_TYPE="C" I sort of preferred the strings to be UTF8-consistent prior to restarting a backend on the UTF8'ed database. So I guess the ultimate question therefore is: what is wrong, if anything, with the above hack? :) Thanks! V. On Wed, Mar 9, 2011 at 5:23 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > I think convert_from is a little more direct: > convert_from(string bytea, src_encoding name)