Обсуждение: Reinterpreting BYTEA as TEXT, converting BYTEA to TEXT

Поиск
Список
Период
Сортировка

Reinterpreting BYTEA as TEXT, converting BYTEA to TEXT

От
Vlad Romascanu
Дата:
Hello,

I need to perform "conversions" (transcoding) between BYTEA and TEXT
columns in a UTF-8 database.  I searched for existing solutions and
was unable to find one for 8.x or 9.x, so I cam up with something I'd
like to validate with the more enlightened members of this list...

Case 1: reinterpreting:

(Working on a 8.4.3 backend.)

In a UTF8 database I have a BYTEA value which contains a perfectly
valid UTF8 string.  I want to simply *reinterpret* it as TEXT (in the
same way pg_convert_from internally reinterprets the BYTEA return
value from pg_convert as TEXT), backend-side, no
transcoding/encoding/decoding should take place.

The solution I came up with goes something like this:

  CREATE DOMAIN my_varlena AS bytea;
  CREATE CAST (my_varlena AS text) WITHOUT FUNCTION;
  ...
  SELECT bytea_col::my_varlena::text FROM tbl; -- bypass the bytea to
varchar/text conversion which actually calls encode()
  ...
  DROP DOMAIN my_varlena CASCADE;

Is there anything blatantly wrong with this approach that I have
missed, or is there a more straightforward way, or anything to be
improved?  (Again, I need a backend-side solution, not a client-side
one -- e.g. copying huge amounts of data from a BYTEA column to a TEXT
column in some other table.)



Case 2: converting:

(Working on a 8.4.3 backend.)

In a UTF8 database I have a BYTEA value which contains a perfectly
valid e.g. LATIN1 string.  Building on top of the above (again, 100%
backend-side), is there anything blatantly wrong with:

  CREATE DOMAIN my_varlena AS bytea;
  CREATE CAST (my_varlena AS text) WITHOUT FUNCTION;
  ...
  SELECT convert(bytea_col::bytea, 'LATIN1', 'UTF8')::my_varlena::text FROM tbl;
  ...
  DROP DOMAIN my_varlena CASCADE;


Thank you!

PS: Incidentally, if bytea_col contains any of the invalid LATIN1
sequences 0x7f, 0x80, 0x81, 0x82 etc., pg_convert (v8.4.3) will
(leniently) convert them to the invalid UTF-8 BYTEA sequences 0x7f,
0xc2 0x80, 0xc2 0x81, 0xc2 0x82 etc. :)

Re: Reinterpreting BYTEA as TEXT, converting BYTEA to TEXT

От
Merlin Moncure
Дата:
On Wed, Mar 9, 2011 at 4:09 PM, Vlad Romascanu <vromascanu@accurev.com> wrote:
> Hello,
>
> I need to perform "conversions" (transcoding) between BYTEA and TEXT
> columns in a UTF-8 database.  I searched for existing solutions and
> was unable to find one for 8.x or 9.x, so I cam up with something I'd
> like to validate with the more enlightened members of this list...
>
> Case 1: reinterpreting:
>
> (Working on a 8.4.3 backend.)
>
> In a UTF8 database I have a BYTEA value which contains a perfectly
> valid UTF8 string.  I want to simply *reinterpret* it as TEXT (in the
> same way pg_convert_from internally reinterprets the BYTEA return
> value from pg_convert as TEXT), backend-side, no
> transcoding/encoding/decoding should take place.
>
> The solution I came up with goes something like this:
>
>  CREATE DOMAIN my_varlena AS bytea;
>  CREATE CAST (my_varlena AS text) WITHOUT FUNCTION;
>  ...
>  SELECT bytea_col::my_varlena::text FROM tbl; -- bypass the bytea to
> varchar/text conversion which actually calls encode()
>  ...
>  DROP DOMAIN my_varlena CASCADE;
>
> Is there anything blatantly wrong with this approach that I have
> missed, or is there a more straightforward way, or anything to be
> improved?  (Again, I need a backend-side solution, not a client-side
> one -- e.g. copying huge amounts of data from a BYTEA column to a TEXT
> column in some other table.)

I think convert_from is a little more direct:
convert_from(string bytea, src_encoding name)
http://www.postgresql.org/docs/8.4/interactive/functions-string.html

merlin

Re: Reinterpreting BYTEA as TEXT, converting BYTEA to TEXT

От
Vlad Romascanu
Дата:
Aha! :)  Why haven't I used convert_from() -- my ultimate abominable
goal is to do an in-place migration of an SQL_ASCII database
(LC_TYPE/COLLATION="C") to UTF8 (LC_TYPE/COLLATION="C"), where the
string data in the SQL_ASCII database is in some encoding, say LATIN1,
and where the bulk of the data is ASCII-only (i.e. don't want to
"pg_dump -E sqlasciidb | pg_restore -d utf8db" when I can potentially
get away with a fraction of the I/O and disk space requirements.)  So
the plan was to:

UPDATE tbl SET str=convert(str::bytea, 'LATIN1',
'UTF8')::my_varlena::text WHERE str::bytea<>convert(str::bytea,
'LATIN1', 'UTF8');
UPDATE pg_database SET encoding=pg_char_to_encoding('UTF8') WHERE
datname='sqlasciidb';
-- close all connections/backends connected to sqlasciidb
-- reopen connection
-- reindex

Of course, I could theoretically UPDATE pg_database first, restart the
backends, and then use convert_from(), but even though LC_TYPE="C" I
sort of preferred the strings to be UTF8-consistent prior to
restarting a backend on the UTF8'ed database.

So I guess the ultimate question therefore is: what is wrong, if
anything, with the above hack? :)

Thanks!
V.

On Wed, Mar 9, 2011 at 5:23 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> I think convert_from is a little more direct:
> convert_from(string bytea, src_encoding name)