Обсуждение: encode, lower and 0x8a

Поиск
Список
Период
Сортировка

encode, lower and 0x8a

От
"Michael Artz"
Дата:
Perhaps my understanding of the 'encode' function is incorrect, but I
was under the impression that I could do something like:

SELECT lower(encode(bytes, 'escape')) FROM mytable;

as it sounded like (from the manual) that 'encode' would return valid
ASCII, with all the non-ascii bytes hex escaped.  When I have the byte
0x8a, however, I get the error:

ERROR:  invalid byte sequence for encoding "UTF8": 0x8a

I have the sneaking suspicion that I am missing something, so please
correct me if I am wrong.  If I am wrong, is there a better way to
lowercase all the ascii characters in a bytea string?

Here is a simple way to recreate this:

CREATE TABLE mytable (bytes BYTEA);
INSERT INTO mytable VALUES (E'212');
SELECT lower(encode(bytes, 'escape')) FROM mytable;

Thanks
-Mike

Re: encode, lower and 0x8a

От
Michael Fuhr
Дата:
On Thu, Jan 25, 2007 at 02:28:38PM -0500, Michael Artz wrote:
> Perhaps my understanding of the 'encode' function is incorrect, but I
> was under the impression that I could do something like:
>
> SELECT lower(encode(bytes, 'escape')) FROM mytable;
>
> as it sounded like (from the manual) that 'encode' would return valid
> ASCII, with all the non-ascii bytes hex escaped.

The documentation for encode() does give that impression: "Encode
binary string to ASCII-only representation.  Supported types are:
base64, hex, escape."  However, the source code for esc_encode()
in src/backend/utils/adt/encode.c says and does otherwise:

 * Only two characters are escaped:
 * \0 (null) and \\ (backslash)

> When I have the byte 0x8a, however, I get the error:
>
> ERROR:  invalid byte sequence for encoding "UTF8": 0x8a

Since encode() returns text and doesn't escape non-ASCII characters,
all of the original binary data will be treated as though it's text
in the database's encoding.  If the data contains byte sequences
that aren't valid in that encoding then you get the above error.

> I have the sneaking suspicion that I am missing something, so please
> correct me if I am wrong.  If I am wrong, is there a better way to
> lowercase all the ascii characters in a bytea string?

What are you trying to do?  What is the binary data and why are you
treating it (or part of it) as though it's text?  Do you want the
end result to be text with escape sequences or do you want to convert
it back to bytea?

Something like this might work:

SELECT lower(textin(byteaout(bytes))) FROM mytable;

To turn the result back into bytea:

SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable;

--
Michael Fuhr

Re: encode, lower and 0x8a

От
"Michael Artz"
Дата:
On 1/27/07, Michael Fuhr <mike@fuhr.org> wrote:
> On Thu, Jan 25, 2007 at 02:28:38PM -0500, Michael Artz wrote:
> > Perhaps my understanding of the 'encode' function is incorrect, but I
> > was under the impression that I could do something like:
> >
> > SELECT lower(encode(bytes, 'escape')) FROM mytable;
> >
> > as it sounded like (from the manual) that 'encode' would return valid
> > ASCII, with all the non-ascii bytes hex escaped.
...snip...
> > I have the sneaking suspicion that I am missing something, so please
> > correct me if I am wrong.  If I am wrong, is there a better way to
> > lowercase all the ascii characters in a bytea string?
>
> What are you trying to do?  What is the binary data and why are you
> treating it (or part of it) as though it's text?  Do you want the
> end result to be text with escape sequences or do you want to convert
> it back to bytea?

The data are binary strings culled from network traffic.  Most of the
"binary strings" are just ascii strings, however not all of them are,
and some of the ascii strings have binary characters embedded in them.
 In this case, I was displaying the string to a user via a web
browser, in which case I wanted the output to be escaped.  The reason
for the 'lower()' was to enable the case-insensitive sort, so actually
the offending line of the program would look something more like:

SELECT encode(bytes, 'escape') FROM mytables ORDER BY
lower(encode(bytes, 'escape'))


> Something like this might work:
>
> SELECT lower(textin(byteaout(bytes))) FROM mytable;
>
> To turn the result back into bytea:
>
> SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable;

That seems to work correctly, however I missed the functions textin'
and 'byteaout' in the docs ... are they documented somewhere?

Thanks,
-Mike

Re: encode, lower and 0x8a

От
Michael Fuhr
Дата:
On Mon, Jan 29, 2007 at 12:52:33PM -0500, Michael Artz wrote:
> On 1/27/07, Michael Fuhr <mike@fuhr.org> wrote:
> >SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable;
>
> That seems to work correctly, however I missed the functions textin'
> and 'byteaout' in the docs ... are they documented somewhere?

Those are the types' input/output functions.  They're not specifically
documented but "User-Defined Types" in the "Extending SQL" chapter
discusses the concept; see also CREATE TYPE.

http://www.postgresql.org/docs/8.2/interactive/xtypes.html
http://www.postgresql.org/docs/8.2/interactive/sql-createtype.html

Using the input/output functions is a bit hackish since, as you
discovered, their exact behavior isn't documented.

--
Michael Fuhr

Re: encode, lower and 0x8a

От
Bruce Momjian
Дата:
I have updated the encode() documentation to not mention "ASCII", and to
be more specific about what 'escape' does.  Backpatched to 8.2.X.

---------------------------------------------------------------------------

Michael Fuhr wrote:
> On Thu, Jan 25, 2007 at 02:28:38PM -0500, Michael Artz wrote:
> > Perhaps my understanding of the 'encode' function is incorrect, but I
> > was under the impression that I could do something like:
> >
> > SELECT lower(encode(bytes, 'escape')) FROM mytable;
> >
> > as it sounded like (from the manual) that 'encode' would return valid
> > ASCII, with all the non-ascii bytes hex escaped.
>
> The documentation for encode() does give that impression: "Encode
> binary string to ASCII-only representation.  Supported types are:
> base64, hex, escape."  However, the source code for esc_encode()
> in src/backend/utils/adt/encode.c says and does otherwise:
>
>  * Only two characters are escaped:
>  * \0 (null) and \\ (backslash)
>
> > When I have the byte 0x8a, however, I get the error:
> >
> > ERROR:  invalid byte sequence for encoding "UTF8": 0x8a
>
> Since encode() returns text and doesn't escape non-ASCII characters,
> all of the original binary data will be treated as though it's text
> in the database's encoding.  If the data contains byte sequences
> that aren't valid in that encoding then you get the above error.
>
> > I have the sneaking suspicion that I am missing something, so please
> > correct me if I am wrong.  If I am wrong, is there a better way to
> > lowercase all the ascii characters in a bytea string?
>
> What are you trying to do?  What is the binary data and why are you
> treating it (or part of it) as though it's text?  Do you want the
> end result to be text with escape sequences or do you want to convert
> it back to bytea?
>
> Something like this might work:
>
> SELECT lower(textin(byteaout(bytes))) FROM mytable;
>
> To turn the result back into bytea:
>
> SELECT decode(lower(textin(byteaout(bytes))), 'escape') FROM mytable;
>
> --
> Michael Fuhr
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faq

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +
Index: doc/src/sgml/func.sgml
===================================================================
RCS file: /cvsroot/pgsql/doc/src/sgml/func.sgml,v
retrieving revision 1.368
diff -c -c -r1.368 func.sgml
*** doc/src/sgml/func.sgml    20 Feb 2007 18:18:05 -0000    1.368
--- doc/src/sgml/func.sgml    20 Feb 2007 19:48:37 -0000
***************
*** 1356,1363 ****
         </entry>
         <entry><type>text</type></entry>
         <entry>
!         Encode binary data to <acronym>ASCII</acronym>-only representation.  Supported
          types are: <literal>base64</>, <literal>hex</>, <literal>escape</>.
         </entry>
         <entry><literal>encode( E'123\\000\\001', 'base64')</literal></entry>
         <entry><literal>MTIzAAE=</literal></entry>
--- 1356,1365 ----
         </entry>
         <entry><type>text</type></entry>
         <entry>
!         Encode binary data to different representation.  Supported
          types are: <literal>base64</>, <literal>hex</>, <literal>escape</>.
+         <literal>Escape</> merely outputs null bytes as <literal>\000</> and
+         doubles backslashes.
         </entry>
         <entry><literal>encode( E'123\\000\\001', 'base64')</literal></entry>
         <entry><literal>MTIzAAE=</literal></entry>