Обсуждение: How to convert HEX to ASCII?

Поиск
Список
Период
Сортировка

How to convert HEX to ASCII?

От
Torsten Zuehlsdorff
Дата:
Hello,

after several attempts I have finally succeeded in developing a
urlencode() function to encode text correctly like defined in RFC 1738.

Now i have a big problem: how to decode the text?

Example:
# SELECT urlencode('Hellö World!');
        urlencode
-----------------------
  Hell%C3%B6%20World%21

Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?

Thanks for your help and greetings from Germany,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

Re: How to convert HEX to ASCII?

От
Damien Churchill
Дата:
On 2 December 2011 13:18, Torsten Zuehlsdorff <foo@meisterderspiele.de> wrote:
> Hello,
>
> after several attempts I have finally succeeded in developing a urlencode()
> function to encode text correctly like defined in RFC 1738.
>
> Now i have a big problem: how to decode the text?
>
> Example:
> # SELECT urlencode('Hellö World!');
>       urlencode
> -----------------------
>  Hell%C3%B6%20World%21
>
> Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?
>

I've extracted the unquote method [0] from urllib in the python stdlib
that decodes urlencoded strings. Hopefully be some use!

[0] http://pastie.org/2954968

Re: How to convert HEX to ASCII?

От
Torsten Zuehlsdorff
Дата:
Damien Churchill schrieb:

>> after several attempts I have finally succeeded in developing a urlencode()
>> function to encode text correctly like defined in RFC 1738.
>>
>> Now i have a big problem: how to decode the text?
>>
>> Example:
>> # SELECT urlencode('Hellö World!');
>>       urlencode
>> -----------------------
>>  Hell%C3%B6%20World%21
>>
>> Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?
>
> I've extracted the unquote method [0] from urllib in the python stdlib
> that decodes urlencoded strings. Hopefully be some use!

Not directly, but it gives me some helpful hints. For example i'm now
able to decode some basic characters, for example:

# SELECT chr(x'21'::int);
  chr
-----
  !
(1 row)

But i clearly have a missunderstanding of other chars, like umlauts or
utf-8 chars. This, for example, should return a 'ö':

# SELECT chr(x'C3B6'::int);
  chr
-----
  쎶
(1 row)

Also i'm not sure how to figure out, when to decode '%C3' and when to
decode '%C3%B6'.

Thanks for your help,
Torsten
--
http://www.dddbl.de - ein Datenbank-Layer, der die Arbeit mit 8
verschiedenen Datenbanksystemen abstrahiert,
Queries von Applikationen trennt und automatisch die Query-Ergebnisse
auswerten kann.

Re: How to convert HEX to ASCII?

От
Merlin Moncure
Дата:
On Fri, Dec 2, 2011 at 8:16 AM, Torsten Zuehlsdorff
<foo@meisterderspiele.de> wrote:
> Damien Churchill schrieb:
>
>
>>> after several attempts I have finally succeeded in developing a
>>> urlencode()
>>> function to encode text correctly like defined in RFC 1738.
>>>
>>> Now i have a big problem: how to decode the text?
>>>
>>> Example:
>>> # SELECT urlencode('Hellö World!');
>>>      urlencode
>>> -----------------------
>>>  Hell%C3%B6%20World%21
>>>
>>> Does anybody know a way to convert '%21' back to '!' and '%C3%B6' to 'ö'?
>>
>>
>> I've extracted the unquote method [0] from urllib in the python stdlib
>> that decodes urlencoded strings. Hopefully be some use!
>
>
> Not directly, but it gives me some helpful hints. For example i'm now able
> to decode some basic characters, for example:
>
> # SELECT chr(x'21'::int);
>  chr
> -----
>  !
> (1 row)
>
> But i clearly have a missunderstanding of other chars, like umlauts or utf-8
> chars. This, for example, should return a 'ö':
>
> # SELECT chr(x'C3B6'::int);
>  chr
> -----
>  쎶
> (1 row)
>
> Also i'm not sure how to figure out, when to decode '%C3' and when to decode
> '%C3%B6'.
>
> Thanks for your help,

You're welcome.  get ready for some seriously abusive sql:

create or replace function unencode(text) returns text as
$$
with q as
(
  select (regexp_matches($1, '(%..|.)', 'g'))[1] as v
)
select string_agg(case when length(v) = 3 then chr(replace(v, '%',
'x')::bit(8)::int) else v end, '') from q;
$$ language sql immutable;

set client_encoding to latin1;
SET

postgres=# select unencode('Hell%C3%B6%20World%21');
   unencode
---------------
 Hellö World!
(1 row)

Time: 1.908 ms

(maybe this isn't really an immutable function, but oh well).

merlin

Re: How to convert HEX to ASCII?

От
Marti Raudsepp
Дата:
On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
<foo@meisterderspiele.de> wrote:
> But i clearly have a missunderstanding of other chars, like umlauts or utf-8
> chars. This, for example, should return a 'ö':
>
> # SELECT chr(x'C3B6'::int);
>  chr
> -----
>  쎶
> (1 row)

That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
and actually decodes to the codepoint 00F6.

There is a fundamental problem that a decoded URL may actually be a
binary string -- it might not have a textual representation at all.
But if text is what you want, RFC3986 strongly suggests using UTF-8
for encoding text strings in URLs, and that works almost always in the
real world.

So the *right* way is to first convert the URL to a binary "bytea"
type by fixing all the % escapes, then convert that to UTF-8 encoding
to handle multibyte characters.

What I came up with is far from elegant because PostgreSQL lacks
convenient functions for bytea manipulation (no bytea_agg, etc).
Stealing a little from Merlin, this is what it looks like:

CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
LANGUAGE plpgsql IMMUTABLE STRICT AS $$
DECLARE
  bin bytea = '';
  byte text;
BEGIN
  FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
    IF length(byte) = 3 THEN
      bin = bin || decode(substring(byte, 2, 2), 'hex');
    ELSE
      bin = bin || byte::bytea;
    END IF;
  END LOOP;
  RETURN convert_from(bin, 'utf8');
END
$$;

db=# select url_decode('Hell%C3%B6%20World%21');
  url_decode
--------------
 Hellö World!

db=# select url_decode('%EC%8E%B6');
 url_decode
------------
 쎶

This will break for binary-encoded data in URLs, though.
db=# select url_decode('%fa%fa%fa');
ERROR:  invalid byte sequence for encoding "UTF8": 0xfa
CONTEXT:  PL/pgSQL function "url_decode" line 13 at RETURN

----

On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure@gmail.com> wrote:
> set client_encoding to latin1;

> postgres=# select unencode('Hell%C3%B6%20World%21');
>   unencode
> ---------------
>  Hellö World!
> (1 row)

Sorry, but AFAICT this makes a mess of encodings and only works by
pure luck. The server thinks it's sending the client LATIN1 text, but
it's actually UTF8-encoded and the last decoding step is done by your
terminal.

Regards,
Marti

Re: How to convert HEX to ASCII?

От
Merlin Moncure
Дата:
On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:
> Sorry, but AFAICT this makes a mess of encodings and only works by
> pure luck. The server thinks it's sending the client LATIN1 text, but
> it's actually UTF8-encoded and the last decoding step is done by your
> terminal.

yup -- your're right -- what a coincidence!  I still prefer the 1
liner sql variant vs plpgsql loop though.  nicely done.

merlin

Re: How to convert HEX to ASCII?

От
Pavel Stehule
Дата:
2011/12/2 Merlin Moncure <mmoncure@gmail.com>:
> On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:
>> Sorry, but AFAICT this makes a mess of encodings and only works by
>> pure luck. The server thinks it's sending the client LATIN1 text, but
>> it's actually UTF8-encoded and the last decoding step is done by your
>> terminal.
>
> yup -- your're right -- what a coincidence!  I still prefer the 1
> liner sql variant vs plpgsql loop though.  nicely done.

so bytea_agg - one param aggregate has sense

it's very easy to implement it

Pavel

>
> merlin
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Re: How to convert HEX to ASCII?

От
Merlin Moncure
Дата:
On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> 2011/12/2 Merlin Moncure <mmoncure@gmail.com>:
>> On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:
>>> Sorry, but AFAICT this makes a mess of encodings and only works by
>>> pure luck. The server thinks it's sending the client LATIN1 text, but
>>> it's actually UTF8-encoded and the last decoding step is done by your
>>> terminal.
>>
>> yup -- your're right -- what a coincidence!  I still prefer the 1
>> liner sql variant vs plpgsql loop though.  nicely done.
>
> so bytea_agg - one param aggregate has sense
>
> it's very easy to implement it

yup:

create aggregate bytea_agg (bytea)
(
  sfunc=byteacat,
  stype=bytea
);

merlin

Re: How to convert HEX to ASCII?

От
Pavel Stehule
Дата:
2011/12/2 Merlin Moncure <mmoncure@gmail.com>:
> On Fri, Dec 2, 2011 at 11:15 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> 2011/12/2 Merlin Moncure <mmoncure@gmail.com>:
>>> On Fri, Dec 2, 2011 at 10:42 AM, Marti Raudsepp <marti@juffo.org> wrote:
>>>> Sorry, but AFAICT this makes a mess of encodings and only works by
>>>> pure luck. The server thinks it's sending the client LATIN1 text, but
>>>> it's actually UTF8-encoded and the last decoding step is done by your
>>>> terminal.
>>>
>>> yup -- your're right -- what a coincidence!  I still prefer the 1
>>> liner sql variant vs plpgsql loop though.  nicely done.
>>
>> so bytea_agg - one param aggregate has sense
>>
>> it's very easy to implement it
>
> yup:
>
> create aggregate bytea_agg (bytea)
> (
>  sfunc=byteacat,
>  stype=bytea
> );

this is workaround :)

without a memory preallocating it has same speed like cycle in plpgsql.

Regards

Pavel

>
> merlin

Re: How to convert HEX to ASCII?

От
Merlin Moncure
Дата:
On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>>> so bytea_agg - one param aggregate has sense
>>>
>>> it's very easy to implement it
>>
>> yup:
>>
>> create aggregate bytea_agg (bytea)
>> (
>>  sfunc=byteacat,
>>  stype=bytea
>> );
>
> this is workaround :)
>
> without a memory preallocating it has same speed like cycle in plpgsql.

sure, but I prefer to code against the workaround because it's cleaner
and it makes things easier to port over when such a feature makes it
into core.  also, one liner sql has better chance of inlining as a
general rule.

merlin

Re: How to convert HEX to ASCII?

От
Pavel Stehule
Дата:
2011/12/2 Merlin Moncure <mmoncure@gmail.com>:
> On Fri, Dec 2, 2011 at 2:01 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>>>> so bytea_agg - one param aggregate has sense
>>>>
>>>> it's very easy to implement it
>>>
>>> yup:
>>>
>>> create aggregate bytea_agg (bytea)
>>> (
>>>  sfunc=byteacat,
>>>  stype=bytea
>>> );
>>
>> this is workaround :)
>>
>> without a memory preallocating it has same speed like cycle in plpgsql.
>
> sure, but I prefer to code against the workaround because it's cleaner
> and it makes things easier to port over when such a feature makes it
> into core.  also, one liner sql has better chance of inlining as a
> general rule.

ook

Pavel

>
> merlin

Re: How to convert HEX to ASCII?

От
Torsten Zühlsdorff
Дата:
Marti Raudsepp schrieb:
> On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
> <foo@meisterderspiele.de> wrote:
>> But i clearly have a missunderstanding of other chars, like umlauts or utf-8
>> chars. This, for example, should return a 'ö':
>>
>> # SELECT chr(x'C3B6'::int);
>>  chr
>> -----
>>  쎶
>> (1 row)
>
> That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
> and actually decodes to the codepoint 00F6.
>
> There is a fundamental problem that a decoded URL may actually be a
> binary string -- it might not have a textual representation at all.
> But if text is what you want, RFC3986 strongly suggests using UTF-8
> for encoding text strings in URLs, and that works almost always in the
> real world.

Text is what i want. :) I've created a highly specialiced CMS, which
handle a bunch of big sites (in meaning of a great numbers of users and
content). It has a build-in traffic-analyze and with this function it
creates a real time analyze of the keywords, a user used to find the
sites in search engines. This is very needful if you try to do SEO for
websites with more than 20.000 unique content-pages. :)

> CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
> LANGUAGE plpgsql IMMUTABLE STRICT AS $$
> DECLARE
>   bin bytea = '';
>   byte text;
> BEGIN
>   FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
>     IF length(byte) = 3 THEN
>       bin = bin || decode(substring(byte, 2, 2), 'hex');
>     ELSE
>       bin = bin || byte::bytea;
>     END IF;
>   END LOOP;
>   RETURN convert_from(bin, 'utf8');
> END
> $$;

Hey, this function looks similar to my encoding function :) Thank you
very munch!

> This will break for binary-encoded data in URLs, though.

Thats no problem, i just have text.

Big thanks to all of you,
Torsten

Re: How to convert HEX to ASCII?

От
Pavel Stehule
Дата:
Hello all

just note

9.1 will have a bytea_agg aggregate

regards

Pavel Stehule

2011/12/2 Marti Raudsepp <marti@juffo.org>:
> On Fri, Dec 2, 2011 at 16:16, Torsten Zuehlsdorff
> <foo@meisterderspiele.de> wrote:
>> But i clearly have a missunderstanding of other chars, like umlauts or utf-8
>> chars. This, for example, should return a 'ö':
>>
>> # SELECT chr(x'C3B6'::int);
>>  chr
>> -----
>>  쎶
>> (1 row)
>
> That gives you the Unicode codepoint C3B6, but %C3%B6 is UTF-8-encoded
> and actually decodes to the codepoint 00F6.
>
> There is a fundamental problem that a decoded URL may actually be a
> binary string -- it might not have a textual representation at all.
> But if text is what you want, RFC3986 strongly suggests using UTF-8
> for encoding text strings in URLs, and that works almost always in the
> real world.
>
> So the *right* way is to first convert the URL to a binary "bytea"
> type by fixing all the % escapes, then convert that to UTF-8 encoding
> to handle multibyte characters.
>
> What I came up with is far from elegant because PostgreSQL lacks
> convenient functions for bytea manipulation (no bytea_agg, etc).
> Stealing a little from Merlin, this is what it looks like:
>
> CREATE OR REPLACE FUNCTION url_decode(input text) RETURNS text
> LANGUAGE plpgsql IMMUTABLE STRICT AS $$
> DECLARE
>  bin bytea = '';
>  byte text;
> BEGIN
>  FOR byte IN (select (regexp_matches(input, '(%..|.)', 'g'))[1]) LOOP
>    IF length(byte) = 3 THEN
>      bin = bin || decode(substring(byte, 2, 2), 'hex');
>    ELSE
>      bin = bin || byte::bytea;
>    END IF;
>  END LOOP;
>  RETURN convert_from(bin, 'utf8');
> END
> $$;
>
> db=# select url_decode('Hell%C3%B6%20World%21');
>  url_decode
> --------------
>  Hellö World!
>
> db=# select url_decode('%EC%8E%B6');
>  url_decode
> ------------
>  쎶
>
> This will break for binary-encoded data in URLs, though.
> db=# select url_decode('%fa%fa%fa');
> ERROR:  invalid byte sequence for encoding "UTF8": 0xfa
> CONTEXT:  PL/pgSQL function "url_decode" line 13 at RETURN
>
> ----
>
> On Fri, Dec 2, 2011 at 17:46, Merlin Moncure <mmoncure@gmail.com> wrote:
>> set client_encoding to latin1;
>
>> postgres=# select unencode('Hell%C3%B6%20World%21');
>>   unencode
>> ---------------
>>  Hellö World!
>> (1 row)
>
> Sorry, but AFAICT this makes a mess of encodings and only works by
> pure luck. The server thinks it's sending the client LATIN1 text, but
> it's actually UTF8-encoded and the last decoding step is done by your
> terminal.
>
> Regards,
> Marti
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general