Обсуждение: Need magic that checks if a collumn contains values that CAN NOT be converted to Latin1

Поиск
Список
Период
Сортировка

Need magic that checks if a collumn contains values that CAN NOT be converted to Latin1

От
Andreas
Дата:
Hi,

I've got an 8.4.3 Unicode DB that accidentally holds a few records with
characters that can't be converted to Latin1 or 9 for output to CSV.

I'd just need a way to check if a collumn contains values that CAN NOT
be converted from Utf8 to Latin1 to select all those affected records.

I tried:
Select convert_to (my_column::text, 'LATIN1') from my_table;

It raises an error that says translated:
ERROR:  character 0xe28093 in encoding »UTF8« has no equivalent in »LATIN1«
Regrettably it doesn't explain where it found this sign.

Select '\xe28093'
complains that this weren't a valid UTF8 code at all.
So how was it accepted and stored in the first place?

When I know which record has faulty content I can correct it.


regards
Andreas   :)

Re: Need magic that checks if a collumn contains values that CAN NOT be converted to Latin1

От
Craig Ringer
Дата:
On 29/04/10 11:02, Andreas wrote:
> Hi,
>
> I've got an 8.4.3 Unicode DB that accidentally holds a few records with
> characters that can't be converted to Latin1 or 9 for output to CSV.
>
> I'd just need a way to check if a collumn contains values that CAN NOT
> be converted from Utf8 to Latin1 to select all those affected records.
>
> I tried:
> Select convert_to (my_column::text, 'LATIN1') from my_table;
>
> It raises an error that says translated:
> ERROR: character 0xe28093 in encoding »UTF8« has no equivalent in »LATIN1«
> Regrettably it doesn't explain where it found this sign.

I'd use a PL/PgSQL procedure to step through the result of a FOR IN
SELECT, running each test in an exception handling block.

> Select '\xe28093'
> complains that this weren't a valid UTF8 code at all.
> So how was it accepted and stored in the first place?

Because that's the escape E'\xe2" followed by the literal characters
"8093".  "\xe2" followed by those characters isn't legal utf-8.


I'm pretty sure the error actually refers to:

   select E'\xe2\x80\x93';

which is the character "–" (U+2013 EN DASH).


Yes, it'd be nice if PostgreSQL's error message was in syntax that
PostgreSQL understood, not pseudo-C-style hex literal form.

--
Craig Ringer

Re: Need magic that checks if a collumn contains values that CAN NOT be converted to Latin1

От
Magnus Hagander
Дата:
On Thu, Apr 29, 2010 at 05:02, Andreas <maps.on@gmx.net> wrote:
> Hi,
>
> I've got an 8.4.3 Unicode DB that accidentally holds a few records with
> characters that can't be converted to Latin1 or 9 for output to CSV.
>
> I'd just need a way to check if a collumn contains values that CAN NOT be
> converted from Utf8 to Latin1 to select all those affected records.
>
> I tried:
> Select convert_to (my_column::text, 'LATIN1') from my_table;
>
> It raises an error that says translated:
> ERROR:  character 0xe28093 in encoding »UTF8« has no equivalent in »LATIN1«
> Regrettably it doesn't explain where it found this sign.
>
> Select '\xe28093'
> complains that this weren't a valid UTF8 code at all.
> So how was it accepted and stored in the first place?
>
> When I know which record has faulty content I can correct it.


Wrap your check in a simple function:

CREATE OR REPLACE FUNCTION public.is_valid_encoding(vtext text, encname text)
 RETURNS boolean
 LANGUAGE plpgsql
AS $$
BEGIN
   BEGIN
      PERFORM convert_to(vtext, encname);
   EXCEPTION WHEN untranslatable_character THEN
      RETURN 'f';
   END;
   RETURN 't';
END;
$$



And execute
SELECT * FROM my_table WHERE NOT is_valid_encoding(my_column, 'LATIN1')



--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/