Обсуждение: unaccent extension missing some accents

Поиск
Список
Период
Сортировка

unaccent extension missing some accents

От
J Smith
Дата:
G'day list.

I've been messing around with the unaccent extension and I've noticed
that some of the characters listed in the unaccent.rules file aren't
actually being unaccented on my system.

Here are the system details and whatnot.

- OSX 10.7.2

- the server is compiled via macports. Tried using both gcc and llvm
4.2.1 compilers that come with the latest version of XCode.

- the same symptoms show up in both 9.0.5 and 9.1.1. I've also tried
building manually from the latest REL9_1_STABLE branch from git to
make sure macports wasn't the problem, but I'm getting the same
results with both compilers.

When I first do a CREATE EXTENSION for unaccent, I'm seeing the
following warnings in the log file:

===
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 8 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"à    a"
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 57 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"Ġ    G"
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 144 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"Š    S"
===

I've dug around through the unaccent code a bit and I've noticed that
the sscanf it does when reading the file is producing some odd output.


Re: unaccent extension missing some accents

От
J Smith
Дата:
Gah! Accidentally hit Send. Let me finish that last message before
sending this time!


G'day list.

I've been messing around with the unaccent extension and I've noticed
that some of the characters listed in the unaccent.rules file aren't
actually being unaccented on my system.

Here are the system details and whatnot.

- OSX 10.7.2

- the server is compiled via macports. Tried using both gcc and llvm
4.2.1 compilers that come with the latest version of XCode.

- the same symptoms show up in both 9.0.5 and 9.1.1. I've also tried
building manually from the latest REL9_1_STABLE branch from git to
make sure macports wasn't the problem, but I'm getting the same
results with both compilers.

When I first do a CREATE EXTENSION for unaccent, I'm seeing the
following warnings in the log file:

===
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 8 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"à      a
       "
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 57 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"Ġ      G
       "
WARNING:  duplicate TO argument, use first one
CONTEXT:  line 144 of configuration file
"/usr/local/postgresql91-local/share/tsearch_data/unaccent.rules":
"Š      S
       "
===

I've dug around through the unaccent.c code a bit and I've noticed
that the sscanf it does when reading the file is producing some odd
output. I've tried with a minimal example using the same sort of
sscanf code reading from the same unaccent.rules file, but the minimal
example doesn't produce the same output.

I put some elog debugging lines into unaccent.c and found that sscanf
sometimes reads the scanned line by finding only one byte for the for
the source character rather than the two required for the complete
UTF-8 code point. It appears that the following characters are causing
the problem, along with the code points and such:

'Å' => 'A' | c3,85 => 41
'à' => 'a' | c3,a0 => 61
'ą' => 'a' | c4,85 => 61
'Ġ' => 'G' | c4,a0 => 47
'Ņ' => 'N' | c5,85 => 4e
'Š' => 'S' | c5,a0 => 53

In each case, one byte was being read in the source string rather than
two, leading to the "duplicate TO" warnings above. This later leads to
the characters that produced the warning being ignored when unaccent
is called and left in the output.

I haven't been able to reproduce in a smaller example, and haven't
been able to reproduce on a CentOS server, so at this point I'm at a
loss as to the problem.

Anybody got any ideas?

Cheers


Re: unaccent extension missing some accents

От
Florian Pflug
Дата:
On Nov6, 2011, at 18:43 , J Smith wrote:
> I put some elog debugging lines into unaccent.c and found that sscanf
> sometimes reads the scanned line by finding only one byte for the for
> the source character rather than the two required for the complete
> UTF-8 code point. It appears that the following characters are causing
> the problem, along with the code points and such:
>
> 'Å' => 'A' | c3,85 => 41
> 'à' => 'a' | c3,a0 => 61
> 'ą' => 'a' | c4,85 => 61
> 'Ġ' => 'G' | c4,a0 => 47
> 'Ņ' => 'N' | c5,85 => 4e
> 'Š' => 'S' | c5,a0 => 53
>
> In each case, one byte was being read in the source string rather than
> two, leading to the "duplicate TO" warnings above. This later leads to
> the characters that produced the warning being ignored when unaccent
> is called and left in the output.

What's the locale of the database you're seeing this in, and which charset
does it use?

I think scanf() uses isspace() and friends, and last time I looked the
locale definitions where all pretty bogus on OSX. So maybe scanf() somehow
decides that 0xA0 is whitespace.

> I haven't been able to reproduce in a smaller example, and haven't
> been able to reproduce on a CentOS server, so at this point I'm at a
> loss as to the problem.

Have you tried to set the same locale as postgres (using setlocale()) in
your tests?

best regards,
Florian Pflug



Re: unaccent extension missing some accents

От
J Smith
Дата:
On Sun, Nov 6, 2011 at 1:18 PM, Florian Pflug <fgp@phlo.org> wrote:
>
> What's the locale of the database you're seeing this in, and which charset
> does it use?
>
> I think scanf() uses isspace() and friends, and last time I looked the
> locale definitions where all pretty bogus on OSX. So maybe scanf() somehow
> decides that 0xA0 is whitespace.
>

Ah, that does it: the locale I was using in the test code was just
plain ol' C locale, while in the database it was en_CA.UTF-8. Changing
the locale in my test code produced the wonky results. I should have
figured it was a locale problem. Sure enough, in a UTF-8 locale, it
believes that both 0xa0 and 0x85 are spaces. Pretty wonky behaviour
indeed.

Apparently this is a known OSX issue that has its roots in and older
version of FreeBSD's libc I guess, eh? I've found various bug reports
that allude to the problem and they all seem to point that way.

I've attached a patch against master for unaccent.c that uses swscanf
along with char2wchar and wchar2char instead of sscanf directly to
initialize the unaccent extension and it appears to fix the problem in
both the master and 9.1 branches.

I haven't added any tests in the expected output file 'cause I'm not
exactly sure what I should be testing against, but I could take a
crack at that, too, if the patch looks reasonable and is usable.

Cheers.

Вложения

Re: unaccent extension missing some accents

От
Tom Lane
Дата:
J Smith <dark.panda+lists@gmail.com> writes:
> I've attached a patch against master for unaccent.c that uses swscanf
> along with char2wchar and wchar2char instead of sscanf directly to
> initialize the unaccent extension and it appears to fix the problem in
> both the master and 9.1 branches.

swscanf doesn't seem like an acceptable approach: it's a function that
is relied on nowhere else in PG, so it adds new portability risks of its
own.  It doesn't exist on some platforms that we support (like the one
I'm typing this message on) and there's no real good reason to assume
that it's not broken in its own ways on others.

If you really want to pursue this, I'd suggest parsing the line
manually, perhaps via strchr searches for \t and \n.  It likely wouldn't
be very many more lines than what you've got here.

However, the bigger picture is that OS X's UTF8 locales are broken
through-and-through, and most of their other problems are not feasible
to work around.  So basically you can't use them for anything
interesting, and it's not clear that it's worth putting any time into
solving individual problems.  In the particular case here, the issue
presumably is that sscanf is relying on isspace() ... but we rely on
isspace() directly, in quite a lot of places, so how much is it going
to fix to dodge it right here?
        regards, tom lane


Re: unaccent extension missing some accents

От
J Smith
Дата:
On 2011-11-06, at 7:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> swscanf doesn't seem like an acceptable approach: it's a function that
> is relied on nowhere else in PG, so it adds new portability risks of its
> own.  It doesn't exist on some platforms that we support (like the one
> I'm typing this message on) and there's no real good reason to assume
> that it's not broken in its own ways on others.
>
> If you really want to pursue this, I'd suggest parsing the line
> manually, perhaps via strchr searches for \t and \n.  It likely wouldn't
> be very many more lines than what you've got here.
>
> However, the bigger picture is that OS X's UTF8 locales are broken
> through-and-through, and most of their other problems are not feasible
> to work around.  So basically you can't use them for anything
> interesting, and it's not clear that it's worth putting any time into
> solving individual problems.  In the particular case here, the issue
> presumably is that sscanf is relying on isspace() ... but we rely on
> isspace() directly, in quite a lot of places, so how much is it going
> to fix to dodge it right here?
>
>            regards, tom lane

There are some fixes for isspace and friend that I've seen python
using so perhaps in those cases a similar fix could be applied. For
instance, maybe something like the code around line 674 here:

http://svn.python.org/view/python/trunk/Include/pyport.h?revision=81029&view=markup

Perhaps that would be suitable on OSX at least in the case of isspace
et al. As far as I can tell scanf doesn't seem to use isspace on my
system so that would only be a partial fix for this an whatever other
situations isspace is used in. (on a mobile now so I can't check a the
moment.)

This isn't really a huge deal for me but I'll try to get a chance to
write up a little parser anyways just for kicks.

Cheers


Re: unaccent extension missing some accents

От
J Smith
Дата:
Alright, I wrote up another patch that uses strchr to parse out the
lines of the unaccent.rules file, foregoing sscanf completely.
Hopefully this looks a bit better than using swscanf.

As for the other problems with isspace and such on OSX, it might be
worth looking at the python portability fixes. I played briefly with
the isspace and friends macros they have and they looked okay, but I
certainly can't speak for how well they'd work for the rest of the
PostgreSQL code base.

Cheers.

Вложения

Re: unaccent extension missing some accents

От
Tom Lane
Дата:
J Smith <dark.panda+lists@gmail.com> writes:
> Alright, I wrote up another patch that uses strchr to parse out the
> lines of the unaccent.rules file, foregoing sscanf completely.
> Hopefully this looks a bit better than using swscanf.

I looked at this a bit and realized that sscanf is actually doing a
couple of critical things for us, which are lost in translation when
doing it like this:

1. It ignores whitespace other than the dividing tab.  If we don't
continue to do that, we'll likely break existing config files.

2. It ensures that src and trg each consist of at least one (nonblank)
character.  placeChar() is critically dependent on the assumption that
src is not empty.

However, after looking around a bit at the other tsearch config-file-
reading functions, I noted that they all use t_isspace() to identify
whitespace ... and that function in fact should be okay on OS X,
because it uses iswspace in multibyte encodings.

So it's fairly simple to improve this code to reject whitespace that
way.  I don't like the existing code anyway because of its potential
vulnerability to buffer overrun.  I'll fix it up and commit.

> As for the other problems with isspace and such on OSX, it might be
> worth looking at the python portability fixes.

If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
work), I might be tempted to try to do it that way, but I still fail
to see the point.  After reviewing the code I feel that unaccent needs
to be fixed because it's not consistent with the other tsearch config
file parsers, and not so much because it works or doesn't work on any
specific platform.
        regards, tom lane


Re: unaccent extension missing some accents

От
Florian Pflug
Дата:
On Nov7, 2011, at 17:46 , J Smith wrote:
> On Mon, Nov 7, 2011 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
>> work), I might be tempted to try to do it that way, but I still fail
>> to see the point.  After reviewing the code I feel that unaccent needs
>> to be fixed because it's not consistent with the other tsearch config
>> file parsers, and not so much because it works or doesn't work on any
>> specific platform.
> 
> Yeah, I never knew there was such a problem with OSX and UTF8 before
> running into it here but it's good to know.

Various issues with OSX and UTF-8 locales seems to come up quite often, yet
we're not really in a position to do anything about them.

Thus, I think we should warn about these issues and save people the trouble
of finding out about this the hard way. The only question is, what would be
a good place for such a warning?

best regards,
Florian Pflug



Re: unaccent extension missing some accents

От
J Smith
Дата:
On Mon, Nov 7, 2011 at 11:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I looked at this a bit and realized that sscanf is actually doing a
> couple of critical things for us, which are lost in translation when
> doing it like this:
>
> 1. It ignores whitespace other than the dividing tab.  If we don't
> continue to do that, we'll likely break existing config files.
>
> 2. It ensures that src and trg each consist of at least one (nonblank)
> character.  placeChar() is critically dependent on the assumption that
> src is not empty.
>
> However, after looking around a bit at the other tsearch config-file-
> reading functions, I noted that they all use t_isspace() to identify
> whitespace ... and that function in fact should be okay on OS X,
> because it uses iswspace in multibyte encodings.
>
> So it's fairly simple to improve this code to reject whitespace that
> way.  I don't like the existing code anyway because of its potential
> vulnerability to buffer overrun.  I'll fix it up and commit.
>
>> As for the other problems with isspace and such on OSX, it might be
>> worth looking at the python portability fixes.
>
> If OS X's UTF8 locales weren't so thoroughly broken (eg sorting does not
> work), I might be tempted to try to do it that way, but I still fail
> to see the point.  After reviewing the code I feel that unaccent needs
> to be fixed because it's not consistent with the other tsearch config
> file parsers, and not so much because it works or doesn't work on any
> specific platform.
>

Yeah, I never knew there was such a problem with OSX and UTF8 before
running into it here but it's good to know. When I noticed the
unnaccent extension in more recent PostgreSQL versions, I figured it
would perform better than our current plperl-based accent stripping
function (which it surely does) and just noticed the results on my
machine were a little off, but our linux-based servers were fine and
dandy and yadda yadda yadda.

Anyways, lemme know if there's anything else I could help with or
could test and whatnot. Cheers.


Re: unaccent extension missing some accents

От
Tom Lane
Дата:
J Smith <dark.panda+lists@gmail.com> writes:
> Anyways, lemme know if there's anything else I could help with or
> could test and whatnot. Cheers.

If you have time to check that the patch I just committed fixes your
problem, it'd be worth doing.  I did not test it on OS X ...
        regards, tom lane


Re: unaccent extension missing some accents

От
J Smith
Дата:
On Mon, Nov 7, 2011 at 11:59 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> If you have time to check that the patch I just committed fixes your
> problem, it'd be worth doing.  I did not test it on OS X ...

Looks good to me, thanks.

Would it even really be worth it to look into any of the other locale
issues on OSX, given that PostgreSQL is now included in their default
installs starting with 10.7, or would this really be more of a case of
hoping Apple some day fixes the issue upstream? It doesn't seem like
that's going to happen any time soon, mind you, as apparently this is
a rather long-standing issue in their libc, but who knows. I know OSX
isn't exactly the most popular database server OS out there, but it
seems to be becoming more popular for developers these days at the
very least.

Anyways, cheers, and thanks again.


Re: unaccent extension missing some accents

От
Tom Lane
Дата:
J Smith <dark.panda+lists@gmail.com> writes:
> Would it even really be worth it to look into any of the other locale
> issues on OSX, given that PostgreSQL is now included in their default
> installs starting with 10.7, or would this really be more of a case of
> hoping Apple some day fixes the issue upstream?

To my mind, the killer issue is the incorrect sorting --- there's
basically no way we can work around that.  If Apple were to fix that,
we might be able to nibble at the margins of the other stuff.
        regards, tom lane


Re: unaccent extension missing some accents

От
J Smith
Дата:
On Mon, Nov 7, 2011 at 11:53 AM, Florian Pflug <fgp@phlo.org> wrote:
>
> Various issues with OSX and UTF-8 locales seems to come up quite often, yet
> we're not really in a position to do anything about them.
>
> Thus, I think we should warn about these issues and save people the trouble
> of finding out about this the hard way. The only question is, what would be
> a good place for such a warning?
>

Hmm, I suppose one place could be on initdb if initializing with a
UTF-8 locale, although that only really helps with fixed settings like
LC_COLLATE and LC_CTYPE and the defaults for the user-configurable
settings, right? For the configurable settings I guess logging as
warnings on server start up or when they're changed via SET. But then
there's all of the other places, like when using COLLATE and using
locale-aware functions and so on and so forth. It would be a lot to
cover, I'll bet.

I guess maybe initdb, start up and SET warnings and a note in the docs
would hopefully suffice?


Re: unaccent extension missing some accents

От
Bruce Momjian
Дата:
Tom Lane wrote:
> J Smith <dark.panda+lists@gmail.com> writes:
> > I've attached a patch against master for unaccent.c that uses swscanf
> > along with char2wchar and wchar2char instead of sscanf directly to
> > initialize the unaccent extension and it appears to fix the problem in
> > both the master and 9.1 branches.
> 
> swscanf doesn't seem like an acceptable approach: it's a function that
> is relied on nowhere else in PG, so it adds new portability risks of its
> own.  It doesn't exist on some platforms that we support (like the one
> I'm typing this message on) and there's no real good reason to assume
> that it's not broken in its own ways on others.
> 
> If you really want to pursue this, I'd suggest parsing the line
> manually, perhaps via strchr searches for \t and \n.  It likely wouldn't
> be very many more lines than what you've got here.
> 
> However, the bigger picture is that OS X's UTF8 locales are broken
> through-and-through, and most of their other problems are not feasible
> to work around.  So basically you can't use them for anything
> interesting, and it's not clear that it's worth putting any time into
> solving individual problems.  In the particular case here, the issue
> presumably is that sscanf is relying on isspace() ... but we rely on
> isspace() directly, in quite a lot of places, so how much is it going
> to fix to dodge it right here?

If Apple's low-level code came from FreeBSD and NetBSD, how did they get
so broken?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +


Re: unaccent extension missing some accents

От
Tom Lane
Дата:
Bruce Momjian <bruce@momjian.us> writes:
> Tom Lane wrote:
>> However, the bigger picture is that OS X's UTF8 locales are broken
>> through-and-through, and most of their other problems are not feasible
>> to work around.

> If Apple's low-level code came from FreeBSD and NetBSD, how did they get
> so broken?

AFAIK, they're broken in the BSDen too, or at least were when Apple
branched off from whichever BSD they started from (which was years ago).
There may be a better solution available upstream by now, but it doesn't
appear to me that Apple has any interest in fixing this area.
        regards, tom lane