Обсуждение: BUG #15805: Problem with lower function for greek sigma (Σ) letter
The following bug has been logged on the website: Bug reference: 15805 Logged by: Sergey kuznetsov Email address: iksss.88@gmail.com PostgreSQL version: 9.6.10 Operating system: linux Description: I see unexpected behaviour of lower function for greek sigma letter Σ. According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter is in final word position, it should be ς in lowercase, and not σ. But PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') = "δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this lower string from another part of a system (java code, for example) Postgre will not return this row cause it differs from java-generated one. Thanks, Sergey Kuznetsov
> On 15 May 2019, at 09:57, PG Bug reporting form <noreply@postgresql.org> wrote: > > The following bug has been logged on the website: > > Bug reference: 15805 > Logged by: Sergey kuznetsov > Email address: iksss.88@gmail.com > PostgreSQL version: 9.6.10 > Operating system: linux > Description: > > I see unexpected behaviour of lower function for greek sigma letter Σ. > According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter > is in final word position, it should be ς in lowercase, and not σ. But > PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') = > "δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this > lower string from another part of a system (java code, for example) Postgre > will not return this row cause it differs from java-generated one. This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and googling there is only a single case of word-final lowercasing which is this sigma. The attached patch takes a stab at fixing this. cheers ./daniel
Вложения
On 2019-May-15, Daniel Gustafsson wrote: > > I see unexpected behaviour of lower function for greek sigma letter Σ. > > According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter > > is in final word position, it should be ς in lowercase, and not σ. But > > PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') = > > "δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this > > lower string from another part of a system (java code, for example) Postgre > > will not return this row cause it differs from java-generated one. > > This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and > googling there is only a single case of word-final lowercasing which is this > sigma. The attached patch takes a stab at fixing this. Ummm ... isn't this a counterexample? https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2019-May-15, Daniel Gustafsson wrote: >> This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and >> googling there is only a single case of word-final lowercasing which is this >> sigma. The attached patch takes a stab at fixing this. > Ummm ... isn't this a counterexample? > https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html I do not think the patch as given is acceptable in any case: 1. assumes without any evidence whatsoever that the system's wide-character representation is Unicode code points; 2. assumes without checking that the locale is one that would allow this conversion (counterexample: C locale); 3. unreasonable hard-coded assumption about what the "not a word character" condition is. It's possible that 1 and 2 could be finessed by checking both that the original character is Σ and the new one is σ (in Unicode). We'd still theoretically be taking a risk of the wrong substitution if the wchar representation is not Unicode, but the odds seem fairly small. As for point 3, why aren't you using iswalpha() on the next character? regards, tom lane
PG Bug reporting form wrote: > lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') = > "δημοτεσ φαιστου" instead of "δημοτες φαιστου" With PostgreSQL version 10 or newer, you could use an ICU locale. lower() would produce the expected result: psql (11.3 (Debian 11.3-1.pgdg90+1)) => select lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ' collate "und-x-icu"); lower ----------------- δημοτες φαιστου (1 row) This case looks comparable to the case of the german ß (sharp s), which should be upcased into 'SS', but the locales backed by libc don't do that: => select upper(''Ich muß'); upper --------- ICH MUß For that exemple as well, an ICU locale produces a correct result with regard to linguistic rules: => select upper('Ich muß' collate "und-x-icu"); upper ---------- ICH MUSS The libc library provides an API with character-by-character case conversions (tolower/toupper), which is too limited to deal with the above cases, and PostgreSQL basically just uses this API. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
"Daniel Verite" <daniel@manitou-mail.org> writes: > PG Bug reporting form wrote: >> lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') = >> "δημοτεσ φαιστου" instead of "δημοτες φαιστου" > With PostgreSQL version 10 or newer, you could use an ICU > locale. lower() would produce the expected result: Oh, if using ICU already fixes this, I think we might as well just say that you have to use ICU if you want the right behavior for such cases. > The libc library provides an API with character-by-character > case conversions (tolower/toupper), which is too limited > to deal with the above cases, and PostgreSQL basically > just uses this API. Right. regards, tom lane
> On 15 May 2019, at 14:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2019-May-15, Daniel Gustafsson wrote: > >>> I see unexpected behaviour of lower function for greek sigma letter Σ. >>> According to wikipedia (https://en.wikipedia.org/wiki/Sigma) if this letter >>> is in final word position, it should be ς in lowercase, and not σ. But >>> PotgreSQL lower function returns σ, for example lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') = >>> "δημοτεσ φαιστου" instead of "δημοτες φαιστου". So if I try to pass this >>> lower string from another part of a system (java code, for example) Postgre >>> will not return this row cause it differs from java-generated one. >> >> This is indeed a bug, and a rare occurrence since AFAICT from ISO 30112 and >> googling there is only a single case of word-final lowercasing which is this >> sigma. The attached patch takes a stab at fixing this. > > Ummm ... isn't this a counterexample? > https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html Hebrew doesn’t have case, so it doesn’t apply in this case. cheers ./daniel
On 2019-May-15, Daniel Gustafsson wrote: > > On 15 May 2019, at 14:20, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Ummm ... isn't this a counterexample? > > https://hebrew4christians.com/Grammar/Unit_One/Final_Forms/final_forms.html > > Hebrew doesn’t have case, so it doesn’t apply in this case. Yeah, I realized that afterwards. Sorry for the noise. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 15 May 2019, at 17:46, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > "Daniel Verite" <daniel@manitou-mail.org> writes: >> PG Bug reporting form wrote: >>> lower('ΔΗΜΟΤΕΣ ΦΑΙΣΤΟΥ') = >>> "δημοτεσ φαιστου" instead of "δημοτες φαιστου" > >> With PostgreSQL version 10 or newer, you could use an ICU >> locale. lower() would produce the expected result: > > Oh, if using ICU already fixes this, I think we might as well just > say that you have to use ICU if you want the right behavior for such > cases. Seems reasonable. Maybe it warrants a mention in the docs on the string function page since it may suprise users? cheers ./daniel