Обсуждение: Windows default locale vs initdb
Hi, Moving this topic into its own thread from the one about collation versions, because it concerns pre-existing problems, and that thread is long. Currently initdb sets up template databases with old-style Windows locale names reported by the OS, and they seem to have caused us quite a few problems over the years: db29620d "Work around Windows locale name with non-ASCII character." aa1d2fc5 "Another attempt at fixing Windows Norwegian locale." db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..." 9f12a3b9 "Tolerate version lookup failure for old style Windows locale..." ... and probably more, and also various threads about , for example, "German_German.1252" vs "German_Switzerland.1252" which seem to get confused or badly canonicalised or rejected somewhere in the mix. I hadn't focused on any of that before, being a non-Windows-user, but the entire contents of win32setlocale.c supports the theory that Windows' manual meant what it said when it said[1]: "We do not recommend this form for locale strings embedded in code or serialized to storage, because these strings are more likely to be changed by an operating system update than the locale name form." I suppose that was the only form available at the time the code was written, so there was no choice. The question we asked ourselves multiple times in the other thread was how we're supposed to get to the modern BCP 47 form when creating the template databases. It looks like one possibility, since Vista, is to call GetUserDefaultLocaleName()[2], which doesn't appear to have been discussed before on this list. That doesn't allow you to ask for the default for each individual category, but I don't know if that is even a concept for Windows user settings. It may be that some of the other nearby functions give a better answer for some reason. But one thing is clear from a test that someone kindly ran for me: it reports standardised strings like "en-NZ", not strings like "English_New Zealand.1252". No patch, but I wondered if any Windows hackers have any feedback on relative sanity of trying to fix all these problems this way. [1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160 [2] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename
po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:
Hi,
Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.
Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:
db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."
... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.
I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:
"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."
I suppose that was the only form available at the time the code was
written, so there was no choice. The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list. That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings. It may be that some of the other
nearby functions give a better answer for some reason. But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".
No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.
Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.
Regards
Pavel
[1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
[2] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename
On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:Hi,
Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.
Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:
db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."
... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.
I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:
"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."
I suppose that was the only form available at the time the code was
written, so there was no choice. The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list. That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings. It may be that some of the other
nearby functions give a better answer for some reason. But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".
No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.
My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on linux, not WIndows.
cheers
andrew
po 19. 4. 2021 v 12:52 odesílatel Andrew Dunstan <andrew@dunslane.net> napsal:
On Mon, Apr 19, 2021 at 4:53 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:po 19. 4. 2021 v 7:43 odesílatel Thomas Munro <thomas.munro@gmail.com> napsal:Hi,
Moving this topic into its own thread from the one about collation
versions, because it concerns pre-existing problems, and that thread
is long.
Currently initdb sets up template databases with old-style Windows
locale names reported by the OS, and they seem to have caused us quite
a few problems over the years:
db29620d "Work around Windows locale name with non-ASCII character."
aa1d2fc5 "Another attempt at fixing Windows Norwegian locale."
db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..."
9f12a3b9 "Tolerate version lookup failure for old style Windows locale..."
... and probably more, and also various threads about , for example,
"German_German.1252" vs "German_Switzerland.1252" which seem to get
confused or badly canonicalised or rejected somewhere in the mix.
I hadn't focused on any of that before, being a non-Windows-user, but
the entire contents of win32setlocale.c supports the theory that
Windows' manual meant what it said when it said[1]:
"We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."
I suppose that was the only form available at the time the code was
written, so there was no choice. The question we asked ourselves
multiple times in the other thread was how we're supposed to get to
the modern BCP 47 form when creating the template databases. It looks
like one possibility, since Vista, is to call
GetUserDefaultLocaleName()[2], which doesn't appear to have been
discussed before on this list. That doesn't allow you to ask for the
default for each individual category, but I don't know if that is even
a concept for Windows user settings. It may be that some of the other
nearby functions give a better answer for some reason. But one thing
is clear from a test that someone kindly ran for me: it reports
standardised strings like "en-NZ", not strings like "English_New
Zealand.1252".
No patch, but I wondered if any Windows hackers have any feedback on
relative sanity of trying to fix all these problems this way.Last weekend I talked with one user about one interesting (and messing) issue. They needed to create a new database with Czech collation on Azure SAS. There was not any entry in pg_collation for Czech language. The reply from Microsoft support was to use CREATE DATABASE xxx TEMPLATE 'template0' ENCODING 'utf8' LOCALE 'cs_CZ.UTF8' and it was working.My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on linux, not WIndows.
I had different informations, but still there was something wrong because no czech locales was in pg_collation
cheersandrew
On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan <andrew@dunslane.net> wrote:
My understanding from Microsoft staff at conferences is that Azure's PostgreSQL SAS runs on linux, not WIndows.
This is from a regular Azure Database for PostgreSQL single server:
postgres=> select version();
version
------------------------------------------------------------
PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
(1 row)
And this is from the new Flexible Server preview:
postgres=> select version();
version
-----------------------------------------------------------------------------------------------------------------
PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit
(1 row)
So I guess it's a case of "it depends".
On 4/19/21 10:26 AM, Dave Page wrote: > > > On Mon, Apr 19, 2021 at 11:52 AM Andrew Dunstan <andrew@dunslane.net > <mailto:andrew@dunslane.net>> wrote: > > > My understanding from Microsoft staff at conferences is that > Azure's PostgreSQL SAS runs on linux, not WIndows. > > > This is from a regular Azure Database for PostgreSQL single server: > > postgres=> select version(); > version > ------------------------------------------------------------ > PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit > (1 row) > > And this is from the new Flexible Server preview: > > postgres=> select version(); > version > > ----------------------------------------------------------------------------------------------------------------- > PostgreSQL 12.6 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu > 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit > (1 row) > > So I guess it's a case of "it depends". > Good to know. A year or two back at more than one conference I tried to enlist some of these folks in helping us with WindowsPostgreSQL and their reply was that they knew nothing about it because they were on Linux :-) I guess things changeover time. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
On 19.04.21 07:42, Thomas Munro wrote: > It looks > like one possibility, since Vista, is to call > GetUserDefaultLocaleName()[2], which doesn't appear to have been > discussed before on this list. That doesn't allow you to ask for the > default for each individual category, but I don't know if that is even > a concept for Windows user settings. pg_newlocale_from_collation() doesn't support collcollate != collctype on Windows anyway, so that wouldn't be an issue.
On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote: > Currently initdb sets up template databases with old-style Windows > locale names reported by the OS, and they seem to have caused us quite > a few problems over the years: > > db29620d "Work around Windows locale name with non-ASCII character." > aa1d2fc5 "Another attempt at fixing Windows Norwegian locale." > db477b69 "Deal with yet another issue related to "Norwegian (Bokmål)"..." > 9f12a3b9 "Tolerate version lookup failure for old style Windows locale..." > I suppose that was the only form available at the time the code was > written, so there was no choice. Right. > The question we asked ourselves > multiple times in the other thread was how we're supposed to get to > the modern BCP 47 form when creating the template databases. It looks > like one possibility, since Vista, is to call > GetUserDefaultLocaleName()[2] > No patch, but I wondered if any Windows hackers have any feedback on > relative sanity of trying to fix all these problems this way. Sounds reasonable. If PostgreSQL v15 would otherwise run on Windows Server 2003 R2, this is a good time to let that support end.
On Sun, May 16, 2021 at 6:29 AM Noah Misch <noah@leadboat.com> wrote:
On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote:
> The question we asked ourselves
> multiple times in the other thread was how we're supposed to get to
> the modern BCP 47 form when creating the template databases. It looks
> like one possibility, since Vista, is to call
> GetUserDefaultLocaleName()[2]
> No patch, but I wondered if any Windows hackers have any feedback on
> relative sanity of trying to fix all these problems this way.
Sounds reasonable. If PostgreSQL v15 would otherwise run on Windows Server
2003 R2, this is a good time to let that support end.
The value returned by GetUserDefaultLocaleName() is a system configured parameter, independent of what you set with setlocale(). It might be reasonable for initdb but not for a backend in most cases.
You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is no longer recommended, because using LCIDs is no longer recommended [1]. Although, this would work for legacy locales. Please find attached a POC patch showing this approach.
Regards,
Juan José Santamaría Flecha
Вложения
On Wed, Dec 15, 2021 at 11:32 PM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > On Sun, May 16, 2021 at 6:29 AM Noah Misch <noah@leadboat.com> wrote: >> On Mon, Apr 19, 2021 at 05:42:51PM +1200, Thomas Munro wrote: >> > The question we asked ourselves >> > multiple times in the other thread was how we're supposed to get to >> > the modern BCP 47 form when creating the template databases. It looks >> > like one possibility, since Vista, is to call >> > GetUserDefaultLocaleName()[2] >> >> > No patch, but I wondered if any Windows hackers have any feedback on >> > relative sanity of trying to fix all these problems this way. >> >> Sounds reasonable. If PostgreSQL v15 would otherwise run on Windows Server >> 2003 R2, this is a good time to let that support end. >> > The value returned by GetUserDefaultLocaleName() is a system configured parameter, independent of what you set with setlocale().It might be reasonable for initdb but not for a backend in most cases. Agreed. Only for initdb, and only if you didn't specify a locale name on the command line. > You can get the locale POSIX-ish name using GetLocaleInfoEx(), but this is no longer recommended, because using LCIDs isno longer recommended [1]. Although, this would work for legacy locales. Please find attached a POC patch showing thisapproach. Now that museum-grade Windows has been defenestrated, we are free to call GetUserDefaultLocaleName(). Here's a patch. One thing you did in your patch that I disagree with, I think, was to convert a BCP 47 name to a POSIX name early, that is, s/-/_/. I think we should use the locale name exactly as Windows (really, under the covers, ICU) spells it. There is only one place in the tree today that really wants a POSIX locale name, and that's LC_MESSAGES, accessed by GNU gettext, not Windows. We already had code to cope with that. I think we should also convert to POSIX format when making the collname in your pg_import_system_collations() proposal, so that COLLATE "en_US" works (= a SQL identifier), but that's another thread[1]. I don't think we should do it in collcollate or datcollate, which is a string for the OS to interpret. With my garbage collector hat on, I would like to rip out all of the support for traditional locale names, eventually. Deleting kludgy code is easy and fun -- 0002 is a first swing at that -- but there remains an important unanswered question. How should someone pg_upgrade a "English_Canada.1521" cluster if we now reject that name? We'd need to do a conversion to "en-CA", or somehow tell the user to. Hmmmm. [1] https://www.postgresql.org/message-id/flat/CAC%2BAXB0WFjJGL1n33bRv8wsnV-3PZD0A7kkjJ2KjPH0dOWqQdg%40mail.gmail.com
Вложения
On Tue, Jul 19, 2022 at 10:58 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Here's a patch. I added this to the next commitfest, and cfbot promptly told me about some warnings I needed to fix. That'll teach me to post a patch tested with "ci-os-only: windows". Looking more closely at some error messages that report GetLastError() where I'd mixed up %d and %lu, I see also that I didn't quite follow existing conventions for wording when reporting Windows error numbers, so I fixed that too. In the "startcreate" step on CI you can see that it says: The database cluster will be initialized with locale "en-US". The default database encoding has accordingly been set to "WIN1252". The default text search configuration will be set to "english". As for whether "accordingly" still applies, by the logic of of win32_langinfo()... Windows still considers WIN1252 to be the default ANSI code page for "en-US", though it'd work with UTF-8 too. I'm not sure what to make of that. The goal here was to give Windows users good defaults, but WIN1252 is probably not what most people actually want. Hmph.
Вложения
On Tue, Jul 19, 2022 at 12:59 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Now that museum-grade Windows has been defenestrated, we are free to
call GetUserDefaultLocaleName(). Here's a patch.
This LGTM.
I think we should also convert to POSIX format when making the
collname in your pg_import_system_collations() proposal, so that
COLLATE "en_US" works (= a SQL identifier), but that's another
thread[1]. I don't think we should do it in collcollate or
datcollate, which is a string for the OS to interpret.
That thread has been split [1], but that is how the current version behaves.
With my garbage collector hat on, I would like to rip out all of the
support for traditional locale names, eventually. Deleting kludgy
code is easy and fun -- 0002 is a first swing at that -- but there
remains an important unanswered question. How should someone
pg_upgrade a "English_Canada.1521" cluster if we now reject that name?
We'd need to do a conversion to "en-CA", or somehow tell the user to.
Hmmmm.
Is there a safe way to do that in pg_upgrade or would we be forcing users to pg_dump into the new cluster?
[1] https://www.postgresql.org/message-id/flat/0050ec23-34d9-2765-9015-98c04f0e18ac%40postgrespro.ru
Regards,
Juan José Santamaría Flecha
On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro <thomas.munro@gmail.com> wrote:
As for whether "accordingly" still applies, by the logic of of
win32_langinfo()... Windows still considers WIN1252 to be the default
ANSI code page for "en-US", though it'd work with UTF-8 too. I'm not
sure what to make of that. The goal here was to give Windows users
good defaults, but WIN1252 is probably not what most people actually
want. Hmph.
Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
[1] https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170
Regards,
Juan José Santamaría Flecha
On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > On Tue, Jul 19, 2022 at 4:47 AM Thomas Munro <thomas.munro@gmail.com> wrote: >> As for whether "accordingly" still applies, by the logic of of >> win32_langinfo()... Windows still considers WIN1252 to be the default >> ANSI code page for "en-US", though it'd work with UTF-8 too. I'm not >> sure what to make of that. The goal here was to give Windows users >> good defaults, but WIN1252 is probably not what most people actually >> want. Hmph. > > > Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will usethe current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page. I'm still confused about what that means. Suppose we decided to insist by adding a ".UTF-8" suffix to the name, as that page says we can now that we're on Windows 10+, when building the default locale name (see experimental 0002 patch, attached). It initially seemed to have the right effect: The database cluster will be initialized with locale "en-US.UTF-8". The default database encoding has accordingly been set to "UTF8". The default text search configuration will be set to "english". But then the Turkish i test in contrib/citext/sql/citext_utf8.sql failed[1]: SELECT 'i'::citext = 'İ'::citext AS t; t --- - t + f (1 row) About the pg_upgrade problem, maybe it's OK ... existing old format names should continue to work, but we can still remove the weird code that does locale name tweaking, right? pg_upgraded databases should contain fixed names (ie that were fixed by old initdb so should continue to work), and new clusters will get BCP 47 names. I don't really know, I was just playing with rough ideas by sending patches to CI here... [1] https://cirrus-ci.com/task/6423238052937728
Вложения
On Wed, Jul 20, 2022 at 1:44 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Jul 20, 2022 at 10:27 PM Juan José Santamaría Flecha
<juanjo.santamaria@gmail.com> wrote:
> Still, WIN1252 is not the wrong answer for what we are asking. Even if you enable UTF-8 support [1], the system will use the current default Windows ANSI code page (ACP) for the locale and UTF-8 for the code page.
I'm still confused about what that means. Suppose we decided to
insist by adding a ".UTF-8" suffix to the name, as that page says we
can now that we're on Windows 10+, when building the default locale
name (see experimental 0002 patch, attached). It initially seemed to
have the right effect:
The database cluster will be initialized with locale "en-US.UTF-8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Let me try to explain this using the "Beta: Use Unicode UTF-8 for worldwide language support" option [1].
- Currently in a system with the language settings of "English_United States" and that option disabled, when executing initdb you get:
The database cluster will be initialized with locale "English_United States.1252".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".
The default database encoding has accordingly been set to "WIN1252".
The default text search configuration will be set to "english".
And as a test for psql:
SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
ERROR: character with byte sequence 0xc5 0x9f in encoding "UTF8" has no equivalent in encoding "WIN1252"
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
ERROR: character with byte sequence 0xc5 0x9f in encoding "UTF8" has no equivalent in encoding "WIN1252"
We get this error even if the database encoding is UTF8, and is caused by the tr_tr locales being encoded in WIN1254. We can discuss this in another thread, and I can propose a patch.
- If we enable the UTF-8 support option, then the same test goes as:
The database cluster will be initialized with locale "English_United States.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
And for psql:
SET lc_time='tr_tr.utf8';
SET
SELECT to_char('2000-2-01'::date, 'tmmonth');
to_charSET
SELECT to_char('2000-2-01'::date, 'tmmonth');
---------
şubat
(1 row)
In this case the Windows locales are actually UTF8 encoded.
TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be done through the Windows registry and only in recent releases.
But then the Turkish i test in contrib/citext/sql/citext_utf8.sql failed[1]:
SELECT 'i'::citext = 'İ'::citext AS t;
t
---
- t
+ f
(1 row)
This is current state of affairs:
- Windows:
SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
latin_small_dotless | latin_small | latin_capital | lower | latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ | İ
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ | İ
- Linux:
SELECT U&'\0131' latin_small_dotless,U&'\0069' latin_small
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
latin_small_dotless | latin_small | latin_capital | lower | latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ | i
,U&'\0049' latin_capital, lower(U&'\0049')
,U&'\0130' latin_capital_dotted, lower(U&'\0130');
latin_small_dotless | latin_small | latin_capital | lower | latin_capital_dotted | lower
---------------------+-------------+---------------+-------+----------------------+-------
ı | i | I | i | İ | i
Latin_capital_dotted doesn't have the same lower value.
Regards,
Juan José Santamaría Flecha
On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha <juanjo.santamaria@gmail.com> wrote: > TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be donethrough the Windows registry and only in recent releases. Thanks, that was helpful, and so was that SO link. So it sounds like I should forget about the v3-0002 patch, but the v3-0001 and v3-0003 patches might have a future. And it sounds like we might need to investigate maybe defending ourselves against the ACP being different than what we expect (ie not matching the database encoding)? Did I understand correctly that you're looking into that?
On Fri, Jul 29, 2022 at 3:33 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Jul 22, 2022 at 11:59 PM Juan José Santamaría Flecha > <juanjo.santamaria@gmail.com> wrote: > > TL;DR; What I want to show through this example is that Windows ACP is not modified by setlocale(), it can only be donethrough the Windows registry and only in recent releases. > > Thanks, that was helpful, and so was that SO link. > > So it sounds like I should forget about the v3-0002 patch, but the > v3-0001 and v3-0003 patches might have a future. And it sounds like > we might need to investigate maybe defending ourselves against the ACP > being different than what we expect (ie not matching the database > encoding)? Did I understand correctly that you're looking into that? I'm going to withdraw this entry. The sooner we get something like 0001 into a release, the sooner the world will be rid of PostgreSQL clusters initialised with the bad old locale names that the manual very clearly tells you not to use for databases.... but I don't understand this ACP/registry vs database encoding stuff and how it relates to the use of BCP47 locale names, which puts me off changing anything until we do.
Another country has changed its name, and a Windows OS update has again broken every PostgreSQL cluster in that whole country[1] (or at least those that had accepted initdb's default choice of locale, probably most). Let's get to the bottom of this, because otherwise it is simply going to keep happening, causing administrative pain for a lot of people. Here is a rebase of the basic patch I proposed last time, and a re-statement of what we know: 1. initdb chooses a default locale using a technique that gives you an unstable ("Czech Republic"->"Czechia", "Turkey"->"Türkiye"), non-ASCII ("Norwegian (Bokmål)") string that we are warned we should not store anywhere. We store it, and then later it is not recognised. Instead we should select an IETF BCP 47 locale name, based on stable ISO country and language codes, like "en-US", "tr-TR" etc. Here is the patch to teach initdb to use that, unchanged from v3 except that I tweaked the docs a bit. 2. In Windows 10+ it is now also possible to put ".UTF-8" on the end of locale names. I couldn't figure out whether we should do that, and what effect it has on ctypes -- apparently not the effect I expected (see upthread). Was our UTF-8 support on Windows already broken, and this new ".UTF-8" thing is just a new way to reach that brokenness? Is it OK to continue to choose the "legacy" single byte encodings by default on that OS, and consider that a separate topic for separate research? 3. It is not clear to me how we should deal with pg_upgrade. Eventually we want all of the old-school names to fade away, and pg_upgrade would need to be part of that. Perhaps there is some API that can be used to translate to the new canonical forms without us having to maintain translation tables and other messiness in our tree. 4. Eventually we should probably ban non-ASCII characters from entering the relevant catalogues (they are shared, so their encoding is undefined except that they must be a superset of ASCII), and delete all the old win32setlocale.c kludges, after we reach a point where everyone should be using exclusively BCP 47. [1] https://www.postgresql.org/message-id/flat/18196-b10f93dfbde3d7db%40postgresql.org
Вложения
I clicked "Trigger" to get a Mingw test run of this, and it failed[1]. I see why: our function win32_langinfo() believes that it shouldn't call GetLocaleInfoEx() on non-MSVC compilers, so we see 'initdb: error: could not find suitable encoding for locale "en-US"'. I think it has fallback code that parses the ".1252" or whatever on the end of the name, but "en-US" hasn't got one. I don't know the first thing about Mingw but it looks like a declaration for that function arrived 6 years ago[2], and deleting the "#if defined(_MSC_VER)" fixes the problem and the tests pass[3]. As far as I know, we don't support any Mingw but the very latest: it's not a target with real users who have version requirements, it's just a developer [in]convenience, so if it passes on CI and whatever MSYS version "fairywren" runs in the build farm right now, that should be enough. I could just do that in this patch, but I suppose that also means that someone needs to go through pg_locale.c and other places that test _MSC_VER not because they actually care about the compiler but because they want to detect some crusty old Mingw version, and see what else can be deleted as a result, possibly including a lot of fallback code. It feels like a separate cleanup for a separate patch. [1] https://cirrus-ci.com/task/5301814774464512 [2] https://github.com/mirror/mingw-w64/blame/eff726c461e09f35eeaed125a3570fa5f807f02b/mingw-w64-tools/widl/include/winnls.h#L931 [3] https://cirrus-ci.com/task/6558569718349824
Here is a thought that occurs to me, as I follow along with Jeff Davis's evolving proposals for built-in collations and ctypes: What would stop us from dropping support for the libc (sic) provider on Windows? That may sound radical and likely to cause extra work for people on upgrade, but how does that compare to the pain of keeping this barely maintained code in the tree? Suppose the idea in this thread goes ahead and we get people to transition to the modern locale names: there is non-zero transitional/upgrade pain there too. How delicious it would be to just nuke the whole thing from orbit, and keep only cross-platform code that is maintained with enthusiasm by active hackers. That's probably a little extreme, but it's the direction my thoughts start to go in when confronting the realisation that it's up to us [Unix hackers making drive-by changes], no one is coming to help us [from the Windows user community]. I've even heard others talk about dropping Windows completely, due to the maintenance imbalance. This would be somewhat more fine grained. (One could use a similar argument to drop non-NTFS filesystems and turn on POSIX-mode file links, to end that other locus of struggle.)
Ertan Küçükoglu offered to try to review and test this, so here's a rebase. Some notes: * it turned out that the Turkish i/I test problem I mentioned earlier in this thread[1] was just always broken on Windows, we just didn't ever test with UTF-8 before Meson took over; it's skipped now, see commit cff4e5a3[2] * it seems that you can't actually put encodings like .1252 on the end (.UTF-8 must be a special case); I don't know if we should look into a better UTF-8 mode for modern Windows, but that'd be a separate project * this patch only benefits people who run initdb.exe without explicitly specifying a locale; probably a good number of real systems in the wild actually use EDB's graphical installer which initialises a cluster and has its own way of choosing the locale, as discussed in Ertan's thread[3] [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2 [2] https://github.com/postgres/postgres/commit/cff4e5a3 [3] https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com
Вложения
Hi,
I am a complete noob about PostgreSQL development.
I don't know about the PostgreSQL CI system.
I will be needing some help as to how to do the tests.
I have access to different Windows OSes (v10, Server 2022 mainly).
These systems can be set to English or Turkish locales if needed.
I can also add new Windows versions if needed.
I do not know how to use patch files. I am also not sure what tests I should do.
Do I need to set up a Windows build system for PostgreSQL CI?
Will I download some files (EXE, etc) ready for testing? Copy them over an existing installation for testing?
Thanks for your help.
Regards,
Ertan
Thomas Munro <thomas.munro@gmail.com>, 22 Tem 2024 Pzt, 05:52 tarihinde şunu yazdı:
Ertan Küçükoglu offered to try to review and test this, so here's a rebase.
Some notes:
* it turned out that the Turkish i/I test problem I mentioned earlier
in this thread[1] was just always broken on Windows, we just didn't
ever test with UTF-8 before Meson took over; it's skipped now, see
commit cff4e5a3[2]
* it seems that you can't actually put encodings like .1252 on the end
(.UTF-8 must be a special case); I don't know if we should look into a
better UTF-8 mode for modern Windows, but that'd be a separate project
* this patch only benefits people who run initdb.exe without
explicitly specifying a locale; probably a good number of real systems
in the wild actually use EDB's graphical installer which initialises a
cluster and has its own way of choosing the locale, as discussed in
Ertan's thread[3]
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2
[2] https://github.com/postgres/postgres/commit/cff4e5a3
[3] https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com
Hello Thomas,
Can you please list down some of the use cases for the patch ? Other than Turkish, does this patch have an impact on other locales too ?
Regards,
Zaid
On Mon, Jul 22, 2024 at 7:52 AM Thomas Munro <thomas.munro@gmail.com> wrote:
Ertan Küçükoglu offered to try to review and test this, so here's a rebase.
Some notes:
* it turned out that the Turkish i/I test problem I mentioned earlier
in this thread[1] was just always broken on Windows, we just didn't
ever test with UTF-8 before Meson took over; it's skipped now, see
commit cff4e5a3[2]
* it seems that you can't actually put encodings like .1252 on the end
(.UTF-8 must be a special case); I don't know if we should look into a
better UTF-8 mode for modern Windows, but that'd be a separate project
* this patch only benefits people who run initdb.exe without
explicitly specifying a locale; probably a good number of real systems
in the wild actually use EDB's graphical installer which initialises a
cluster and has its own way of choosing the locale, as discussed in
Ertan's thread[3]
[1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2
[2] https://github.com/postgres/postgres/commit/cff4e5a3
[3] https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com
On Mon, Jul 22, 2024 at 8:38 PM Zaid Shabbir <zaidshabbir@gmail.com> wrote: > Can you please list down some of the use cases for the patch ? Other than Turkish, does this patch have an impact on otherlocales too ? Hi Zaid, Yes, initdb.exe would use BCP47 codes by default for all languages. Who knows which country will change its name next? From a quick search of other recent cases: Czech Republic -> Czechia, Swaziland -> Eswatini, Cape Verde -> Cabo Verde, and more, plus others that we have older records of in the mailing list that seemed to change in some minor technical way: Macau, Hong Hong, Norwegian etc. The Windows manual says: "We do not recommend this form for locale strings embedded in code or serialized to storage, because these strings are more likely to be changed by an operating system update than the locale name form." It's pretty bad for our users when it happens and the Windows locale name changes: a database cluster that suddenly can't start, and even after you've figured out why and adjusted the references in postgresql.conf, you still can't connect. There is also the problem that some of the old full names have non-ASCII characters (Türkiye, São Tomé and Príncipe, Curaçao, Côte d'Ivoire, Åland) which is bad at least in theory because we use the string in times and places when it it is not clear what the encoding the name itself has. I don't use Windows myself, I've just been watching this train wreck replaying in a loop for long enough. Clearly it's going to take some time to wean the user community off the unstable names, and it struck me that the default is probably the main source of them in new clusters, hence this patch.
On Mon, Jul 22, 2024 at 8:04 PM Ertan Küçükoglu <ertan.kucukoglu@gmail.com> wrote: > I am a complete noob about PostgreSQL development. > I don't know about the PostgreSQL CI system. > I will be needing some help as to how to do the tests. > I have access to different Windows OSes (v10, Server 2022 mainly). > These systems can be set to English or Turkish locales if needed. > I can also add new Windows versions if needed. > I do not know how to use patch files. I am also not sure what tests I should do. > Do I need to set up a Windows build system for PostgreSQL CI? > Will I download some files (EXE, etc) ready for testing? Copy them over an existing installation for testing? Sorry, I didn't mean to put you on the spot :-) Yeah you'd need to install a compiler, various libraries and tools to be able to build form source with a patch. Unfortunately I'm not the best person to explain how to do that on Windows as I don't use it. Honestly it might be a bit too much new stuff to figure out at once just to test this small patch. What I'd be hoping for is confirmation that there are no weird unintended consequences or problems I'm not seeing since I'm writing blind patches based on documentation only, but it's probably too much to ask to figure out the whole development environment and then go on an open ended expedition looking for unknown problems.
Thomas Munro <thomas.munro@gmail.com>, 22 Tem 2024 Pzt, 14:00 tarihinde şunu yazdı:
Sorry, I didn't mean to put you on the spot :-) Yeah you'd need to
install a compiler, various libraries and tools to be able to build
form source with a patch. Unfortunately I'm not the best person to
explain how to do that on Windows as I don't use it. Honestly it
might be a bit too much new stuff to figure out at once just to test
this small patch. What I'd be hoping for is confirmation that there
are no weird unintended consequences or problems I'm not seeing since
I'm writing blind patches based on documentation only, but it's
probably too much to ask to figure out the whole development
environment and then go on an open ended expedition looking for
unknown problems.
I already installed Visual Studio 2022 with C++ support as suggested in https://www.postgresql.org/docs/current/install-windows-full.html
I cloned codes in the system.
But, I cannot find any "src/tools/msvc" directory. It is missing.
Document states I need everything in there
"The tools for building using Visual C++ or Platform SDK are in the src\tools\msvc directory."
It seems I will need help setting up the build environment.
On 2024-07-21 Su 10:51 PM, Thomas Munro wrote: > Ertan Küçükoglu offered to try to review and test this, so here's a rebase. > > Some notes: > > * it turned out that the Turkish i/I test problem I mentioned earlier > in this thread[1] was just always broken on Windows, we just didn't > ever test with UTF-8 before Meson took over; it's skipped now, see > commit cff4e5a3[2] > > * it seems that you can't actually put encodings like .1252 on the end > (.UTF-8 must be a special case); I don't know if we should look into a > better UTF-8 mode for modern Windows, but that'd be a separate project > > * this patch only benefits people who run initdb.exe without > explicitly specifying a locale; probably a good number of real systems > in the wild actually use EDB's graphical installer which initialises a > cluster and has its own way of choosing the locale, as discussed in > Ertan's thread[3] > > [1] https://www.postgresql.org/message-id/flat/CA%2BhUKGJZskvCh%3DQm75UkHrY6c1QZUuC92Po9rponj1BbLmcMEA%40mail.gmail.com#3a00c08214a4285d2f3c4297b0ac2be2 > [2] https://github.com/postgres/postgres/commit/cff4e5a3 > [3] https://www.postgresql.org/message-id/flat/CAH2i4ydECHZPxEBB7gtRG3vROv7a0d3tqAFXzcJWQ9hRsc1znQ%40mail.gmail.com I have an environment I can use for testing. But what exactly am I testing? :-) Install a few "problem" language/region settings, switch the system and ensure initdb runs ok? Other than Turkish, which locales should I install? cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net>, 22 Tem 2024 Pzt, 16:44 tarihinde şunu yazdı:
I have an environment I can use for testing. But what exactly am I
testing? :-) Install a few "problem" language/region settings, switch
the system and ensure initdb runs ok?
Other than Turkish, which locales should I install?
Thomas earlier listed a few:
"From a quick search of other recent cases: Czech Republic -> Czechia,Swaziland -> Eswatini, Cape Verde -> Cabo Verde, and more, plus others
that we have older records of in the mailing list that seemed to
change in some minor technical way: Macau, Hong Hong, Norwegian etc."
I am not sure if all needs testing though.
Thanks & Regards,
Ertan
On Tue, Jul 23, 2024 at 1:44 AM Andrew Dunstan <andrew@dunslane.net> wrote: > I have an environment I can use for testing. But what exactly am I > testing? :-) Install a few "problem" language/region settings, switch > the system and ensure initdb runs ok? I just want to know about any weird unexpected consequences of using BCP47 locale names, before we change the default in v18. The only concrete thing I found so far was that MinGW didn't like it, but I provided a fix for that. It'd still be possible to initialise a new cluster with the old style names if you really want to, but you'd have to pass it in explicitly; I was wondering if that could be necessary in some pg_upgrade scenario but I guess not, it just clobbers template0's pg_database row with values from the source database, and recreates everything else so I think it should be fine (?). I am a little uneasy about the new names not having .encoding but there doesn't seem to be an issue with that (such locales exist on Unix too), and the OS still knows which encoding they use in that case.
On Tue, Jul 23, 2024 at 11:19 AM Thomas Munro <thomas.munro@gmail.com> wrote: > On Tue, Jul 23, 2024 at 1:44 AM Andrew Dunstan <andrew@dunslane.net> wrote: > > I have an environment I can use for testing. But what exactly am I > > testing? :-) Install a few "problem" language/region settings, switch > > the system and ensure initdb runs ok? I thought a bit more about what to do with the messy .UTF-8 situation on Windows, and I think I might see a way forward that harmonises the code and behaviour with Unix, and deletes a lot of special case code. But it's only theories + CI so far. 0001, 0002: As before, teach initdb.exe to choose eg "en-US" by default. 0003: Force people to choose locales that match the database encoding, as we do on Unix. That is, forbid contradictory combinations like --locale="English_United States.1252" --encoding=UTF8, which are currently allowed (and the world is full of such database clusters because that is how the EDB installer GUI makes them). The only allowed combinations for American English should now be: --locale="en-US" --encoding="WIN1252", and --locale="en-US.UTF-8" --encoding="UTF8". You can still use the old names if you like, by explicitly writing --locale="English_United States.1252", but the encoding then has to be WIN1252. It's crazy to mix them up, let's ban that. Obviously there is a pg_upgrade case to worry about there. We'd have to "fix" the now illegal combinations, and I don't know exactly how yet. 0004: Rip out the code that does extra wchar_t conversations for collations. If I've understood correctly, we don't need them: if you have a .UTF-8 locale then your encoding is UTF-8 and should be able to use strcoll_l() directly. Right? 0005: Something similar was being done for strftime(). And we might as well use strftime_l() instead while we're here (part of general movement to use _l functions and stop splattering setlocale() all over the place, for the multithreaded future). These patches pass on CI. Do they give the expected results when used on a real Windows system? There are a few more places where we do wchar_t conversions that could probably be stripped out too, if my assumptions are correct, and we could dig further if the basic idea can be validated and people think this is going in a good direction.
Вложения
I already installed Visual Studio 2022 with C++ support as suggested in https://www.postgresql.org/docs/current/install-windows-full.htmlI cloned codes in the system.But, I cannot find any "src/tools/msvc" directory. It is missing.Document states I need everything in there"The tools for building using Visual C++ or Platform SDK are in the src\tools\msvc directory."It seems I will need help setting up the build environment.
I am willing to be a tester for Windows given I could get help setting up the build environment.
It also feels documentation needs some update as I failed to find necessary files.
Thanks & Regards,
Ertan
On 2024-08-08 Th 4:08 AM, Ertan Küçükoglu wrote:
I already installed Visual Studio 2022 with C++ support as suggested in https://www.postgresql.org/docs/current/install-windows-full.htmlI cloned codes in the system.But, I cannot find any "src/tools/msvc" directory. It is missing.Document states I need everything in there"The tools for building using Visual C++ or Platform SDK are in the src\tools\msvc directory."It seems I will need help setting up the build environment.I am willing to be a tester for Windows given I could get help setting up the build environment.It also feels documentation needs some update as I failed to find necessary files.
If you're trying to build the master branch those documents no longer apply. You will need to build using meson, as documented here: <https://www.postgresql.org/docs/17/install-meson.html>
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com