Обсуждение: Re: BUG #6510: A simple prompt is displayed using wrong charset
Hello, Please let me know if I can do something to get the bug fix (https://commitfest.postgresql.org/action/patch_view?id=902) committed. I would like to fix other bugs related to postgres localization, but I am not sure yet how to do it. Thanks in advance, Alexander 18.10.2012 19:46, Alvaro Herrera wrote: > Noah Misch escribió: > >> Following an off-list ack from Alexander, here is that version. No functional >> differences from Alexander's latest version, and I have verified that it still >> fixes the original test case. I'm marking this Ready for Committer. > This seems good to me, but I'm not comfortable committing Windows stuff. > Andrew, Magnus, are you able to handle this? > > >
Alexander Law <exclusion@gmail.com> writes: > Please let me know if I can do something to get the bug fix > (https://commitfest.postgresql.org/action/patch_view?id=902) committed. It's waiting on some Windows-savvy committer to pick it up, IMO. (FWIW, I have no objection to the patch as given, but I am unqualified to evaluate how sane it is on Windows.) regards, tom lane
Tom Lane escribió: > Alexander Law <exclusion@gmail.com> writes: > > Please let me know if I can do something to get the bug fix > > (https://commitfest.postgresql.org/action/patch_view?id=902) committed. > > It's waiting on some Windows-savvy committer to pick it up, IMO. > (FWIW, I have no objection to the patch as given, but I am unqualified > to evaluate how sane it is on Windows.) I think part of the problem is that we no longer have many Windows-savvy committers willing to spend time on Windows-specific patches; there are, of course, people with enough knowledge, but they don't always necessarily have much interest. Note that I added Magnus and Andrew to this thread because they are known to have contributed Windows-specific improvements, but they have yet to followup on this thread *at all*. I remember back when Windows support was added, one of the arguments in its favor was "it's going to attract lots of new developers". Yeah, right. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 01/23/2013 01:08 PM, Alvaro Herrera wrote: > Tom Lane escribió: >> Alexander Law <exclusion@gmail.com> writes: >>> Please let me know if I can do something to get the bug fix >>> (https://commitfest.postgresql.org/action/patch_view?id=902) committed. >> It's waiting on some Windows-savvy committer to pick it up, IMO. >> (FWIW, I have no objection to the patch as given, but I am unqualified >> to evaluate how sane it is on Windows.) > I think part of the problem is that we no longer have many Windows-savvy > committers willing to spend time on Windows-specific patches; there are, > of course, people with enough knowledge, but they don't always > necessarily have much interest. Note that I added Magnus and Andrew to > this thread because they are known to have contributed Windows-specific > improvements, but they have yet to followup on this thread *at all*. > > I remember back when Windows support was added, one of the arguments in > its favor was "it's going to attract lots of new developers". Yeah, right. I'm about to take a look at this one. Note BTW that Craig Ringer has recently done some excellent work on Windows, and there are several other active non-committers (e.g. Noah) who comment on Windows too. IIRC I never expected us to get a huge influx of developers from the Windows world. What I did expect was a lot of new users, and I think we have on fact got that. cheers andrew
On 01/24/2013 01:34 AM, Tom Lane wrote: > Alexander Law <exclusion@gmail.com> writes: >> Please let me know if I can do something to get the bug fix >> (https://commitfest.postgresql.org/action/patch_view?id=902) committed. > It's waiting on some Windows-savvy committer to pick it up, IMO. I'm no committer, but I can work with Windows and know text encoding issues fairly well. I'll take a look, though I can't do it immediately as I have some other priority work. Should be able to in the next 24h. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
<div class="moz-cite-prefix">On 01/24/2013 01:06 AM, Alexander Law wrote:<br /></div><blockquote cite="mid:51001879.3000207@gmail.com"type="cite">Hello, <br /> Please let me know if I can do something to get the bug fix(<a class="moz-txt-link-freetext" href="https://commitfest.postgresql.org/action/patch_view?id=902">https://commitfest.postgresql.org/action/patch_view?id=902</a>) committed.<br /> I would like to fix other bugs related to postgres localization, but I am not sure yet how to do it.<br/></blockquote><br /> For anyone looking for the history, the 1st post on this topic is here:<br /><br /><a href="http://www.postgresql.org/message-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org">http://www.postgresql.org/message-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org</a><br /><preclass="moz-signature" cols="72">-- Craig Ringer <a class="moz-txt-link-freetext" href="http://www.2ndQuadrant.com/">http://www.2ndQuadrant.com/</a>PostgreSQLDevelopment, 24x7 Support, Training & Services</pre>
On Thu, Jan 24, 2013 at 03:45:36PM +0800, Craig Ringer wrote: > On 01/24/2013 01:34 AM, Tom Lane wrote: > > Alexander Law <exclusion@gmail.com> writes: > >> Please let me know if I can do something to get the bug fix > >> (https://commitfest.postgresql.org/action/patch_view?id=902) committed. > > It's waiting on some Windows-savvy committer to pick it up, IMO. > I'm no committer, but I can work with Windows and know text encoding > issues fairly well. I'll take a look, though I can't do it immediately > as I have some other priority work. Should be able to in the next 24h. Feel free, but it's already Ready for Committer. Also, Andrew has said he plans to pick it up. If in doubt, I think this patch is relatively solid in terms of non-committer review.
On 01/24/2013 07:57 PM, Noah Misch wrote: > On Thu, Jan 24, 2013 at 03:45:36PM +0800, Craig Ringer wrote: >> On 01/24/2013 01:34 AM, Tom Lane wrote: >>> Alexander Law <exclusion@gmail.com> writes: >>>> Please let me know if I can do something to get the bug fix >>>> (https://commitfest.postgresql.org/action/patch_view?id=902) committed. >>> It's waiting on some Windows-savvy committer to pick it up, IMO. >> I'm no committer, but I can work with Windows and know text encoding >> issues fairly well. I'll take a look, though I can't do it immediately >> as I have some other priority work. Should be able to in the next 24h. > Feel free, but it's already Ready for Committer. Also, Andrew has said he > plans to pick it up. If in doubt, I think this patch is relatively solid in > terms of non-committer review. I'm happy with that; I'll look elsewhere. Thanks. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 01/24/2013 03:42 AM, Craig Ringer wrote: > On 01/24/2013 01:06 AM, Alexander Law wrote: >> Hello, >> Please let me know if I can do something to get the bug fix >> (https://commitfest.postgresql.org/action/patch_view?id=902) committed. >> I would like to fix other bugs related to postgres localization, but >> I am not sure yet how to do it. > > For anyone looking for the history, the 1st post on this topic is here: > > http://www.postgresql.org/message-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org Yeah. I'm happy enough with this patch. ISTM it's really a bug and should be backpatched, no? cheers andrew
On Thu, Jan 24, 2013 at 08:50:36AM -0500, Andrew Dunstan wrote: > On 01/24/2013 03:42 AM, Craig Ringer wrote: >> On 01/24/2013 01:06 AM, Alexander Law wrote: >>> Hello, >>> Please let me know if I can do something to get the bug fix >>> (https://commitfest.postgresql.org/action/patch_view?id=902) >>> committed. >>> I would like to fix other bugs related to postgres localization, but >>> I am not sure yet how to do it. >> >> For anyone looking for the history, the 1st post on this topic is here: >> >> http://www.postgresql.org/message-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org > > > Yeah. > > I'm happy enough with this patch. ISTM it's really a bug and should be > backpatched, no? It is a bug, yes. I'm neutral on whether to backpatch.
On 01/24/2013 11:19 AM, Noah Misch wrote: > On Thu, Jan 24, 2013 at 08:50:36AM -0500, Andrew Dunstan wrote: >> On 01/24/2013 03:42 AM, Craig Ringer wrote: >>> On 01/24/2013 01:06 AM, Alexander Law wrote: >>>> Hello, >>>> Please let me know if I can do something to get the bug fix >>>> (https://commitfest.postgresql.org/action/patch_view?id=902) >>>> committed. >>>> I would like to fix other bugs related to postgres localization, but >>>> I am not sure yet how to do it. >>> For anyone looking for the history, the 1st post on this topic is here: >>> >>> http://www.postgresql.org/message-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org >> >> Yeah. >> >> I'm happy enough with this patch. ISTM it's really a bug and should be >> backpatched, no? > It is a bug, yes. I'm neutral on whether to backpatch. > Well, that's what I did. :-) cheers andrew
On 1/24/13 4:04 PM, Andrew Dunstan wrote: > > On 01/24/2013 11:19 AM, Noah Misch wrote: >> On Thu, Jan 24, 2013 at 08:50:36AM -0500, Andrew Dunstan wrote: >>> On 01/24/2013 03:42 AM, Craig Ringer wrote: >>>> On 01/24/2013 01:06 AM, Alexander Law wrote: >>>>> Hello, >>>>> Please let me know if I can do something to get the bug fix >>>>> (https://commitfest.postgresql.org/action/patch_view?id=902) >>>>> committed. >>>>> I would like to fix other bugs related to postgres localization, but >>>>> I am not sure yet how to do it. >>>> For anyone looking for the history, the 1st post on this topic is here: >>>> >>>> http://www.postgresql.org/message-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org >>>> >>> >>> Yeah. >>> >>> I'm happy enough with this patch. ISTM it's really a bug and should be >>> backpatched, no? >> It is a bug, yes. I'm neutral on whether to backpatch. >> > > Well, that's what I did. :-) The 9.0 and 9.1 branches are now failing to build.
On 01/25/2013 10:26 AM, Peter Eisentraut wrote: > On 1/24/13 4:04 PM, Andrew Dunstan wrote: >> On 01/24/2013 11:19 AM, Noah Misch wrote: >>> On Thu, Jan 24, 2013 at 08:50:36AM -0500, Andrew Dunstan wrote: >>>> On 01/24/2013 03:42 AM, Craig Ringer wrote: >>>>> On 01/24/2013 01:06 AM, Alexander Law wrote: >>>>>> Hello, >>>>>> Please let me know if I can do something to get the bug fix >>>>>> (https://commitfest.postgresql.org/action/patch_view?id=902) >>>>>> committed. >>>>>> I would like to fix other bugs related to postgres localization, but >>>>>> I am not sure yet how to do it. >>>>> For anyone looking for the history, the 1st post on this topic is here: >>>>> >>>>> http://www.postgresql.org/message-id/E1S3twb-0004OY-4g@wrigleys.postgresql.org >>>>> >>>> Yeah. >>>> >>>> I'm happy enough with this patch. ISTM it's really a bug and should be >>>> backpatched, no? >>> It is a bug, yes. I'm neutral on whether to backpatch. >>> >> Well, that's what I did. :-) > The 9.0 and 9.1 branches are now failing to build. Yeah, sorry, working on it. cheers andrew
Hello, Thanks for fixing bug #6510! Please look at the following l10n bug: http://www.postgresql.org/message-id/502A26F1.6010109@gmail.com and the proposed patch. Best regards, Alexander
Вложения
Alexander Law <exclusion@gmail.com> writes: > Please look at the following l10n bug: > http://www.postgresql.org/message-id/502A26F1.6010109@gmail.com > and the proposed patch. That patch looks entirely unsafe to me. Neither of those functions should be expected to be able to run when none of our standard infrastructure (palloc, elog) is up yet. Possibly it would be safe to do this somewhere around where we do GUC initialization. regards, tom lane
On Tue, Jan 29, 2013 at 09:54:04AM -0500, Tom Lane wrote: > Alexander Law <exclusion@gmail.com> writes: > > Please look at the following l10n bug: > > http://www.postgresql.org/message-id/502A26F1.6010109@gmail.com > > and the proposed patch. > > That patch looks entirely unsafe to me. Neither of those functions > should be expected to be able to run when none of our standard > infrastructure (palloc, elog) is up yet. > > Possibly it would be safe to do this somewhere around where we do > GUC initialization. Even then, I wouldn't be surprised to find problematic consequences beyond error display. What if all the databases are EUC_JP, the platform encoding is KOI8, and some postgresql.conf settings contain EUC_JP characters? Does the postmaster not rely on its use of SQL_ASCII to allow those values? I would look at fixing this by making the error output machinery smarter in this area before changing the postmaster's notion of server_encoding.
30.01.2013 05:51, Noah Misch wrote: > On Tue, Jan 29, 2013 at 09:54:04AM -0500, Tom Lane wrote: >> Alexander Law <exclusion@gmail.com> writes: >>> Please look at the following l10n bug: >>> http://www.postgresql.org/message-id/502A26F1.6010109@gmail.com >>> and the proposed patch. >> That patch looks entirely unsafe to me. Neither of those functions >> should be expected to be able to run when none of our standard >> infrastructure (palloc, elog) is up yet. >> >> Possibly it would be safe to do this somewhere around where we do >> GUC initialization. Looking at elog.c:write_console, and boostrap.c:AuxiliaryProcessMain, mcxt.c:MemoryContextInit I would place this call (SetDatabaseEncoding(GetPlatformEncoding())) at MemoryContextInit. (The branch of conversion pgwin32_toUTF16 is not executed until CurrentMemoryContext is not null) But I see some calls to ereport before MemoryContextInit. Is it ok or MemoryContext initialization should be done before? For example, main.c:main -> pgwin32_signal_initialize -> ereport And there is another issue with elog.c:write_stderr if (pgwin32_is_service) then the process writes message to the windows eventlog (write_eventlog), trying to convert in to UTF16. But it doesn't check MemoryContext before the call to pgwin32_toUTF16 (as write_console does) and we can get a crash in the following way: main.c:check_root -> if (pgwin32_is_admin()) write_stderr -> if (pgwin32_is_service()) write_eventlog -> if (if (GetDatabaseEncoding() != GetPlatformEncoding() ) pgwin32_toUTF16 -> crash So placing SetDatabaseEncoding(GetPlatformEncoding()) before the check_root can be a solution for the issue. > Even then, I wouldn't be surprised to find problematic consequences beyond > error display. What if all the databases are EUC_JP, the platform encoding is > KOI8, and some postgresql.conf settings contain EUC_JP characters? Does the > postmaster not rely on its use of SQL_ASCII to allow those values? > > I would look at fixing this by making the error output machinery smarter in > this area before changing the postmaster's notion of server_encoding. Maybe I still miss something but I thought that postinit.c/CheckMyDatabase will switch encoding of a messages by pg_bind_textdomain_codeset to EUC_JP so there will be no issues with it. But until then KOI8 should be used. Regarding postgresql.conf, as it has no explicit encoding specification, it should be interpreted as having the platform encoding. So in your example it should contain KOI8, not EUC_JP characters. Thanks, Alexander
On Wed, Jan 30, 2013 at 10:00:01AM +0400, Alexander Law wrote: > 30.01.2013 05:51, Noah Misch wrote: >> On Tue, Jan 29, 2013 at 09:54:04AM -0500, Tom Lane wrote: >>> Alexander Law <exclusion@gmail.com> writes: >>>> Please look at the following l10n bug: >>>> http://www.postgresql.org/message-id/502A26F1.6010109@gmail.com >>>> and the proposed patch. >> Even then, I wouldn't be surprised to find problematic consequences beyond >> error display. What if all the databases are EUC_JP, the platform encoding is >> KOI8, and some postgresql.conf settings contain EUC_JP characters? Does the >> postmaster not rely on its use of SQL_ASCII to allow those values? >> >> I would look at fixing this by making the error output machinery smarter in >> this area before changing the postmaster's notion of server_encoding. With your proposed change, the problem will resurface in an actual SQL_ASCII database. At the problem's root is write_console()'s assumption that messages are in the database encoding. pg_bind_textdomain_codeset() tries to make that so, but it only works for encodings with a pg_enc2gettext_tbl entry. That excludes SQL_ASCII, MULE_INTERNAL, and others. write_console() needs to behave differently in such cases. > Maybe I still miss something but I thought that > postinit.c/CheckMyDatabase will switch encoding of a messages by > pg_bind_textdomain_codeset to EUC_JP so there will be no issues with it. > But until then KOI8 should be used. > Regarding postgresql.conf, as it has no explicit encoding specification, > it should be interpreted as having the platform encoding. So in your > example it should contain KOI8, not EUC_JP characters. Following some actual testing, I see that we treat postgresql.conf values as byte sequences; any reinterpretation as encoded text happens later. Hence, contrary to my earlier suspicion, your patch does not make that situation worse. The present situation is bad; among other things, current_setting() is a vector for injecting invalid text data. But unconditionally validating postgresql.conf values in the platform encoding would not be an improvement. Suppose you have a UTF-8 platform encoding and KOI8R databases. You may wish to put KOI8R strings in a GUC, say search_path. That's possible today; if we required that postgresql.conf conform to the platform encoding and no other, it would become impossible. This area warrants improvement, but doing so will entail careful design. Thanks, nm
Noah Misch <noah@leadboat.com> writes: > Following some actual testing, I see that we treat postgresql.conf values as > byte sequences; any reinterpretation as encoded text happens later. Hence, > contrary to my earlier suspicion, your patch does not make that situation > worse. The present situation is bad; among other things, current_setting() is > a vector for injecting invalid text data. But unconditionally validating > postgresql.conf values in the platform encoding would not be an improvement. > Suppose you have a UTF-8 platform encoding and KOI8R databases. You may wish > to put KOI8R strings in a GUC, say search_path. That's possible today; if we > required that postgresql.conf conform to the platform encoding and no other, > it would become impossible. This area warrants improvement, but doing so will > entail careful design. The key problem, ISTM, is that it's not at all clear what encoding to expect the incoming data to be in. I'm concerned about trying to fix that by assuming it's in some "platform encoding" --- for one thing, while that might be a well-defined concept on Windows, I don't believe it is anywhere else. If we knew that postgresql.conf was stored in, say, UTF8, then it would probably be possible to perform encoding conversion to get string variables into the database encoding. Perhaps we should allow some magic syntax to tell us the encoding of a config file? file_encoding = 'utf8' # must precede any non-ASCII in the file There would still be a lot of practical problems to solve, like what to do if we fail to convert some string into the database encoding. But at least the problems would be somewhat well-defined. While we're thinking about this, it'd be nice to fix our handling (or rather lack of handling) of encoding considerations for database names, user names, and passwords. I could imagine adding some sort of encoding marker to connection request packets, which could fix the don't-know- the-encoding problem as far as incoming data is concerned. But how shall we deal with storing the strings in shared catalogs, which have to be readable from multiple databases possibly of different encodings? regards, tom lane
On Sun, Feb 10, 2013 at 06:47:30PM -0500, Tom Lane wrote: > Noah Misch <noah@leadboat.com> writes: > > Following some actual testing, I see that we treat postgresql.conf values as > > byte sequences; any reinterpretation as encoded text happens later. Hence, > > contrary to my earlier suspicion, your patch does not make that situation > > worse. The present situation is bad; among other things, current_setting() is > > a vector for injecting invalid text data. But unconditionally validating > > postgresql.conf values in the platform encoding would not be an improvement. > > Suppose you have a UTF-8 platform encoding and KOI8R databases. You may wish > > to put KOI8R strings in a GUC, say search_path. That's possible today; if we > > required that postgresql.conf conform to the platform encoding and no other, > > it would become impossible. This area warrants improvement, but doing so will > > entail careful design. > > The key problem, ISTM, is that it's not at all clear what encoding to > expect the incoming data to be in. I'm concerned about trying to fix > that by assuming it's in some "platform encoding" --- for one thing, > while that might be a well-defined concept on Windows, I don't believe > it is anywhere else. GetPlatformEncoding() imposes a sufficiently-portable definition. I just don't think that definition leads to a value that can be presumed desirable and adequate for postgresql.conf. > If we knew that postgresql.conf was stored in, say, UTF8, then it would > probably be possible to perform encoding conversion to get string > variables into the database encoding. Perhaps we should allow some > magic syntax to tell us the encoding of a config file? > > file_encoding = 'utf8' # must precede any non-ASCII in the file > > There would still be a lot of practical problems to solve, like what to > do if we fail to convert some string into the database encoding. But at > least the problems would be somewhat well-defined. Agreed. That's a promising direction. > While we're thinking about this, it'd be nice to fix our handling (or > rather lack of handling) of encoding considerations for database names, > user names, and passwords. I could imagine adding some sort of encoding > marker to connection request packets, which could fix the don't-know- > the-encoding problem as far as incoming data is concerned. That deserves a TODO entry under Wire Protocol Changes to avoid losing it. > But how > shall we deal with storing the strings in shared catalogs, which have to > be readable from multiple databases possibly of different encodings? I suppose we would pick an encoding sufficient for all values we intend to support (UTF8? MULE_INTERNAL?), then store the data in that encoding using either bytea or a new type, say "omnitext". Thanks, nm
Noah Misch <noah@leadboat.com> writes: > On Sun, Feb 10, 2013 at 06:47:30PM -0500, Tom Lane wrote: >> The key problem, ISTM, is that it's not at all clear what encoding to >> expect the incoming data to be in. I'm concerned about trying to fix >> that by assuming it's in some "platform encoding" --- for one thing, >> while that might be a well-defined concept on Windows, I don't believe >> it is anywhere else. > GetPlatformEncoding() imposes a sufficiently-portable definition. I just > don't think that definition leads to a value that can be presumed desirable > and adequate for postgresql.conf. Nah, GetPlatformEncoding is utterly useless for this purpose --- restart the server with some other environment, and bad things will happen. regards, tom lane
On Sun, Feb 10, 2013 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > If we knew that postgresql.conf was stored in, say, UTF8, then it would > probably be possible to perform encoding conversion to get string > variables into the database encoding. Perhaps we should allow some > magic syntax to tell us the encoding of a config file? > > file_encoding = 'utf8' # must precede any non-ASCII in the file If we're going to do that we might as well use the Emacs standard -*-coding: latin-1;-*- But that said I'm not sure saying the whole file is in an encoding is the right approach. Paths are actually binary strings. any encoding is purely for display purposes anyways. Log line formats could be treated similarly if we choose. Hostnames would need to be in a particular encoding if we're to generate punycode I think but I think that particular encoding would have to be UTF8. Converting from some other encoding would be error prone and unnecessarily complex. What parts of postgresql.conf are actually encoded strings that need to be (and can be) manipulated as encoded strings? -- greg
On Tue, Feb 12, 2013 at 03:22:17AM +0000, Greg Stark wrote: > But that said I'm not sure saying the whole file is in an encoding is > the right approach. Paths are actually binary strings. any encoding is > purely for display purposes anyways. For Unix, yes. On Windows, they're ultimately UTF16 strings; some system APIs accept paths in the Windows ANSI code page and convert to UTF16 internally. Nonetheless, good point. > What parts of postgresql.conf are actually encoded strings that need > to be (and can be) manipulated as encoded strings? Mainly the ones that refer to arbitrary database objects. At least these: default_tablespace default_text_search_config search_path temp_tablespaces
<div class="moz-cite-prefix">Hello, </div><blockquote cite="mid:20130210210259.GA7401@tornado.leadboat.com" type="cite"><blockquotetype="cite"><blockquote type="cite"><blockquote type="cite"><pre wrap="">Alexander Law <a class="moz-txt-link-rfc2396E"href="mailto:exclusion@gmail.com"><exclusion@gmail.com></a> writes: </pre><blockquote type="cite"><pre wrap="">Please look at the following l10n bug: <a class="moz-txt-link-freetext" href="http://www.postgresql.org/message-id/502A26F1.6010109@gmail.com">http://www.postgresql.org/message-id/502A26F1.6010109@gmail.com</a> and the proposed patch. </pre></blockquote></blockquote></blockquote></blockquote><pre wrap=""> With your proposed change, the problem will resurface in an actual SQL_ASCII database. At the problem's root is write_console()'s assumption that messages are in the database encoding. pg_bind_textdomain_codeset() tries to make that so, but it only works for encodings with a pg_enc2gettext_tbl entry. That excludes SQL_ASCII, MULE_INTERNAL, and others. write_console() needs to behave differently in such cases.</pre></blockquote> Thank you for the notice. So it seems that "DatabaseEncoding" variablealone can't present a database encoding (for communication with a client) and current process messages encoding (forlogging messages) at once. There should be another variable, something like CurrentProcessEncoding, that will be setto OS encoding at start and can be changed to encoding of a connected database (if bind_textdomain_codeset succeeded).<br/><br /><blockquote cite="mid:20130214024020.GA28577@tornado.leadboat.com" type="cite"><div class="moz-text-plain"graphical-quote="true" lang="x-western" style="font-family: -moz-fixed; font-size: 12px;" wrap="true"><prewrap="">On Tue, Feb 12, 2013 at 03:22:17AM +0000, Greg Stark wrote: </pre><blockquote style="color: #000000;" type="cite"><pre wrap=""><span class="moz-txt-citetags">> </span>But that saidI'm not sure saying the whole file is in an encoding is <span class="moz-txt-citetags">> </span>the right approach. Paths are actually binary strings. any encoding is <span class="moz-txt-citetags">> </span>purely for display purposes anyways. </pre></blockquote><pre wrap="">For Unix, yes. On Windows, they're ultimately UTF16 strings; some system APIs accept paths in the Windows ANSI code page and convert to UTF16 internally. Nonetheless, good point. </pre></div></blockquote> Yes, and if postresql.conf not going to be UTF16 encoded, it seems natural to use ANSI code pageon Windows to write such paths in it.<br /> So the paths should be written in OS encoding, which is accepted by OS functions,such as fopen. (This is what we have now.)<br /> And it seems too complicated to have different encodings in onefile. Or maybe path parameters should be separated from the others, for which OS encoding is undesirable.<br /><blockquotecite="mid:CAM-w4HOpZfoaXwS8Tb114XaXS=3psLBFqaWb0ZZFG2keS+CbdQ@mail.gmail.com" type="cite"><blockquote type="cite"><prewrap="">If we knew that postgresql.conf was stored in, say, UTF8, then it would probably be possible to perform encoding conversion to get string variables into the database encoding. Perhaps we should allow some magic syntax to tell us the encoding of a config file? file_encoding = 'utf8' # must precede any non-ASCII in the file </pre></blockquote><pre wrap="">If we're going to do that we might as well use the Emacs standard -*-coding: latin-1;-*- </pre></blockquote> Explicit encoding specification such as these (or even <?xml version="1.0" encoding="utf-8"?>)can be useful but what encoding to assume without it? For XML (without BOM) it's UTF-8, for emacs itdepends on it's language environment.<br /> If postgresql.conf doesn't have to be portable (as XML), then IMO OS encodingis the right choice for it.<br /><br /><br /> Best regards,<br /> Alexander<br />
On 2/11/13 10:22 PM, Greg Stark wrote: > On Sun, Feb 10, 2013 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> If we knew that postgresql.conf was stored in, say, UTF8, then it would >> probably be possible to perform encoding conversion to get string >> variables into the database encoding. Perhaps we should allow some >> magic syntax to tell us the encoding of a config file? >> >> file_encoding = 'utf8' # must precede any non-ASCII in the file > > If we're going to do that we might as well use the Emacs standard > -*-coding: latin-1;-*- Yes, or more generally perhaps what Python does: http://docs.python.org/2.7/reference/lexical_analysis.html#encoding-declarations (In Python 2, the default is ASCII, in Python 3, the default is UTF8.) > But that said I'm not sure saying the whole file is in an encoding is > the right approach. Paths are actually binary strings. any encoding is > purely for display purposes anyways. Log line formats could be treated > similarly if we choose. Well, at some point someone is going to open a configuration file in an editor, and the editor will usually only be table to deal with one encoding per file. So we need to make that work, even if technically, different considerations apply to different settings.
On 02/15/2013 12:45 AM, Peter Eisentraut wrote: > On 2/11/13 10:22 PM, Greg Stark wrote: >> On Sun, Feb 10, 2013 at 11:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> If we knew that postgresql.conf was stored in, say, UTF8, then it would >>> probably be possible to perform encoding conversion to get string >>> variables into the database encoding. Perhaps we should allow some >>> magic syntax to tell us the encoding of a config file? >>> >>> file_encoding = 'utf8' # must precede any non-ASCII in the file >> If we're going to do that we might as well use the Emacs standard >> -*-coding: latin-1;-*- > Yes, or more generally perhaps what Python does: > http://docs.python.org/2.7/reference/lexical_analysis.html#encoding-declarations > > (In Python 2, the default is ASCII, in Python 3, the default is UTF8.) Not that Python also respects a BOM in a UTF-8 file, treating the BOM as flagging the file as being UTF-8. "In addition, if the first bytes of the file are the UTF-8 byte-order mark ('\xef\xbb\xbf'), the declared file encoding is UTF-8." IMO we should do the same. If there's no explicit encoding declaration, treat it as UTF-8 if there's a BOM and as the platform's local character encoding otherwise. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Feb 14, 2013 at 11:11:13AM +0400, Alexander Law wrote: > Hello, >>>>> Alexander Law <exclusion@gmail.com> writes: >>>>>> Please look at the following l10n bug: >>>>>> http://www.postgresql.org/message-id/502A26F1.6010109@gmail.com >>>>>> and the proposed patch. >> With your proposed change, the problem will resurface in an actual SQL_ASCII >> database. At the problem's root is write_console()'s assumption that messages >> are in the database encoding. pg_bind_textdomain_codeset() tries to make that >> so, but it only works for encodings with a pg_enc2gettext_tbl entry. That >> excludes SQL_ASCII, MULE_INTERNAL, and others. write_console() needs to >> behave differently in such cases. > Thank you for the notice. So it seems that "DatabaseEncoding" variable > alone can't present a database encoding (for communication with a > client) and current process messages encoding (for logging messages) at > once. There should be another variable, something like > CurrentProcessEncoding, that will be set to OS encoding at start and can > be changed to encoding of a connected database (if > bind_textdomain_codeset succeeded). I'd call it MessageEncoding unless it corresponds with similar rigor to a broader concept. >> On Tue, Feb 12, 2013 at 03:22:17AM +0000, Greg Stark wrote: >>> >But that said I'm not sure saying the whole file is in an encoding is >>> >the right approach. Paths are actually binary strings. any encoding is >>> >purely for display purposes anyways. >> For Unix, yes. On Windows, they're ultimately UTF16 strings; some system APIs >> accept paths in the Windows ANSI code page and convert to UTF16 internally. >> Nonetheless, good point. > Yes, and if postresql.conf not going to be UTF16 encoded, it seems > natural to use ANSI code page on Windows to write such paths in it. > So the paths should be written in OS encoding, which is accepted by OS > functions, such as fopen. (This is what we have now.) To the contrary, we would do better to use _wfopen() after converting from the encoding at hand to UTF16. We should have the goal of removing our dependence on the Windows ANSI code page, not tightening our bonds thereto. As long as PostgreSQL uses fopen() on Windows, it will remain possible to create a file that PostgreSQL cannot open. Making the full transition is probably a big job, and we don't need to get there in one patch. Try, however, to avoid patch designs that increase the distance left to cover. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Hello, 15.02.2013 02:59, Noah Misch wrote: >>> With your proposed change, the problem will resurface in an actual SQL_ASCII >>> database. At the problem's root is write_console()'s assumption that messages >>> are in the database encoding. pg_bind_textdomain_codeset() tries to make that >>> so, but it only works for encodings with a pg_enc2gettext_tbl entry. That >>> excludes SQL_ASCII, MULE_INTERNAL, and others. write_console() needs to >>> behave differently in such cases. >> Thank you for the notice. So it seems that "DatabaseEncoding" variable >> alone can't present a database encoding (for communication with a >> client) and current process messages encoding (for logging messages) at >> once. There should be another variable, something like >> CurrentProcessEncoding, that will be set to OS encoding at start and can >> be changed to encoding of a connected database (if >> bind_textdomain_codeset succeeded). > I'd call it MessageEncoding unless it corresponds with similar rigor to a > broader concept. Please look at the next version of the patch. Thanks, Alexander
Вложения
On Wed, Feb 20, 2013 at 04:00:00PM +0400, Alexander Law wrote: > 15.02.2013 02:59, Noah Misch wrote: >>>> With your proposed change, the problem will resurface in an actual SQL_ASCII >>>> database. At the problem's root is write_console()'s assumption that messages >>>> are in the database encoding. pg_bind_textdomain_codeset() tries to make that >>>> so, but it only works for encodings with a pg_enc2gettext_tbl entry. That >>>> excludes SQL_ASCII, MULE_INTERNAL, and others. write_console() needs to >>>> behave differently in such cases. >>> Thank you for the notice. So it seems that "DatabaseEncoding" variable >>> alone can't present a database encoding (for communication with a >>> client) and current process messages encoding (for logging messages) at >>> once. There should be another variable, something like >>> CurrentProcessEncoding, that will be set to OS encoding at start and can >>> be changed to encoding of a connected database (if >>> bind_textdomain_codeset succeeded). >> I'd call it MessageEncoding unless it corresponds with similar rigor to a >> broader concept. > Please look at the next version of the patch. I studied this patch. I like the APIs it adds, but the implementation details required attention. Your patch assumes the message encoding will be a backend encoding, but this need not be so; locale "Japanese (Japan)" uses CP932 aka SJIS. I've also added the non-backend encodings to pg_enc2gettext_tbl; I'd welcome sanity checks from folks who know those encodings well. write_console() still garbled the payload in all cases where it used write() to a console instead of WriteConsoleW(). You can observe this by configuring Windows for Russian, running initdb with default locale settings, and running "select 1/0" in template1. See comments for the technical details; I fixed this by removing the logic to preemptively skip WriteConsoleW() for encoding-related reasons. (Over in write_eventlog(), I am suspicious of the benefits we get from avoiding ReportEventW() where possible. I have not removed that, though.) > --- a/src/backend/main/main.c > +++ b/src/backend/main/main.c > @@ -100,6 +100,8 @@ main(int argc, char *argv[]) > > set_pglocale_pgservice(argv[0], PG_TEXTDOMAIN("postgres")); > > + SetMessageEncoding(GetPlatformEncoding()); SetMessageEncoding() and GetPlatformEncoding() can both elog()/ereport(), but this is too early in the life of a PostgreSQL process for that. The goal of this line of code is to capture gettext's default encoding, which is platform-specific. On non-Windows platforms, it's the encoding implied by LC_CTYPE. On Windows, it's the Windows ANSI code page. GetPlatformEncoding() always gives the encoding implied by LC_CTYPE, which is therefore not correct on Windows. You can observe broken output by setting your locale (in control panel "Region and Language" -> Formats -> Format) to something unrelated to your Windows ANSI code page (Region and Language -> Administrative -> Change system locale...). Let's make PostgreSQL independent of gettext's Windows-specific default by *always* calling bind_textdomain_codeset() on Windows. In the postmaster and other non-database-attached processes, as well as in SQL_ASCII databases, we would pass the encoding implied by LC_CTYPE. Besides removing a discrepancy in PostgreSQL behavior between Windows and non-Windows platforms, this has the great practical advantage of reducing PostgreSQL's dependence on the deprecated Windows ANSI code page, which can only be changed on a system-wide basis, by an administrator, after a reboot. For message creation purposes, the encoding implied by LC_CTYPE on Windows is somewhat arbitrary. Microsoft has deprecated them all in favor of UTF16, and some locales (e.g. "Hindi (India)") have no legacy code page. I can see adding a postmaster_encoding GUC to be used in place of consulting LC_CTYPE. I haven't done that now; I just mention it for future reference. On a !ENABLE_NLS build, messages are ASCII with database-encoding strings sometimes interpolated. Therefore, such builds should keep the message encoding equal to the database encoding. The MessageEncoding was not maintained accurately on non-Windows systems. For example, connecting to an lc_ctype=LATIN1, encoding=LATIN1 database does not require a bind_textdomain_codeset() call on such systems, so we did not update the message encoding. This was academic for the time being, because MessageEncoding is only consulted on Windows. The attached revision fixes all above points. Would you look it over? The area was painfully light on comments, so I added some. I renamed pgwin32_toUTF16(), which ceases to be a good name now that it converts from message encoding, not database encoding. GetPlatformEncoding() became unused, so I removed it. (If we have cause reintroduce the exact same concept later, GetTTYEncoding() would name it more accurately.) What should we do for the back branches, if anything? Fixes for garbled display on consoles and event logs are fair to back-patch, but users might be accustomed to the present situation for SQL_ASCII databases. Given the low incidence of complaints and the workaround of using logging_collector, I am inclined to put the whole thing in master only. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com
Вложения
Hello Noah, Thanks for your work, your patch is definitely better. I agree that this approach much more generic. 23.06.2013 20:53, Noah Misch wrote: > The attached revision fixes all above points. Would you look it over? The > area was painfully light on comments, so I added some. I renamed > pgwin32_toUTF16(), which ceases to be a good name now that it converts from > message encoding, not database encoding. GetPlatformEncoding() became unused, > so I removed it. (If we have cause reintroduce the exact same concept later, > GetTTYEncoding() would name it more accurately.) Yes, the patch works for me. I have just a little question about pgwin32_message_to_UTF16. Do we need to convert SQL_ASCII through UTF8 or should SQL_ASCII be mapped to 20127 (US-ASCII (7-bit))? > What should we do for the back branches, if anything? Fixes for garbled > display on consoles and event logs are fair to back-patch, but users might be > accustomed to the present situation for SQL_ASCII databases. Given the low > incidence of complaints and the workaround of using logging_collector, I am > inclined to put the whole thing in master only. I thought that the change could be a first step to the PosgreSQL log encoding normalization. Today the log may contain messages with different encodings (we had a long discussion a year ago: http://www.postgresql.org/message-id/5007C399.6000405@gmail.com) Now the new function GetMessageEncoding allows to convert all the messages consistently. If the future log encoding fix will be considered as important enough to be backported, then this patch should be backported too. Thanks again! Best regards, Alexander
On Mon, Jun 24, 2013 at 04:00:00PM +0400, Alexander Law wrote: > 23.06.2013 20:53, Noah Misch wrote: >> The attached revision fixes all above points. Would you look it over? The >> area was painfully light on comments, so I added some. I renamed >> pgwin32_toUTF16(), which ceases to be a good name now that it converts from >> message encoding, not database encoding. GetPlatformEncoding() became unused, >> so I removed it. (If we have cause reintroduce the exact same concept later, >> GetTTYEncoding() would name it more accurately.) > Yes, the patch works for me. I have just a little question about > pgwin32_message_to_UTF16. Do we need to convert SQL_ASCII through UTF8 > or should SQL_ASCII be mapped to 20127 (US-ASCII (7-bit))? Good question. SQL_ASCII is an awkward database encoding; it actually indicates ignorance about the significance of byte sequences in text. As a result, nothing we do here is likely to be delightful. I caused an error containing some LATIN2 bytes (SELECT * FROM "śmiać się") in an SQL_ASCII / LC_CTYPE=C database, and the result was decent, considering: the event log shows a question-mark-in-rhombus symbol for each of the non-ASCII bytes. Using CP20127 turned it into "6miaf sij". I don't have a great idea for improving on the existing hack. >> What should we do for the back branches, if anything? Fixes for garbled >> display on consoles and event logs are fair to back-patch, but users might be >> accustomed to the present situation for SQL_ASCII databases. Given the low >> incidence of complaints and the workaround of using logging_collector, I am >> inclined to put the whole thing in master only. > I thought that the change could be a first step to the PosgreSQL log > encoding normalization. Today the log may contain messages with > different encodings (we had a long discussion a year ago: > http://www.postgresql.org/message-id/5007C399.6000405@gmail.com) > Now the new function GetMessageEncoding allows to convert all the > messages consistently. If the future log encoding fix will be considered > as important enough to be backported, then this patch should be > backported too. I doubt that would prove to be back-patch material. -- Noah Misch EnterpriseDB http://www.enterprisedb.com
On Mon, Jun 24, 2013 at 04:00:00PM +0400, Alexander Law wrote: > Thanks for your work, your patch is definitely better. I agree that this > approach much more generic. Committed. -- Noah Misch EnterpriseDB http://www.enterprisedb.com