Обсуждение: Patch: add conversion from pg_wchar to multibyte
Hackers,
attached patch adds conversion from pg_wchar string to multibyte string.
This functionality is needed for my patch on index support for regular expression search http://archives.postgresql.org/pgsql-hackers/2011-11/msg01297.php .
Analyzing conversion from multibyte to pg_wchar I found following types of conversion:
1) Trivial conversion for single-byte encoding. It just adds leading zeros to each byte.
2) Conversion from UTF-8 to unicode.
3) Conversions from euc* encodings. They write bytes of a character to pg_wchar in inverse order starting from lower byte (this explanation assume little endian system).
4) Conversion from mule encoding. This conversion is unclear for me and also seems to be lossy.
It was easy to write inverse conversion for 1-3. I've changed 4 conversion to behave like 3. I'm not sure my change is ok, because I didn't understand original conversion.
With best regards,
Alexander Korotkov.
Вложения
Hi Alexander, Perhaps I'm too early with these tests, but FWIW I reran my earlier test program against three instances. (the patches compiled fine, and make check was without problem). -- 3 instances: HEAD port 6542 trgm_regex port 6547 HEAD + trgm-regexp patch (22 Nov 2011) [1] trgm_regex_wchar2mb port 6549 HEAD + trgm-regexp + wchar2mb patch (23 Apr 2012) [2] [1] http://archives.postgresql.org/pgsql-hackers/2011-11/msg01297.php [2] http://archives.postgresql.org/pgsql-hackers/2012-04/msg01095.php -- table sizes: azjunk4 10^4 rows 1 MB azjunk5 10^5 rows 11 MB azjunk6 10^6 rows 112 MB azjunk7 10^7 rows 1116 MB for table creation/structure, see: [3] http://archives.postgresql.org/pgsql-hackers/2012-01/msg01094.php Results for three instances with 4 repetitions per instance are attached. Although the regexes I chose are somewhat arbitrary, it does show some of the good, the bad and the ugly of the patch(es). (Also: I've limited the tests to a range of 'workable' regexps, i.e. avoiding unbounded regexps) hth (and thanks, great work!), Erik Rijkers
Вложения
On Sun, Apr 29, 2012 at 8:12 AM, Erik Rijkers <er@xs4all.nl> wrote: > Perhaps I'm too early with these tests, but FWIW I reran my earlier test program against three > instances. (the patches compiled fine, and make check was without problem). These tests results seem to be more about the pg_trgm changes than the patch actually on this thread, unless I'm missing something. But the executive summary seems to be that pg_trgm might need to be a bit smarter about costing the trigram-based search, because when the number of trigrams is really big, using the index is counterproductive. Hopefully that's not too hard to fix; the basic approach seems quite promising. (I haven't actually looked at the patch on this thread yet to understand how it fits in; the above comments are about the pg_trgm regex stuff.) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > Hopefully that's not too hard to fix; the basic approach seems > quite promising. After playing with trigram searches for name searches against copies of production database with appropriate indexing, our shop has chosen it as the new way to do name searches here. It's really nice. My biggest complaint is related to setting the threshold for the % operator. It seems to me that there should be a GUC to control the default, and that there should be a way to set the threshold for each % operator in a query (if there is more than one). The function names which must be used on the connection before running the queries don't give any clue that they are related to trigrams: show_limit() and set_limit() are nearly useless for conveying the semantics of what they do. Even with those issues, trigram similarity searching is IMO one of the top five coolest things about PostgreSQL and should be promoted heavily. -Kevin
Hi Erik
On Sun, Apr 29, 2012 at 4:12 PM, Erik Rijkers <er@xs4all.nl> wrote:
Alexander Korotkov.
Perhaps I'm too early with these tests, but FWIW I reran my earlier test program against three
instances. (the patches compiled fine, and make check was without problem).
-- 3 instances:
HEAD port 6542
trgm_regex port 6547 HEAD + trgm-regexp patch (22 Nov 2011) [1]
trgm_regex_wchar2mb port 6549 HEAD + trgm-regexp + wchar2mb patch (23 Apr 2012) [2]
Actually wchar2mb patch doesn't affect behaviour of trgm-regexp. It provide correct way to do some work of encoding conversion which last published version of trgm-regexp does internally. So "HEAD + trgm-regexp patch" and "HEAD + trgm-regexp + wchar2mb patch" should behave similarly.
[1] http://archives.postgresql.org/pgsql-hackers/2011-11/msg01297.php
[2] http://archives.postgresql.org/pgsql-hackers/2012-04/msg01095.php
-- table sizes:
azjunk4 10^4 rows 1 MB
azjunk5 10^5 rows 11 MB
azjunk6 10^6 rows 112 MB
azjunk7 10^7 rows 1116 MB
for table creation/structure, see:
[3] http://archives.postgresql.org/pgsql-hackers/2012-01/msg01094.php
Results for three instances with 4 repetitions per instance are attached.
Although the regexes I chose are somewhat arbitrary, it does show some of the good, the bad and
the ugly of the patch(es). (Also: I've limited the tests to a range of 'workable' regexps, i.e.
avoiding unbounded regexps)
Thank you for testing!
Such synthetical tests are very valuable for finding corner cases of the patch, bugs etc.
But also, it would be nice to do some tests on reallife datasets with reallife regexps in order to see real benefit of this approach of indexing and do some comparison with other approaches. May be you or somebody else could obtain such datasets?
Also, I did some optimizations in algorithm. Automaton analysis stage should become less CPU and memory consuming. I'll publish new version soon.
------
With best regards,Alexander Korotkov.
<div class="gmail_quote">On Mon, Apr 30, 2012 at 10:07 PM, Robert Haas <span dir="ltr"><<a href="mailto:robertmhaas@gmail.com"target="_blank">robertmhaas@gmail.com</a>></span> wrote:<br /><blockquote class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">On Sun, Apr 29,2012 at 8:12 AM, Erik Rijkers <<a href="mailto:er@xs4all.nl">er@xs4all.nl</a>> wrote:<br /> > Perhaps I'm tooearly with these tests, but FWIW I reran my earlier test program against three<br /> > instances. (the patches compiledfine, and make check was without problem).<br /><br /></div>These tests results seem to be more about the pg_trgmchanges than the<br /> patch actually on this thread, unless I'm missing something. But the<br /> executive summaryseems to be that pg_trgm might need to be a bit<br /> smarter about costing the trigram-based search, because whenthe<br /> number of trigrams is really big, using the index is<br /> counterproductive. Hopefully that's not too hardto fix; the basic<br /> approach seems quite promising.</blockquote><div class="gmail_quote"><br /></div><div class="gmail_quote">Right.When number of trigrams is big, it is slow to scan posting list of all of them. The solution isthis case is to exclude most frequent trigrams from index scan. But, it require some kind of statistics of trigrams frequencieswhich we don't have. We could estimate frequencies using some hard-coded assumptions about natural languages.Or we could exclude arbitrary trigrams. But I don't like both these ideas. This problem is also relevant for LIKE/ILIKEsearch using trigram indexes.</div><div class="gmail_quote"><br /></div><div class="gmail_quote">Something similarcould occur in tsearch when we search for "frequent_term & rare_term". In some situations (depending on termsfrequencies) it's better to exclude frequent_term from index scan and do recheck. We have relevant statistics to dosuch decision, but it doesn't seem to be feasible to get it using current GIN interface.</div><div class="gmail_quote"><br/></div><div class="gmail_quote">Probably you have some comments on idea of conversion from pg_wcharto multibyte? Is it acceptable at all?</div><br />------<br />With best regards,<br />Alexander Korotkov.</div>
On Tue, May 1, 2012 at 1:48 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote:
I think this problem can be avoided by introducing composite type representing trigram similarity query. It could consists of a query text and similarity threshold. This type would have similar purpose as tsquery or query_int. It would make queries more heavy, but would allow to use distinct similarity threshold in the same query.
My biggest complaint is related to setting the threshold for the %operator. It seems to me that there should be a GUC to control the
default, and that there should be a way to set the threshold for
each % operator in a query (if there is more than one). The
function names which must be used on the connection before running
the queries don't give any clue that they are related to trigrams:
show_limit() and set_limit() are nearly useless for conveying the
semantics of what they do.
------
With best regards,
Alexander Korotkov.
With best regards,
Alexander Korotkov.
On Tue, May 1, 2012 at 6:02 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > Right. When number of trigrams is big, it is slow to scan posting list of > all of them. The solution is this case is to exclude most frequent trigrams > from index scan. But, it require some kind of statistics of trigrams > frequencies which we don't have. We could estimate frequencies using some > hard-coded assumptions about natural languages. Or we could exclude > arbitrary trigrams. But I don't like both these ideas. This problem is also > relevant for LIKE/ILIKE search using trigram indexes. I was thinking you could perhaps do it just based on the *number* of trigrams, not necessarily their frequency. > Probably you have some comments on idea of conversion from pg_wchar to > multibyte? Is it acceptable at all? Well, I'm not an expert on encodings, but it seems like a logical extension of what we're doing right now, so I don't really see why not. I'm confused by the diff hunks in pg_mule2wchar_with_len, though. Presumably either the old code is right (in which case, don't change it) or the new code is right (in which case, there's a bug fix needed here that ought to be discussed and committed separately from the rest of the patch). Maybe I am missing something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 2, 2012 at 4:50 PM, Robert Haas <robertmhaas@gmail.com> wrote:
The first query require reading posting lists of trigrams "abc" and "bcd".
The second query require reading posting lists of trigrams "abc", "bcd", "cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
We could decide to use index scan for first query and sequential scan for second query because number of posting list to read is high. But it is unreasonable because actually second query is narrower than the first one. We can use same index scan for it, recheck will remove all false positives. When number of trigrams is high we can just exclude some of them from index scan. It would be better than just decide to do sequential scan. But the question is what trigrams to exclude? Ideally we would leave most rare trigrams to make index scan cheaper.
Unfortunately I didn't understand original logic of pg_mule2wchar_with_len. I just did proposal about how it could be. I hope somebody more familiar with this code would clarify this situation.
------
With best regards,
Alexander Korotkov.
On Tue, May 1, 2012 at 6:02 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:I was thinking you could perhaps do it just based on the *number* of
> Right. When number of trigrams is big, it is slow to scan posting list of
> all of them. The solution is this case is to exclude most frequent trigrams
> from index scan. But, it require some kind of statistics of trigrams
> frequencies which we don't have. We could estimate frequencies using some
> hard-coded assumptions about natural languages. Or we could exclude
> arbitrary trigrams. But I don't like both these ideas. This problem is also
> relevant for LIKE/ILIKE search using trigram indexes.
trigrams, not necessarily their frequency.
Imagine we've two queries:
1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';
The second query require reading posting lists of trigrams "abc", "bcd", "cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
We could decide to use index scan for first query and sequential scan for second query because number of posting list to read is high. But it is unreasonable because actually second query is narrower than the first one. We can use same index scan for it, recheck will remove all false positives. When number of trigrams is high we can just exclude some of them from index scan. It would be better than just decide to do sequential scan. But the question is what trigrams to exclude? Ideally we would leave most rare trigrams to make index scan cheaper.
> Probably you have some comments on idea of conversion from pg_wchar toWell, I'm not an expert on encodings, but it seems like a logical
> multibyte? Is it acceptable at all?
extension of what we're doing right now, so I don't really see why
not. I'm confused by the diff hunks in pg_mule2wchar_with_len,
though. Presumably either the old code is right (in which case, don't
change it) or the new code is right (in which case, there's a bug fix
needed here that ought to be discussed and committed separately from
the rest of the patch). Maybe I am missing something.
------
With best regards,
Alexander Korotkov.
On Wed, May 2, 2012 at 9:35 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: >> I was thinking you could perhaps do it just based on the *number* of >> trigrams, not necessarily their frequency. > > Imagine we've two queries: > 1) SELECT * FROM tbl WHERE col LIKE '%abcd%'; > 2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%'; > > The first query require reading posting lists of trigrams "abc" and "bcd". > The second query require reading posting lists of trigrams "abc", "bcd", > "cde", "def", "efg", "fgh", "ghi", "hij" and "ijk". > We could decide to use index scan for first query and sequential scan for > second query because number of posting list to read is high. But it is > unreasonable because actually second query is narrower than the first one. > We can use same index scan for it, recheck will remove all false positives. > When number of trigrams is high we can just exclude some of them from index > scan. It would be better than just decide to do sequential scan. But the > question is what trigrams to exclude? Ideally we would leave most rare > trigrams to make index scan cheaper. True. I guess I was thinking more of the case where you've got abc|def|ghi|jkl|mno|pqr|stu|vwx|yza|.... There's probably some point at which it becomes silly to think about using the index. >> > Probably you have some comments on idea of conversion from pg_wchar to >> > multibyte? Is it acceptable at all? >> >> Well, I'm not an expert on encodings, but it seems like a logical >> extension of what we're doing right now, so I don't really see why >> not. I'm confused by the diff hunks in pg_mule2wchar_with_len, >> though. Presumably either the old code is right (in which case, don't >> change it) or the new code is right (in which case, there's a bug fix >> needed here that ought to be discussed and committed separately from >> the rest of the patch). Maybe I am missing something. > > Unfortunately I didn't understand original logic of pg_mule2wchar_with_len. > I just did proposal about how it could be. I hope somebody more familiar > with this code would clarify this situation. Well, do you think the current code is buggy, or not? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 2, 2012 at 5:48 PM, Robert Haas <robertmhaas@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
On Wed, May 2, 2012 at 9:35 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:True. I guess I was thinking more of the case where you've got
> Imagine we've two queries:
> 1) SELECT * FROM tbl WHERE col LIKE '%abcd%';
> 2) SELECT * FROM tbl WHERE col LIKE '%abcdefghijk%';
>
> The first query require reading posting lists of trigrams "abc" and "bcd".
> The second query require reading posting lists of trigrams "abc", "bcd",
> "cde", "def", "efg", "fgh", "ghi", "hij" and "ijk".
> We could decide to use index scan for first query and sequential scan for
> second query because number of posting list to read is high. But it is
> unreasonable because actually second query is narrower than the first one.
> We can use same index scan for it, recheck will remove all false positives.
> When number of trigrams is high we can just exclude some of them from index
> scan. It would be better than just decide to do sequential scan. But the
> question is what trigrams to exclude? Ideally we would leave most rare
> trigrams to make index scan cheaper.
abc|def|ghi|jkl|mno|pqr|stu|vwx|yza|.... There's probably some point
at which it becomes silly to think about using the index.
Yes, such situations are also possible.
>> Well, I'm not an expert on encodings, but it seems like a logicalWell, do you think the current code is buggy, or not?
>> extension of what we're doing right now, so I don't really see why
>> not. I'm confused by the diff hunks in pg_mule2wchar_with_len,
>> though. Presumably either the old code is right (in which case, don't
>> change it) or the new code is right (in which case, there's a bug fix
>> needed here that ought to be discussed and committed separately from
>> the rest of the patch). Maybe I am missing something.
>
> Unfortunately I didn't understand original logic of pg_mule2wchar_with_len.
> I just did proposal about how it could be. I hope somebody more familiar
> with this code would clarify this situation.
Probably, but I'm not sure. The conversion seems lossy to me unless I'm missing something about mule encoding.
With best regards,
Alexander Korotkov.
Hello, Ishii-san!
We've talked on PGCon that I've questions about mule to wchar conversion. My questions about pg_mule2wchar_with_len function are following. In these parts of code:
we skip first character of original string. Are we able to restore it back from pg_wchar?
------
With best regards,
Alexander Korotkov.
else if (IS_LCPRV1(*from) && len >= 3)
{
from++;
*to = *from++ << 16;
*to |= *from++;
len -= 3;
}
and
else if (IS_LCPRV2(*from) && len >= 4)
{
from++;
*to = *from++ << 16;
*to |= *from++ << 8;
*to |= *from++;
len -= 4;
}
Also in this part of code we're shifting first byte by 16 bits:
if (IS_LC1(*from) && len >= 2)
{
*to = *from++ << 16;
*to |= *from++;
len -= 2;
}
else if (IS_LCPRV1(*from) && len >= 3)
{
from++;
*to = *from++ << 16;
*to |= *from++;
len -= 3;
}
Why don't we shift it by 8 bits?
You can see my patch in this thread where I propose purely mechanical changes in this function which make inverse conversion possible.
With best regards,
Alexander Korotkov.
Hi Alexander, It was good seeing you in Ottawa! > Hello, Ishii-san! > > We've talked on PGCon that I've questions about mule to wchar > conversion. My questions about pg_mule2wchar_with_len function are > following. In these parts of code: > * > * > else if (IS_LCPRV1(*from) && len >= 3) > { > from++; > *to = *from++ << 16; > *to |= *from++; > len -= 3; > } > > and > > else if (IS_LCPRV2(*from) && len >= 4) > { > from++; > *to = *from++ << 16; > *to |= *from++ << 8; > *to |= *from++; > len -= 4; > } > > we skip first character of original string. Are we able to restore it back > from pg_wchar? I think it's possible. The first characters are defined like this: #define IS_LCPRV1(c) ((unsigned char)(c) == 0x9a || (unsigned char)(c) == 0x9b) #define IS_LCPRV2(c) ((unsigned char)(c) == 0x9c || (unsigned char)(c) == 0x9d) It seems IS_LCPRV1 is not used in any of PostgreSQL supported encodings at this point, that means there's 0 chance which existing databases include LCPRV1. So you could safely ignore it. For IS_LCPRV2, it is only used for Chinese encodings (EUC_TW and BIG5) in backend/utils/mb/conversion_procs/euc_tw_and_big5/euc_tw_and_big5.c and it is fixed to 0x9d. So you can always restore the value to 0x9d. > Also in this part of code we're shifting first byte by 16 bits: > > if (IS_LC1(*from) && len >= 2) > { > *to = *from++ << 16; > *to |= *from++; > len -= 2; > } > else if (IS_LCPRV1(*from) && len >= 3) > { > from++; > *to = *from++ << 16; > *to |= *from++; > len -= 3; > } > > Why don't we shift it by 8 bits? Because we want the first byte of LC1 case to be placed in the second byte of wchar. i.e. 0th byte: always 0 1th byte: leading byte (the first byte of the multibyte) 2th byte: always 0 3th byte: the second byte of the multibyte Note that we always assume that the 1th byte (called "leading byte": LB in short) represents the id of the character set (from 0x81 to 0xff) in MULE INTERNAL encoding. For the mapping between LB and charsets, see pg_wchar.h. > You can see my patch in this thread where I propose purely mechanical > changes in this function which make inverse conversion possible. > > ------ > With best regards, > Alexander Korotkov.
On Tue, May 22, 2012 at 11:50 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:
------I think it's possible. The first characters are defined like this:
#define IS_LCPRV1(c) ((unsigned char)(c) == 0x9a || (unsigned char)(c) == 0x9b)
#define IS_LCPRV2(c) ((unsigned char)(c) == 0x9c || (unsigned char)(c) == 0x9d)
It seems IS_LCPRV1 is not used in any of PostgreSQL supported
encodings at this point, that means there's 0 chance which existing
databases include LCPRV1. So you could safely ignore it.
For IS_LCPRV2, it is only used for Chinese encodings (EUC_TW and BIG5)
in backend/utils/mb/conversion_procs/euc_tw_and_big5/euc_tw_and_big5.c
and it is fixed to 0x9d. So you can always restore the value to 0x9d.Because we want the first byte of LC1 case to be placed in the second
> Also in this part of code we're shifting first byte by 16 bits:
>
> if (IS_LC1(*from) && len >= 2)
> {
> *to = *from++ << 16;
> *to |= *from++;
> len -= 2;
> }
> else if (IS_LCPRV1(*from) && len >= 3)
> {
> from++;
> *to = *from++ << 16;
> *to |= *from++;
> len -= 3;
> }
>
> Why don't we shift it by 8 bits?
byte of wchar. i.e.
0th byte: always 0
1th byte: leading byte (the first byte of the multibyte)
2th byte: always 0
3th byte: the second byte of the multibyte
Note that we always assume that the 1th byte (called "leading byte":
LB in short) represents the id of the character set (from 0x81 to
0xff) in MULE INTERNAL encoding. For the mapping between LB and
charsets, see pg_wchar.h.
Thanks for your comments. They clarify a lot.
But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2? Isn't it possible for them to produce same pg_wchar?
With best regards,
Alexander Korotkov.
> Thanks for your comments. They clarify a lot. > But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2? > Isn't it possible for them to produce same pg_wchar? If LB is in 0x90 - 0x99 range, then they are LC2. If LB is in 0xf0 - 0xff range, then they are LCPRV2. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On Tue, May 22, 2012 at 3:27 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
------
With best regards,
Alexander Korotkov.
> Thanks for your comments. They clarify a lot.If LB is in 0x90 - 0x99 range, then they are LC2.
> But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2?
> Isn't it possible for them to produce same pg_wchar?
If LB is in 0xf0 - 0xff range, then they are LCPRV2.
Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of patch is attached.
------
With best regards,
Alexander Korotkov.
Вложения
> On Tue, May 22, 2012 at 3:27 PM, Tatsuo Ishii <ishii@postgresql.org> wrote: > >> > Thanks for your comments. They clarify a lot. >> > But I still don't realize how can we distinguish IS_LCPRV2 and IS_LC2? >> > Isn't it possible for them to produce same pg_wchar? >> >> If LB is in 0x90 - 0x99 range, then they are LC2. >> If LB is in 0xf0 - 0xff range, then they are LCPRV2. >> > > Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of > patch is attached. [forgot to cc: to the list] I looked into your patch, especially: pg_wchar2euc_with_len(const pg_wchar *from, unsigned char *to, int len) I think there's a small room to enhance the function. if (*from >> 24) { *to++ = *from >> 24; *to++ = (*from >> 16) & 0xFF; *to++ = (*from >> 8) &0xFF; *to++ = *from & 0xFF; cnt += 4; } Since the function walk through this every single wchar, something like: if ((c = *from >> 24)) { *to++ = c; *to++ = (*from >> 16) & 0xFF; *to++ = (*from >> 8) & 0xFF; *to++ = *from & 0xFF; cnt += 4; } will save few cycles(I'm not sure the optimizer produces similar code above anyway though). -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On Thu, May 24, 2012 at 12:04 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > Thanks. I rewrote inverse conversion from pg_wchar to mule. New version of > patch is attached. Review: It looks to me like pg_wchar2utf_with_len will not work, because unicode_to_utf8 returns its second argument unmodified - not, as your code seems to assume, the byte following what was already written. MULE also looks problematic. The code that you've written isn't symmetric with the opposite conversion, unlike what you did in all other cases, and I don't understand why. I'm also somewhat baffled by the reverse conversion: it treats a multi-byte sequence beginning with a byte for which IS_LCPRV1(x) returns true as invalid if there are less than 3 bytes available, but it only reads two; similarly, for IS_LCPRV2(x), it demands 4 bytes but converts only 3. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Jun 27, 2012 at 11:35 PM, Robert Haas <robertmhaas@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
It looks to me like pg_wchar2utf_with_len will not work, becauseunicode_to_utf8 returns its second argument unmodified - not, as your
code seems to assume, the byte following what was already written.
Fixed.
MULE also looks problematic. The code that you've written isn't
symmetric with the opposite conversion, unlike what you did in all
other cases, and I don't understand why. I'm also somewhat baffled by
the reverse conversion: it treats a multi-byte sequence beginning with
a byte for which IS_LCPRV1(x) returns true as invalid if there are
less than 3 bytes available, but it only reads two; similarly, for
IS_LCPRV2(x), it demands 4 bytes but converts only 3.
Should we save existing pg_wchar representation for MULE encoding? Probably, we can modify it like in 0.1 version of patch in order to make it more transparent.
With best regards,
Alexander Korotkov.
Вложения
On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: >> MULE also looks problematic. The code that you've written isn't >> symmetric with the opposite conversion, unlike what you did in all >> other cases, and I don't understand why. I'm also somewhat baffled by >> the reverse conversion: it treats a multi-byte sequence beginning with >> a byte for which IS_LCPRV1(x) returns true as invalid if there are >> less than 3 bytes available, but it only reads two; similarly, for >> IS_LCPRV2(x), it demands 4 bytes but converts only 3. > > Should we save existing pg_wchar representation for MULE encoding? Probably, > we can modify it like in 0.1 version of patch in order to make it more > transparent. Changing the encoding would break pg_upgrade, so -1 from me on that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 2, 2012 at 8:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
------
With best regards,
Alexander Korotkov.
On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote:Changing the encoding would break pg_upgrade, so -1 from me on that.
>> MULE also looks problematic. The code that you've written isn't
>> symmetric with the opposite conversion, unlike what you did in all
>> other cases, and I don't understand why. I'm also somewhat baffled by
>> the reverse conversion: it treats a multi-byte sequence beginning with
>> a byte for which IS_LCPRV1(x) returns true as invalid if there are
>> less than 3 bytes available, but it only reads two; similarly, for
>> IS_LCPRV2(x), it demands 4 bytes but converts only 3.
>
> Should we save existing pg_wchar representation for MULE encoding? Probably,
> we can modify it like in 0.1 version of patch in order to make it more
> transparent.
I didn't realize that we store pg_wchar on disk somewhere. I thought it is only in-memory representation. Where do we store pg_wchar on disk?
With best regards,
Alexander Korotkov.
On Mon, Jul 2, 2012 at 4:33 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > On Mon, Jul 2, 2012 at 8:12 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> >> wrote: >> >> MULE also looks problematic. The code that you've written isn't >> >> symmetric with the opposite conversion, unlike what you did in all >> >> other cases, and I don't understand why. I'm also somewhat baffled by >> >> the reverse conversion: it treats a multi-byte sequence beginning with >> >> a byte for which IS_LCPRV1(x) returns true as invalid if there are >> >> less than 3 bytes available, but it only reads two; similarly, for >> >> IS_LCPRV2(x), it demands 4 bytes but converts only 3. >> > >> > Should we save existing pg_wchar representation for MULE encoding? >> > Probably, >> > we can modify it like in 0.1 version of patch in order to make it more >> > transparent. >> >> Changing the encoding would break pg_upgrade, so -1 from me on that. > > > I didn't realize that we store pg_wchar on disk somewhere. I thought it is > only in-memory representation. Where do we store pg_wchar on disk? OK, now I'm confused. I was thinking (incorrectly) that you were talking about changing the multibyte encoding, which of course is saved on disk all over the place. Changing the wchar encoding is a different kettle of fish, and I have no idea what that would or would not break. But I don't see why we'd want to do such a thing. We just need to make the MB->WCHAR and WCHAR->MB transformations mirror images of each other; why is that hard? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jul 3, 2012 at 12:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
So, I provided such transformation in versions 0.3 and 0.4 based on explanation from Tatsuo Ishii. The problem is that both conversions are nontrivial and it's not evident that they are mirror (understanding that they are mirror require some additional assumptions about encodings, not evident just by transformation itself). I though you mention that problem two message back.
OK, now I'm confused. I was thinking (incorrectly) that you wereOn Mon, Jul 2, 2012 at 4:33 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote:
> On Mon, Jul 2, 2012 at 8:12 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com>
>> wrote:
>> >> MULE also looks problematic. The code that you've written isn't
>> >> symmetric with the opposite conversion, unlike what you did in all
>> >> other cases, and I don't understand why. I'm also somewhat baffled by
>> >> the reverse conversion: it treats a multi-byte sequence beginning with
>> >> a byte for which IS_LCPRV1(x) returns true as invalid if there are
>> >> less than 3 bytes available, but it only reads two; similarly, for
>> >> IS_LCPRV2(x), it demands 4 bytes but converts only 3.
>> >
>> > Should we save existing pg_wchar representation for MULE encoding?
>> > Probably,
>> > we can modify it like in 0.1 version of patch in order to make it more
>> > transparent.
>>
>> Changing the encoding would break pg_upgrade, so -1 from me on that.
>
>
> I didn't realize that we store pg_wchar on disk somewhere. I thought it is
> only in-memory representation. Where do we store pg_wchar on disk?
talking about changing the multibyte encoding, which of course is
saved on disk all over the place. Changing the wchar encoding is a
different kettle of fish, and I have no idea what that would or would
not break. But I don't see why we'd want to do such a thing. We just
need to make the MB->WCHAR and WCHAR->MB transformations mirror images
of each other; why is that hard?
------
With best regards,
Alexander Korotkov.
With best regards,
Alexander Korotkov.
On Mon, Jul 2, 2012 at 4:46 PM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > So, I provided such transformation in versions 0.3 and 0.4 based on > explanation from Tatsuo Ishii. The problem is that both conversions are > nontrivial and it's not evident that they are mirror (understanding that > they are mirror require some additional assumptions about encodings, not > evident just by transformation itself). I though you mention that problem > two message back. Yeah, I did. I think I may be a bit confused here, so let me try to understand this a bit better. It seems like pg_mule2wchar_with_len uses the following algorithm: - If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored with shifts of 16 and 0. - If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes, skipping the first one and storing the remaining two with shifts of 16 and 0. - If the first character IS_LC2 (0x90-0x99), decode three bytes, stored with shifts of 16, 8, and 0. - If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes, skipping the first one and storing the remaining three with offsets of 16, 8, and 0. In the reverse transformation implemented by pg_wchar2mule_with_len, if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or 3 bytes, respectively, exactly as I would expect. ASCII decoding is also as I would expect. The case I don't understand is what happens when the leading byte of the multibyte character was IS_LCPRV1 or IS_LCPRV2. In that case, we ought to decode three bytes if it was IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we always decode 4 bytes. That implies that the IS_LCPRV1() case in pg_mule2wchar_with_len is dead code, and that any 4 byte characters are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the comment there is driving at, but it's not too clear to me. Am I close? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> Yeah, I did. I think I may be a bit confused here, so let me try to > understand this a bit better. It seems like pg_mule2wchar_with_len > uses the following algorithm: > > - If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored > with shifts of 16 and 0. > - If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes, > skipping the first one and storing the remaining two with shifts of 16 > and 0. > - If the first character IS_LC2 (0x90-0x99), decode three bytes, > stored with shifts of 16, 8, and 0. > - If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes, > skipping the first one and storing the remaining three with offsets of > 16, 8, and 0. Correct. > In the reverse transformation implemented by pg_wchar2mule_with_len, > if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or > 3 bytes, respectively, exactly as I would expect. ASCII decoding is > also as I would expect. The case I don't understand is what happens > when the leading byte of the multibyte character was IS_LCPRV1 or > IS_LCPRV2. In that case, we ought to decode three bytes if it was > IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we > always decode 4 bytes. That implies that the IS_LCPRV1() case in > pg_mule2wchar_with_len is dead code, Yes, dead code unless we want to support following encodings in the future(from include/mb/pg_wchar.h: #define LC_SISHENG 0xa0/* Chinese SiSheng characters for * PinYin/ZhuYin (not supported)*/ #define LC_IPA 0xa1/* IPA (International Phonetic Association) * (not supported)*/ #define LC_VISCII_LOWER 0xa2/* Vietnamese VISCII1.1 lower-case (not * supported) */ #define LC_VISCII_UPPER 0xa3/* Vietnamese VISCII1.1 upper-case (not * supported) */ #define LC_ARABIC_DIGIT 0xa4 /* Arabic digit (not supported) */ #define LC_ARABIC_1_COLUMN 0xa5 /* Arabic 1-column (not supported) */ #define LC_ASCII_RIGHT_TO_LEFT 0xa6 /* ASCII (left half of ISO8859-1) with * right-to-leftdirection (not * supported) */ #define LC_LAO 0xa7/* Lao characters (ISO10646 0E80..0EDF) (not * supported) */ #define LC_ARABIC_2_COLUMN 0xa8 /* Arabic 1-column (not supported) */ > and that any 4 byte characters > are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the > comment there is driving at, but it's not too clear to me. Yes, that's because we only support EUC_TW and BIG5 which are using IS_LCPRV2 in the mule interal encoding, as stated in the comment. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Robert Haas <robertmhaas@gmail.com> writes: > In the reverse transformation implemented by pg_wchar2mule_with_len, > if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or > 3 bytes, respectively, exactly as I would expect. ASCII decoding is > also as I would expect. The case I don't understand is what happens > when the leading byte of the multibyte character was IS_LCPRV1 or > IS_LCPRV2. Some inspection of pg_wchar.h suggests that the IS_LCPRV1 and IS_LCPRV2 cases are unused: the file doesn't define any encoding labels that match the byte values they accept, nor do the comments suggest that Emacs has any such labels either. If true, it would not be much of a stretch to believe that any code claiming to support these cases could be broken. regards, tom lane
I wrote: > Some inspection of pg_wchar.h suggests that the IS_LCPRV1 and IS_LCPRV2 > cases are unused: the file doesn't define any encoding labels that match > the byte values they accept, nor do the comments suggest that Emacs has > any such labels either. Scratch that --- I was misled by the fond illusion that our code wouldn't use magic hex literals for encoding labels. Stuff like this: /* 0x9d means LCPRV2 */ if (c1 == LC_CNS11643_1 || c1 == LC_CNS11643_2 || c1 == 0x9d) seems to me to be well below the minimum acceptable quality standards for Postgres code. Having said that, grepping the src/backend/utils/mb/conversion_procs/ reveals no sign that 0x9a, 0x9b, or 0x9c are used anywhere with the meanings that the IS_LCPRV1 and IS_LCPRV2 macros assign to them. Furthermore, AFAICS the 0x9d case is only used in euc_tw_and_big5/, with the following byte being one of the LC_CNS11643_[3-7] constants. Given that these constants are treading on encoding ID namespace that Emacs upstream might someday decide to assign, I think we'd be well advised to *not* start installing any code that thinks that 9a-9c mean something. regards, tom lane
On Mon, Jul 2, 2012 at 7:33 PM, Tatsuo Ishii <ishii@postgresql.org> wrote: >> Yeah, I did. I think I may be a bit confused here, so let me try to >> understand this a bit better. It seems like pg_mule2wchar_with_len >> uses the following algorithm: >> >> - If the first character IS_LC1 (0x81-0x8d), decode two bytes, stored >> with shifts of 16 and 0. >> - If the first character IS_LCPRV1 (0x9a-0x9b), decode three bytes, >> skipping the first one and storing the remaining two with shifts of 16 >> and 0. >> - If the first character IS_LC2 (0x90-0x99), decode three bytes, >> stored with shifts of 16, 8, and 0. >> - If the first character IS_LCPRV2 (0x9c-0x9d), decode four bytes, >> skipping the first one and storing the remaining three with offsets of >> 16, 8, and 0. > > Correct. > >> In the reverse transformation implemented by pg_wchar2mule_with_len, >> if the byte stored with shift 16 IS_LC1 or IS_LC2, then we decode 2 or >> 3 bytes, respectively, exactly as I would expect. ASCII decoding is >> also as I would expect. The case I don't understand is what happens >> when the leading byte of the multibyte character was IS_LCPRV1 or >> IS_LCPRV2. In that case, we ought to decode three bytes if it was >> IS_LCPRV1 and four bytes if it was IS_LCPRV2, but actually it seems we >> always decode 4 bytes. That implies that the IS_LCPRV1() case in >> pg_mule2wchar_with_len is dead code, > > Yes, dead code unless we want to support following encodings in the > future(from include/mb/pg_wchar.h: > #define LC_SISHENG 0xa0/* Chinese SiSheng characters for > * PinYin/ZhuYin (not supported) */ > #define LC_IPA 0xa1/* IPA (International Phonetic Association) > * (not supported) */ > #define LC_VISCII_LOWER 0xa2/* Vietnamese VISCII1.1 lower-case (not > * supported) */ > #define LC_VISCII_UPPER 0xa3/* Vietnamese VISCII1.1 upper-case (not > * supported) */ > #define LC_ARABIC_DIGIT 0xa4 /* Arabic digit (not supported) */ > #define LC_ARABIC_1_COLUMN 0xa5 /* Arabic 1-column (not supported) */ > #define LC_ASCII_RIGHT_TO_LEFT 0xa6 /* ASCII (left half of ISO8859-1) with > * right-to-left direction (not > * supported) */ > #define LC_LAO 0xa7/* Lao characters (ISO10646 0E80..0EDF) (not > * supported) */ > #define LC_ARABIC_2_COLUMN 0xa8 /* Arabic 1-column (not supported) */ > >> and that any 4 byte characters >> are always of the form 0x9d 0xf? 0x?? 0x??; maybe that's what the >> comment there is driving at, but it's not too clear to me. > > Yes, that's because we only support EUC_TW and BIG5 which are using > IS_LCPRV2 in the mule interal encoding, as stated in the comment. OK. So, in that case, I suggest that if the leading byte is non-zero, we emit 0x9d followed by the three available bytes, instead of first testing whether the first byte is >= 0xf0. That test seems to serve no purpose but to confuse the issue. I further suggest that we improve the comments on the mule functions for both wchar->mb and mb->wchar to make all this more clear. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> OK. So, in that case, I suggest that if the leading byte is non-zero, > we emit 0x9d followed by the three available bytes, instead of first > testing whether the first byte is >= 0xf0. That test seems to serve > no purpose but to confuse the issue. Probably the code shoud look like this(see below comment): else if (lb >= 0xf0 && lb <= 0xfe) { if (lb <= 0xf4) *to++ = 0x9c; else *to++ = 0x9d; *to++ = lb; *to++ = (*from >> 8) & 0xff; *to++ = *from & 0xff; cnt += 4; > I further suggest that we improve the comments on the mule functions > for both wchar->mb and mb->wchar to make all this more clear. I have added comments about mule internal encoding by refreshing my memory and from old document found on web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string). Please take a look at. BTW, it seems conversion between multibyte and wchar can be roundtrip in the leading character is LCPRV2 case: If the second byte of wchar (out of 4 bytes of wchar. The first byte is always 0x00) is in range of 0xf0 to 0xf4, then the first byte of multibyte must be 0x9c. If the second byte of wchar is in range of 0xf5 to 0xfe, then the first byte of multibyte must be 0x9d. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index d456309..1148eb5 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -37,6 +37,31 @@ typedef unsigned int pg_wchar;#define ISSJISTAIL(c) (((c) >= 0x40 && (c) <= 0x7e) || ((c) >= 0x80 && (c)<= 0xfc))/* + * Currently PostgreSQL supports 5 types of mule internal encodings: + * + * 1) 1-byte ASCII characters, each byte is below 0x7f. + * + * 2) "Official" single byte charsets such as ISO 8859 latin1. Each + * mule character consists of 2 bytes: LC1 + C1, where LC1 is + * corresponds to each charset and in range of 0x81 to 0x8d and C1 + * is in rage of 0xa0 to 0xff(ISO 8859-1 for example, plus each + * high bit is on). + * + * 3) "Private" single byte charsets such as SISHENG. Each mule + * character consists of 3 bytes: LCPRV1 + LC12 + C1 where LCPRV1 + * is either 0x9a (if LC12 is in range of 0xa0 to 0xdf) or 0x9b (if + * LC12 is in range of 0xe0 to 0xef). + * + * 4) "Official" multibyte charsets such as JIS X0208. Each mule + * character consists of 3 bytes: LC2 + C1 + C2 where LC2 is + * corresponds to each charset and is in rage of 0x90 to 0x99. C1 + * and C2 is in rage of 0xa0 to 0xff(each high bit is on). + * + * 5) "Private" multibyte charsets such as CNS 11643-1992 Plane 3. + * Each mule character consists of 4 bytes: LCPRV2 + LC22 + C1 + + * C2. where LCPRV2 is either 0x9c (if LC12 is in range of 0xf0 to + * 0xf4) or 0x9d (if LC22 is in range of 0xf5 to 0xfe). + * * Leading byte types or leading prefix byte for MULE internal code. * See http://www.xemacs.org for more details. (there is a doc titled * "XEmacs Internals Manual", "MULE Character Sets and Encodings"
On Tue, Jul 3, 2012 at 10:17 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:
------> OK. So, in that case, I suggest that if the leading byte is non-zero,Probably the code shoud look like this(see below comment):
> we emit 0x9d followed by the three available bytes, instead of first
> testing whether the first byte is >= 0xf0. That test seems to serve
> no purpose but to confuse the issue.
else if (lb >= 0xf0 && lb <= 0xfe)
{
if (lb <= 0xf4)
*to++ = 0x9c;
else
*to++ = 0x9d;
*to++ = lb;
*to++ = (*from >> 8) & 0xff;
*to++ = *from & 0xff;
cnt += 4;
It's likely we also need to assign some names to all these numbers (0xf0, 0xf4, 0xfe, 0x9c, 0x9d). But it's hard for me to invent such names.
> I further suggest that we improve the comments on the mule functionsI have added comments about mule internal encoding by refreshing my
> for both wchar->mb and mb->wchar to make all this more clear.
memory and from old document found on
web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string).
Please take a look at. BTW, it seems conversion between multibyte and
wchar can be roundtrip in the leading character is LCPRV2 case:
If the second byte of wchar (out of 4 bytes of wchar. The first byte
is always 0x00) is in range of 0xf0 to 0xf4, then the first byte of
multibyte must be 0x9c. If the second byte of wchar is in range of
0xf5 to 0xfe, then the first byte of multibyte must be 0x9d.
Should I intergrate these code changes into my patch? Or we would like to commit them first?
With best regards,
Alexander Korotkov.
Alexander Korotkov <aekorotkov@gmail.com> writes: > It's likely we also need to assign some names to all these numbers > (0xf0, 0xf4, 0xfe, 0x9c, 0x9d). But it's hard for me to invent such names. The encoding ID byte values already have names (see pg_wchar.h), but the private prefix bytes don't. I griped about that upthread. I agree this code needs some basic readability cleanup that's independent of your feature addition. It'd likely be reasonable to do that as a separate patch. regards, tom lane
> I have added comments about mule internal encoding by refreshing my > memory and from old document found on > web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string). Any objection to apply my patch? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Tatsuo Ishii <ishii@postgresql.org> writes: >> I have added comments about mule internal encoding by refreshing my >> memory and from old document found on >> web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string). > Any objection to apply my patch? It needs a bit of copy-editing, and I think we need to do more than just add comments: the various byte values should have #defines so that you can grep for code usages. I'll see what I can do with it. regards, tom lane
I wrote: > Tatsuo Ishii <ishii@postgresql.org> writes: >>> I have added comments about mule internal encoding by refreshing my >>> memory and from old document found on >>> web(http://mibai.tec.u-ryukyu.ac.jp/cgi-bin/info2www?%28mule%29Buffer%20and%20string). >> Any objection to apply my patch? > It needs a bit of copy-editing, and I think we need to do more than just > add comments: the various byte values should have #defines so that you > can grep for code usages. I'll see what I can do with it. I cleaned up the comments in pg_wchar.h some more, added #define symbols for the LCPRVn marker codes, and committed it. So far as I can see, the only LCPRVn marker code that is actually in use right now is 0x9d --- there are no instances of 9a, 9b, or 9c that I can find. I also read in the xemacs internals doc, at http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145 that XEmacs thinks the marker code for private single-byte charsets is 0x9e (only) and that for private multi-byte charsets is 0x9f (only); moreover they think 0x9a-0x9d are potential future official multibyte charset codes. I don't know how we got to the current state of using 0x9a-0x9d as private charset markers, but it seems pretty inconsistent with XEmacs. Since only 0x9d could possibly be on-disk anywhere at the moment (unless I'm missing something), I think we would be well advised to redefine our marker codes thus: LCPRV1 0x9e (only) (matches XEmacs spec)LCPRV2 0x9d (only) (doesn't match XEmacs, but too late to change) This would simplify and speed up the IS_LCPRVn macros, simplify the conversions that Alexander is worried about, and get us out from under the risk that XEmacs will assign their next three official multibyte encoding IDs. We're still in trouble if they ever get to 0x9d, but since that's the last code they have, I bet they will be in no hurry to use it up. regards, tom lane
> So far as I can see, the only LCPRVn marker code that is actually in > use right now is 0x9d --- there are no instances of 9a, 9b, or 9c > that I can find. > > I also read in the xemacs internals doc, at > http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145 > that XEmacs thinks the marker code for private single-byte charsets > is 0x9e (only) and that for private multi-byte charsets is 0x9f (only); > moreover they think 0x9a-0x9d are potential future official multibyte > charset codes. I don't know how we got to the current state of using > 0x9a-0x9d as private charset markers, but it seems pretty inconsistent > with XEmacs. At the time when mule internal code was introduced to PostgreSQL, xemacs did not have multi encoding capabilty and mule (a patch to emacs) was the only implementation allowed to use multi encoding. So I used the specification of mule documented in the URL I wrote. > Since only 0x9d could possibly be on-disk anywhere at the moment (unless > I'm missing something), I think we would be well advised to redefine our > marker codes thus: > > LCPRV1 0x9e (only) (matches XEmacs spec) > LCPRV2 0x9d (only) (doesn't match XEmacs, but too late to change) > > This would simplify and speed up the IS_LCPRVn macros, simplify the > conversions that Alexander is worried about, and get us out from under > the risk that XEmacs will assign their next three official multibyte > encoding IDs. We're still in trouble if they ever get to 0x9d, but > since that's the last code they have, I bet they will be in no hurry > to use it up. > > regards, tom lane
On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: > [ new patch ] With the improved comments in pg_wchar.h, it seemed clear what needed to be done here, so I fixed up the MULE conversion and committed this.I'd appreciate it if someone would check my work, butI think it's right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: >> [ new patch ] > > With the improved comments in pg_wchar.h, it seemed clear what needed > to be done here, so I fixed up the MULE conversion and committed this. > I'd appreciate it if someone would check my work, but I think it's > right. For me your commit looks good. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: >> [ new patch ] > With the improved comments in pg_wchar.h, it seemed clear what needed > to be done here, so I fixed up the MULE conversion and committed this. > I'd appreciate it if someone would check my work, but I think it's > right. Hm, several of these routines seem to neglect to advance the "from" pointer? regards, tom lane
Tatsuo Ishii <ishii@postgresql.org> writes: >> So far as I can see, the only LCPRVn marker code that is actually in >> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c >> that I can find. >> >> I also read in the xemacs internals doc, at >> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145 >> that XEmacs thinks the marker code for private single-byte charsets >> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only); >> moreover they think 0x9a-0x9d are potential future official multibyte >> charset codes. I don't know how we got to the current state of using >> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent >> with XEmacs. > At the time when mule internal code was introduced to PostgreSQL, > xemacs did not have multi encoding capabilty and mule (a patch to > emacs) was the only implementation allowed to use multi encoding. So I > used the specification of mule documented in the URL I wrote. I see. Given that upstream has decided that a simpler definition is more appropriate, is there any reason not to follow their lead, to the extent that we can do so without breaking existing on-disk data? regards, tom lane
On Thu, Jul 5, 2012 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Jul 1, 2012 at 5:11 AM, Alexander Korotkov <aekorotkov@gmail.com> wrote: >>> [ new patch ] > >> With the improved comments in pg_wchar.h, it seemed clear what needed >> to be done here, so I fixed up the MULE conversion and committed this. >> I'd appreciate it if someone would check my work, but I think it's >> right. > > Hm, several of these routines seem to neglect to advance the "from" > pointer? Err... yeah. That's not a bug I introduced, but I should have caught it... and it does make me wonder how well this code was tested. Does the attached look like an appropriate fix? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Вложения
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jul 5, 2012 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hm, several of these routines seem to neglect to advance the "from" >> pointer? > Err... yeah. That's not a bug I introduced, but I should have caught > it... and it does make me wonder how well this code was tested. > Does the attached look like an appropriate fix? I'd be inclined to put the from++ and len-- at the bottom of each loop, and in that order every time, just for consistency and obviousness. But yeah, that's basically what's needed. regards, tom lane
> Tatsuo Ishii <ishii@postgresql.org> writes: >>> So far as I can see, the only LCPRVn marker code that is actually in >>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c >>> that I can find. >>> >>> I also read in the xemacs internals doc, at >>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145 >>> that XEmacs thinks the marker code for private single-byte charsets >>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only); >>> moreover they think 0x9a-0x9d are potential future official multibyte >>> charset codes. I don't know how we got to the current state of using >>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent >>> with XEmacs. > >> At the time when mule internal code was introduced to PostgreSQL, >> xemacs did not have multi encoding capabilty and mule (a patch to >> emacs) was the only implementation allowed to use multi encoding. So I >> used the specification of mule documented in the URL I wrote. > > I see. Given that upstream has decided that a simpler definition is > more appropriate, is there any reason not to follow their lead, to the > extent that we can do so without breaking existing on-disk data? Please let me spend week end to understand the their latest spec. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On Thu, Jul 5, 2012 at 8:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jul 5, 2012 at 7:11 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Hm, several of these routines seem to neglect to advance the "from" >>> pointer? > >> Err... yeah. That's not a bug I introduced, but I should have caught >> it... and it does make me wonder how well this code was tested. > >> Does the attached look like an appropriate fix? > > I'd be inclined to put the from++ and len-- at the bottom of each loop, > and in that order every time, just for consistency and obviousness. > But yeah, that's basically what's needed. OK, I've committed a slightly tweaked version of that patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
>> Tatsuo Ishii <ishii@postgresql.org> writes: >>>> So far as I can see, the only LCPRVn marker code that is actually in >>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c >>>> that I can find. >>>> >>>> I also read in the xemacs internals doc, at >>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145 >>>> that XEmacs thinks the marker code for private single-byte charsets >>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only); >>>> moreover they think 0x9a-0x9d are potential future official multibyte >>>> charset codes. I don't know how we got to the current state of using >>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent >>>> with XEmacs. >> >>> At the time when mule internal code was introduced to PostgreSQL, >>> xemacs did not have multi encoding capabilty and mule (a patch to >>> emacs) was the only implementation allowed to use multi encoding. So I >>> used the specification of mule documented in the URL I wrote. >> >> I see. Given that upstream has decided that a simpler definition is >> more appropriate, is there any reason not to follow their lead, to the >> extent that we can do so without breaking existing on-disk data? > > Please let me spend week end to understand the their latest spec. This is an intermediate report on the internal multi-byte charset implementation of emacen. I have read the link Tom showed. Also I made a quick scan on xemacs-21.4.0 source code, especially xemacs-21.4.0/src/mule-charset.h. It seems the web document is essentially a copy of the comments in the file. Also I looked into other place of xemacs code and I think I can conclude that xeamcs 21.4's multi-byte implementation is based on the doc on the web. Next I looked into emacs 24.1 source code because I could not find any doc regarding emacs's(not xemacs's) implementation of internal multi-byte charset. I found followings in src/charset.h: /* Leading-code followed by extended leading-code. DIMENSION/COLUMN */ #define EMACS_MULE_LEADING_CODE_PRIVATE_11 0x9A /* 1/1 */ #define EMACS_MULE_LEADING_CODE_PRIVATE_12 0x9B /* 1/2 */ #define EMACS_MULE_LEADING_CODE_PRIVATE_21 0x9C /* 2/2 */ #define EMACS_MULE_LEADING_CODE_PRIVATE_22 0x9D /* 2/2 */ And these are used like this: /* Read one non-ASCII character from INSTREAM. The character is encoded in `emacs-mule' and the first byte is already readin C. */ static int read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun) { : : else if (len == 3) { if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11 || buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12){ charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]); code = buf[2] & 0x7F;} As far as I can tell, this is exactly the same way how PostgreSQL handles single private character sets: they consist of 3 bytes, and leading byte is either 0x9a or 0x9b. Other examples regarding single byte/multi-byte private charsets can be seen in coding.c. As far as I can tell, it seems emacs and xemacs employes different implementations of multi-byte charaset regarding "private" charsets. Emacs's is same as PostgreSQL, while xemacs is not. I am contacting to the original Mule author if he knows anything about this. BTW, while looking into emacs's source code, I found their charset definitions are in lisp/international/mule-conf.el. According to the file several new charsets has been added. Included is the patch to follow their changes. This makes no changes to current behavior, since the patch just changes some comments and non supported charsets. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp diff --git a/src/include/mb/pg_wchar.h b/src/include/mb/pg_wchar.h index 1bcdfbc..e44749e 100644 --- a/src/include/mb/pg_wchar.h +++ b/src/include/mb/pg_wchar.h @@ -108,7 +108,7 @@ typedef unsigned int pg_wchar;#define LC_KOI8_R 0x8b /* Cyrillic KOI8-R */#define LC_ISO8859_5 0x8c /* ISO8859 Cyrillic */#define LC_ISO8859_9 0x8d /* ISO8859 Latin 5 (not supported yet)*/ -/* #define FREE 0x8e free (unused) */ +#define LC_ISO8859_15 0x8e /* ISO8859 Latin 15 (not supported yet) *//* #define CONTROL_1 0x8f controlcharacters (unused) *//* Is a leading byte for "official" single byte encodings? */ @@ -119,14 +119,13 @@ typedef unsigned int pg_wchar; * 0x9a-0x9d are free. 0x9e and 0x9f are reserved. */#define LC_JISX0208_1978 0x90 /* Japanese Kanji, old JIS (not supported) */ -/* #define FREE 0x90 free (unused) */#define LC_GB2312_80 0x91 /* Chinese */#define LC_JISX0208 0x92 /* Japanese Kanji (JIS X 0208) */#define LC_KS5601 0x93 /* Korean */#define LC_JISX0212 0x94 /* Japanese Kanji (JIS X 0212) */#define LC_CNS11643_1 0x95 /* CNS 11643-1992 Plane1 */#define LC_CNS11643_2 0x96 /* CNS 11643-1992 Plane 2 */ -/* #define FREE 0x97 free (unused) */ +#define LC_JISX0213-1 0x97 /* Japanese Kanji (JIS X 0213 Plane 1) (not supported) */#define LC_BIG5_1 0x98 /* Plane 1 Chinese traditional (not supported) */#define LC_BIG5_2 0x99 /* Plane 1 Chinese traditional(not supported) */ @@ -184,6 +183,12 @@ typedef unsigned int pg_wchar; * (not supported) */#define LC_TIBETAN_1_COLUMN0xf1 /* Tibetan 1-column width glyphs * (not supported) */ +#define LC_UNICODE_SUBSET_2 0xf2 /* Unicode characters of the range U+2500..U+33FF. + * (not supported) */ +#define LC_UNICODE_SUBSET_3 0xf3 /* Unicode characters of the range U+E000..U+FFFF. + * (not supported) */ +#define LC_UNICODE_SUBSET 0xf4 /* Unicode characters of the range U+0100..U+24FF. + * (not supported) */ #define LC_ETHIOPIC 0xf5 /* Ethiopic characters(not supported) */#define LC_CNS11643_3 0xf6 /* CNS 11643-1992 Plane 3 */#define LC_CNS11643_4 0xf7 /* CNS 11643-1992 Plane 4 */
>>> Tatsuo Ishii <ishii@postgresql.org> writes: >>>>> So far as I can see, the only LCPRVn marker code that is actually in >>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c >>>>> that I can find. >>>>> >>>>> I also read in the xemacs internals doc, at >>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145 >>>>> that XEmacs thinks the marker code for private single-byte charsets >>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only); >>>>> moreover they think 0x9a-0x9d are potential future official multibyte >>>>> charset codes. I don't know how we got to the current state of using >>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent >>>>> with XEmacs. >>> >>>> At the time when mule internal code was introduced to PostgreSQL, >>>> xemacs did not have multi encoding capabilty and mule (a patch to >>>> emacs) was the only implementation allowed to use multi encoding. So I >>>> used the specification of mule documented in the URL I wrote. >>> >>> I see. Given that upstream has decided that a simpler definition is >>> more appropriate, is there any reason not to follow their lead, to the >>> extent that we can do so without breaking existing on-disk data? >> >> Please let me spend week end to understand the their latest spec. > > This is an intermediate report on the internal multi-byte charset > implementation of emacen. I have read the link Tom showed. Also I made > a quick scan on xemacs-21.4.0 source code, especially > xemacs-21.4.0/src/mule-charset.h. It seems the web document is > essentially a copy of the comments in the file. Also I looked into > other place of xemacs code and I think I can conclude that xeamcs > 21.4's multi-byte implementation is based on the doc on the web. > > Next I looked into emacs 24.1 source code because I could not find any > doc regarding emacs's(not xemacs's) implementation of internal > multi-byte charset. I found followings in src/charset.h: > > /* Leading-code followed by extended leading-code. DIMENSION/COLUMN */ > #define EMACS_MULE_LEADING_CODE_PRIVATE_11 0x9A /* 1/1 */ > #define EMACS_MULE_LEADING_CODE_PRIVATE_12 0x9B /* 1/2 */ > #define EMACS_MULE_LEADING_CODE_PRIVATE_21 0x9C /* 2/2 */ > #define EMACS_MULE_LEADING_CODE_PRIVATE_22 0x9D /* 2/2 */ > > And these are used like this: > > /* Read one non-ASCII character from INSTREAM. The character is > encoded in `emacs-mule' and the first byte is already read in > C. */ > > static int > read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun) > { > : > : > else if (len == 3) > { > if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11 > || buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12) > { > charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]); > code = buf[2] & 0x7F; > } > > As far as I can tell, this is exactly the same way how PostgreSQL > handles single private character sets: they consist of 3 bytes, and > leading byte is either 0x9a or 0x9b. Other examples regarding single > byte/multi-byte private charsets can be seen in coding.c. > > As far as I can tell, it seems emacs and xemacs employes different > implementations of multi-byte charaset regarding "private" > charsets. Emacs's is same as PostgreSQL, while xemacs is not. I am > contacting to the original Mule author if he knows anything about > this. I got reply from the Mule author, Kenichi Handa (the mail is in Japanese. So I do not quote his mail here. If somebody wants to read the original mail please let me know). First of all my understanding with emacs's implementaion is correct according to him. He did not know about xemacs's implementation. Apparently the implementation of xemacs was not lead by the original mule author. So which one of emacs/xemacs should we follow? My suggestion is, not to follow xemacs, and to leave the current treatment of private leading byte as it is because emacs seems to be more "right" upstream comparing with xemacs. > BTW, while looking into emacs's source code, I found their charset > definitions are in lisp/international/mule-conf.el. According to the > file several new charsets has been added. Included is the patch to > follow their changes. This makes no changes to current behavior, since > the patch just changes some comments and non supported charsets. If there's no objection, I would like to commit this. Objection? -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
>>>> Tatsuo Ishii <ishii@postgresql.org> writes: >>>>>> So far as I can see, the only LCPRVn marker code that is actually in >>>>>> use right now is 0x9d --- there are no instances of 9a, 9b, or 9c >>>>>> that I can find. >>>>>> >>>>>> I also read in the xemacs internals doc, at >>>>>> http://www.xemacs.org/Documentation/21.5/html/internals_26.html#SEC145 >>>>>> that XEmacs thinks the marker code for private single-byte charsets >>>>>> is 0x9e (only) and that for private multi-byte charsets is 0x9f (only); >>>>>> moreover they think 0x9a-0x9d are potential future official multibyte >>>>>> charset codes. I don't know how we got to the current state of using >>>>>> 0x9a-0x9d as private charset markers, but it seems pretty inconsistent >>>>>> with XEmacs. >>>> >>>>> At the time when mule internal code was introduced to PostgreSQL, >>>>> xemacs did not have multi encoding capabilty and mule (a patch to >>>>> emacs) was the only implementation allowed to use multi encoding. So I >>>>> used the specification of mule documented in the URL I wrote. >>>> >>>> I see. Given that upstream has decided that a simpler definition is >>>> more appropriate, is there any reason not to follow their lead, to the >>>> extent that we can do so without breaking existing on-disk data? >>> >>> Please let me spend week end to understand the their latest spec. >> >> This is an intermediate report on the internal multi-byte charset >> implementation of emacen. I have read the link Tom showed. Also I made >> a quick scan on xemacs-21.4.0 source code, especially >> xemacs-21.4.0/src/mule-charset.h. It seems the web document is >> essentially a copy of the comments in the file. Also I looked into >> other place of xemacs code and I think I can conclude that xeamcs >> 21.4's multi-byte implementation is based on the doc on the web. >> >> Next I looked into emacs 24.1 source code because I could not find any >> doc regarding emacs's(not xemacs's) implementation of internal >> multi-byte charset. I found followings in src/charset.h: >> >> /* Leading-code followed by extended leading-code. DIMENSION/COLUMN */ >> #define EMACS_MULE_LEADING_CODE_PRIVATE_11 0x9A /* 1/1 */ >> #define EMACS_MULE_LEADING_CODE_PRIVATE_12 0x9B /* 1/2 */ >> #define EMACS_MULE_LEADING_CODE_PRIVATE_21 0x9C /* 2/2 */ >> #define EMACS_MULE_LEADING_CODE_PRIVATE_22 0x9D /* 2/2 */ >> >> And these are used like this: >> >> /* Read one non-ASCII character from INSTREAM. The character is >> encoded in `emacs-mule' and the first byte is already read in >> C. */ >> >> static int >> read_emacs_mule_char (int c, int (*readbyte) (int, Lisp_Object), Lisp_Object readcharfun) >> { >> : >> : >> else if (len == 3) >> { >> if (buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_11 >> || buf[0] == EMACS_MULE_LEADING_CODE_PRIVATE_12) >> { >> charset = CHARSET_FROM_ID (emacs_mule_charset[buf[1]]); >> code = buf[2] & 0x7F; >> } >> >> As far as I can tell, this is exactly the same way how PostgreSQL >> handles single private character sets: they consist of 3 bytes, and >> leading byte is either 0x9a or 0x9b. Other examples regarding single >> byte/multi-byte private charsets can be seen in coding.c. >> >> As far as I can tell, it seems emacs and xemacs employes different >> implementations of multi-byte charaset regarding "private" >> charsets. Emacs's is same as PostgreSQL, while xemacs is not. I am >> contacting to the original Mule author if he knows anything about >> this. > > I got reply from the Mule author, Kenichi Handa (the mail is in > Japanese. So I do not quote his mail here. If somebody wants to read > the original mail please let me know). First of all my understanding > with emacs's implementaion is correct according to him. He did not > know about xemacs's implementation. Apparently the implementation of > xemacs was not lead by the original mule author. > > So which one of emacs/xemacs should we follow? My suggestion is, not > to follow xemacs, and to leave the current treatment of private > leading byte as it is because emacs seems to be more "right" upstream > comparing with xemacs. > >> BTW, while looking into emacs's source code, I found their charset >> definitions are in lisp/international/mule-conf.el. According to the >> file several new charsets has been added. Included is the patch to >> follow their changes. This makes no changes to current behavior, since >> the patch just changes some comments and non supported charsets. > > If there's no objection, I would like to commit this. Objection? Done along with comment that we follow emacs's implementation, not xemacs's. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Tatsuo Ishii <ishii@postgresql.org> writes: > Done along with comment that we follow emacs's implementation, not > xemacs's. Well, when the preceding comment block contains five references to xemacs and the link for more information leads to www.xemacs.org, I don't think it's real helpful to add one sentence saying "oh by the way we're not actually following xemacs". I continue to think that we'd be better off to follow the xemacs spec, as the subdivisions the emacs spec is insisting on seem like entirely useless complication. The only possible reason for doing it the emacs way is that it would provide room for twice as many charset IDs ... but the present design for wchar conversion destroys that advantage, because it requires the charset ID spaces to be nonoverlapping anyhow. Moreover, it's not apparent to me that charset standards are still proliferating, so I doubt that we need any more ID space. regards, tom lane
> Well, when the preceding comment block contains five references to > xemacs and the link for more information leads to www.xemacs.org, > I don't think it's real helpful to add one sentence saying "oh > by the way we're not actually following xemacs". > > I continue to think that we'd be better off to follow the xemacs > spec, as the subdivisions the emacs spec is insisting on seem like > entirely useless complication. The only possible reason for doing > it the emacs way is that it would provide room for twice as many > charset IDs ... but the present design for wchar conversion destroys > that advantage, because it requires the charset ID spaces to be > nonoverlapping anyhow. Moreover, it's not apparent to me that > charset standards are still proliferating, so I doubt that we need > any more ID space. Well, we have been following emacs spec, not xemacs spec from the day 0. I don't see any value to switch to xemacs way at this moment, because I think the reason why we support particular encoding is, to keep on supporting existing user data, not "enhance" our internal architecture. If you like xeamcs's spec, I think you'd better add new encoding, rather than break data compatibility. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp