Обсуждение: Illegal SJIS mapping
Hi, I found an useless entry in utf8_to_sjis.map > {0xc19c, 0x815f}, which is apparently illegal as UTF-8 which postgresql deliberately refuses. So it should be removed and the attached patch does that. 0x815f(SJIS) is also mapped from 0xefbcbc(U+FF3C FULLWIDTH REVERSE SOLIDUS) and it is a right mapping. By the way, the file comment at the beginning of UCS_to_SJIS.pl is the following. # Generate UTF-8 <--> SJIS code conversion tables from # map files provided by Unicode organization. # Unfortunately it is prohibited by the organization # to distribute the map files. So if you try to use this script, # you have to obtain SHIFTJIS.TXT from # the organization's ftp site. The file was found at the following place thanks to google. ftp://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/ As the URL is showing, or as written in the file Public/MAPPINGS/EASTASIA/ReadMe.txt, it is already obsolete and the *live* definition *may* be found in Unicode Character Database. But I haven't found SJIS-related informatin there. If I'm not missing anything, the only available authority would be JIS X 0208/0213 but what should be implmented seems to be maybe-modified MS932 for which I don't know the authority. Anyway I ran UCS_to_SJIS.pl with the SHIFTJIS.TXT above and I got a quite different mapping files from the current ones. So, I wonder how the mappings related to SJIS (and/or EUC-JP) are maintained. If no authoritative information is available, the generating script no longer usable. If any other autority is choosed, it is to be modified according to whatever the new source format is. Any suggestions? Or opinions? regards, -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/utils/mb/Unicode/utf8_to_sjis.map b/src/backend/utils/mb/Unicode/utf8_to_sjis.map index bcb76c9..47f5fdf 100644 --- a/src/backend/utils/mb/Unicode/utf8_to_sjis.map +++ b/src/backend/utils/mb/Unicode/utf8_to_sjis.map @@ -1,5 +1,4 @@ -static const pg_utf_to_local ULmapSJIS[ 7398 ] = { - {0xc19c, 0x815f}, +static const pg_utf_to_local ULmapSJIS[ 7397 ] = { {0xc2a2, 0x8191}, {0xc2a3, 0x8192}, {0xc2a5, 0x5c},
On 09/07/2016 09:50 AM, Kyotaro HORIGUCHI wrote: > Hi, > > I found an useless entry in utf8_to_sjis.map > >> {0xc19c, 0x815f}, > > which is apparently illegal as UTF-8 which postgresql > deliberately refuses. So it should be removed and the attached > patch does that. 0x815f(SJIS) is also mapped from 0xefbcbc(U+FF3C > FULLWIDTH REVERSE SOLIDUS) and it is a right mapping. Yes, I think you're right. Committed, thanks! > By the way, the file comment at the beginning of UCS_to_SJIS.pl > is the following. > > # Generate UTF-8 <--> SJIS code conversion tables from > # map files provided by Unicode organization. > # Unfortunately it is prohibited by the organization > # to distribute the map files. So if you try to use this script, > # you have to obtain SHIFTJIS.TXT from > # the organization's ftp site. > > The file was found at the following place thanks to google. > > ftp://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/ > > As the URL is showing, or as written in the file > Public/MAPPINGS/EASTASIA/ReadMe.txt, it is already obsolete and > the *live* definition *may* be found in Unicode Character > Database. But I haven't found SJIS-related informatin there.> > If I'm not missing anything, the only available authority would > be JIS X 0208/0213 but what should be implmented seems to be > maybe-modified MS932 for which I don't know the authority. > > Anyway I ran UCS_to_SJIS.pl with the SHIFTJIS.TXT above and I got > a quite different mapping files from the current ones. > > So, I wonder how the mappings related to SJIS (and/or EUC-JP) are > maintained. If no authoritative information is available, the > generating script no longer usable. If any other autority is > choosed, it is to be modified according to whatever the new > source format is. The script is clearly intended to read CP932.TXT, rather than SHIFTJIS.TXT, despite the comments in it. CP932.TXT can be found at ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT However, running the script with that doesn't produce exactly what we have in utf8_to_sjis.map, either. It's otherwise same, but we have some extra mappings: - {0xc2a5, 0x5c}, - {0xc2ac, 0x81ca}, - {0xe28096, 0x8161}, - {0xe280be, 0x7e}, - {0xe28892, 0x817c}, - {0xe3809c, 0x8160}, Those mappings were added in commit a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus mapping for the invalid 0xc19c UTF-8 byte sequence was also added by that commit, as well a few valid mappings that UCS_to_SJIS.pl also produces. I can't judge if those mappings make sense. If we can't find an authoritative source for them, I suggest that we leave them as they are, but also hard-code them to UCS_to_SJIS.pl, so that running that script produces those mappings in utf8_to_sjis.map, even though they are not present in the CP932.TXT source file. - Heikki
Hello, At Fri, 7 Oct 2016 23:58:45 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in <9c544547-7214-aebe-9b04-57624aedde96@iki.fi> > > So, I wonder how the mappings related to SJIS (and/or EUC-JP) are > > maintained. If no authoritative information is available, the > > generating script no longer usable. If any other autority is > > choosed, it is to be modified according to whatever the new > > source format is. > > The script is clearly intended to read CP932.TXT, rather than > SHIFTJIS.TXT, despite the comments in it. CP932.TXT can be found at > > ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT > > However, running the script with that doesn't produce exactly what we > have in utf8_to_sjis.map, either. It's otherwise same, but we have > some extra mappings: > > - {0xc2a5, 0x5c}, > - {0xc2ac, 0x81ca}, > - {0xe28096, 0x8161}, > - {0xe280be, 0x7e}, > - {0xe28892, 0x817c}, > - {0xe3809c, 0x8160}, > > Those mappings were added in commit > a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus > mapping for the invalid 0xc19c UTF-8 byte sequence was also added by > that commit, as well a few valid mappings that UCS_to_SJIS.pl also > produces. > > I can't judge if those mappings make sense. If we can't find an > authoritative source for them, I suggest that we leave them as they The mappings have a hystorical reason came from differences between Unicode definition and Oracle and Microsoft implementations and developing of Unicode specification. So the several SJIS (and EUC-JP) characters have two or more mappings to Unicode. There's also several variations of the opposite mapping. But none of them is the autority and what to adopt depends on system requirement. The only requirement that PostgreSQL should keep seems to be round-trip consistency starts from SJIS input. > are, but also hard-code them to UCS_to_SJIS.pl, so that running that > script produces those mappings in utf8_to_sjis.map, even though they > are not present in the CP932.TXT source file. Agreed. I do that at least for Japanese charsets. regards, -- Kyotaro Horiguchi NTT Open Source Software Center
> However, running the script with that doesn't produce exactly what we > have in utf8_to_sjis.map, either. It's otherwise same, but we have > some extra mappings: > > - {0xc2a5, 0x5c}, 0xc2a5 is U+00a5. The glyph is "YEN SIGN" which is corresponding to 0x5c in SJIS. So this is a valid mapping. In the mean time, Microsoft wants to map U+005c to 0x5c in CP932. The glyph of U+005c is "REVERSE SOLDIUS" (back slash). So MS decided that the glyph of U+00x5c is "YEN SIGN" in CP932! In summary we need to keep both of mappings: U+00a5 (utf 0xc2a5) -> 0x5c and U+005c -> 0x5c. Obviously this breaks the round trip conversion between UTF8 and SJIS encoding in this case though. > - {0xc2ac, 0x81ca}, U+00ac (NOT SIGN). Exists in SJIS. > - {0xe28096, 0x8161}, U+2016 (DOUBLE VERTICAL LINE). Exists in SJIS. > - {0xe280be, 0x7e}, U+213e (OVERLINE). Mapped to acii 0x7e, which is "half width tilde". > - {0xe28892, 0x817c}, U+2212 (MINUS SIGN). Mapped to "double width minus sign" in SJIS. > - {0xe3809c, 0x8160}, u+301c (WAVE DASH). Mapped to "double width wave dash" in SJIS. > Those mappings were added in commit > a8bd7e1c6e026678019b2f25cffc0a94ce62b24b, back in 2002. The bogus > mapping for the invalid 0xc19c UTF-8 byte sequence was also added by > that commit, as well a few valid mappings that UCS_to_SJIS.pl also > produces. > > I can't judge if those mappings make sense. If we can't find an > authoritative source for them, I suggest that we leave them as they > are, but also hard-code them to UCS_to_SJIS.pl, so that running that > script produces those mappings in utf8_to_sjis.map, even though they > are not present in the CP932.TXT source file. Sounds acceptable. In summary current PostgreSQL UTF8 <--> SJIS mapping is a somewhat mixture of SJIS (Shift_JIS) and MS932. There's no cleaner solution to exodus this situation. I think we need live with it. Best regards, -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese:http://www.sraoss.co.jp