Обсуждение: Almost bug in COPY FROM processing of GB18030 encoded input
Hi, I happened to notice that when CopyReadLineText() calls mblen(), it passes only the first byte of the multi-byte characters. However, pg_gb18030_mblen() looks at the first and the second byte. CopyReadLineText() always passes \0 as the second byte, so pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded characters as 2. It works out fine, though, because the second half of the 4-byte encoded character always looks like another 2-byte encoded character, in GB18030. CopyReadLineText() is looking for delimiter and escape characters and newlines, and only single-byte characters are supported for those, so treating a 4-byte character as two 2-byte characters is harmless. Attached is a patch to explain that in the comments. Grepping for mblen(), I didn't find any other callers that used mblen() like that. - Heikki
Вложения
On Wed, Jan 23, 2019 at 6:23 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I happened to notice that when CopyReadLineText() calls mblen(), it > passes only the first byte of the multi-byte characters. However, > pg_gb18030_mblen() looks at the first and the second byte. > CopyReadLineText() always passes \0 as the second byte, so > pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded > characters as 2. > > It works out fine, though, because the second half of the 4-byte encoded > character always looks like another 2-byte encoded character, in > GB18030. CopyReadLineText() is looking for delimiter and escape > characters and newlines, and only single-byte characters are supported > for those, so treating a 4-byte character as two 2-byte characters is > harmless. Yikes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 24/01/2019 23:27, Robert Haas wrote: > On Wed, Jan 23, 2019 at 6:23 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >> I happened to notice that when CopyReadLineText() calls mblen(), it >> passes only the first byte of the multi-byte characters. However, >> pg_gb18030_mblen() looks at the first and the second byte. >> CopyReadLineText() always passes \0 as the second byte, so >> pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded >> characters as 2. >> >> It works out fine, though, because the second half of the 4-byte encoded >> character always looks like another 2-byte encoded character, in >> GB18030. CopyReadLineText() is looking for delimiter and escape >> characters and newlines, and only single-byte characters are supported >> for those, so treating a 4-byte character as two 2-byte characters is >> harmless. > > Yikes. Committed the comment changes, so it's less of a gotcha now. - Heikki