Обсуждение: Almost bug in COPY FROM processing of GB18030 encoded input

Поиск

Список

Период

Сортировка

Almost bug in COPY FROM processing of GB18030 encoded input

От

Heikki Linnakangas

Дата:

23 января 2019 г., 14:23:23

Hi,

I happened to notice that when CopyReadLineText() calls mblen(), it 
passes only the first byte of the multi-byte characters. However, 
pg_gb18030_mblen() looks at the first and the second byte. 
CopyReadLineText() always passes \0 as the second byte, so 
pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded 
characters as 2.

It works out fine, though, because the second half of the 4-byte encoded 
character always looks like another 2-byte encoded character, in 
GB18030. CopyReadLineText() is looking for delimiter and escape 
characters and newlines, and only single-byte characters are supported 
for those, so treating a 4-byte character as two 2-byte characters is 
harmless.

Attached is a patch to explain that in the comments. Grepping for 
mblen(), I didn't find any other callers that used mblen() like that.

- Heikki

Вложения

0001-Fix-comments-to-that-claimed-that-mblen-only-looks-a.patch

Re: Almost bug in COPY FROM processing of GB18030 encoded input

От

Robert Haas

Дата:

25 января 2019 г., 00:27:11

On Wed, Jan 23, 2019 at 6:23 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I happened to notice that when CopyReadLineText() calls mblen(), it
> passes only the first byte of the multi-byte characters. However,
> pg_gb18030_mblen() looks at the first and the second byte.
> CopyReadLineText() always passes \0 as the second byte, so
> pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded
> characters as 2.
>
> It works out fine, though, because the second half of the 4-byte encoded
> character always looks like another 2-byte encoded character, in
> GB18030. CopyReadLineText() is looking for delimiter and escape
> characters and newlines, and only single-byte characters are supported
> for those, so treating a 4-byte character as two 2-byte characters is
> harmless.

Yikes.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Almost bug in COPY FROM processing of GB18030 encoded input

От

Heikki Linnakangas

Дата:

25 января 2019 г., 15:56:27

On 24/01/2019 23:27, Robert Haas wrote:
> On Wed, Jan 23, 2019 at 6:23 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> I happened to notice that when CopyReadLineText() calls mblen(), it
>> passes only the first byte of the multi-byte characters. However,
>> pg_gb18030_mblen() looks at the first and the second byte.
>> CopyReadLineText() always passes \0 as the second byte, so
>> pg_gb18030_mblen() will incorrectly report the length of 4-byte encoded
>> characters as 2.
>>
>> It works out fine, though, because the second half of the 4-byte encoded
>> character always looks like another 2-byte encoded character, in
>> GB18030. CopyReadLineText() is looking for delimiter and escape
>> characters and newlines, and only single-byte characters are supported
>> for those, so treating a 4-byte character as two 2-byte characters is
>> harmless.
> 
> Yikes.

Committed the comment changes, so it's less of a gotcha now.

- Heikki

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Almost bug in COPY FROM processing of GB18030 encoded input

Almost bug in COPY FROM processing of GB18030 encoded input

Вложения

Re: Almost bug in COPY FROM processing of GB18030 encoded input

Re: Almost bug in COPY FROM processing of GB18030 encoded input