Обсуждение: Corrupted subjects on the archive website

Поиск
Список
Период
Сортировка

Corrupted subjects on the archive website

От
Thomas Munro
Дата:
Hi

Why do some message display with corrupted subjects on the mailing
list archives site?  The replies to the message below, but not the
message itself, are displayed with a corrupted subject.  They appear
fine in my mail client though.

http://www.postgresql.org/message-id/20150922134404.5050.75087@wrigleys.postgresql.org

The website shows "Re: [BUGS] BUG #13632: violation de l'intégrité rQ1|ɕѥ".
My mail client shows "Re: [BUGS] BUG #13632: violation de l'intégrité
référentielle".

The original message that displays correctly has the following raw header:

Subject: BUG #13632: violation de l'intégrité référentielle

The reply that doesn't display correctly has the following raw header:

Subject: Re: [BUGS] BUG #13632: violation de l'intégrité référentielle

A wise denizen of #postgresql pointed out that 'UTF-8' decoded as
base64 produces 'Q1\377' of which we see at least the 'Q1' in the
corrupted string.

Some other examples of corruption can be seen in the French, Spanish,
Russian etc sections:

http://www.postgresql.org/list/pgsql-es-ayuda/2015-09/

--
Thomas Munro
http://www.enterprisedb.com



Re: Corrupted subjects on the archive website

От
Stefan Kaltenbrunner
Дата:
On 09/23/2015 06:59 AM, Thomas Munro wrote:
> Hi
> 
> Why do some message display with corrupted subjects on the mailing
> list archives site?  The replies to the message below, but not the
> message itself, are displayed with a corrupted subject.  They appear
> fine in my mail client though.
> 
> http://www.postgresql.org/message-id/20150922134404.5050.75087@wrigleys.postgresql.org
> 
> The website shows "Re: [BUGS] BUG #13632: violation de l'intégrité rQ1|ɕѥ".
> My mail client shows "Re: [BUGS] BUG #13632: violation de l'intégrité
> référentielle".
> 
> The original message that displays correctly has the following raw header:
> 
> Subject: BUG #13632: violation de l'intégrité ré
>  férentielle
> 
> The reply that doesn't display correctly has the following raw header:
> 
> Subject: Re: [BUGS] BUG #13632: violation de l'intégrité r
>  éférentielle
> 
> A wise denizen of #postgresql pointed out that 'UTF-8' decoded as
> base64 produces 'Q1\377' of which we see at least the 'Q1' in the
> corrupted string.

I looked a bit at the code and did some testing - the difference between
the original mail (which is stored and displayed correctly in the
archives database) and the two replys that have it corrupted is how the
line wrapping for the Subject is done(basically linebreak + space in the
first version and linebreak+tab in the broken one).

We use decode_header() from the python email package to parse headers
and it is actually capable of correctly decoding both variants.
However there is a special hack in our importer code citing
http://bugs.python.org/issue504152 that removes \n\t unconditionally
from the raw string.
I dont know the details of why that was put in originally but that
surely must be wrong in general because it removes the required
seperation between different header words through a linear whitespace
per RFC2047(because in this case it leaves no seperation at all causing
header_decode() to go haywire).
I think it was magnus who put that special case in so maybe he can shed
some light on the issue this change was targeted at?


Stefan



Re: Corrupted subjects on the archive website

От
Magnus Hagander
Дата:


On Wed, Sep 23, 2015 at 7:30 PM, Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> wrote:
On 09/23/2015 06:59 AM, Thomas Munro wrote:
> Hi
>
> Why do some message display with corrupted subjects on the mailing
> list archives site?  The replies to the message below, but not the
> message itself, are displayed with a corrupted subject.  They appear
> fine in my mail client though.
>
> http://www.postgresql.org/message-id/20150922134404.5050.75087@wrigleys.postgresql.org
>
> The website shows "Re: [BUGS] BUG #13632: violation de l'intégrité rQ1|   ɕѥ".
> My mail client shows "Re: [BUGS] BUG #13632: violation de l'intégrité
> référentielle".
>
> The original message that displays correctly has the following raw header:
>
> Subject: BUG #13632: violation de l'intégrité ré
>  férentielle
>
> The reply that doesn't display correctly has the following raw header:
>
> Subject: Re: [BUGS] BUG #13632: violation de l'intégrité r
>  éférentielle
>
> A wise denizen of #postgresql pointed out that 'UTF-8' decoded as
> base64 produces 'Q1\377' of which we see at least the 'Q1' in the
> corrupted string.

I looked a bit at the code and did some testing - the difference between
the original mail (which is stored and displayed correctly in the
archives database) and the two replys that have it corrupted is how the
line wrapping for the Subject is done(basically linebreak + space in the
first version and linebreak+tab in the broken one).

We use decode_header() from the python email package to parse headers
and it is actually capable of correctly decoding both variants.
However there is a special hack in our importer code citing
http://bugs.python.org/issue504152 that removes \n\t unconditionally
from the raw string.
I dont know the details of why that was put in originally but that
surely must be wrong in general because it removes the required
seperation between different header words through a linear whitespace
per RFC2047(because in this case it leaves no seperation at all causing
header_decode() to go haywire).
I think it was magnus who put that special case in so maybe he can shed
some light on the issue this change was targeted at?



He can not, unfortunately. That was years ago and I don't have a testcase around for it.

As discussed with Stefan, we need to set up a proper testbench to make sure we don't break something else when/if we remove this change. It's on my TODO list, and I just wanted to ack that in this thread. This is clearly a bug in the archives code that needs to get fixed, it'll just take a bit longer as we don't currently have a way to test across the 1M+ messages that are in the archives today yet. 

--