Обсуждение: Annoying corruption in PostgreSQL.

Поиск
Список
Период
Сортировка

Annoying corruption in PostgreSQL.

От
Kirill Reshke
Дата:
Hi hackers!

We run a large amount of PostgreSQL clusters in our production. They differ by versions (we have 11-16 pg), load, amount of data, schema, etc. From time to time, postgresql corruption happens. It says
ERROR,XX001,"missing chunk number 0 for toast value 18767319 in pg_toast_2619",,,,,,"vacuum full ;"

in logs. the missing chunk number  almost every is equal to zero, while other values vary. There are no known patterns, which triggers this issue. Moreover, if trying to rerun the VACUUM statement against relations from a log message, it succeeds all the time.  So, we just ignore these errors. Maybe it is just some wierd data race?

We don't know how to trigger this problem, or why it occurs. I'm not asking you to resolve this issue, but to help with debugging. What can we do to deduct failure reasons? Maybe we can add more logging somewhere (we can deploy a special patched PostgreSQL version everywhere), to have more information about the issue, when it happens next time?



Re: Annoying corruption in PostgreSQL.

От
Tomas Vondra
Дата:

On 10/27/23 14:19, Kirill Reshke wrote:
> Hi hackers!
> 
> We run a large amount of PostgreSQL clusters in our production. They
> differ by versions (we have 11-16 pg), load, amount of data, schema,
> etc. From time to time, postgresql corruption happens. It says
> ERROR,XX001,"missing chunk number 0 for toast value 18767319 in
> pg_toast_2619",,,,,,"vacuum full ;"
> 
> in logs. the missing chunk number  almost every is equal to zero, while
> other values vary. There are no known patterns, which triggers this
> issue. Moreover, if trying to rerun the VACUUM statement against
> relations from a log message, it succeeds all the time.  So, we just
> ignore these errors. Maybe it is just some wierd data race?
> 
> We don't know how to trigger this problem, or why it occurs. I'm not
> asking you to resolve this issue, but to help with debugging. What can
> we do to deduct failure reasons? Maybe we can add more logging somewhere
> (we can deploy a special patched PostgreSQL version everywhere), to have
> more information about the issue, when it happens next time?
> 

For starters, it'd be good to know something about the environment, and
stuff that'd tell us if there's some possible pattern:

1) Which exact PG versions are you observing these errors on?

2) In the error example you shared it's pg_toast_2619, which is the
TOAST table for pg_statistic (probably). Is it always this relation? Or
what relations you noticed this for?

3) What kind of commands are triggering this? In the example it seems to
be vacuum full. Did you see it for other commands too? People generally
don't do VACUUM FULL very often, particularly not in environments with
concurrent activity.

Considering you don't know what's causing this, or what to look for, I
think it might be interesting to use pg_waldump, and investigate what
happened to the page containing the TOAST chunk and to the page
referencing it. Do you have physical backups and ability to do PITR?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Fwd: Annoying corruption in PostgreSQL.

От
Kirill Reshke
Дата:

Sorry, seems that i replied only to Tomas, so forwarding message.
---------- Forwarded message ---------
From: Kirill Reshke <reshkekirill@gmail.com>
Date: Sat, 28 Oct 2023 at 02:06
Subject: Re: Annoying corruption in PostgreSQL.
To: Tomas Vondra <tomas.vondra@enterprisedb.com>


Hi Tomas!

Thanks for the explanation!

1) 11 to 15. This week there were 14.9 and 12.16 reproductions. Two weeks ago there was 15.4 and 11.21 repro. Unfortunately, there is no info about repro which were month old or more, but I found in our work chats that there was repro on PostgreSQL 13 in April, a minor version unknown. Overall, we observed this issue for over a year on all pgdg supported versions.

2) Searching out bug tracker i have found:

1. missing chunk number 0 for toast value 592966012 in pg_toast_563953150 (some user relation)
2. missing chunk number 0 for toast value 18019714 in pg_toast_17706963 (some user relation)
3. missing chunk number 0 for toast value 52677740 in pg_toast_247794

So, this is not always pg_catalog. There toast tables were toast to some user relations.

3) It is always about VACUUM FULL (FREEZE/VERBOSE/ANALYZE) / autovacuum.

We have physical backups and we can PITR. But restoring a cluster to some point in the past is a bit of a different task: we need our client's approval for these operations, since we are a Managed DBs Cloud Provider. Will try to ask someone.

Best regards
 

On Fri, 27 Oct 2023 at 23:28, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:


On 10/27/23 14:19, Kirill Reshke wrote:
> Hi hackers!
>
> We run a large amount of PostgreSQL clusters in our production. They
> differ by versions (we have 11-16 pg), load, amount of data, schema,
> etc. From time to time, postgresql corruption happens. It says
> ERROR,XX001,"missing chunk number 0 for toast value 18767319 in
> pg_toast_2619",,,,,,"vacuum full ;"
>
> in logs. the missing chunk number  almost every is equal to zero, while
> other values vary. There are no known patterns, which triggers this
> issue. Moreover, if trying to rerun the VACUUM statement against
> relations from a log message, it succeeds all the time.  So, we just
> ignore these errors. Maybe it is just some wierd data race?
>
> We don't know how to trigger this problem, or why it occurs. I'm not
> asking you to resolve this issue, but to help with debugging. What can
> we do to deduct failure reasons? Maybe we can add more logging somewhere
> (we can deploy a special patched PostgreSQL version everywhere), to have
> more information about the issue, when it happens next time?
>

For starters, it'd be good to know something about the environment, and
stuff that'd tell us if there's some possible pattern:

1) Which exact PG versions are you observing these errors on?

2) In the error example you shared it's pg_toast_2619, which is the
TOAST table for pg_statistic (probably). Is it always this relation? Or
what relations you noticed this for?

3) What kind of commands are triggering this? In the example it seems to
be vacuum full. Did you see it for other commands too? People generally
don't do VACUUM FULL very often, particularly not in environments with
concurrent activity.

Considering you don't know what's causing this, or what to look for, I
think it might be interesting to use pg_waldump, and investigate what
happened to the page containing the TOAST chunk and to the page
referencing it. Do you have physical backups and ability to do PITR?


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Fwd: Annoying corruption in PostgreSQL.

От
Tomas Vondra
Дата:
On 10/27/23 23:10, Kirill Reshke wrote:
> 
> Sorry, seems that i replied only to Tomas, so forwarding message.
> ---------- Forwarded message ---------
> From: *Kirill Reshke* <reshkekirill@gmail.com
> <mailto:reshkekirill@gmail.com>>
> Date: Sat, 28 Oct 2023 at 02:06
> Subject: Re: Annoying corruption in PostgreSQL.
> To: Tomas Vondra <tomas.vondra@enterprisedb.com
> <mailto:tomas.vondra@enterprisedb.com>>
> 
> 
> Hi Tomas!
> 
> Thanks for the explanation!
> 
> 1) 11 to 15. This week there were 14.9 and 12.16 reproductions. Two
> weeks ago there was 15.4 and 11.21 repro. Unfortunately, there is no
> info about repro which were month old or more, but I found in our work
> chats that there was repro on PostgreSQL 13 in April, a minor version
> unknown. Overall, we observed this issue for over a year on all pgdg
> supported versions.
> 
> 2) Searching out bug tracker i have found:
> 
> 1. missing chunk number 0 for toast value 592966012 in
> pg_toast_563953150 (some user relation)
> |2. missing chunk number 0 for toast value 18019714 in
> pg_toast_17706963| (some user relation)
> 3. missing chunk number 0 for toast value 52677740 in pg_toast_247794
> 
> So, this is not always pg_catalog. There toast tables were toast to some
> user relations.
> 

OK.

> 3) It is always about VACUUM FULL (FREEZE/VERBOSE/ANALYZE) / autovacuum.
> 

Hmm, so it's always one of these VACUUM processes complaining?

> We have physical backups and we can PITR. But restoring a cluster to
> some point in the past is a bit of a different task: we need our
> client's approval for these operations, since we are a Managed DBs Cloud
> Provider. Will try to ask someone.
> 

That's what I'd try, to get some sense of what state the vacuum saw,
what were the transactions modifying the TOAST + parent table doing,
etc, how much stuff the transactions did, if maybe there are some
aborts, that sort of thing. Hard to try reproducing this without any
knowledge of the workload. The WAL might tell us if

How often do you actually see this issue? Once of twice a week?

Are you using some extensions that might interfere with this?

And you mentioned you're running large number of clusters - are those
running similar workloads, or are they unrelated?

Actually, can you elaborate why are you running VACUUM FULL etc? That
generally should not be necessary, so maybe we can learn something about
that about your workload.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Fwd: Annoying corruption in PostgreSQL.

От
Stephen Frost
Дата:
Greetings,

Please don't top-post on these lists.

* Kirill Reshke (reshkekirill@gmail.com) wrote:
> We have physical backups and we can PITR. But restoring a cluster to some
> point in the past is a bit of a different task: we need our client's
> approval for these operations, since we are a Managed DBs Cloud Provider.
> Will try to ask someone.

Do you have page-level checksums enabled for these PG instances?

Are you able to see if these clusters which are reporting the corruption
have been restored in the past from a backup?  What are you using to
perform your backups and perform your restores?

Are you able to see if these clusters have ever crashed and come back up
after by doing WAL replay?

Where I'm heading with these questions is essentially: I suspect either
your backup/restore procedure is broken or you're running on a system
that doesn't properly fsync data.  Or possibly both.

Oh, and you should probably have checksums enabled.

Thanks,

Stephen

Вложения