Обсуждение: Data corruption

Поиск
Список
Период
Сортировка

Data corruption

От
Konrad Garus
Дата:
Hello,

we have a database that apparently got corrupted in a crash a few months ago, long enough that there is no "good" backup.

There are at least two tables occasionally emitting errors like this:

ERROR: invalid page header in block 39684 of relation base/12926/32397 while performing SQL query: SQL DML: insert into foo (a, b, c) values (?, ?, ?) | PARAMETERS: foo, bar, baz: org.postgresql.util.PSQLException: ERROR: invalid page header in block 39684 of relation base/12926/32397

The files in question (e.g. 32397) correspond to the table itself, not an index. The larger of these tables has two files (32397 & 32397.1), almost 2G total.

To make things worse, vacuum ends with:

ERROR:  could not access status of transaction 2828785257
DETAIL:  Could not open file "pg_clog/0A89": No such file or directory.

That file is nowhere to be found.

When I do this:

set zero_damaged_pages = on;
select count(*) from foo

It shows 3 errors like this:

WARNING:  invalid page header in block 78550 of relation base/12926/31642; zeroing out page
WARNING:  invalid page header in block 78551 of relation base/12926/31642; zeroing out page
WARNING:  invalid page header in block 78552 of relation base/12926/31642; zeroing out page

One idea I found is to generate a zero file for the missing clog, set zero_damaged_pages=on, then vacuum or recreate the table. Looks like it might work, but I'm trying to understand the implications.

1. When I generate a zero clog file like this, what impact will it have on everything? Can it break anything else in the database, introduce some kind of inconsistency with other (correctly committed) table data?

2. How can I gauge how much actual data (e.g. how many rows) will be lost by doing this (zero clog file + vacuum/recreate table)?

3. Is there some place I can look to reliably tell how many rows the table should have, so I can compare to how many I can read after setting zero_damaged_pages?

4. Is there any way at all to recover/repair the corrupted data at this point?

--
Konrad Garus

Re: Data corruption

От
Vijaykumar Jain
Дата:
On Wed, 21 Jul 2021 at 21:44, Konrad Garus <konrad.garus@gmail.com> wrote:
Hello,

we have a database that apparently got corrupted in a crash a few months ago, long enough that there is no "good" backup.

There are at least two tables occasionally emitting errors like this:

ERROR: invalid page header in block 39684 of relation base/12926/32397 while performing SQL query: SQL DML: insert into foo (a, b, c) values (?, ?, ?) | PARAMETERS: foo, bar, baz: org.postgresql.util.PSQLException: ERROR: invalid page header in block 39684 of relation base/12926/32397

The files in question (e.g. 32397) correspond to the table itself, not an index. The larger of these tables has two files (32397 & 32397.1), almost 2G total.

To make things worse, vacuum ends with:

ERROR:  could not access status of transaction 2828785257
DETAIL:  Could not open file "pg_clog/0A89": No such file or directory.


I have not tried this (not sure to simulate corruption on a laptop) , but ,maybe the below extension and demo, help to see if how much of it has got corrupted ?

Also, does pg_dump on those relations complete? or does it throw errors and abort?



--
Thanks,
Vijay
Mumbai, India

Re: Data corruption

От
soumitra bhandary
Дата:
This is very nasty error scenario . 

You can try by faking the clog blocks . Anyways your data is corrupted and if you don’t have any back up so it’s lost . 

Thanks
Soumitra 

Sent from my iPhone

On 21-Jul-2021, at 10:02 PM, Vijaykumar Jain <vijaykumarjain.github@gmail.com> wrote:


On Wed, 21 Jul 2021 at 21:44, Konrad Garus <konrad.garus@gmail.com> wrote:
Hello,

we have a database that apparently got corrupted in a crash a few months ago, long enough that there is no "good" backup.

There are at least two tables occasionally emitting errors like this:

ERROR: invalid page header in block 39684 of relation base/12926/32397 while performing SQL query: SQL DML: insert into foo (a, b, c) values (?, ?, ?) | PARAMETERS: foo, bar, baz: org.postgresql.util.PSQLException: ERROR: invalid page header in block 39684 of relation base/12926/32397

The files in question (e.g. 32397) correspond to the table itself, not an index. The larger of these tables has two files (32397 & 32397.1), almost 2G total.

To make things worse, vacuum ends with:

ERROR:  could not access status of transaction 2828785257
DETAIL:  Could not open file "pg_clog/0A89": No such file or directory.


I have not tried this (not sure to simulate corruption on a laptop) , but ,maybe the below extension and demo, help to see if how much of it has got corrupted ?

Also, does pg_dump on those relations complete? or does it throw errors and abort?



--
Thanks,
Vijay
Mumbai, India

Re: Data corruption

От
Scott Ribe
Дата:
Do you have any clue as to how this happened?

If you have a hardware problem, disk, controller or cable that is corrupting data written, then any efforts you make at
recoveryon the same hardware could just result in more & more corruption. 

You don't mention, I think, whether you've copied the data to different storage (and preferably different server), but
ifyou haven't already done so, STOP RIGHT NOW and do nothing else with this db until you've done that. 





Re: Data corruption

От
Ron
Дата:
I've zeroes out Pg 8.4 pc_clog files using dd (instructions are out there on the web) without any visible negative effect.

Naturally, YMMV.

On 7/21/21 11:14 AM, Konrad Garus wrote:
Hello,

we have a database that apparently got corrupted in a crash a few months ago, long enough that there is no "good" backup.

There are at least two tables occasionally emitting errors like this:

ERROR: invalid page header in block 39684 of relation base/12926/32397 while performing SQL query: SQL DML: insert into foo (a, b, c) values (?, ?, ?) | PARAMETERS: foo, bar, baz: org.postgresql.util.PSQLException: ERROR: invalid page header in block 39684 of relation base/12926/32397

The files in question (e.g. 32397) correspond to the table itself, not an index. The larger of these tables has two files (32397 & 32397.1), almost 2G total.

To make things worse, vacuum ends with:

ERROR:  could not access status of transaction 2828785257
DETAIL:  Could not open file "pg_clog/0A89": No such file or directory.

That file is nowhere to be found.

When I do this:

set zero_damaged_pages = on;
select count(*) from foo

It shows 3 errors like this:

WARNING:  invalid page header in block 78550 of relation base/12926/31642; zeroing out page
WARNING:  invalid page header in block 78551 of relation base/12926/31642; zeroing out page
WARNING:  invalid page header in block 78552 of relation base/12926/31642; zeroing out page

One idea I found is to generate a zero file for the missing clog, set zero_damaged_pages=on, then vacuum or recreate the table. Looks like it might work, but I'm trying to understand the implications.

1. When I generate a zero clog file like this, what impact will it have on everything? Can it break anything else in the database, introduce some kind of inconsistency with other (correctly committed) table data?

2. How can I gauge how much actual data (e.g. how many rows) will be lost by doing this (zero clog file + vacuum/recreate table)?

3. Is there some place I can look to reliably tell how many rows the table should have, so I can compare to how many I can read after setting zero_damaged_pages?

4. Is there any way at all to recover/repair the corrupted data at this point?

--
Konrad Garus

--
Angular momentum makes the world go 'round.

Re: Data corruption

От
Laurenz Albe
Дата:
On Wed, 2021-07-21 at 18:14 +0200, Konrad Garus wrote:
> we have a database that apparently got corrupted in a crash a few months ago, long enough that there is no "good"
backup.
> 
> There are at least two tables occasionally emitting errors like this:
> 
> ERROR: invalid page header in block 39684 of relation base/12926/32397 while performing SQL query: SQL DML: insert
intofoo (a, b, c) values (?, ?, ?)
 
>  | PARAMETERS: foo, bar, baz: org.postgresql.util.PSQLException: ERROR: invalid page header in block 39684 of
relationbase/12926/32397
 
> 
> The files in question (e.g. 32397) correspond to the table itself, not an index.
>  The larger of these tables has two files (32397 & 32397.1), almost 2G total.
> 
> To make things worse, vacuum ends with:
> 
> ERROR:  could not access status of transaction 2828785257
> DETAIL:  Could not open file "pg_clog/0A89": No such file or directory.
> 
> That file is nowhere to be found.
> 
> When I do this:
> 
> set zero_damaged_pages = on;
> select count(*) from foo
> 
> It shows 3 errors like this:
> 
> WARNING:  invalid page header in block 78550 of relation base/12926/31642; zeroing out page
> WARNING:  invalid page header in block 78551 of relation base/12926/31642; zeroing out page
> WARNING:  invalid page header in block 78552 of relation base/12926/31642; zeroing out page
> 
> One idea I found is to generate a zero file for the missing clog, set zero_damaged_pages=on,
>  then vacuum or recreate the table. Looks like it might work, but I'm trying to understand the implications.
> 
> 1. When I generate a zero clog file like this, what impact will it have on everything?
>    Can it break anything else in the database, introduce some kind of inconsistency with other (correctly committed)
tabledata?
 

Certainly.  I would create a fake clog file with only 0x55 in it, that corresponds to
all transactions being committed, so you will "see" more of the damaged data.

But of yourse there is no guarantee that the data are consistent.  For example,
foreign keys could be violated.  You will have to manually fix these data
before you can create the constraints.

> 2. How can I gauge how much actual data (e.g. how many rows) will be lost by doing this (zero clog file +
vacuum/recreatetable)?
 

You'd have to look at the binary data of the affected blocks to get an idea.
The "pageinspect" contrib can extract a block from a data file.

> 3. Is there some place I can look to reliably tell how many rows the table should have,
>    so I can compare to how many I can read after setting zero_damaged_pages?

No.

> 4. Is there any way at all to recover/repair the corrupted data at this point?

With the techiques you describe, you can salvage some data.  Carefully reading the damaged
blocks, you may glean some more.

But typically, at that point you are happy if you lose only some data.  You may have to
delete some more to make the data consistent.  Then cut your losses and load the dump
into a new cluster on good hardware.

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com