Обсуждение: 8.2.3 PANIC with "corrupted item pointer"

Поиск
Список
Период
Сортировка

8.2.3 PANIC with "corrupted item pointer"

От
"Henka"
Дата:
Hello all,

I'm using PG 8.2.3:

PostgreSQL 8.2.3 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.3.6

I happened to notice this error in the log when my application was refused
a db connection (quite unexpectedly):

PANIC:  corrupted item pointer: offset = 3308, size = 28
LOG:  autovacuum process (PID 18165) was terminated by signal 6
LOG:  terminating any other active server processes
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted at 2007-06-21 12:27:13 SAST
LOG:  checkpoint record is at D6/F00F418C
LOG:  redo record is at D6/F00C24B8; undo record is at 0/0; shutdown FALSE
LOG:  next transaction ID: 0/2090506603; next OID: 101300203
LOG:  next MultiXactId: 35676; next MultiXactOffset: 85365
LOG:  database system was not properly shut down; automatic recovery in
progress
LOG:  redo starts at D6/F00C24B8
FATAL:  the database system is starting up
LOG:  record with zero length at D6/F138DDA4
LOG:  redo done at D6/F138DD7C
LOG:  database system is ready

Just prior to the panic, an INSERT was executed three times and rejected
as expected because of a unique constraint index.

Database size is about 180GB and growing.

Any comments would be appreciated.

Regards
Henry


Re: 8.2.3 PANIC with "corrupted item pointer"

От
Gregory Stark
Дата:
"Henka" <henka@cityweb.co.za> writes:

> Hello all,
>
> I'm using PG 8.2.3:

You should update to 8.2.4, it includes a security fix and several bug fixes.
However afaik none of them look like this.

> PANIC:  corrupted item pointer: offset = 3308, size = 28
> LOG:  autovacuum process (PID 18165) was terminated by signal 6

Huh, that's pretty strange.

My first thought is bad memory. It's always good to rule that out since it's
quite common and can cause a lot of hair-pulling. If you can schedule some
downtime download memtest86 and run it overnight.

Other than that it might be interesting to know the values of some server
parameters: "fsync" and "full_page_writes". Have you ever had this machine
crash or had a power failure? And what kind of i/o controller is this?

Ideally it would be good to get a dump of this block, it looks like it's
probably a block of an index (unless you have a table with extremely narrow
rows?). But there doesn't seem to be enough information in this error to tell
which block it happened on.

If you manually "vacuum verbose" each table does it cause a crash? If so send
that along and at least we'll know which table or index has the corrupted
data. You probably don't want to do this during peak production time though...

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: 8.2.3 PANIC with "corrupted item pointer"

От
"Henka"
Дата:

>> I'm using PG 8.2.3:
>
> You should update to 8.2.4, it includes a security fix and several bug
> fixes.

That was my next option.  My last backup dump looks suspiciously small,
but the day before that looks about right.


> My first thought is bad memory. It's always good to rule that out since
> it's
> quite common and can cause a lot of hair-pulling. If you can schedule some
> downtime download memtest86 and run it overnight.

Thanks for the suggestion - will give it a try.


> Other than that it might be interesting to know the values of some server
> parameters: "fsync" and "full_page_writes". Have you ever had this machine
> crash or had a power failure? And what kind of i/o controller is this?

fsync = off
full_page_writes = default

Sadly yes, the machine has experienced a power failure about 3 weeks ago
(genset startup problem).  With fsync=off this presents a problem wrt safe
recovery, I know...


> Ideally it would be good to get a dump of this block, it looks like it's
> probably a block of an index (unless you have a table with extremely
> narrow
> rows?). But there doesn't seem to be enough information in this error to
> tell
> which block it happened on.
>
> If you manually "vacuum verbose" each table does it cause a crash? If so
> send

Giving that a try now on the suspect table.




Re: 8.2.3 PANIC with "corrupted item pointer"

От
Gregory Stark
Дата:
"Henka" <henka@cityweb.co.za> writes:

>> Other than that it might be interesting to know the values of some server
>> parameters: "fsync" and "full_page_writes". Have you ever had this machine
>> crash or had a power failure? And what kind of i/o controller is this?
>
> fsync = off
> full_page_writes = default
>
> Sadly yes, the machine has experienced a power failure about 3 weeks ago
> (genset startup problem).  With fsync=off this presents a problem wrt safe
> recovery, I know...

Ugh. The worst part is that you won't even know that there's anything wrong
with your data. I would actually suggest that if you run with fsync off and
have a power failure or kernel crash you should just immediately restore from
your last backup and not risk running with the possibly corrupt database.

Honestly this seems like a weird error to occur as a result of crashing with
fsync off but anything's possible. More likely is you have records that you
have partial transactions in your database, ie, records which were inserted or
deleted in a transaction but missing other records that were inserted or
deleted in the same transaction.

You could probably fix this particular problem by reindexing the corrupted
index. But you may never know if some of the data is incorrect.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: 8.2.3 PANIC with "corrupted item pointer"

От
Tom Lane
Дата:
"Henka" <henka@cityweb.co.za> writes:
> I happened to notice this error in the log when my application was refused
> a db connection (quite unexpectedly):

> PANIC:  corrupted item pointer: offset = 3308, size = 28
> LOG:  autovacuum process (PID 18165) was terminated by signal 6

FWIW, the only occurrences of that specific message text are in
PageIndexTupleDelete and PageIndexMultiDelete, so you can be pretty sure
that this is just a corrupted-index problem.  Once you've identified
which table has the problem, a REINDEX should fix it.

I concur with the other comments that a crash with fsync off might well
have allowed more corruption than just this to sneak in, though :-(

            regards, tom lane

Re: 8.2.3 PANIC with "corrupted item pointer"

От
Henk - CityWEB
Дата:
On Thu, 21 Jun 2007, Tom Lane wrote:

> "Henka" <henka@cityweb.co.za> writes:
> > I happened to notice this error in the log when my application was refused
> > a db connection (quite unexpectedly):
>
> > PANIC:  corrupted item pointer: offset = 3308, size = 28
> > LOG:  autovacuum process (PID 18165) was terminated by signal 6
>
> FWIW, the only occurrences of that specific message text are in
> PageIndexTupleDelete and PageIndexMultiDelete, so you can be pretty sure
> that this is just a corrupted-index problem.  Once you've identified
> which table has the problem, a REINDEX should fix it.
>
> I concur with the other comments that a crash with fsync off might well
> have allowed more corruption than just this to sneak in, though :-(

Thanks for the comments, Tom.  I've started the reindex on the offending
index, so let's see how it goes.

I think once that's done, and I'm able to dump-all without error, I'll
upgrade to 8.2.4 and perform a restore.  That process should
uncover any other funnies.

Regards
Henry

Re: 8.2.3 PANIC with "corrupted item pointer"

От
Henk - CityWEB
Дата:

On Thu, 21 Jun 2007, Gregory Stark wrote:
> Ugh. The worst part is that you won't even know that there's anything wrong
> with your data. I would actually suggest that if you run with fsync off and
> have a power failure or kernel crash you should just immediately restore from
> your last backup and not risk running with the possibly corrupt database.
>
> Honestly this seems like a weird error to occur as a result of crashing with
> fsync off but anything's possible. More likely is you have records that you
> have partial transactions in your database, ie, records which were inserted or
> deleted in a transaction but missing other records that were inserted or
> deleted in the same transaction.
>
> You could probably fix this particular problem by reindexing the corrupted
> index. But you may never know if some of the data is incorrect.

Thanks, Greg.  Luckily the data is for internal/behind-the-scenes
use only, with no customer access.  So the situation isn't dire.

I can't wait to get a decent master/multi-slave setup going where I can
turn fsync on and still get semi-decent performance...

Regards
Henry

Re: 8.2.3 PANIC with "corrupted item pointer"

От
Hannes Dorbath
Дата:
Henk - CityWEB wrote:
> I can't wait to get a decent master/multi-slave setup going where I can
> turn fsync on and still get semi-decent performance...

I don't see how replication can help you with fsync performance
problems. Controllers with battery backed write cache are cheap. What is
the point of disabling fsync these days?


--
Best regards,
Hannes Dorbath

Re: 8.2.3 PANIC with "corrupted item pointer"

От
Henk - CityWEB
Дата:

On Thu, 21 Jun 2007, Tom Lane wrote:

> "Henka" <henka@cityweb.co.za> writes:
> > I happened to notice this error in the log when my application was refused
> > a db connection (quite unexpectedly):
>
> > PANIC:  corrupted item pointer: offset = 3308, size = 28
> > LOG:  autovacuum process (PID 18165) was terminated by signal 6
>
> FWIW, the only occurrences of that specific message text are in
> PageIndexTupleDelete and PageIndexMultiDelete, so you can be pretty sure
> that this is just a corrupted-index problem.  Once you've identified
> which table has the problem, a REINDEX should fix it.

I've identified the offending index, but REINDEX also causes the PANIC.
I also tried DROPping the index (and REINDEX DATABASE foo), but that too
causes a panic.

Is it possible to identify the physical file/s used by this index and
remove it manually from the filesystem (while pg is offline) and update
system tables, or is there some other way I can reindex/drop/remove it?

The only thing I haven't tried is dropping the entire table, but I suspect
that will also cause a panic.

If I can't trash the index externally, is possible to trash the entire
table/index group externally?

Any comments are appreciated.

Regards
Henry

Re: 8.2.3 PANIC with "corrupted item pointer"

От
Tom Lane
Дата:
Henk - CityWEB <henka@cityweb.co.za> writes:
> On Thu, 21 Jun 2007, Tom Lane wrote:
>> FWIW, the only occurrences of that specific message text are in
>> PageIndexTupleDelete and PageIndexMultiDelete, so you can be pretty sure
>> that this is just a corrupted-index problem.  Once you've identified
>> which table has the problem, a REINDEX should fix it.

> I've identified the offending index, but REINDEX also causes the PANIC.
> I also tried DROPping the index (and REINDEX DATABASE foo), but that too
> causes a panic.

In that case you've not correctly identified the broken index.  It
sounds to me like the problem could possibly be in a system catalog
index.  REINDEX SYSTEM with use of system indexes disabled might help
(see the REINDEX man page).

            regards, tom lane