Обсуждение: Corrupted Table

Поиск
Список
Период
Сортировка

Corrupted Table

От
"Bryan White"
Дата:
I have a table that seems to have become corrupted.

---------------
$ pg_dump -t archivewords ec > aw.dump
pqWait() -- connection not open
PQendcopy: resetting connection
SQL query to dump the contents of Table 'archivewords' did not execute
correctly.  After we read all the table contents from the backend,
PQendcopy() failed.  Explanation from backend: 'The Data Base System is
starting up
'.
The query was: 'COPY "archivewords" TO stdout;
'.
---------------

As a result of this event, in the log file I see:
---------------
Server process (pid 21764) exited with status 139 at Mon Jul 31 14:51:44
2000
Terminating any active server processes...
Server processes were terminated at Mon Jul 31 14:51:44 2000
Reinitializing shared memory and semaphores
The Data Base System is starting up
000731.14:51:44.967 [21766] DEBUG:  Data Base System is starting up at Mon
Jul 31 14:51:44 2000
000731.14:51:44.967 [21766] DEBUG:  Data Base System was interrupted being
in production at Mon Jul 31 14:50:24 2000
000731.14:51:44.968 [21766] DEBUG:  Data Base System is in production state
at Mon Jul 31 14:51:44 2000
---------------

Here is what the table looks like:
CREATE TABLE "archivewords" (
        "docid" text NOT NULL,
        "word" text NOT NULL,
        "weight" int4 DEFAULT 0 NOT NULL
);
CREATE UNIQUE INDEX "iarcwordsword" on "archivewords" using btree ( "word"
"text_ops", "docid" "text_ops" );

Does anyone know what would couse this or how to recover from it?  In this
particular case I can recreate the file but I am concerned that this might
happen on a more critical file so I would like to pursue recovering it if
possible.





Re: Corrupted Table

От
Tom Lane
Дата:
"Bryan White" <bryan@arcamax.com> writes:
> As a result of this event, in the log file I see:
> ---------------
> Server process (pid 21764) exited with status 139 at Mon Jul 31 14:51:44

Status 139 indicates a SEGV trap on most Unixen.  There should be a core
dump left by the crashed backend --- can you get a backtrace from it
with gdb?

I concur that this probably indicates corrupted data in the file.  We
may or may not be able to guess how it got corrupted, but a stack trace
seems like the place to start.

            regards, tom lane

Re: Corrupted Table

От
"Bryan White"
Дата:
> Status 139 indicates a SEGV trap on most Unixen.  There should be a core
> dump left by the crashed backend --- can you get a backtrace from it
> with gdb?
>
> I concur that this probably indicates corrupted data in the file.  We
> may or may not be able to guess how it got corrupted, but a stack trace
> seems like the place to start.

Here is the backtrace:
#0  0x808b0e1 in CopyTo ()
#1  0x808ae2f in DoCopy ()
#2  0x80ec7c1 in ProcessUtility ()
#3  0x80ead48 in pg_exec_query_dest ()
#4  0x80eaca7 in pg_exec_query ()
#5  0x80ebba2 in PostgresMain ()
#6  0x80d61f2 in DoBackend ()
#7  0x80d5dd1 in BackendStartup ()
#8  0x80d518a in ServerLoop ()
#9  0x80d4c14 in PostmasterMain ()
#10 0x80ab736 in main ()
#11 0x401029cb in __libc_start_main (main=0x80ab6d0 <main>, argc=8,
    argv=0xbffffb54, init=0x8063fac <_init>, fini=0x812969c <_fini>,
    rtld_fini=0x4000ae60 <_dl_fini>, stack_end=0xbffffb4c)
    at ../sysdeps/generic/libc-start.c:92

BTW this is Postgres 7.0.2 on i386/RedHat 6.2.

The core file was made when I tried to dump the table.  As far as I can tell
the corruption occured on Friday because that is the date of my last good
automated backup.



Re: Corrupted Table

От
Tom Lane
Дата:
"Bryan White" <bryan@arcamax.com> writes:
>> I concur that this probably indicates corrupted data in the file.  We
>> may or may not be able to guess how it got corrupted, but a stack trace
>> seems like the place to start.

> Here is the backtrace:
> #0  0x808b0e1 in CopyTo ()

Hmm.  Assuming that it is a corrupted-data issue, the only likely
failure spot that I see in CopyTo() is the heap_getattr macro.
A plausible theory is that the length word of a variable-length field
(eg, text column) has gotten corrupted, so that when the code tries to
access the next field beyond that, it calculates a pointer off the end
of memory.

You will probably find that plain SELECT will die too if it tries to
extract data from the corrupted tuple or tuples.  With judicious use of
SELECT last-column ... LIMIT you might be able to narrow down which
tuples are bad, and then dump out the disk block containing them (use
the 'tid' pseudo-attribute to see which block a tuple is in).  I'm not
sure if the exercise will lead to anything useful or not, but if you
want to pursue it...

            regards, tom lane

Re: Corrupted Table

От
"Bryan White"
Дата:
> Hmm.  Assuming that it is a corrupted-data issue, the only likely
> failure spot that I see in CopyTo() is the heap_getattr macro.
> A plausible theory is that the length word of a variable-length field
> (eg, text column) has gotten corrupted, so that when the code tries to
> access the next field beyond that, it calculates a pointer off the end
> of memory.
>
> You will probably find that plain SELECT will die too if it tries to
> extract data from the corrupted tuple or tuples.  With judicious use of
> SELECT last-column ... LIMIT you might be able to narrow down which
> tuples are bad, and then dump out the disk block containing them (use
> the 'tid' pseudo-attribute to see which block a tuple is in).  I'm not
> sure if the exercise will lead to anything useful or not, but if you
> want to pursue it...

I am wiling to spend some time to track this down.  However I would prefer
to not keep crashing my live database.  I would like to copy the raw data
files to a backup maching.  Are there any catches in doing this.  This
particular table is only updated at predictable times on the live system.  I
am guessing as long as it is stable for at least a few minutes before I copy
the file it will work.

How hard would it be to write a utility that would walk a table looking this
kind of corruption?  Are the on-disk data formats documented anywhere?



Re: Corrupted Table

От
Tom Lane
Дата:
"Bryan White" <bryan@arcamax.com> writes:
> I am wiling to spend some time to track this down.  However I would prefer
> to not keep crashing my live database.  I would like to copy the raw data
> files to a backup maching.  Are there any catches in doing this.

Shut down the postmaster and then copy the entire db (including pg_log
file) and it should work.  The catch is to make sure pg_log is in sync
with your table files.

> How hard would it be to write a utility that would walk a table looking this
> kind of corruption?  Are the on-disk data formats documented anywhere?

I don't think there's a single handy-dandy reference, although you can
learn a lot from the header files in src/include/storage/.

Not sure about the value of a utility like that --- seems like it'd
always be prepared to fight the last war rather than the next one,
since anything we learned about likely failure patterns would probably
get folded into the main code.  But if you feel like working on one,
far be it from me to stop you...

            regards, tom lane

Re: Corrupted Table

От
"Bryan White"
Дата:
> Shut down the postmaster and then copy the entire db (including pg_log
> file) and it should work.  The catch is to make sure pg_log is in sync
> with your table files.

I would rather not leave my database down long enough to copy the entire db
(3.5GB).  I have control over when changes are applied to this table.  If I
restart the database and make certain no updates/inserts/deletes occur then
will all info be flushed from the pg_log file?  If not, how about if I first
vacuum the table?




Re: Corrupted Table

От
Tom Lane
Дата:
"Bryan White" <bryan@arcamax.com> writes:
>> Shut down the postmaster and then copy the entire db (including pg_log
>> file) and it should work.  The catch is to make sure pg_log is in sync
>> with your table files.

> I would rather not leave my database down long enough to copy the entire db
> (3.5GB).

Well, you don't really need the *whole* db, just pg_log (and the other
tables at the top level of the data directory), the system tables
(pg_foo in the database subdir), and the table of interest.

> If not, how about if I first
> vacuum the table?

Don't do that --- it might wipe out vital clues.

            regards, tom lane