Обсуждение: Recovering database after disk crash
Hello A disk hosting an instance of 8.2 crashed on me (hardware failure). I was able to pull most of the data off the drive, but the one database that I need the most is corrupt. I'm not really sure where to start... so here are some error messages: postgres@vsimkus-laptop:/var/log/postgresql$ pg_dump EPC pg_dump: SQL command failed pg_dump: Error message from server: ERROR: index "pg_opclass_am_name_nsp_index" contains unexpected zero page at block 0 HINT: Please REINDEX it. pg_dump: The command was: SELECT tableoid, oid, *, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = '10') as lanowner FROM pg_language WHERE lanispl ORDER BY oid postgres@vsimkus-laptop:/var/log/postgresql$ reindexdb EPC NOTICE: table "pg_class" was reindexed reindexdb: reindexing of database "EPC" failed: ERROR: catalog is missing 4 attribute(s) for relid 10762 Almost every operation I attempt ends in a similar error message: EPC=# \dt ERROR: index "pg_opclass_am_name_nsp_index" contains unexpected zero page at block 0 HINT: Please REINDEX it. EPC=# select * from siemr_sys.zip_codes; ERROR: catalog is missing 10 attribute(s) for relid 1337158 EPC=# select * from siemr_sys.sex_list; ERROR: catalog is missing 2 attribute(s) for relid 1337150 EPC=# select * from siemr_data.visits; ERROR: catalog is missing 7 attribute(s) for relid 1337095 Is there some way I can extract the data from the tables even if the metadata is corrupt? Thanks Vic -- Men never do evil so completely and cheerfully as when they do it from religious conviction. -Blaise Pascal
After doing some more reading I've come to the conclusion that I'm in completely over my head. I got a fresh copy of the corrupt data and am starting from the very beginning. Here's the error I get on startup: postgres@vsimkus-laptop:/usr/lib/postgresql/8.2/bin$ ./postgres -D /var/lib/postgresql/8.2/main 2008-05-12 16:16:02 CDT LOG: could not load root certificate file "root.crt": no SSL error reported 2008-05-12 16:16:02 CDT DETAIL: Will not verify client certificates. 2008-05-12 16:16:02 CDT LOG: database system was interrupted at 2008-04-16 09:28:36 CDT 2008-05-12 16:16:02 CDT LOG: unexpected pageaddr 0/1114A000 in log file 0, segment 17, offset 14295040 2008-05-12 16:16:02 CDT LOG: invalid primary checkpoint record 2008-05-12 16:16:02 CDT LOG: unexpected pageaddr 0/1114A000 in log file 0, segment 17, offset 14295040 2008-05-12 16:16:02 CDT LOG: invalid secondary checkpoint record 2008-05-12 16:16:02 CDT PANIC: could not locate a valid checkpoint record 2008-05-12 16:16:02 CDT LOG: startup process (PID 8359) was terminated by signal 6 2008-05-12 16:16:02 CDT LOG: aborting startup due to startup process failure pg_control output: ./pg_controldata /var/lib/postgresql/8.2/main pg_control version number: 822 Catalog version number: 200611241 Database system identifier: 5126483664462806975 Database cluster state: in production pg_control last modified: Wed 16 Apr 2008 09:28:36 AM CDT Current log file ID: 0 Next log file segment: 18 Latest checkpoint location: 0/11DA3C38 Prior checkpoint location: 0/11DA3BF0 Latest checkpoint's REDO location: 0/11DA3C38 Latest checkpoint's UNDO location: 0/0 Latest checkpoint's TimeLineID: 1 Latest checkpoint's NextXID: 0/908105 Latest checkpoint's NextOID: 2001259 Latest checkpoint's NextMultiXactId: 7 Latest checkpoint's NextMultiOffset: 13 Time of latest checkpoint: Wed 16 Apr 2008 09:13:56 AM CDT Minimum recovery ending location: 0/0 Maximum data alignment: 4 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Date/time type storage: 64-bit integers Maximum length of locale name: 128 LC_COLLATE: en_US.UTF-8 LC_CTYPE: en_US.UTF-8 I ran pg_resetxlog in my previous mucking around, but that's when all hell broke loose, so I'm holding off until someone more qualified advises me to do so. TIA Vic
"Vic Simkus" <vic.simkus@gmail.com> writes: > postgres@vsimkus-laptop:/var/log/postgresql$ reindexdb EPC > NOTICE: table "pg_class" was reindexed > reindexdb: reindexing of database "EPC" failed: ERROR: catalog is > missing 4 attribute(s) for relid 10762 If you're really lucky, doing the reindex with ignore_system_indexes enabled will go through. Otherwise, it's time to go back to your last backups :-( regards, tom lane
If I'm understanding the errors correctly it seems that the corruption is in the system catalogs (metadata). The database does not use any fancy datatypes. Is there any way for me to rebuild the metadata manually? If I can see the leftover metadata and the data minus the missing metadata I can probably piece everything together enough to get the data out. Does that make sense or am I off my rocker? On 5/12/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Vic Simkus" <vic.simkus@gmail.com> writes: > > postgres@vsimkus-laptop:/var/log/postgresql$ reindexdb EPC > > NOTICE: table "pg_class" was reindexed > > reindexdb: reindexing of database "EPC" failed: ERROR: catalog is > > missing 4 attribute(s) for relid 10762 > > If you're really lucky, doing the reindex with ignore_system_indexes > enabled will go through. Otherwise, it's time to go back to your > last backups :-( > > regards, tom lane > -- Men never do evil so completely and cheerfully as when they do it from religious conviction. -Blaise Pascal
"Vic Simkus" <vic.simkus@gmail.com> writes: > If I'm understanding the errors correctly it seems that the corruption > is in the system catalogs (metadata). The database does not use any > fancy datatypes. Is there any way for me to rebuild the metadata > manually? If I can see the leftover metadata and the data minus the > missing metadata I can probably piece everything together enough to > get the data out. Based on the evidence so far, the disk failure has zeroed out multiple, randomly-chosen pages of your system catalogs. I'd think it very likely indeed that random pages of your table files got the same favor. You won't have any way to know what is missing ... regards, tom lane
During the initial ill-educated messing around I had set the zero_damaged_pages to yes, but I'm guessing that the end result is the same... Ill try it with the fresh copy of [the corrupt] data What kind of a database can't deal with a bit of random values injected into its sytem files anyways? :) On 5/12/08, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Vic Simkus" <vic.simkus@gmail.com> writes: > > If I'm understanding the errors correctly it seems that the corruption > > is in the system catalogs (metadata). The database does not use any > > fancy datatypes. Is there any way for me to rebuild the metadata > > manually? If I can see the leftover metadata and the data minus the > > missing metadata I can probably piece everything together enough to > > get the data out. > > Based on the evidence so far, the disk failure has zeroed out multiple, > randomly-chosen pages of your system catalogs. I'd think it very likely > indeed that random pages of your table files got the same favor. > You won't have any way to know what is missing ... > > regards, tom lane > -- Men never do evil so completely and cheerfully as when they do it from religious conviction. -Blaise Pascal