Обсуждение: Database Corruption - last chance recovery options?
Had some database corruption problems today. Since they came on the heels of making some minor database changes yesterday, they may or may not be related to that. Centos 4.x, Postgresql 8.1.4 I modified the following settings and then issued a reload. I hadn't turned up the kernel.shmmax to allow for these bigger memory settings, but the database continued to run fine. shared_buffers = 10000 work_mem = 2048 autovacuum = on # enable autovacuum subprocess? autovacuum_naptime = 60 # time between autovacuum runs, in secs autovacuum_vacuum_threshold = 1000 # min # of tuple updates before autovacuum_analyze_threshold = 500 # min # of tuple updates before autovacuum_vacuum_scale_factor = 0.4 # fraction of rel size before autovacuum_analyze_scale_factor = 0.2 # fraction of rel size before autovacuum_vacuum_cost_delay = -1 # default vacuum cost delay for autovacuum_vacuum_cost_limit = -1 # default vacuum cost limit for After reloading I ran a number of vacuumdb -a -z which completed successfully. Sometime after the vacuum or reload one of our clients started to have database problems. The other databases in the same postgresql on the server don't seem to be affected. When I finally got the error report in the morning the database was in this state: $ psql dbname dbname=# \dt ERROR: cache lookup failed for relation 20884 Doing a select * from pg_tables seemed to indicate that some of the tables were no longer in the database, also some other tables were inaccessible. I made a backup and then some functionality was restored by issuing a reindex system dbname Using the "broken database" pg_dump on all tables in pg_table gives this for some tables: pg_dump: SQL command failed pg_dump: Error message from server: ERROR: cache lookup failed for relation 20871 pg_dump: The command was: SELECT tableoid, oid, conname, pg_catalog.pg_get_const raintdef(oid) as condef FROM pg_catalog.pg_constraint WHERE conrelid = '20876':: pg_catalog.oid AND contype = 'f' pg_dump: SQL command failed pg_dump: Error message from server: ERROR: relation "public.auth_groups_permissions" does not exist pg_dump: The command was: LOCK TABLE public.auth_groups_permissions IN ACCESS SH ARE MODE pg_dump: SQL command failed pg_dump: Error message from server: ERROR: relation with OID 21186 does not exist pg_dump: The command was: LOCK TABLE public.ght_ght_shippingorders IN ACCESS SHA RE MODE Our backups failed of course, and we have made a good attempt at recovery, which we are willing to accept as all the recovery we can do. As a last chance to get some of the data back, I would be interested if there is any way to read through the raw database files to see if I can recover some more data from them. I'm also curious if any of the settings/reload caused this problem, or perhaps the vacuum or autovacuum is what caused our error. I'll file a bug report if it's somehow repeatable. Oh, and make backups. -Mike
On Jan 4, 2007, at 11:24 PM, Michael Best wrote: > When I finally got the error report in the morning the database was > in this state: > > $ psql dbname > > dbname=# \dt > ERROR: cache lookup failed for relation 20884 Do you have your error logs, and were there any relevant errors in them preceding your inability to get a table listing via psql? -- Thomas F. O'Connell optimizing modern web applications : for search engines, for usability, and for performance : http://o.ptimized.com/ 615-260-0005
Thomas F. O'Connell wrote: > > On Jan 4, 2007, at 11:24 PM, Michael Best wrote: > >> When I finally got the error report in the morning the database was in >> this state: >> >> $ psql dbname >> >> dbname=# \dt >> ERROR: cache lookup failed for relation 20884 > > Do you have your error logs, and were there any relevant errors in them > preceding your inability to get a table listing via psql? From pgstartup.log I have the following, but I think these were from when I actually restarted the database the next day after finding out about the corruption. FATAL: could not create shared memory segment: Invalid argument DETAIL: Failed system call was shmget(key=5432001, size=86441984, 03600). Set your memory requirement too high in postgresql.conf, reload instead of restarting the database, it silently fails sometime later? Just a theory, as far as I know those errors were probably just from failed startup attempts when I was rebooting the server after seeing the data corruption. From Wednesday when the problems started occuring I have a SIGHUP which I assume is the reload, and then a number of errors such as: LOG: received SIGHUP, reloading configuration files LOG: transaction ID wrap limit is 1074563238, limited by database "template1" The last line is repeated 34 times before I start seeing things like ERROR: relation "auth_users" does not exist On Thursday after the corruption there are more cache lookup failed log entries and more relation does not exist errors as well as some lines that say "dbname" instead of "template1": LOG: transaction ID wrap limit is 1074563466, limited by database "dbname" Interspersed with all these entries are NOTICES that are from ongoing development on a different database on the same server (I should talk to them about insanely long names): 236 of these NOTICES on the day of the problems and about 638 since last Saturday. NOTICE: identifier "docmanager_document__document_type__workflow__workflowstatehasaction" will be truncated to "docmanager_document__document_type__workflow__workflowstatehasa" NOTICE: identifier "m2m_docmanager_documenttype__workflow__workflowstatehasaction__permitted_ roles" will be truncated to "m2m_docmanager_documenttype__workflow__workflowstatehasaction__" That's all I have. No timestamps on the logs means I have to guess as some of the timing as well. -Mike
Michael Best <mbest@pendragon.org> writes: > Set your memory requirement too high in postgresql.conf, reload instead > of restarting the database, it silently fails sometime later? Yeah, wouldn't surprise me, since the reload is going to ignore any changes related to resizing shared memory. I think that 8.2 might warn you that it was ignoring the un-applyable changes, but the warnings would only go to the postmaster log, where they're easily missed :-( regards, tom lane
On Jan 5, 2007, at 10:01 PM, Tom Lane wrote: > Michael Best <mbest@pendragon.org> writes: >> Set your memory requirement too high in postgresql.conf, reload >> instead >> of restarting the database, it silently fails sometime later? > > Yeah, wouldn't surprise me, since the reload is going to ignore any > changes related to resizing shared memory. I think that 8.2 might > warn > you that it was ignoring the un-applyable changes, but the warnings > would only go to the postmaster log, where they're easily missed :-( Wait, now I'm curious. If a change in postgresql.conf that requires a restart doesn't take effect on reload, then how could a related failure manifest at all, regardless of when? -- Thomas F. O'Connell optimizing modern web applications : for search engines, for usability, and for performance : http://o.ptimized.com/ 615-260-0005
"Thomas F. O'Connell" <tf@o.ptimized.com> writes: >> Michael Best <mbest@pendragon.org> writes: >>> Set your memory requirement too high in postgresql.conf, reload >>> instead of restarting the database, it silently fails sometime later? > Wait, now I'm curious. If a change in postgresql.conf that requires a > restart doesn't take effect on reload, then how could a related > failure manifest at all, regardless of when? The point is that when you eventually shut down and restart the postmaster, it'd fail then ... perhaps long after you've forgotten about your configuration changes, so it might be a bit mystifying. regards, tom lane