Обсуждение: failure with pg_dump
Hi: I have a script that I use to do regular dumps of my database. Over the weekend it failed, and produced the following error message. I'm not sure why this would have happened, how I would find out which index is referenced by 136451098, or where this select came from. pg_dump.sqlhost: Error message from server: ERROR: cache lookup failed for index 136451098 pg_dump.sqlhost: The command was: SELECT t.tableoid, t.oid, t.relname as indexname, pg_catalog.pg_get_indexdef(i.indexrelid) as indexdef, t.relnatts as indnkeys, i.indkey, i.indisclustered, c.contype, c.conname, c.tableoid as contableoid, c.oid as conoid, (SELECT spcname FROM pg_catalog.pg_tablespace s WHERE s.oid = t.reltablespace) as tablespace, array_to_string(t.reloptions, ', ') as options FROM pg_catalog.pg_index i JOIN pg_catalog.pg_class t ON (t.oid = i.indexrelid) LEFT JOIN pg_catalog.pg_depend d ON (d.classid = t.tableoid AND d.objid = t.oid AND d.deptype = 'i') LEFT JOIN pg_catalog.pg_constraint c ON (d.refclassid = c.tableoid AND d.refobjid = c.oid) WHERE i.indrelid = '136451090'::pg_catalog.oid ORDER BY indexname Any help would be greatly appreciated. Mija
Mija Lee <mija@scharp.org> writes: > I have a script that I use to do regular dumps of my database. Over the > weekend it failed, and produced the following error message. I'm not > sure why this would have happened, how I would find out which index is > referenced by 136451098, or where this select came from. It sounds like system catalog corruption, which is not good :-(. > pg_dump.sqlhost: Error message from server: ERROR: cache lookup failed > for index 136451098 > pg_dump.sqlhost: The command was: SELECT t.tableoid, t.oid, t.relname as > indexname, pg_catalog.pg_get_indexdef(i.indexrelid) as indexdef, > t.relnatts as indnkeys, i.indkey, i.indisclustered, c.contype, > c.conname, c.tableoid as contableoid, c.oid as conoid, (SELECT spcname > FROM pg_catalog.pg_tablespace s WHERE s.oid = t.reltablespace) as > tablespace, array_to_string(t.reloptions, ', ') as options FROM > pg_catalog.pg_index i JOIN pg_catalog.pg_class t ON (t.oid = > i.indexrelid) LEFT JOIN pg_catalog.pg_depend d ON (d.classid = > t.tableoid AND d.objid = t.oid AND d.deptype = 'i') LEFT JOIN > pg_catalog.pg_constraint c ON (d.refclassid = c.tableoid AND d.refobjid > = c.oid) WHERE i.indrelid = '136451090'::pg_catalog.oid ORDER BY indexname That looks like pg_dump's query to get information about the indexes of a particular table. So apparently the problem index is one of the ones for the table with OID 136451090. The easiest way to find out which one that is is select '136451090'::regclass; Trying \d on each of that table's indexes in succession would tell you which one is trashed. As for fixing it, the $64 question is how extensive is the catalog corruption. I see no very good reason to hope that only this one index is affected :-(. What you probably want to do is try to get a clean pg_dump then initdb and reload --- at least that's how I'd approach it, rather than hoping that there's no lurking problems remaining after you hack your way around the one you can see. What I'd try first is a REINDEX on pg_class. If that doesn't help, try to delete the pg_index row linking 136451098 and 136451090. What PG version is this, anyway, and did anything weird happen on your system that might explain data corruption? regards, tom lane
We've had a number of odd things that have been going on that I can't really explain, and that don't seem to result in log entries. Here's some info: - this is running 8.2.4 on a solaris 10 machine - I reran the dump after posting and these problems did not reoccur - We have a number of replicated schemas and tables on this server. There were other problems with the replication that happened earlier in the evening. - we have been having some very odd problems where our replication scripts hang intermittantly. For the life of me I can't figure out why, but when this happens, I look for processes that are idle in transaction that are more than one day old and kill them. That seems to allow the replication to finish. I have a few users that use a variety of products to view and manipulate the data in these tables (tableau, access, excel, ems, phppgadmin, dbvisualizer) and it seems like some connections/transactions never terminate, but I can't figure out which ones or why. I've been struggling with this problem for some time, but have never had an issue with the stalled replication affecting the dump. I was actually hoping that this error would help shed light on the replication problem. Mija Tom Lane wrote: > Mija Lee <mija@scharp.org> writes: >> I have a script that I use to do regular dumps of my database. Over the >> weekend it failed, and produced the following error message. I'm not >> sure why this would have happened, how I would find out which index is >> referenced by 136451098, or where this select came from. > > It sounds like system catalog corruption, which is not good :-(. > >> pg_dump.sqlhost: Error message from server: ERROR: cache lookup failed >> for index 136451098 >> pg_dump.sqlhost: The command was: SELECT t.tableoid, t.oid, t.relname as >> indexname, pg_catalog.pg_get_indexdef(i.indexrelid) as indexdef, >> t.relnatts as indnkeys, i.indkey, i.indisclustered, c.contype, >> c.conname, c.tableoid as contableoid, c.oid as conoid, (SELECT spcname >> FROM pg_catalog.pg_tablespace s WHERE s.oid = t.reltablespace) as >> tablespace, array_to_string(t.reloptions, ', ') as options FROM >> pg_catalog.pg_index i JOIN pg_catalog.pg_class t ON (t.oid = >> i.indexrelid) LEFT JOIN pg_catalog.pg_depend d ON (d.classid = >> t.tableoid AND d.objid = t.oid AND d.deptype = 'i') LEFT JOIN >> pg_catalog.pg_constraint c ON (d.refclassid = c.tableoid AND d.refobjid >> = c.oid) WHERE i.indrelid = '136451090'::pg_catalog.oid ORDER BY indexname > > That looks like pg_dump's query to get information about the indexes of > a particular table. So apparently the problem index is one of the ones > for the table with OID 136451090. The easiest way to find out which one > that is is > select '136451090'::regclass; > Trying \d on each of that table's indexes in succession would tell you > which one is trashed. > > As for fixing it, the $64 question is how extensive is the catalog > corruption. I see no very good reason to hope that only this one index > is affected :-(. What you probably want to do is try to get a clean > pg_dump then initdb and reload --- at least that's how I'd approach it, > rather than hoping that there's no lurking problems remaining after you > hack your way around the one you can see. > > What I'd try first is a REINDEX on pg_class. If that doesn't help, > try to delete the pg_index row linking 136451098 and 136451090. > > What PG version is this, anyway, and did anything weird happen on your > system that might explain data corruption? > > regards, tom lane
Mija Lee <mija@scharp.org> writes: > We've had a number of odd things that have been going on that I can't > really explain, and that don't seem to result in log entries. Here's > some info: > - this is running 8.2.4 on a solaris 10 machine > - I reran the dump after posting and these problems did not reoccur > - We have a number of replicated schemas and tables on this server. > There were other problems with the replication that happened earlier in > the evening. Hmm, are you using Slony? If so, you ought to take this to the Slony mailing lists. There are various restrictions in Slony on what it assumes can be done to a replicated table, and I believe that things like what you are seeing are one of the possible consequences of breaking Slony's expectations. That's about as far as my knowledge goes though. regards, tom lane
We are not using slony - we are replicating from sql server using perl. Tom Lane wrote: > Mija Lee <mija@scharp.org> writes: >> We've had a number of odd things that have been going on that I can't >> really explain, and that don't seem to result in log entries. Here's >> some info: > >> - this is running 8.2.4 on a solaris 10 machine >> - I reran the dump after posting and these problems did not reoccur >> - We have a number of replicated schemas and tables on this server. >> There were other problems with the replication that happened earlier in >> the evening. > > Hmm, are you using Slony? If so, you ought to take this to the Slony > mailing lists. There are various restrictions in Slony on what it > assumes can be done to a replicated table, and I believe that things > like what you are seeing are one of the possible consequences of > breaking Slony's expectations. That's about as far as my knowledge > goes though. > > regards, tom lane