Обсуждение: Database corruption

Поиск
Список
Период
Сортировка

Database corruption

От
Waldo Nell
Дата:
We have recently upgraded from PostgreSQL 7.4.5 to 8.1.4.  Our DB is
about 45GB in size and has about 100 tables, lots of stored
procedures, triggers etc.

The DB lived on a SAN and is accessed via Fibre Channel cards in an
IBM BladeCenter.  The system is running RedHat Linux Enterprise 3 I
think. We connect to it using the latest JDBC driver from Java 1.5.

Here is the problem.  Yesterday night I saw one of my statements in
my code threw this exception:

net.za.pwnconsulting.dblayer.query.DBQueryLayerException: Could not
execute SQL query [select c.mc_ca_id as c_mc_ca_id, d.mc_ca_id as
d_mc_ca_id, b.ca_id, b.ca_ctr_id from contract a, contract_agreement
b left outer join monthend_contract c on c.mc_ca_id = ? and
c.mc_last_proc_date = ? left outer join monthend2_contract d on
d.mc_ca_id = ? and d.mc_last_proc_date = ? where a.ctr_id =
b.ca_ctr_id and b.ca_agreement_type = ? and b.ca_start_date > ? and
b.ca_start_date <= ? and a.ctr_id = ? and ca_level_company = ? and
ca_level_data = ? and ca_level_type = 'V' and ca_id <> ? limit 1] in
pool [Oasis] because: ERROR: out of memory
   Detail: Failed on request of size 20.
org.postgresql.util.PSQLException: ERROR: out of memory
   Detail: Failed on request of size 20.
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse
(QueryExecutorImpl.java:1525)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults
(QueryExecutorImpl.java:1309)
    at org.postgresql.core.v3.QueryExecutorImpl.execute
(QueryExecutorImpl.java:188)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.execute
(AbstractJdbc2Statement.java:452)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags
(AbstractJdbc2Statement.java:340)
    at org.postgresql.jdbc2.AbstractJdbc2Statement.executeQuery
(AbstractJdbc2Statement.java:239)
    at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabaseInter
nal(StandardQueryLayer.java:289)
    at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabaseInter
nalExceptionWrapper(StandardQueryLayer.java:158)
    at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabase
(StandardQueryLayer.java:68)
    at
net.za.pwnconsulting.dblayer.query.StandardQueryLayer.queryDatabase
(StandardQueryLayer.java:58)
    at net.za.pwnconsulting.dblayer.AbstractDBLayer.queryDatabase
(AbstractDBLayer.java:57)
    at
za.co.massmart.oasis.daemons.ContractExtensionTimerTask.safeExecuteInter
nal(ContractExtensionTimerTask.java:183)
    at za.co.massmart.oasis.daemons.ExtendedHACronTimerTask.safeExecute
(ExtendedHACronTimerTask.java:94)
    at net.za.pwnconsulting.javaconfig.utils.time.HACronTimerTask.execute
(HACronTimerTask.java:156)
    at net.za.pwnconsulting.javaconfig.utils.time.CronTimerTask.run
(CronTimerTask.java:103)
    at java.lang.Thread.run(Thread.java:595)

So I checked and saw another application connecting to another DB in
the same instance showed a very weird error.  For an "update
tablename set field1 = X where primary_field = Y" statement I got a
duplicate key exception on the serial field "primary_field" - which
is impossible since I am not changing its value - I am updating a
different column.  Vacuum full analyze did not fix this.  I dropped
and recreated the index, then the error went away.  This was just
after I restarted the postgres instance (I stopped it using pg_ctl
stop -m immediate" since there were connections I could not close
from other systems.

I restarted the application that encountered the first error above
and ran the task again - it worked.  So I then started a vacuum full
analyze on the main production DB.  It produced several warnings
about page sizes not being large enough I think, but it completed
successfully.

This morning we found some data generated this morning at 01:00 was
present, but most of the tables lost all data captured since 25
August 2006.  That amounts to about 500 contracts, a while billing
cycle etc.  We upgraded postgresql on the 20th.

I checked the file system and it yielded no errors (ext3).  I checked
the server and there were no obvious hardware issues.

How can part of the DB just be lost like that?  I have just verified
by restoring a backup made yesterday night at 20:00 - before the
errors I showed above happened - that DB also has the data missing.
The bill run happened yesterday afternoon so somewhere between that
something broke.  Any ideas on how to trouble shoot this?

To summarise - the DB lost about 6 days of transactions for no known
reason.



Вложения