Обсуждение: Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid

Поиск
Список
Период
Сортировка

Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid

От
Ned Wolpert
Дата:
(I originally posted this to pgsql-admin and was pointed to here instead.)

Folks-

  I'm doing a postmortem on a corruption event we had. I have an idea on
what happened, but not sure. I figure I'd share what happened and see if
I'm close to right here.

  Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files,
and even a nightly pg_dump all. 50G database.  Trying to update or delete a
row in a small (21 row, but heavily used table) would lock up completely.
Never finish. Removed all clients, restarted the db instance, no joy. Check
pg_stat_activity, and nothing that wasn't idle.... run the delete, locked
up.

  Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.relation =
pg_class.oid;) with no clients touching this row, fresh restart of the db,
and saw virtualtransactions against this same table where the values would
be -1/nnnnn were nnnnn was a huge number.  Turns out we had about 159
entries from different tables in the database.  Checked hot-standby and, of
course, no locks or anything.  Switched to hot-standby.

  Hot-standby instantly gained these locks, Also noticed that 2 prepared
transactions migrated to the hot-standby. Binary upgraded to 9.1.8, locks
still existed. Ended up executing the one thing we knew would work. Take
the site down, pg_dumpall to fresh instance. Everything is fine.

A little more background: We were running 9.1.4 back when 9.1.6 came out.
We saw there was possible corruption issues and did a binary upgrade and
reindexing. Everything seemed to be fine, but I wonder if we really had
problems back then. We rebuilt the hot-standby after the binary upgrade via
normal restore and wal-file replays.  I should also note that this row that
had the lock on it that would not go away, was created by an app server
that was killed (via -9) since it was non-responsive, and the row 'create
date' (in db and logs) is the exact time the app server was killed.

I was wondering if a) these virtualtransactions that start with '-1/'
indicate a problem, b) if this could have happened from pre 9.1.6
corruption that was fixed in 9.1.6. Or, could this have occurred when we
killed that app server? Or.... am I looking in the wrong place.

I do still have the old data directories so I can start them up and check
out the dataset. Any advice?

Currently running environment: CentOS 5.x
Used the http://yum.postgresql.org/9.1 repo...
Currently using PostgreSQL 9.1.8 on x86_64-unknown-linux-gnu, compiled by
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit
but was on 9.1.6 at failure time.
The hardware on both primary and secondary servers are in good condition,
raid5 via a hardware card with no issues.  Each a 2x4x2 (2 hyperthreaded
Xeon E5620 2.4Ghz, 4 cores each, 16 threads total) 32G Ram.

Data from pg_dumpall checked out fine; no data loss occurred that we could
tell. Just this transaction lock and what seemed like invalid backend ids
listed in the virtualtransaction from the pg_locks table/view.
(-1/verylargenumber)






--
Virtually, Ned Wolpert

"Settle thy studies, Faustus, and begin..."   --Marlowe

Re: Database corruption event, unlockable rows, possibly bogus virtual xids? Invalid backend server xid

От
Kevin Grittner
Дата:
Ned Wolpert <ned.wolpert@imemories.com> wrote:=0A=0A> I'm doing a postmorte=
m on a corruption event we had. I have an=0A> idea on what happened, but no=
t sure. I figure I'd share what=0A> happened and see if I'm close to right =
here.=0A>=0A> Running 9.1.6 with hot-standby, archiving 4 months of wal fil=
es,=0A> and even a nightly pg_dump all.=0A=0AThose WAL files aren't going t=
o be of much use without an=0Aoccasional base backup to apply them to.=0A=
=0A> 50G database.=A0 Trying to update or delete a row in a small (21=0A> r=
ow, but heavily used table) would lock up completely. Never=0A> finish.=0A=
=0AHow long is "never" in hours or days?=0A=0A> Removed all clients, restar=
ted the db instance, no joy.=0A> Check pg_stat_activity, and nothing that w=
asn't idle.... run the=0A> delete, locked up.=0A=0ADid you have any rows in=
 pg_prepared_xacts that had lingered for=0Alonger than you were waiting for=
 the delete?=0A=0A> Ran (SELECT*FROM pg_class JOIN pg_locks ON pg_locks.rel=
ation =3D=0A> pg_class.oid;) with no clients touching this row=0A=0AOf cour=
se you need to be connected to the right database when you=0Arun this, and =
you need to look at relation locks -- row locks don't=0Ashow up in that vie=
w.=0A=0A> fresh restart of the db, and saw virtualtransactions against this=
=0A> same table where the values would be -1/nnnnn were nnnnn was a=0A> hug=
e number.=A0 Turns out we had about 159 entries from different=0A> tables i=
n the database.=0A=0ASounds like prepared transactions.=A0 Again, what was =
in=0Apg_prepared_xacts?=0A=0A> Checked hot-standby and, of course, no locks=
 or anything.=0A> Switched to hot-standby.=0A=0AMeaning you promoted it to =
master?=0A=0A> Hot-standby instantly gained these locks, Also noticed that =
2=0A> prepared transactions migrated to the hot-standby. Binary=0A> upgrade=
d to 9.1.8, locks still existed. Ended up executing the=0A> one thing we kn=
ew would work. Take the site down, pg_dumpall to=0A> fresh instance. Everyt=
hing is fine.=0A=0AThat all sounds consistent with a flaky transaction mana=
ger.=0A=0A> A little more background: We were running 9.1.4 back when 9.1.6=
=0A> came out. We saw there was possible corruption issues and did a=0A> bi=
nary upgrade and reindexing. Everything seemed to be fine, but=0A> I wonder=
 if we really had problems back then. We rebuilt the=0A> hot-standby after =
the binary upgrade via normal restore and=0A> wal-file replays.=A0 I should=
 also note that this row that had the=0A> lock on it that would not go away=
, was created by an app server=0A> that was killed (via -9) since it was no=
n-responsive, and the row=0A> 'create date' (in db and logs) is the exact t=
ime the app server=0A> was killed.=0A=0AIt sounds as though you abruptly te=
rminated the process running=0Ayour transaction manager, which left it unaw=
are of one or more=0Adangling prepared transactions.=A0 Further, it sounds =
like your=0Atransaction manager doesn't go looking for such detritus.=A0 If=
 it's=0Anot going to watch for such problems, you had better do so.=A0 Any=
=0Aprepared transaction which is sitting in pg_prepared_xacts for more=0Ath=
an a few seconds, I would consider suspect.=A0 After a few minutes=0Ahours =
I would consider them to be a problem.=A0 After a day I would=0Aconsider th=
e transaction manager to have fallen on its face, and=0Awould go clean thin=
gs up by either committing or rolling back the=0Aprepared transaction(s).=
=0A=0A> I was wondering if a) these virtualtransactions that start with=0A>=
 '-1/' indicate a problem, b) if this could have happened from pre=0A> 9.1.=
6 corruption that was fixed in 9.1.6. Or, could this have=0A> occurred when=
 we killed that app server? Or.... am I looking in=0A> the wrong place.=0A=
=0A-1 as the process doesn't indicate a problem per se.=A0 It indicates=0At=
he transaction has been "prepared" and is no longer associated=0Awith a bac=
kend process or connection.=0A=0ASomething which was assuming the role of a=
 transaction manager told=0Aa transaction (or many of them) to prepare for =
commit as part of a=0Adistributed transaction.=A0 A transaction which says =
it successfully=0Acompleted such a PREPARE statement must hold all its lock=
s and keep=0Aall changes pending until it is told to commit or roll back, e=
ven=0Aacross database restarts.=A0 It sounds like things were left in this=
=0Astate for a very long time, which can lead to all kinds of=0Aproblems, n=
otably bloat and blocking.=0A=0A> I do still have the old data directories =
so I can start them up=0A> and check out the dataset. Any advice?=0A=0AI wo=
uld start it up and see what's in pg_prepared_xacts.=0A=0A-- =0AKevin Gritt=
ner=0AEnterpriseDB: http://www.enterprisedb.com=0AThe Enterprise PostgreSQL=
 Company
Ned Wolpert <ned.wolpert@imemories.com> writes:
>   Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files,
> and even a nightly pg_dump all. 50G database.  Trying to update or delete a
> row in a small (21 row, but heavily used table) would lock up completely.
> Never finish. Removed all clients, restarted the db instance, no joy. Check
> pg_stat_activity, and nothing that wasn't idle.... run the delete, locked
> up.

I'm betting one of those prepared transactions had updated or deleted
this row, and thus held a row lock on it.  (Come to think of it, a
SELECT FOR UPDATE/SHARE might have been enough.)  Did you try committing
or rolling back those xacts?

            regards, tom lane
Tom and Kevin-

  There were two entries in pg_prepared_xacts. In the test-bed, executing
the 'ROLLBACK PREPARED' on both allowed the system to continue processing.
All locks I saw in 'pg_locks' where the virtualtransaction started with the
'-1/' were also gone. That was indeed the issue. More importantly to me,
there was no issue likely leftover during our 9.1.4->9.1.6 upgrade we did;
just a 'flaky transaction manager' as you suspected.

  Thanks to both of you for help in tracking this down.

P.S. Kevin, We also do tar archives of the data directories nightly to
accompany the wal files we store, using pg_start_backup/pg_stop_backup. :-)
 Full restores are tested monthly.


On Thu, Feb 21, 2013 at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Ned Wolpert <ned.wolpert@imemories.com> writes:
> >   Event: Running 9.1.6 with hot-standby, archiving 4 months of wal files,
> > and even a nightly pg_dump all. 50G database.  Trying to update or
> delete a
> > row in a small (21 row, but heavily used table) would lock up completely.
> > Never finish. Removed all clients, restarted the db instance, no joy.
> Check
> > pg_stat_activity, and nothing that wasn't idle.... run the delete, locked
> > up.
>
> I'm betting one of those prepared transactions had updated or deleted
> this row, and thus held a row lock on it.  (Come to think of it, a
> SELECT FOR UPDATE/SHARE might have been enough.)  Did you try committing
> or rolling back those xacts?
>
>                         regards, tom lane
>



--
Virtually, Ned Wolpert

"Settle thy studies, Faustus, and begin..."   --Marlowe