Обсуждение: corruption issue after server crash - ERROR: unexpected chunk number 0

Поиск
Список
Период
Сортировка

corruption issue after server crash - ERROR: unexpected chunk number 0

От
Mike Broers
Дата:
Hello we are running postgres 9.2.5 on RHEL6, our production server crashed hard and when it came back up our logs were flooded with:

STATEMENT:  SELECT "session_session"."session_key", "session_session"."session_data", "session_session"."expire_date", "session_session"."nonce" FROM "session_session" WHERE ("session_session"."session_key" = 'gk9aap5d7btp6tzquh0kf73gpfmik5w3'  AND "session_session"."expire_date" > '2013-11-21 13:27:33.107913' )

ERROR:  unexpected chunk number 0 (expected 1) for toast value 117927127 in pg_toast_19122

We restarted the application and whatever session was constantly hitting that row stopped, but Im concerned about remediation.  When I attempt to read from that row the error occurs. 

select * from session_session where session_key = 'gk9aap5d7btp6tzquh0kf73gpfmik5w3';
ERROR:  unexpected chunk number 0 (expected 1) for toast value 117927127 in pg_toast_19122

When I attempt to delete this row I get this error:
delete from session_session where session_key = 'gk9aap5d7btp6tzquh0kf73gpfmik5w3';
ERROR:  tuple concurrently updated

We happen to have a maintenance window tonight so I will have some time when the app is down to run some database fixes.  I saw other threads suggesting a reindex of the toast table, but this is a 14GB table and I'm not sure how long that will take or if it will even be successful.  We also have a full db vacuum/analyze scheduled nightly for 2am so I am expecting to learn if there are other impacted tables, but its troubling if I dont know what the remediation.  This particular table could be truncated if necessary if that is an option but Im not sure about other tables.

Any suggestions for how to handle the tuple concurrently updated error? Or if a reindex is likely to help with the unexpected chunk error? 

Thanks
Mike

Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Mike Broers
Дата:
Update - I have two hot replication slaves of this db, both have the problem.  I took one out of recovery and ran REINDEX table session_session and it fixed the errors about this row.  Now Im going to run vacuum and see if there are other tables that complain, but Im guessing if so I will need to see if there is a way to force vacuum to continue on error, worst case I might have to script a table by table vacuum script I guess..  If anyone has a better suggestion for determining the extent of the damage Id appreciate it.


On Thu, Nov 21, 2013 at 2:10 PM, Mike Broers <mbroers@gmail.com> wrote:
Hello we are running postgres 9.2.5 on RHEL6, our production server crashed hard and when it came back up our logs were flooded with:

STATEMENT:  SELECT "session_session"."session_key", "session_session"."session_data", "session_session"."expire_date", "session_session"."nonce" FROM "session_session" WHERE ("session_session"."session_key" = 'gk9aap5d7btp6tzquh0kf73gpfmik5w3'  AND "session_session"."expire_date" > '2013-11-21 13:27:33.107913' )

ERROR:  unexpected chunk number 0 (expected 1) for toast value 117927127 in pg_toast_19122

We restarted the application and whatever session was constantly hitting that row stopped, but Im concerned about remediation.  When I attempt to read from that row the error occurs. 

select * from session_session where session_key = 'gk9aap5d7btp6tzquh0kf73gpfmik5w3';
ERROR:  unexpected chunk number 0 (expected 1) for toast value 117927127 in pg_toast_19122

When I attempt to delete this row I get this error:
delete from session_session where session_key = 'gk9aap5d7btp6tzquh0kf73gpfmik5w3';
ERROR:  tuple concurrently updated

We happen to have a maintenance window tonight so I will have some time when the app is down to run some database fixes.  I saw other threads suggesting a reindex of the toast table, but this is a 14GB table and I'm not sure how long that will take or if it will even be successful.  We also have a full db vacuum/analyze scheduled nightly for 2am so I am expecting to learn if there are other impacted tables, but its troubling if I dont know what the remediation.  This particular table could be truncated if necessary if that is an option but Im not sure about other tables.

Any suggestions for how to handle the tuple concurrently updated error? Or if a reindex is likely to help with the unexpected chunk error? 

Thanks
Mike

Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Kevin Grittner
Дата:
Mike Broers <mbroers@gmail.com> wrote:

> Hello we are running postgres 9.2.5 on RHEL6, our production
> server crashed hard and when it came back up our logs were
> flooded with:

> ERROR:  unexpected chunk number 0 (expected 1) for toast value 117927127 in pg_toast_19122

Your database is corrupted.  Unless you were running with fsync =
off or full_page_writes = off, that should not happen.  It is
likely to be caused by a hardware problem (bad RAM, a bad disk
drive, or network problems if your storage is across a network).

If it were me, I would stop the database service and copy the full
data directory tree.

http://wiki.postgresql.org/wiki/Corruption

If fsync or full_page_writes were off, your best bet is probably to
go to your backup.  If you don't go to a backup, you should try to
get to a point where you can run pg_dump, and dump and load to a
freshly initdb'd cluster.

If fsync and full_page_writes were both on, you should run hardware
diagnostics at your earliest opportunity.  When hardware starts to
fail, the first episode is rarely the last or the most severe.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Mike Broers
Дата:
Thanks for the response.  fsync and full_page_writes are both on.  

Our database runs on a managed hosting provider's vmhost server/san, I can possibly request for them to provide some hardware test results - do you have any specifics diagnostics in mind?  The crash was apparently due to our vmhost suddenly losing power, the only row that it has complained with the chunk error also migrated into both standby servers, and as previously stated was fixed with a reindex of the parent table in one of the standby servers after taking it out of recovery.  The vacuumdb -avz on this test copy didnt have any errors or warnings, im going to also run a pg_dumpall on this host to see if any other rows are problematic. 

Is there something else I can run to confirm we are more or less ok at the database level after the pg_dumpall or is there no way to be sure and a fresh initdb is required. 

I am planning on running the reindex in actual production tonight during our maintenance window, but was hoping if that worked we would be out of the woods.  



On Thu, Nov 21, 2013 at 3:56 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Mike Broers <mbroers@gmail.com> wrote:

> Hello we are running postgres 9.2.5 on RHEL6, our production
> server crashed hard and when it came back up our logs were
> flooded with:

> ERROR:  unexpected chunk number 0 (expected 1) for toast value 117927127 in pg_toast_19122

Your database is corrupted.  Unless you were running with fsync =
off or full_page_writes = off, that should not happen.  It is
likely to be caused by a hardware problem (bad RAM, a bad disk
drive, or network problems if your storage is across a network).

If it were me, I would stop the database service and copy the full
data directory tree.

http://wiki.postgresql.org/wiki/Corruption

If fsync or full_page_writes were off, your best bet is probably to
go to your backup.  If you don't go to a backup, you should try to
get to a point where you can run pg_dump, and dump and load to a
freshly initdb'd cluster.

If fsync and full_page_writes were both on, you should run hardware
diagnostics at your earliest opportunity.  When hardware starts to
fail, the first episode is rarely the last or the most severe.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Kevin Grittner
Дата:
Mike Broers <mbroers@gmail.com> wrote:

> Thanks for the response.  fsync and full_page_writes are both on.

> [ corruption appeared following power loss on the machine hosing
> the VM running PostgreSQL ]

That leaves three possibilities:
  (1)  fsync doesn't actually guarantee persistence in your stack.
  (2)  There is a hardware problem which has not been recognized.
  (3)  There is a so-far unrecognized bug in PostgreSQL.

Based on my personal experience, those are listed in descending
order of probability.  I seem to recall reports of some VM for
which an fsync did not force data all the way to persistent
storage, but I don't recall which one.  You might want to talk to
your service provider about what guarantees they make in this
regard.

> Is there something else I can run to confirm we are more or less
> ok at the database level after the pg_dumpall or is there no way
> to be sure and a fresh initdb is required.

Given that you had persistence options in their default state of
"on", and the corruption appeared after a power failure in a VM
environment, I would guess that the damage is probably limited.
That said, damage from this sort of event can remain hidden and
cause data loss later.  Unfortunately we do not yet have a
consistency checker that can root out such problems.  If you can
arrange a maintenance window to dump and load to a fresh initdb,
that would eliminate the possibility that some hidden corruption is
lurking.  If that is not possible, running VACUUM FREEZE ANALYZE
will reduce the number of things that can go wrong, without
requiring down time.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Mike Broers
Дата:
Thanks, after this pg_dumpall I am going to see what kind of impact I can expect from running VACUUM FREEZE ANALYZE (normally I just run vacuumdb -avz nightly via a cron job) and schedule time to run this in production against all the tables in the database.  Is there anything I should look out for with vacuum freeze?

Much appreciated,
Mike


On Thu, Nov 21, 2013 at 4:51 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Mike Broers <mbroers@gmail.com> wrote:

> Thanks for the response.  fsync and full_page_writes are both on.

> [ corruption appeared following power loss on the machine hosing
> the VM running PostgreSQL ]

That leaves three possibilities:
  (1)  fsync doesn't actually guarantee persistence in your stack.
  (2)  There is a hardware problem which has not been recognized.
  (3)  There is a so-far unrecognized bug in PostgreSQL.

Based on my personal experience, those are listed in descending
order of probability.  I seem to recall reports of some VM for
which an fsync did not force data all the way to persistent
storage, but I don't recall which one.  You might want to talk to
your service provider about what guarantees they make in this
regard.

> Is there something else I can run to confirm we are more or less
> ok at the database level after the pg_dumpall or is there no way
> to be sure and a fresh initdb is required.

Given that you had persistence options in their default state of
"on", and the corruption appeared after a power failure in a VM
environment, I would guess that the damage is probably limited.
That said, damage from this sort of event can remain hidden and
cause data loss later.  Unfortunately we do not yet have a
consistency checker that can root out such problems.  If you can
arrange a maintenance window to dump and load to a fresh initdb,
that would eliminate the possibility that some hidden corruption is
lurking.  If that is not possible, running VACUUM FREEZE ANALYZE
will reduce the number of things that can go wrong, without
requiring down time.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Kevin Grittner
Дата:
Mike Broers <mbroers@gmail.com> wrote:

> Is there anything I should look out for with vacuum freeze?

Just check the logs and the vacuum output for errors and warnings.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
John R Pierce
Дата:
On 11/21/2013 2:51 PM, Kevin Grittner wrote:
> That leaves three possibilities:
>    (1)  fsync doesn't actually guarantee persistence in your stack.

I'll put my $5 on (1).... virtualization stacks add way too much
ooga-booga to the storage stack, and tend to play fast and loose with
write buffering to maintain some semblence of performance.



--
john r pierce                                      37N 122W
somewhere on the middle of the left coast



Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Scott Marlowe
Дата:
On Thu, Nov 21, 2013 at 4:14 PM, John R Pierce <pierce@hogranch.com> wrote:
> On 11/21/2013 2:51 PM, Kevin Grittner wrote:
>>
>> That leaves three possibilities:
>>    (1)  fsync doesn't actually guarantee persistence in your stack.
>
>
> I'll put my $5 on (1).... virtualization stacks add way too much ooga-booga
> to the storage stack, and tend to play fast and loose with write buffering
> to maintain some semblence of performance.

If you really hate your database, be sure and include an NFS mount at
the vm image level in there somewhere.


Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Mike Broers
Дата:
vacuumb avz, pg_dumpall, and vacuum freeze analyze on the former standby database that received the corruption via replication all came back without errors.  Is the vacuum freeze intended to potentially fix problems or just reveal if other tables may have corruption, Im trying to decide if this needs to be run in production.


On Thu, Nov 21, 2013 at 5:09 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
Mike Broers <mbroers@gmail.com> wrote:

> Is there anything I should look out for with vacuum freeze?

Just check the logs and the vacuum output for errors and warnings.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Kevin Grittner
Дата:
Mike Broers <mbroers@gmail.com> wrote:

> vacuumb avz, pg_dumpall, and vacuum freeze analyze on the former
> standby database that received the corruption via replication all
> came back without errors.  Is the vacuum freeze intended to
> potentially fix problems or just reveal if other tables may have
> corruption, Im trying to decide if this needs to be run in
> production.

You do know that you could cause the vacuumdb run to freeze by
adding the F option, right?

Anyway, there are two reasons I would run VACUUM FREEZE (or use the
F option):

  (1)  The vacuum will pass every page in each table, rather than
just visiting pages which are not known to be all-visible already,
so it is more likely to find any corruption that is there.  You
kinda have that covered anyway with the pg_dumpall run, though.

  (2)  The freezing, which then releases no-longer-needed clog
space at the next checkpoint, has been known to dodge some bad
transaction ID / visibility problems.  The odds that it will fix
existing corruption are very slim, but non-zero.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Shaun Thomas
Дата:
> Update - I have two hot replication slaves of this db, both have the problem.
> I took one out of recovery and ran REINDEX table session_session and it
> fixed the errors about this row.  Now Im going to run vacuum and see if
> there are other tables that complain, but Im guessing if so I will need to see
> if there is a way to force vacuum to continue on error, worst case I might
> have to script a table by table vacuum script I guess..  If anyone has a better
> suggestion for determining the extent of the damage Id appreciate it.

Oh man. I'm sorry, Mike.

One of the cardinal rules I have is to disconnect any replication following a database crash. It's just too easy for
damagedreplicated rows to be propagated unless you're on 9.3 and have checksums enabled. If you want to perform a
table-by-tablecheck, don't vacuum the database, but the individual tables. I'd go with a DO loop and have it raise
noticesinto the log so you can investigate further: 

COPY (
SELECT 'VACUUM ' || oid::regclass::text || ';'
  FROM pg_class
 WHERE relkind = 'r'
) to '/tmp/vac_all.sql';

Run the /tmp/vac_all.sql through psql and pipe the contents into a log file. Any table that doesn't vacuum successfully
willneed to be repaired manually. One way you can do this if there are dupes, is by checking the ctid value after
disablingindex scans: 

SET enable_indexscan TO False;

SELECT ctid, * FROM [broken_table] WHERE ...;

Just construct the WHERE clause based on the error output, and you should get all rows if there are dupes. You'll need
tofigure out which row to keep, then delete the bad row based on the ctid. Do this as many times as it takes, then
reindexto make sure the proper row versions are indexed. 

It's also a good idea to dump any table that came back with an error, just in case.

After you've done all of that, you should re-base your replicas once you've determined your production system is
usable.In the meantime, I highly recommend you set up a VIP you can assign to one of your replicas if your production
systemdies again, and remove any autostart code. If your production system crashes, switch the VIP immediately to a
replica,and invalidate your old production system. Data corruption is insidious when streaming replication is involved. 

Look into tools like repmgr to handle managing your replicas as a cluster to make forced invalidation and re-basing
easier.

Good luck!

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Mike Broers
Дата:
Thanks Shaun, 

Im planning to schedule a time to do the vacuum freeze suggested previously.  So far the extent of the problem seems limited to the one session table and the one session row that was being used by a heavy bot scan at the time of the crash.  Currently Im testing a recovery of a production backup from today to rebase one of the replication targets that I was using to test fixes last week.  Hopefully that validates the current backups and I can proceed inquiring with our managed services provider about the false notification of the disk write and ways to prevent that going forward.  

I'll update the list if I uncover anything interesting in the process and/or need more advice, thanks again for your input - its much appreciated as always.  Nothing like a little crash corruption to get the blood flowing!

Mike


On Mon, Nov 25, 2013 at 10:29 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> Update - I have two hot replication slaves of this db, both have the problem.
> I took one out of recovery and ran REINDEX table session_session and it
> fixed the errors about this row.  Now Im going to run vacuum and see if
> there are other tables that complain, but Im guessing if so I will need to see
> if there is a way to force vacuum to continue on error, worst case I might
> have to script a table by table vacuum script I guess..  If anyone has a better
> suggestion for determining the extent of the damage Id appreciate it.

Oh man. I'm sorry, Mike.

One of the cardinal rules I have is to disconnect any replication following a database crash. It's just too easy for damaged replicated rows to be propagated unless you're on 9.3 and have checksums enabled. If you want to perform a  table-by-table check, don't vacuum the database, but the individual tables. I'd go with a DO loop and have it raise notices into the log so you can investigate further:

COPY (
SELECT 'VACUUM ' || oid::regclass::text || ';'
  FROM pg_class
 WHERE relkind = 'r'
) to '/tmp/vac_all.sql';

Run the /tmp/vac_all.sql through psql and pipe the contents into a log file. Any table that doesn't vacuum successfully will need to be repaired manually. One way you can do this if there are dupes, is by checking the ctid value after disabling index scans:

SET enable_indexscan TO False;

SELECT ctid, * FROM [broken_table] WHERE ...;

Just construct the WHERE clause based on the error output, and you should get all rows if there are dupes. You'll need to figure out which row to keep, then delete the bad row based on the ctid. Do this as many times as it takes, then reindex to make sure the proper row versions are indexed.

It's also a good idea to dump any table that came back with an error, just in case.

After you've done all of that, you should re-base your replicas once you've determined your production system is usable. In the meantime, I highly recommend you set up a VIP you can assign to one of your replicas if your production system dies again, and remove any autostart code. If your production system crashes, switch the VIP immediately to a replica, and invalidate your old production system. Data corruption is insidious when streaming replication is involved.

Look into tools like repmgr to handle managing your replicas as a cluster to make forced invalidation and re-basing easier.

Good luck!

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email

Re: Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Mike Broers
Дата:
The restore of a post-crash production backup worked as hoped and the 2nd replication slave is back into its happy hot standby state. 

So if this problem replicated to our standby servers does that indicate that the potential problematic fsync occurred during a pg_xlog write?  Would breaking replication at the time of the crash have prevented this from cascading or was it already too late at that point?

Thanks again for the input, its been very helpful!
Mike




On Mon, Nov 25, 2013 at 12:20 PM, Mike Broers <mbroers@gmail.com> wrote:
Thanks Shaun, 

Im planning to schedule a time to do the vacuum freeze suggested previously.  So far the extent of the problem seems limited to the one session table and the one session row that was being used by a heavy bot scan at the time of the crash.  Currently Im testing a recovery of a production backup from today to rebase one of the replication targets that I was using to test fixes last week.  Hopefully that validates the current backups and I can proceed inquiring with our managed services provider about the false notification of the disk write and ways to prevent that going forward.  

I'll update the list if I uncover anything interesting in the process and/or need more advice, thanks again for your input - its much appreciated as always.  Nothing like a little crash corruption to get the blood flowing!

Mike


On Mon, Nov 25, 2013 at 10:29 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> Update - I have two hot replication slaves of this db, both have the problem.
> I took one out of recovery and ran REINDEX table session_session and it
> fixed the errors about this row.  Now Im going to run vacuum and see if
> there are other tables that complain, but Im guessing if so I will need to see
> if there is a way to force vacuum to continue on error, worst case I might
> have to script a table by table vacuum script I guess..  If anyone has a better
> suggestion for determining the extent of the damage Id appreciate it.

Oh man. I'm sorry, Mike.

One of the cardinal rules I have is to disconnect any replication following a database crash. It's just too easy for damaged replicated rows to be propagated unless you're on 9.3 and have checksums enabled. If you want to perform a  table-by-table check, don't vacuum the database, but the individual tables. I'd go with a DO loop and have it raise notices into the log so you can investigate further:

COPY (
SELECT 'VACUUM ' || oid::regclass::text || ';'
  FROM pg_class
 WHERE relkind = 'r'
) to '/tmp/vac_all.sql';

Run the /tmp/vac_all.sql through psql and pipe the contents into a log file. Any table that doesn't vacuum successfully will need to be repaired manually. One way you can do this if there are dupes, is by checking the ctid value after disabling index scans:

SET enable_indexscan TO False;

SELECT ctid, * FROM [broken_table] WHERE ...;

Just construct the WHERE clause based on the error output, and you should get all rows if there are dupes. You'll need to figure out which row to keep, then delete the bad row based on the ctid. Do this as many times as it takes, then reindex to make sure the proper row versions are indexed.

It's also a good idea to dump any table that came back with an error, just in case.

After you've done all of that, you should re-base your replicas once you've determined your production system is usable. In the meantime, I highly recommend you set up a VIP you can assign to one of your replicas if your production system dies again, and remove any autostart code. If your production system crashes, switch the VIP immediately to a replica, and invalidate your old production system. Data corruption is insidious when streaming replication is involved.

Look into tools like repmgr to handle managing your replicas as a cluster to make forced invalidation and re-basing easier.

Good luck!

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Re: corruption issue after server crash - ERROR: unexpected chunk number 0

От
Shaun Thomas
Дата:
> So if this problem replicated to our standby servers does that indicate
> that the potential problematic fsync occurred during a pg_xlog write?

Pretty much. You have a couple issues here, and no easy way to approach them. Primarily, you got data corruption during
async operation. This means either the OS or the hardware somewhere along the line lied about the write, or the write
wascorrupted and the filesystem log replayed incorrectly upon reboot. Once that happens, you can't trust *any* data in
yourdatabase. Pre-checksum PostgreSQL has no way to verify integrity of existing data, and system crashes can corrupt
quitea bit of data that was only tangentially involved. 

What likely happens in these scenarios, is that the database startup succeeds, and then it read some rows in from a
corruptedtable. By corrupted, I mean even a single data page with a mangled pointer. That mangled pointer gave the
databaseincorrect information about the state of that data page's contents, and the database continued on that
information.That means subsequent transaction logs from that point are *also* corrupt, and hence any streaming or warm
standbyreplicas are subsequently damaged as well. But they'll be damaged differently, because they likely didn't have
theinitial corruption, just the byte changes dictated by the WAL stream. 

Unless you know where the initial corruption came from, the system that caused it should be quarantined for
verification.RAM, disk, CPU, everything should pass integrity checks before putting it back into production. 

> Would breaking replication at the time of the crash have prevented
> this from cascading or was it already too late at that point?

Most likely. If, at the time of the crash, you switched to one of your replicas and made it the new master, it would
giveyou the opportunity to check out the crashed system before it spread the love. Even if you don't have a true
STONITHmodel, starting up a potentially data-compromised node in an active cluster is a gamble. 

I did something similar once. One of our DRBD nodes crashed and came back up and re-attached to the DRBD pair after a
quickdata discard and replay. I continued with some scheduled system maintenance, and performed a node failover with no
incident.It wasn't until 20 minutes later that the corrupt disk pages started making their presence felt, and by then
Itwas too late. Luckily we were still verifying, but with our secondaries ruined, we had to restore from backup. A
30-minuteoutage became a 4-hour one. 

Afterwards, we put in a new policy that any crash means a DRBD verify at minimum, and until the node passes, it is to
beconsidered invalid and unusable. If you haven't already, I suggest something similar for your setup. Verify a crashed
nodebefore using it again, no matter how much pressure you're under. It can always get worse. 

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com


______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email