I’m currently working for a client on a PowerBuilder to WPF migration. It’s one of those “I could tell you, but I’d have to kill you” kind of clients and the quick-lime pits are currently occupied by the EMC tech…but I’ve said too much already.
At approximately 3 or 4 pm that day users of the Batman application here in Gotham started to experience problems accessing the application. Batman is a document management system here that also integrates with the ERP system. Very little goes on here that doesn’t involve Batman in some way. The errors being received seemed to point to network issues (TCP protocol error, connection forcibly closed by the remote host etc…) but the real issue was much more insidious.
Connecting to the database via SSMS and performing selects on certain tables underlying the application areas that were having problems started to reveal the issue. You couldn’t do a SELECT * FROM MyTable without it bombing and giving the same error noted above. A run of DBCC CHECKDB revealed 14 tables with corruption. One of the tables with issues was the Document table. Pretty central to a “document management” system. Information was obtained from IT that a single drive in the SAN went bad in the night. A new drive was in place and was working fine. The partition that held the Batman database is configured for RAID Level 5 so a single drive failure shouldn’t have caused any trouble and yet, the database is corrupted.
They do hourly incremental backups here so the first thing done was to try a restore. A restore of the most recent backup failed so they worked backwards until they hit a good point. This successful restore was for a backup at 3AM – a full day behind. This time also roughly corresponds with the time the SAN started to report the drive failure. The plot thickens…
I got my hands on the output from DBCC CHECKDB and noticed a pattern. What’s sad is that nobody that should have noticed the pattern in the DBCC output did notice. There was a rush to do things to try and recover the data before anybody really understood what was wrong with it in the first place. Cooler heads must prevail in these circumstances and some investigation should be done and a plan of action laid out or you could end up making things worse. DBCC CHECKDB also told us that:
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB
Yikes. That means that the database is so messed up that you’re definitely going to lose some stuff when you repair it to get it back to a consistent state. All the more reason to do a little more investigation into the problem. Rescuing this database is preferable to having to export all of the data possible from this database into a new one. This is a fifteen year old application with about seven hundred tables. There are TRIGGERS everywhere not to mention the referential integrity constraints to deal with. Only fourteen of the tables have an issue. We have a good backup that is missing the last 24 hours of business which means we could have a “do-over” of yesterday but that’s not a very palatable option either.
All of the affected tables had TEXT columns and all of the errors were about LOB data types and orphaned off-row data which basically means TEXT, IMAGE or NTEXT columns. If we did a SELECT on an affected table and excluded those columns, we got all of the rows. We exported that data into a separate database. Things are looking up.
Working on a copy of the production database we then ran DBCC CHECKDB with REPAIR_ALLOW_DATA_LOSS and that “fixed” everything up. The allow data loss option will delete the bad rows. This isn’t too horrible as we have all of those rows minus the text fields from out earlier export. Now I could LEFT JOIN to the exported data to find the missing rows and INSERT them minus the TEXT column data.
We had the restored data from the good 3AM backup that we could now JOIN to and, with fingers crossed, recover the missing TEXT column information. We got lucky in that all of the affected rows were old and in the end we didn’t lose anything. :O All of the row counts along the way worked out and it looks like we dodged a major bullet here.
We’ve heard back from EMC and it turns out the SAN firmware that they were running here is apparently buggy. This thing is only a couple of months old. Grrr…. They dispatched a technician that night to come and update it . That explains why RAID didn’t save us.
All-in-all this could have been a lot worse. Given the root cause here, they basically won the lottery in not losing anything.
Here are a few links to some helpful posts on the SQL Server Engine blog. I love the title of the first one:
Which part of 'REPAIR_ALLOW_DATA_LOSS' isn't clear?
CHECKDB (Part 8): Can repair fix everything?
(in fact, read the whole series)
Ta da! Emergency mode repair
(we didn’t have to resort to this one thank goodness)
Just because I can…
 Names have been changed to protect the guilty.
 I'm Batman.
 And if I'm the coolest head in the room, you've got even bigger problems...