14. August 2009 — Freitag

Admin

Disaster Day
The postmortem remains to be done and will require the assistance of IBM and EMC (VMWare), but we’ve had a fun day of it. It started on Thursday (while I was out) when working with IBM about an issue that had caused the replication connection between two SAN systems to break. IBM determined that there was an Efix specifically for this condition. It was tested on the target SAN with a VMWare cluster connected to it without incident during the day. Therefore, somewhere between 9 and 10 PM Thursday evening, the first part of the Efix was applied to the source SAN. Well, the production VMWare cluster got really pissed and several guest systems lost access to their disks. I am not sure, but I suspect that all guests saw some interruption to their activities. Not only did these systems lose their disks, but VMWare appeared to have lost access to the files that represented their disks. I’m not sure of what things were tried in the next 3.5 hours, but I was paged at 1:30 am Friday morning to deal with recovering our primary LDAP server from backups (essentially to perform a bare metal restoration). We spent a couple of hours during which time we discovered that our support contract with EMC/VMWare had expired — guarantees from the sales people that we’d be sent an automatic renewal notice to the contrary, we had no support contract. It took time to convince the support structure that we were not bums and would be contacting sales to negotiate a renewal and finally VMWare was working to help resolve the issue.
At about 2:30 I started the bare metal recovery of the LDAP server (create a new VMWare guest and recover all the files from the NetWorker backup system) and I was amazed that it had completed in less than two hours. It wasn’t perfect, because the backup had been done before the previous day’s batch update and it was missing some information that was needed. I retrieved the LDAP database from one of the replica servers and got the LDAP server back up and running. I didn’t realize that I was missing the previous day’s batch run data until after I’d manually kicked off the days batch run, which had lots of errors because it was comparing today’s authoritative feed data to that from two days ago instead of just yesterdays. To correct this, I’ve updated the nightly batch script so it issues a NetWorker save of the batch update data directory when the nightly batch process is complete. So, as long as the batch process runs to completion, the backup system will have the information.
VMWare did come through eventually, with not one but two filesystem dd patches to correct the corruption and the files that were unavailable were all eventually able to be seen and their guests could be started again. Which was good.
Sadly, the incident was not free of problems. We discovered that the ASR recovery process on the Windows 2003 printer server tripped over a bug in NetWorker. An issue is open with NetWorker about that — it was opened as a Sev 1, but has been lowered because VMWare was able to get the VMFS corruption fixed and we were able to get the printers server running again before the 1 hour call back time on the Sev 1 had expired. We will built a test Windows system that we will do an ASR recovery on and work through this problem with EMC/NetWorker. In short, the problem is that in our environment we send Full saves to one set of disk and incrementals to another set of disk. Those different sets of disk are mounted on different backup servers (storage nodes). NetWorker’s ASR recovery process is reading the information from the full save and then demanding that the incremental disk be removed from the storage node it is attached to and moved to the same storage node the full save disk is attached to. That’s just not physically possible with direct attached disk.
The final notice that all was recovered was sent out by Geoff at 3:50pm Friday afternoon.

Comments are closed.