The week from HELL
First up…
I normally get up about 7-7:30am. On Tuesday I woke up at 4am and could not get back to sleep. Got up and planned my day, read on-line news sites, etc… At 7:15am my phone rings. It’s an employee at my largest client calling. They can’t login to the server. I tried to remote in. Couldn’t. Walked her through checking the server. Appeared to be down – black screen w/ blinking cursor. Had her reboot it. Same thing.
Got down there about 8am Tuesday morning. The RAID had crashed (RAID 10 with Drives 0 and 1 reported missing and Drives 2 and 3 still being members of the now failed RAID! Discussed options and plan of a attack.
Left the site to get new drives for the RAID and to bring a shop system in to assist with data recovery.
The server had crashed either before or after the planned nightly tape backup.
The disk image backup (to enable a bare metal recovery) that runs on Sunday had failed too (maybe because there was something pending bad with the RAID). Last good backup was Friday nights tape.
Pulled the RAID drives and replaced. Used some RAID Reconstruction software to try and recovery the data from two of the RAID drives. Imaging one of the RAID drives (4 hr process), Then mounting via the software that image and one of the other RAID drives to analyze for data (another 4 hr process) then pulling the discovered data (2.5 hr process).
Most of that data was damaged
During this time loaded the Server OS and apps (Backup Software to recover Friday’s tape, antivirus, etc…)
Restored Friday’s data backup tape (86GB) to an alternate location on the server. Used the backup software to recover the Exchange message store (73GB).
Had to rejoin the +30 client workstations to the domain, shares and printers.
Left the site at 6pm on Wednesday evening. 34 Hours on site
Then while still dealing with residual issues (Outlook Address book replication, slow printers, etc…) this hit our Hosting server on Friday
 We experienced a profound system failure this morning.
At around 1am on the 13th we installed some pending Windows updates to the server and abound restarting it to activate the updates the operating system of the server failed.
After spending considerable time trying to resurrect the OS and bring the server back on line we were forced to replace the drive to save the data that was on the drive in the event that a system restore from backup proved ineffective.
Whatever was causing the issue (corruption or a compromise) was included in the backup because bringing the system on line after the restore we saw the same problem.
We ended up having to rebuild the server from scratch and then copying the data from the saved original drive and modifying the configuration of the software to access the saved data.
The server started accepting mail again around 7pm. The web server was brought back on line around 8pm and with most of the additional services back to function around 9pm.
Within about 45 minutes of the mail server coming on line we could see that there were almost 800 inbound emails sitting in the queue waiting to be processed.
Leave a Reply