Imagine a scenario where you have an offsite SAN running a hot standby legacy Exchange 2003 and a new exchange 2010 dag server both servers being an exchange Domain controller combo. Close your eye’s and now imagine having a total cache on the SAN due to firmware not being current on the disk controllers and losing the whole SAN in one go! Fight full isn’t it?
Now close your eye’s even tighter and imagine saying #@|@{{ did we backup those systems yet? ……. NO
Well this is exactly the scenario I have been working! That’s right, I call this IT’s worst nightmare, well maybe almost worst nightmare as it is hot standby and no production is affected by the fallout.
So, what do you do when you run into this type of scenario? DON’T PANIC, LIFE GOES ON :-)
After ascertaining the depth of the SAN loss (there where also some other machines lost) we decided to open a PS call with Microsoft. Even though we have extensive knowledge in house the depth of the loss was deemed big enough to justify some external support.
Together with MS support we created a high level plan of action ensuring our recovery steps where correct.
The end result was the following steps:
1. Steps on windows platform:
a. Using ntdsutil do a metabase cleanup the Active directory for both servers domain controller function
b. Rebuild both the Servers on OS level bringing them back to member server level.
c. Re-promote both reinstalled servers back to their AD role.
d. Confirm full ad restoration and replication.
2. Steps on Exchange platform:
a. Reinstall exchange 2010 using the setup /m:recoverserver switches
b. Reinstall exchange 2003 using the /disasterrecovery switches
c. Confirm exchange is back to operations
To give you a heads up I’m on bullet 2.a at the moment and have run into some issues when doing the setup /m:recoverserver.
1) The first issue was the media. We had originally used a exchange 2010 RTM disk and upgraded to SP1 but you need a full media disk with SP1 when using the recovery switch. So we needed to get this to the offsite location.
2) There is a great Technet article on recovering a DAG member http://technet.microsoft.com/en-us/library/dd638206.aspx only issue is one of the commands did not work for us out of the box. The system kept telling us can not run rpc server unavailable…
after speaking to PS we needed to add an extra switch to the command
original command: Remove-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer MBX1
updated command: Remove-DatabaseAvailabilityGroupServer -Identity DAG1 -MailboxServer MBX1 –ConfigurationOnly
After doing this the recovery of the exchange systems started to run but we now have an issue half way through the Transport hub role install still begin handled by PS. I’ll keep you all posted on how we get along. As this is a worst nightmare scenario but on the other hand a great experience and challenge.