One of the benefits of Active Directory is built-in redundancy. When you lose a single domain controller, the impact can be insignificant. With many services, such as DHCP, the architecture dictates a dependency on a specific server. When that server becomes unavailable, clients are impacted. Over the years, failover or redundancy has been built into most of these services, including DHCP. With Active Directory, the architecture is built around redundancy. Clients are not dependent on a single DC; they can failover to another DC seamlessly if a failure occurs.
When a failure does occur, you should ask yourself several questions to assess the impact:
This is the worst-case scenario. The redundancy in Active Directory applies only if you have more than one domain controller in a domain. If there is only one, you have a single point of failure. You could irrevocably lose the domain unless you can get that domain controller back online or restore it from backup.
The five FSMO roles outlined in Chapter 2 play an important part in Active Directory. FSMO roles are not redundant, so if a FSMO role owner becomes unavailable, you'll need to seize the FSMO role on another domain controller. Check out the FSMO recovery section later in this chapter for more information.
The Global Catalog is a function that any domain controller can perform if enabled. But if you have only one Global Catalog server in a site and it becomes unavailable, it can impact user's ability to login. As long as clients can access a Global Catalog, even if it isn't in the most optimal location, they will be able to login. If a site without a Global Catalog for some reason loses connectivity with the rest of the network, it would impact user's ability to login. With Windows Server 2003, you can enable universal group caching on a per-site basis to limit this potential issue.
If your domain controllers are running near capacity and one fails, it could overwhelm the remaining servers. At this point, clients could start to experience login failures or extreme slowness when authenticating.
Early versions of Exchange 2000 did not handle domain controller failures well. In fact, once an Exchange 2000 server targeted a specific domain controller, you would have to manually force it to use another one if that domain controller became unavailable. During the outage period, mail delivery could be impacted along with client lookups. Exchange is just one example, but it illustrates that you have to be careful of this when introducing Active Directory-enabled services into your environment.
These questions can help you assess the urgency of restoring the domain controller. If you answered "no" to all of the questions, the domain controller can stay down for a short period without significant impact.
When you've identified that you need to restore a domain controller, there are two options to choose from: restoring from replication or restoring from a backup.
One option for restoring a domain controller is to bring up a freshly installed or repaired machine and promote it into Active Directory. You would use this option if you had a single domain controller failure due to hardware and did not have a recent backup of the machine. This method allows you to replace the server in AD by promoting a newly installed machine and allowing replication to copy all of the data to the DC. Here are the steps to perform this type of restore:
Rebuild OS. Reinstall the operating system and any other applications you support on your domain controllers.
Remove DC from AD. The old remnants of the domain controller must be removed from Active Directory before you promote the freshly installed server. We describe the exact steps to do this shortly.
Promote server. After you've allowed time for the DC removal process to replicate throughout the forest, you can then promote the new server into AD.
Configure any necessary roles. If the failed server had any FSMO roles or was a GC, you can configure the new server to have these roles.
The biggest potential drawback with this method is the restore time. Depending on the size of your DIT file and how fast your network connections are between the new DC and the server it will replicate with, the restore time could be several hours or even days. If this is problematic for you, you'll want to look at the restore from backup option that we describe next.
One of the key steps with the restore from replication method is removing the objects that are associated with the domain controller before it gets added to AD again. This is a three-step process. The first step is to remove the associated metadata. That can be accomplished with the ntdsutil utility. The following example shows the commands necessary to remove the DC3 domain controller, which is in the RTP site, from the emea.mycorp.com domain.
C:\>ntdsutil ntdsutil: metadata cleanup metadata cleanup: connections
Next, we need to connect to an existing domain controller in the domain the domain controller you want to remove is in. In this case, we connect to DC2.
server connections: connect to server dc2 Binding to dc2 ... Connected to dc2 using credentials of locally logged on user. server connections: quit metadata cleanup: select operation target
Now we need to select the domain the domain controller is in. In this case, it is emea.mycorp.com.
select operation target: list domains Found 2 domain(s) 0 - DC=mycorp,DC=com 1 - DC=emea,DC=mycorp,DC=com select operation target: select domain 1 No current site Domain - DC=emea,DC=mycorp,DC=com No current server No current Naming Context
Next we must select the site the domain controller is in. In this case, it is the RTP site.
select operation target: list sites Found 4 site(s) 0 - CN=Default-First-Site-Name,CN=Sites,CN=Configuration,DC=mycorp,DC=com 1 - CN=RTP,CN=Sites,CN=Configuration,DC=mycorp,DC=com 2 - CN=SJC,CN=Sites,CN=Configuration,DC=mycorp,DC=com 3 - CN=NYC,CN=Sites,CN=Configuration,DC=mycorp,DC=com select operation target: select site 1 Site - CN=RTP,CN=Sites,CN=Configuration,DC=mycorp,DC=com Domain - DC=emea,DC=mycorp,DC=com No current server No current Naming Context
After listing the servers in the site, we must select the server we want to remove. In this case, it is DC3.
select operation target: list servers in site Found 3 server(s) 0 - CN=DC1,CN=Servers,CN=RTP,CN=Sites,CN=Configuration,DC=mycorp,DC=com 1 - CN=DC2,CN=Servers,CN=RTP,CN=Sites,CN=Configuration,DC=mycorp,DC=com 2 - CN=DC3,CN=Servers,CN=RTP,CN=Sites,CN=Configuration,DC=mycorp,DC=com select operation target: select server 2 Site - CN=RTP,CN=Sites,CN=Configuration, DC=mycorp,DC=com Domain - DC=emea,DC=mycorp,DC=com Server - CN=DC3,CN=Servers,CN=RTP,CN=Sites,CN=Configuration,DC=mycorp,DC=com DSA object - CN=NTDS Settings,CN=DC3,CN=Servers,CN=RTP,CN=Sites, CN=Configuration,DC=mycorp,DC=com Computer object - CN=DC3,OU=Domain Controllers,DC=emea,DC=mycorp,DC=com No current Naming Context select operation target: quit
The last step removes the metadata for the selected domain controller.
metadata cleanup: remove selected server
At this point, you should receive confirmation that the DC was removed successfully. If you receive an error that the object could not be found, it might have already been removed if you tried to demote the server with dcpromo.
You will then need to manually remove a couple more objects from Active Directory. Via the Active Directory Users and Computers tool, you should remove the computer object in the Domain Controllers OU for the DC. Finally, bring up the Active Directory Sites and Services tool and delete the server object for the DC, which is contained under the site the DC was located in.
Another option to reestablish a failed domain controller is to restore the machine using a backup. This approach is cleaner than the restore from replication method we just described because you do not have to remove any objects from Active Directory. When you restore a DC from a backup, the latest changes will replicate to make it current. If time is of the essence, this will be the quicker approach, because only the latest changes since the last backup, instead of the whole directory tree, will be replicated over the network.
Here are the steps to restore from backup:
Rebuild OS. Reinstall the operating system and any other applications you support on your domain controllers. Leave the server as a standalone or member server.
Restore from backup. Use your backup package, e.g., NT Backup, to restore at least the System State onto the machine. In the next section, we will walk through the NT Backup utility to show how this is done.
Reboot server and allow replication to complete. If the failed server had any FSMO roles or was a GC, you can configure the new server to have these roles.
It is also possible to restore the backup of a machine onto a machine that has different hardware. Here are some issues to be aware of when doing so:
The number of drives and drive letters should be the same.
The disk drive controller and configuration should be the same.
The attached cards, such as network cards, video adapter, and processors, should be the same. After the restore you can install the new cards, which should be recognized by Plug and Play.
The boot.ini from the failed machine will be restored, which may not be compatible with the new hardware, so you'll need to make any necessary changes.
If the HAL is different between machines, you can run into problems. For example, if the failed machine was single processor and the new machine is multiprocessor, you will have a compatibility problem. The only workaround is to copy the Hal.dll, which is not included as part of System State, from the old machine and put it on the new machine. The obvious drawback to this is it will make the new multiprocessor machine act like a single processor machine.
Since there are numerous things that can go wrong with restoring to different hardware, we highly suggest you test and document the process thoroughly. The last thing you want to do is troubleshoot hardware compatibility issues when you are trying to restore a crucial domain controller.