Planning and rolling out a high availability and disaster recovery plan that meets all service level agreements is a non-trivial undertaking and requires a very clear understanding of SQL Server's native strengths and weaknesses. When matching up requirements against a combination of features, some of the critical details may be glossed over, and in this post I'll walk through a few common distortions and bad assumptions that can creep into a solution – ultimately causing us to miss the mark on our recovery point and recovery time objectives. Some of the examples of distortions or self-delusions I detail here can be generalized across different features and some are feature-specific.
"We tested our disaster recovery plan when we first launched our project and we know it will work"
I've worked with clients who indeed got their disaster recovery approach "right" – one time. But once everyone felt confident in the efficacy of the solution, no other disaster recovery exercises were performed again. Of course – in the meantime the data tier and application keeps changing over time. Those changes introduce new objects and processes that are critical to the application. And even if everything remains static after launch, you still have to account for personnel turnover and varying skill sets. Can today's staff successfully perform a disaster recovery exercise? And even the best staff needs ongoing practice.
"We will have no data loss because we are using synchronous database mirroring"
Let's say you're using synchronous database mirroring between two SQL Server instances, with each instance in a separate data center. In this example, assume that the transaction latency is acceptable in spite of this being a synchronous database mirroring session with a few miles between the data centers. You don't have a witness in the mix because you want to control database mirror failover manually.
Now let's say your disaster recovery data center goes away – but your primary data center is still available. Your principal database is disconnected from the mirror database, but it is still accepting connections and data modifications. What about the "no data loss" requirement now? If transactions ran against the disconnected principal for another hour, what is your plan if the primary data center is also lost?
"Our business owner says we can lose up to 12 hours of data"
It is important to ask some questions more than once and to more than one individual in an organization. If someone tells you that "12 hours of data loss is acceptable" – ask them again, or ask them what the consequence of that data loss would be. Ask other people as well. They might give you a much stricter requirement. I've found that recovery point objectives require some negotiation and more than a few thoughtful, deliberate discussions.
"We're using [database mirroring or availability groups] and so we're covered for what we need in the event of a disaster"
Database mirroring and availability groups can certainly be used to protect you at the database level, but what about everything else? What about your logins? SSIS packages? Jobs? Non-FULL recovery model databases? Linked servers?
"We're using SQL Server feature XYZ, so we won't lose any in-flight transactions"
Nope, sorry. While some features allow for transparent redirection, this isn't the same as retaining and persisting open transactions at the time of failover. No SQL Server feature provides this functionality today.
"The team supporting the data tier for this application hates SQL Server feature XYZ, but we're moving forward with it because that's what was recommended to us by an outside consultant"
While it would be nice if people didn't develop specific biases around features in SQL Server, this often isn't the case. If you try to force solutions on a staff that isn't on board with them, you're running the risk of pre-determined failure. As part of the HA/DR exercises I've helped out with in the past, I'm always interested to hear about people's past experiences with specific features. For example, some companies leverage hundreds of Failover Cluster Instances very well – while others avoid them because of bad experiences from previous versions. When planning a solution, you can't ignore history or the predispositions of the staff who will ultimately be responsible for deploying and supporting the recommended solution.
"As the DBA, I decide which HA/DR technology to use for the data tier, so we're going to use availability groups moving forward"
A database administrator could potentially set up database mirroring with little or no involvement with other teams. Now with availability groups, even if a database administrator could do it all on their own, they might be unwise to do so. After all – if you're deploying availability groups for disaster recovery purposes, you'll want everyone involved in a disaster recovery operation to be aware of your solution and the requirements needed to successfully get back online and recover data. With availability groups you'll need to think about the Windows Server Failover Cluster, quorum configurations, node votes, the availability group listener and more. If you'll need other people to facilitate a solution, be sure they are involved with the initial recommendation.
"We're using availability groups so we can have read-only availability in the event of an outage of our read-write replica"
This is just one example of a "what if" scenario. With any implementation of functionality, you'll want to imagine the various ways that failure can occur – and then be sure to test it to ensure your requirements are still being met. For example – if you think that your SQL Server 2012 availability group asynchronous read-only replicas will be available when the read-write replica is unavailable, you'll be unpleasantly surprised to see the
Unable to access database 'XYZ' because its replica role is RESOLVING which does not allow connections message in production.
"We tested SQL Server feature XYZ, and failover was fast, so we have established that we can meet our recovery time objectives easily"
Let's say you decide to use database mirroring for user-database level high availability. You want quick failover (measured in seconds), and you indeed see quick failover during testing. But was it a realistic test? Were you pushing a significant workload? In the example of database mirroring, the longest part of your failover operation can be for the redo operations. If you aren't driving realistic workloads, then you cannot truly say that your failover will actually be "fast".
"If we have a disaster and need to salvage and reconcile data, we'll figure it out when the time comes"
This is a tough one. Let's say you have a disaster and need to make your secondary data center operational. You decide to force service for the most critical databases in the secondary data center and you now have a split in the lineage of data modifications (some unreconciled rows in the primary data center, and now new modifications in the secondary data center). Eventually your primary data center is brought online – but now you have data that needs to be salvaged and reconciled before you can re-establish the overall HA/DR solution. What do you do? What can you do? This discussion is rarely an easy one to have and may depend on several factors such as the software packages you've purchased, the complexity of the data tier, and the data convergence tools at your disposal. In fact, the discussion is usually so difficult that people don't have it at all. But if the data is critical enough for you to have set up a secondary data center, than a key part of the discussion should include how best to salvage data and also reconcile it after a disaster has occurred.
This article included just a few examples of how we can delude ourselves into thinking that a solution is fully meeting their requirements. While it is human nature to do this – when it comes to data loss and business continuity, our jobs will depend on us being more aggressive in testing our own beliefs and assumptions.