I want a simple way of restoring service in the event of a disaster
Foreward
I wrote this over 5 years ago and wanted to see if it stood the test of time – I also see too often that organisations don’t have the right capabilitites to recover so I figured this is a good post on the subject.
Back to the Future
Having a simple singular method for providing data protection would be ideal, a one big red button to rule them all that fails over to a standby secondary datacentre which you pay for on a consumption only basis, sounds too good to be true right? Well, that’s because (today at least) it is!
We often need to leverage a number of different solutions to provide a mechanism to support the business requirements, maintain supportability with vendors and ensure efficiency.
High availability, disaster recovery, backup, and business continuity
I’ve often seen these terms utilised synonymously, this is no surprise given the number of phases we bound around the IT industry, but I think it’s important to understand that the differences and agree on the terms of reference.
Business Continuity encompasses the following:
- Resilience (High availability)
- Recoverability (Backup)
- Contingency (Disaster Recovery)
Splitting these out we can see the following attributes of each component:
- High availability
- High availability refers to the ability of a service to sustain failure of one or more components and continue to function in its ‘production’ state. This can be local to a geography or can span cities, regions or counties.
- Recoverability
- The ability to restore a service and/or it’s data in the event of failure, incorrect change or data corruption.
- Disaster Recovery
- The ability to restore a service in the event of a major incident or disaster.
Requirements and Objectives
Before we begin thinking about solutions it’s important to understand our services and what capability we need from a business perspective across each of the 3 domains:
To demonstrate this, I’ve put together a simplified view, in reality we may need to analyse a service to a far more detailed and granular level depending upon the size, scale and complexity of the business we are looking at:
Service Name | Supporting or
LOB |
Impact to
business If unavailable |
Dependant
Services |
Production
Service Uptime |
Service
availability Requirement |
Backup and
Recoverability |
Disaster
Recovery |
Identity &
Access Management |
Supporting | Severe | All | 99.999% | Local & regional | Must be protected against data loss & corruption. Must be able to conduct granular file level restores. | Required to operate an active/active model across regions and be recoverable in the event of loss of region |
File services | Supporting | Low | None | 98% | None | Must be protected against data loss & corruption. Must be able to conduct granular file level restores. | Must be able to restore the service to the secondary region in the event of a disaster with minimal administrative overhead |
CRM | LOB | Severe | Billing | 99.999% | Local | Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores. | Requirement to be able to recover in the event of a
disaster with a low RTO/RPO as this is a service linked to generating revenue. |
Billing | LOB | Severe | None | 99.999% | Local & regional | Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores. | Requirement to be able to recover in the event of a
disaster with a low RTO/RPO as this is a revenue generating service. |
E-Commerce Web Services | LOB | Severe | Billing | 99.999% | Local & regional | Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores. | Requirement to be able to recover in the event of a
disaster with a low RTO/RPO as this is a revenue generating service. |
Web Services | LOB | Medium | None | 99.9999% | Local, regional and country | Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores. | Required to operate across multiple geographies |
Current State Architecture
Now that we have an understanding of our service requirements we can start to look at our current state architecture:
Current State | |||||||
Service Name | Infrastructure Architecture
Implementation Supports HA? |
Application Architecture Supports HA? | Application Architecture Supports replication and recovery across geographies? | ||||
Identity & Access Management | Yes | Yes | Yes | ||||
File services | No | No | No | ||||
CRM | No | No | No | ||||
Billing | No | No | No | ||||
E-Commerce Web Services | No | No | No | ||||
Web Services | Yes | Yes | No |
Current and Future State comparison (GAP)
Now that we understand our requirements and current state architecture, we can complete a GAP analysis to understand where architectural change is required. The below table provides a high-level gap analysis.
Service
availability Requirement met? |
Backup and
Recoverability requirement met? |
Disaster
Recovery requirement met? |
||
Current State | ||||
Future State |
Solution Capability Mapping
The following table can be utilised to assess solution capability, for this example we have looked at the identity management service:
Capability | Native Application | 3rd Party Solution/s | |
Availability | Application native high availability (site) | ||
Application native high availability (region) | |||
Backup &
Recoverability |
Crash consistent backup and restore | ||
Application levels restore granularity | |||
File Level granular restore | |||
Disaster Recovery | Active/Active Deployment | ||
Active/Passive Deployment | |||
Restore from backup (standby service) | |||
(Active/Passive)
Replication |
|||
Active/Active Replication |
Solutions Options
Once we understand the current state implementation, the potential capability we can provide, we can review the potential options per service. I’m only going to break down two services from the above list, so I’ve decided to look at identity and access management and CRM:
Service Name | Availability | Recoverability | DR Options |
Identity & Access Management | Directory services supports multiple nodes to provide an architecture that supports a distributed model providing availability of service at the local & regional level. | The operating system features out of the box backup and recovery functionality which supports the operating system and application-level recoverability. If granular application recoverability is required there is a recycle bin feature however a 3rd party product would be required to support granular restore. | Utilise multi-site active/active models utilising out of the box features. In the event of loss or corruption of data a restore can be invoked utilising the recoverability solution. |
CRM | The CRM application vendor does not support a highly available topology. | The operating system features out of the box backup and recovery functionality which supports the operating system and application-level recoverability. Granular recoverability is not supported by the application vendor. | To provide disaster recovery capability we have the following options:
level data replication
replication
Replication using crash consistent techniques |
Here we can establish how we can meet our requirements on a per service basis. The realisation here is that it is very rare to have one solution that will meet all our requirements.
Knowing this we now want to look to achieve a standardised and if possible rationalised set of capability to keep our architecture as simple as possible, while catering for the reality that multiple solutions will be needed. To accomplish this at a broad level I’ve suggested the following capability principals:
Availability | Recoverability | Disaster Recovery |
Utilise application availability architectures to provide high availability. E.g., multiple nodes at
each application layer such as multiple Exchange CAS/Mailbox roles |
Standardise on a guest level backup solution that supports application-level backup for core applications, supports virtualisation solutions and can leverage granular snapshot management. Recoverability should utilise disk-based storage for rapid recovery. Periodic off site data replication/shipping should be utilised to provide recoverability in the event of the loss of the primary site. | For applications that support active/active scenarios leverage those. Examples of these are:
availability groups |
Utilise an application load balancing solution that can be leveraged across multiple applications e.g., hardware load balancer | For mission critical services utilise replication services which support active/passive near real time replication.
Examples of this include:
|
|
Utilise virtualisation technologies to fill gaps when application architecture does not provide high availability. E.g., Fault Tolerance for single virtual machines | Link to the recoverability solution to enable restore from backup. This provides efficient recoverability for non-critical services and enables restore of services in the event of data corruption. |
Summary
Providing capability to support business continuity is technically achievable utilising a combination of native and 3rd party solutions. It’s key to understand our business requirements, define standardised solutions to cater for the requirements then establish an appropriate architecture and solution capability on a per service basis. As with most things, the solutions should be appropriate to meet requirements from people, process, technology, and financial perspective.