I want a simple way of restoring service in the event of a disaster

Foreward

I wrote this over 5 years ago and wanted to see if it stood the test of time – I also see too often that organisations don’t have the right capabilitites to recover so I figured this is a good post on the subject.

Back to the Future

Having a simple singular method for providing data protection would be ideal, a one big red button to rule them all that fails over to a standby secondary datacentre which you pay for on a consumption only basis, sounds too good to be true right? Well, that’s because (today at least) it is!

We often need to leverage a number of different solutions to provide a mechanism to support the business requirements, maintain supportability with vendors and ensure efficiency.

High availability, disaster recovery, backup, and business continuity

I’ve often seen these terms utilised synonymously, this is no surprise given the number of phases we bound around the IT industry, but I think it’s important to understand that the differences and agree on the terms of reference.

Business Continuity encompasses the following:

  • Resilience (High availability)
  • Recoverability (Backup)
  • Contingency (Disaster Recovery)

Splitting these out we can see the following attributes of each component:

  • High availability
    • High availability refers to the ability of a service to sustain failure of one or more components and continue to function in its ‘production’ state. This can be local to a geography or can span cities, regions or counties.
  • Recoverability
    • The ability to restore a service and/or it’s data in the event of failure, incorrect change or data corruption.
  • Disaster Recovery
    • The ability to restore a service in the event of a major incident or disaster.

Requirements and Objectives

Before we begin thinking about solutions it’s important to understand our services and what capability we need from a business perspective across each of the 3 domains:

To demonstrate this, I’ve put together a simplified view, in reality we may need to analyse a service to a far more detailed and granular level depending upon the size, scale and complexity of the business we are looking at:

Service Name Supporting or

LOB

Impact to

business If

unavailable

Dependant

Services

Production

Service Uptime

Service

availability

Requirement

Backup and

Recoverability

Disaster

Recovery

Identity &

Access

Management

Supporting Severe All 99.999% Local & regional Must be protected against data loss & corruption. Must be able to conduct granular file level restores. Required to operate an active/active model across regions and be recoverable in the event of loss of region
File services Supporting Low None 98% None Must be protected against data loss & corruption. Must be able to conduct granular file level restores. Must be able to restore the service to the secondary region in the event of a disaster with minimal administrative overhead
CRM LOB Severe Billing 99.999% Local Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores. Requirement to be able to recover in the event of a

disaster with a low RTO/RPO as

this is a service linked to generating revenue.

Billing LOB Severe None 99.999% Local & regional Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores. Requirement to be able to recover in the event of a

disaster with a low RTO/RPO as this is a revenue generating service.

E-Commerce Web Services LOB Severe Billing 99.999% Local & regional Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores. Requirement to be able to recover in the event of a

disaster with a low RTO/RPO as this is a revenue generating service.

Web Services LOB Medium None 99.9999% Local, regional and country Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores. Required to operate across multiple geographies

Current State Architecture

Now that we have an understanding of our service requirements we can start to look at our current state architecture:

Current State
Service Name Infrastructure Architecture

Implementation Supports HA?

Application Architecture Supports HA? Application Architecture Supports replication and recovery across geographies?
Identity & Access Management Yes Yes Yes
File services No No No
CRM No No No
Billing No No No
E-Commerce Web Services No No No
Web Services Yes Yes No

Current and Future State comparison (GAP)

Now that we understand our requirements and current state architecture, we can complete a GAP analysis to understand where architectural change is required. The below table provides a high-level gap analysis.

Service

availability

Requirement

met?

Backup and

Recoverability

requirement

met?

Disaster

Recovery

requirement

met?

Current State
Future State

Solution Capability Mapping

The following table can be utilised to assess solution capability, for this example we have looked at the identity management service:

Capability Native Application 3rd Party Solution/s
Availability Application native high availability (site)
Application native high availability (region)
Backup &

Recoverability

Crash consistent backup and restore
Application levels restore granularity
File Level granular restore
Disaster Recovery Active/Active Deployment
Active/Passive Deployment
Restore from backup (standby service)
(Active/Passive)

Replication

Active/Active Replication

Solutions Options

Once we understand the current state implementation, the potential capability we can provide, we can review the potential options per service. I’m only going to break down two services from the above list, so I’ve decided to look at identity and access management and CRM:

Service Name Availability Recoverability DR Options
Identity & Access Management Directory services supports multiple nodes to provide an architecture that supports a distributed model providing availability of service at the local & regional level. The operating system features out of the box backup and recovery functionality which supports the operating system and application-level recoverability. If granular application recoverability is required there is a recycle bin feature however a 3rd party product would be required to support granular restore. Utilise multi-site active/active models utilising out of the box features. In the event of loss or corruption of data a restore can be invoked utilising the recoverability solution.
CRM The CRM application vendor does not support a highly available topology. The operating system features out of the box backup and recovery functionality which supports the operating system and application-level recoverability. Granular recoverability is not supported by the application vendor. To provide disaster recovery capability we have the following options:

  • OS and/or Application

level data replication

  • Physical/Virtual machine storage

replication

  • Storage Level

Replication using crash consistent techniques

Here we can establish how we can meet our requirements on a per service basis. The realisation here is that it is very rare to have one solution that will meet all our requirements.

Knowing this we now want to look to achieve a standardised and if possible rationalised set of capability to keep our architecture as simple as possible, while catering for the reality that multiple solutions will be needed. To accomplish this at a broad level I’ve suggested the following capability principals:

Availability Recoverability Disaster Recovery
Utilise application availability architectures to provide high availability. E.g., multiple nodes at

each application layer such as multiple

Exchange CAS/Mailbox roles

Standardise on a guest level backup solution that supports application-level backup for core applications, supports virtualisation solutions and can leverage granular snapshot management. Recoverability should utilise disk-based storage for rapid recovery. Periodic off site data replication/shipping should be utilised to provide recoverability in the event of the loss of the primary site. For applications that support active/active scenarios leverage those. Examples of these are:

  • Active Directory Domain Services
  • Exchange Server Database

availability groups

Utilise an application load balancing solution that can be leveraged across multiple applications e.g., hardware load balancer For mission critical services utilise replication services which support active/passive near real time replication.

Examples of this include:

  • hypervisor base replication
  • 3rd party replication services
  • Storage based replication technologies
Utilise virtualisation technologies to fill gaps when application architecture does not provide high availability. E.g., Fault Tolerance for single virtual machines Link to the recoverability solution to enable restore from backup. This provides efficient recoverability for non-critical services and enables restore of services in the event of data corruption.

Summary

Providing capability to support business continuity is technically achievable utilising a combination of native and 3rd party solutions. It’s key to understand our business requirements, define standardised solutions to cater for the requirements then establish an appropriate architecture and solution capability on a per service basis. As with most things, the solutions should be appropriate to meet requirements from people, process, technology, and financial perspective.

Leave a Reply

Your email address will not be published. Required fields are marked *