I want a simple way of restoring service in the event of a disaster

Foreward

I wrote this over 5 years ago and wanted to see if it stood the test of time – I also see too often that organisations don’t have the right capabilitites to recover so I figured this is a good post on the subject.

Back to the Future

Having a simple singular method for providing data protection would be ideal, a one big red button to rule them all that fails over to a standby secondary datacentre which you pay for on a consumption only basis, sounds too good to be true right? Well, that’s because (today at least) it is!

We often need to leverage a number of different solutions to provide a mechanism to support the business requirements, maintain supportability with vendors and ensure efficiency.

High availability, disaster recovery, backup, and business continuity

I’ve often seen these terms utilised synonymously, this is no surprise given the number of phases we bound around the IT industry, but I think it’s important to understand that the differences and agree on the terms of reference.

Business Continuity encompasses the following:

Resilience (High availability)
Recoverability (Backup)
Contingency (Disaster Recovery)

Splitting these out we can see the following attributes of each component:

High availability
- High availability refers to the ability of a service to sustain failure of one or more components and continue to function in its ‘production’ state. This can be local to a geography or can span cities, regions or counties.
Recoverability
- The ability to restore a service and/or it’s data in the event of failure, incorrect change or data corruption.
Disaster Recovery
- The ability to restore a service in the event of a major incident or disaster.

Requirements and Objectives

Before we begin thinking about solutions it’s important to understand our services and what capability we need from a business perspective across each of the 3 domains:

To demonstrate this, I’ve put together a simplified view, in reality we may need to analyse a service to a far more detailed and granular level depending upon the size, scale and complexity of the business we are looking at:

Service Name	Supporting or LOB	Impact to business If unavailable	Dependant Services	Production Service Uptime	Service availability Requirement	Backup and Recoverability	Disaster Recovery
Identity & Access Management	Supporting	Severe	All	99.999%	Local & regional	Must be protected against data loss & corruption. Must be able to conduct granular file level restores.	Required to operate an active/active model across regions and be recoverable in the event of loss of region
File services	Supporting	Low	None	98%	None	Must be protected against data loss & corruption. Must be able to conduct granular file level restores.	Must be able to restore the service to the secondary region in the event of a disaster with minimal administrative overhead
CRM	LOB	Severe	Billing	99.999%	Local	Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores.	Requirement to be able to recover in the event of a disaster with a low RTO/RPO as this is a service linked to generating revenue.
Billing	LOB	Severe	None	99.999%	Local & regional	Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores.	Requirement to be able to recover in the event of a disaster with a low RTO/RPO as this is a revenue generating service.
E-Commerce Web Services	LOB	Severe	Billing	99.999%	Local & regional	Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores.	Requirement to be able to recover in the event of a disaster with a low RTO/RPO as this is a revenue generating service.
Web Services	LOB	Medium	None	99.9999%	Local, regional and country	Must be protected against data loss & corruption. Must be able to conduct granular file level restores as well as application level restores.	Required to operate across multiple geographies

Current State Architecture

Now that we have an understanding of our service requirements we can start to look at our current state architecture:

		Current State
Service Name	Infrastructure Architecture Implementation Supports HA?	Application Architecture Supports HA?	Application Architecture Supports replication and recovery across geographies?
Identity & Access Management	Yes	Yes	Yes
File services	No	No	No
CRM	No	No	No
Billing	No	No	No
E-Commerce Web Services	No	No	No
Web Services	Yes	Yes	No

Current and Future State comparison (GAP)

Now that we understand our requirements and current state architecture, we can complete a GAP analysis to understand where architectural change is required. The below table provides a high-level gap analysis.

Service

availability

Requirement

met?

Backup and

Recoverability

requirement

met?

Disaster

Recovery

requirement

met?

Current State

Future State

Solution Capability Mapping

The following table can be utilised to assess solution capability, for this example we have looked at the identity management service:

	Capability	Native Application	3rd Party Solution/s
Availability	Application native high availability (site)
Availability	Application native high availability (region)
Backup & Recoverability	Crash consistent backup and restore
	Application levels restore granularity
	File Level granular restore
Disaster Recovery	Active/Active Deployment
	Active/Passive Deployment
	Restore from backup (standby service)
	(Active/Passive) Replication
	Active/Active Replication

Solutions Options

Once we understand the current state implementation, the potential capability we can provide, we can review the potential options per service. I’m only going to break down two services from the above list, so I’ve decided to look at identity and access management and CRM:

Service Name

Availability

Recoverability

DR Options

Identity & Access Management

Directory services supports multiple nodes to provide an architecture that supports a distributed model providing availability of service at the local & regional level.

The operating system features out of the box backup and recovery functionality which supports the operating system and application-level recoverability. If granular application recoverability is required there is a recycle bin feature however a 3^rd party product would be required to support granular restore.

Utilise multi-site active/active models utilising out of the box features. In the event of loss or corruption of data a restore can be invoked utilising the recoverability solution.

CRM

The CRM application vendor does not support a highly available topology.

The operating system features out of the box backup and recovery functionality which supports the operating system and application-level recoverability. Granular recoverability is not supported by the application vendor.

To provide disaster recovery capability we have the following options:

OS and/or Application

level data replication

Physical/Virtual machine storage

replication

Storage Level

Replication using crash consistent techniques

Here we can establish how we can meet our requirements on a per service basis. The realisation here is that it is very rare to have one solution that will meet all our requirements.

Knowing this we now want to look to achieve a standardised and if possible rationalised set of capability to keep our architecture as simple as possible, while catering for the reality that multiple solutions will be needed. To accomplish this at a broad level I’ve suggested the following capability principals:

Availability	Recoverability	Disaster Recovery
Utilise application availability architectures to provide high availability. E.g., multiple nodes at each application layer such as multiple Exchange CAS/Mailbox roles	Standardise on a guest level backup solution that supports application-level backup for core applications, supports virtualisation solutions and can leverage granular snapshot management. Recoverability should utilise disk-based storage for rapid recovery. Periodic off site data replication/shipping should be utilised to provide recoverability in the event of the loss of the primary site.	For applications that support active/active scenarios leverage those. Examples of these are: Active Directory Domain Services Exchange Server Database availability groups
Utilise an application load balancing solution that can be leveraged across multiple applications e.g., hardware load balancer		For mission critical services utilise replication services which support active/passive near real time replication. Examples of this include: hypervisor base replication 3^rd party replication services Storage based replication technologies
Utilise virtualisation technologies to fill gaps when application architecture does not provide high availability. E.g., Fault Tolerance for single virtual machines		Link to the recoverability solution to enable restore from backup. This provides efficient recoverability for non-critical services and enables restore of services in the event of data corruption.

Summary

Providing capability to support business continuity is technically achievable utilising a combination of native and 3^rd party solutions. It’s key to understand our business requirements, define standardised solutions to cater for the requirements then establish an appropriate architecture and solution capability on a per service basis. As with most things, the solutions should be appropriate to meet requirements from people, process, technology, and financial perspective.

« Endpoint Security – The Essentials

vSphere Unauthenticated Remote Code Execution Vulnerability – VMSA-2021-0002 »