Image of AWS DR map and Azure Backup and Recovery Services Console

Disaster Recovery of Digital Resources

Why disaster recovery does not have to be a disaster in the cloud

I have often wondered at the nervousness in the engineering community around disaster recovery (aka. DR). Digital resources are digital so if they are not reachable by your customers, you have no digital resources. To me, accounting for infrastructure failure even in very resilient environments is not only a wise investment, but it's also a priority for something that will eventually happen. All digital infrastructure by its very nature will eventually fail.

The cloud industry has made impressive progress in resilient infrastructure and service availability, which is reflected in the evolution of cloud architecture design patterns. This has pressed many vendors into enhancing their resiliency SLAs and providing products to customers for data backup, managed failovers and resource deployment both in daily operations and in major incidents. AWS’s well-architected framework for example places the process front and centre of its pillars. This structures an environment that can failover databases, VMs plus more with little to no downtime. When, however, this daily experience breaks down with a regional failure, the digital experience can easily become a digital nightmare. If this eventuality is not prepared for, confusion can reign with real business consequences. This is where disaster recovery steps in.

The industry has agreed on a few things around structuring disaster recovery. Firstly, good planning around it is essential. This means deciding on what level of loss risk versus cost will be deemed acceptable. Once this is determined from a risk review of digital resources, the objectives for disaster recovery are set up using two main KPIs.

RTO – how long will it take me to recover my digital assets from a disaster?

RPO – how much data can be restored (aka. data loss tolerated) from such a recovery?

These two KPIs really set expectations for a disaster recovery plan. This includes the cloud services adopted, the level of cost involved in their adoption and follow-up actions after the initial recovery action has been implemented. The latter is often overlooked and can be a nasty surprise in a major incident so detailed and rigorous validation of planning is advised.

Some active/passive strategies for meeting RTO/RPO targets for disaster recovery are as follows:

  • Backup & Restore (provision) Resources – This is the cheapest in general terms but has downtime into hours so RTO would need to facilitate an extensive downtime period.
  • Pilot Light – Provision of redundant infrastructure resources in a backup region but not populated with current data. This is somewhat dearer but RTO can be up to an hour of downtime depending on scale.
  • Warm Standby – Business critical resources always running in a backup region, which can be scaled quickly reducing the RTO timeline even more. This is correspondingly dearer given you have live resources running on a redundant basis in the backup region.
  • Multi-Site/Hot Standby – This is a fully provisioned and live backup site with full resource deployment on an active/active basis. It is suitable for use cases with a very low RTO, as it is near immediate. Understandably, this comes in at the highest cost but delivers the best solution.

Finally, training exercises in DR are often against live production sites. Despite the attractive cost point for the exercise, this can lead to disaster in and of itself. I would always recommend testing the process in a test/quality assurance environment. Redeploying current resources to a test environment and synthesizing data loads makes for a far safer and relatively accurate test of your DR automation. Such a project would consume technical resources and incur costs upfront but can be proceduralized for safer DR testing on an ongoing basis. These process structures are relatively new but they do scale. Whilst there is no single way to do things, not having a DR plan is truly the one option any digital business should seek to leave behind.  Stay tuned for more on Cloud Infrastructure in this blog along with articles on other areas of interest in the Writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!

Related Articles

image of a project timeline for a Maolte Technical Solutions Limited article on major incidents and digital migration

Major Incidents and Digital Migrations

Image of Jenkins workflow

CICD and Jenkins

Image of a runbook template header on Confluence for technical writing purposes

Effective Technical Documentation