Image of Azure's regional health dashboard

Disaster Recovery in the Cloud

Why disaster recovery does not need to be a disaster

If you work on a technical team in the cloud, you will be very familiar with the unease caused by disaster recovery being mentioned in team meetings. It's my contention that a culture of bottom listing the need for disaster recovery in everything, but the name has made its way from on-premise environments to the cloud. Looking forward towards positive change, I would advise that a successful disaster recovery strategy is based on effective planning and calm implementation. A plan for the development of a disaster recovery strategy should adequately meet company expectations in the following two areas.

  1. What length of time am I willing to accept for my digital product range to be inaccessible to my customers due to a disaster as defined by my company? This is known as the Recovery Time Objective (RTO).
  2. In executing my Disaster Recovery plan, what level of data loss am I willing to accept during a disaster, which impacts my recovered data levels up to a particular point? This is known as the Recovery Point Objective (RPO).

Bearing these two objectives in mind, a project management approach should be adopted to design and implement a disaster recovery model. I would recommend an iterative methodology as a discovery cycle is required to assess the cloud vendor’s product offerings around disaster recovery. AWS for example has a well-architected framework model that has disaster recovery baked into it. For an active project, architectural changes should be assessed along with cloud vendor-managed products during the discovery cycle. The best fit for your use case requires good technical know-how along with good knowledge of your digital products and circumstance. Following this assessment, the best fit for your needs should feed directly into meeting your RTO and RPO, which business recovery along with longer-term success is dependent upon.

When your use case is fully explored, cost becomes the next consideration given cloud vendors support a variety of disaster recovery strategies at various price points. Disaster recovery can be seamlessly integrated into your cloud architecture. Here are some well-recognised strategies, which form the basis of an effective evaluation of your project.

  • Backup and Restore – This is an active/passive approach at the lowest price point given your active site only has infrastructure fully deployed with a plan to deploy your architecture to another region using backups and infrastructure as code (IaC). This provisioning takes place after the event and has the highest RTO outcome.
  • Pilot Light – As the name indicates, this is a minimalist approach to provisioned resources in the secondary region such as storage gateways, using backups to deploy on top of supporting infrastructure already provisioned at a slighter higher cost. It is also an active/passive approach with a slightly lower RTO than Backup and Restore, yet has a similar RPO, which is backup dependent.
  • Warm Standby – This is an active/passive approach at a higher cost again where infrastructure resources are provisioned and asynchronously replicated between the primary/active site and secondary/backup site. The approach incurs high cost but is not as expensive as an active/active (aka. hot standby) solution. Critical architecture with a permissible RTO would make a good candidate for this approach by price-sensitive companies. Bear in mind secondary networking/computing charges does not nearly approach the price point of an active/active solution. Also, warm standby can be used with managed products such as Azure Backup on individual VMs so it can neatly scale down from the architecture level for some use cases.
  • Multi-Site – This is the dearest by far of solutions for critical architecture running with a very low RTO and RPO. This active/active solution has active load balancing between regions with all resources sync’d making sure disaster recovery at the regional level does not skip a beat for customers. Very high availability SLAs supported by a corresponding budget are required for this to work.

As you can see, your product's use case and expectations of your customer around availability and budget are clear drivers of a successful disaster recovery plan. I would also recommend if your disaster recovery strategy is active/passive in nature, your budget for a test environment to run simulations in. This means notifying your customers of data transfer into a controlled test environment or simulating data loads that match production with all the production controls in place for the exercise. Major incident management simulations should also be done in a test environment.  Best to invest upfront and recover your investment over time as your plans along with skills develop with your business. Stay tuned for more on Cloud Infrastructure in this blog along with articles on other areas of interest in the Writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!

Related Articles

Image of RabbitMQ examination pass and course completion for John Mulhall on his CloudAMQP course examination page.

RabbitMQ Broker and Microservices

Image of St Stephens Green in Dublin, Ireland during the Christmas season 2022

Site Reliability and Santa Claus

Image of Connaught House, Burlington Road where Morgan McKinley Recruitement consultants are located in Dublin's south city, Ireland

Event - Cloud Migration Projects

Image of Jenkins workflow

CICD and Jenkins