Blog article category for blog articles on this site covering the areas of DevOps, Cloud Infrastructure, Site Reliability, Technical Writing, Project Management and Commerical Writing along with Event Management and associated areas. 

Disaster Recovery of Digital Resources

Disaster Recovery of Digital Resources

Why disaster recovery does not have to be a disaster in the cloud

I have often wondered at the nervousness in the engineering community around disaster recovery (aka. DR). Digital resources are digital so if they are not reachable by your customers, you have no digital resources. To me, accounting for infrastructure failure even in very resilient environments is not only a wise investment, but it's also a priority for something that will eventually happen. All digital infrastructure by its very nature will eventually fail.

The cloud industry has made impressive progress in resilient infrastructure and service availability, which is reflected in the evolution of cloud architecture design patterns. This has pressed many vendors into enhancing their resiliency SLAs and providing products to customers for data backup, managed failovers and resource deployment both in daily operations and in major incidents. AWS’s well-architected framework for example places the process front and centre of its pillars. This structures an environment that can failover databases, VMs plus more with little to no downtime. When, however, this daily experience breaks down with a regional failure, the digital experience can easily become a digital nightmare. If this eventuality is not prepared for, confusion can reign with real business consequences. This is where disaster recovery steps in.

The industry has agreed on a few things around structuring disaster recovery. Firstly, good planning around it is essential. This means deciding on what level of loss risk versus cost will be deemed acceptable. Once this is determined from a risk review of digital resources, the objectives for disaster recovery are set up using two main KPIs.

RTO – how long will it take me to recover my digital assets from a disaster?

RPO – how much data can be restored (aka. data loss tolerated) from such a recovery?

These two KPIs really set expectations for a disaster recovery plan. This includes the cloud services adopted, the level of cost involved in their adoption and follow-up actions after the initial recovery action has been implemented. The latter is often overlooked and can be a nasty surprise in a major incident so detailed and rigorous validation of planning is advised.

Some active/passive strategies for meeting RTO/RPO targets for disaster recovery are as follows:

  • Backup & Restore (provision) Resources – This is the cheapest in general terms but has downtime into hours so RTO would need to facilitate an extensive downtime period.
  • Pilot Light – Provision of redundant infrastructure resources in a backup region but not populated with current data. This is somewhat dearer but RTO can be up to an hour of downtime depending on scale.
  • Warm Standby – Business critical resources always running in a backup region, which can be scaled quickly reducing the RTO timeline even more. This is correspondingly dearer given you have live resources running on a redundant basis in the backup region.
  • Multi-Site/Hot Standby – This is a fully provisioned and live backup site with full resource deployment on an active/active basis. It is suitable for use cases with a very low RTO, as it is near immediate. Understandably, this comes in at the highest cost but delivers the best solution.

Finally, training exercises in DR are often against live production sites. Despite the attractive cost point for the exercise, this can lead to disaster in and of itself. I would always recommend testing the process in a test/quality assurance environment. Redeploying current resources to a test environment and synthesizing data loads makes for a far safer and relatively accurate test of your DR automation. Such a project would consume technical resources and incur costs upfront but can be proceduralized for safer DR testing on an ongoing basis. These process structures are relatively new but they do scale. Whilst there is no single way to do things, not having a DR plan is truly the one option any digital business should seek to leave behind.  Stay tuned for more on Cloud Infrastructure in this blog along with articles on other areas of interest in the Writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!

Site Reliability and Change

Site Reliability and Change

Managing change and why site reliability should always be a priority...

Did you ever have those moments of clarity in a team standup after noticing site reliability is not making the headlines on a daily basis? As a cloud infrastructure and systems engineer who has been heavily involved in site reliability over the years, I have often wondered if I could have done things differently and created a better outcome by sharing more insights and pushing for wider change beyond my role in the company.

My pondering has led me to some key points on site reliability that affects all in a saas company.

  • When your company relies on digital products, site reliability is like marketing. It is everybody's responsibility. A company culture that does not embrace site reliability practices can often experience the below as a reality in their organisation.
  • Personal bias in the technical team is a huge cause of errors in how cloud products are constructed, maintained and supported. What worked in an on-prem data centre in 2014 is likely now obsolete in the cloud infrastructure paradigm. 
  • Documentation is often overlooked as a nuisance that has led to poor adoption of modern technical writing practices. This impact has traditionally been felt internally with poorly curated document repositories that don't follow a house standard or an open standard like DITA.
  • Production support has risks associated with error making and a blame culture on major incidents. These poor practices for the production support engineer are often carried forward to the cloud. Modern SRE practices require a highly curated and standardized document suite of alert response runbooks and how-to documents with a common convention applied to them. The idea of this investment in documentation is to provide the engineer on-call with a management-approved (up-to-date) procedure for dealing with x, y and z alerts. This SRE approach allows risk to be measured and rated in line with SLOs (service level objectives) for the system. Well-written documents also mitigate error risk whilst on-call.
  • The divergence of deployment and infrastructure practices from different development teams bypasses good SRE practices and many cloud features that could be leveraged by their adoption. If your development team(s) has always done it that way, why not see what they are doing and assess if your deployment workflows as a standard could service their project's functional and non-functional requirements? Such a review is a great place to start creating a deployment standard that can be automated via a DevOps approach using IaC (infrastructure as code) and CICD. It also shines a light on requirements that are not practical for the business to support going forward. 

Company culture does not change overnight for businesses pivoting to a cloud model. The sooner the adoption of new leadership and business practice begins, the sooner the attitude of all will change allowing new practice adoption to be embraced by all as one product team, not a host of silos that may work together. One of the first places for this change in a digital/saas company is making sure the value of site reliability makes its way early into the software development lifecycle. Such practices around modularizing code, and creating awareness of cloud infrastructure features enhancing resiliency, availability and cost are just the beginning.  Bedding down a modern and robust SRE-friendly company culture will underpin future success in an ever-demanding marketplace where speed, accuracy and quality as a bar are getting higher with every deployment. Stay tuned for more on Cloud Infrastructure in this blog along with articles on other areas of interest in the Writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!

Azure DevOps, Integration and Project Success

Azure DevOps, Integration and Project Success

Why communication around DevOps tools can lead you to project success

Speaking honestly, I can’t imagine anyone wanting to do everything on the command line for every server deployment ever again. This is why ‘DevOps’ is regarded as progress in this new age of automation. As my initial certification path comes to a close as a cloud engineer, I pondered on what demo projects to focus on. DevOps was my first thought out of the gate. It brought me equally as quickly to my favourite DevOps tool, “Azure DevOps”.

I started working on Maolte Technical Solution’s very first DevOps project deploying an Azure V-Net along with 2x subnets onto my company’s brand new demo subscription in Azure. I naturally started using Azure DevOps as it's been my go-to when given a choice by former employers who had their Azure DevOps organizations ready to use. So when my ARM-coded deployment failed due to an error around parallelism being enabled, I realised I never created my own Azure DevOps organization before. Azure DevOps disables parallelism by default on free tier subscriptions as an anti-fraud measure, so I wait for a couple of days for Azure support to complete my enablement request. After shelving the project for a few days, I thought about the correctness of my DevOps tool selection. Is Azure DevOps the best tool for deploying an Azure V-Net with subnets via an ARM template implementation? Well, it's Azure, so yes, it's a good choice but coincidence and not design made it so. I guess the old saying ‘be careful what you wish for’ applies to DevOps and tool selection. When thinking about designing a structured approach to your DevOps project, do note that project requirements are key in your choice of DevOps tool. Here are some good questions to consider before you search for your DevOps tool of choice.

  • Am I deploying an application, SDN infrastructure objects like workspaces and/or provisioning infrastructure resources?
  • Is this project a once-off or to be repeated? 
  • Who is handling the build? Where is the handover from the development team?
  • Does the project require performance, integration testing and security testing to be done via integration modules manually or in a release pipeline?
  • Can the pipeline be run manually and if so, why?
  • What kind of deployment is required? Rolling, A/B, etc?
  • Depending on what type of deployment is required, is the Infrastructure As Code (IaC) written and peer-reviewed to manage the deployment behaviour?

I can't say the above is an exhaustive list but it's a great place to start. I find questions like the above in some cases can be met with below-par responses from fellow engineers and managers, when they should be heartily welcomed. It shows enthusiasm, commitment and professional interest in doing the right thing. Clearly setting out the context of what a DevOps engineer should build with the involved stakeholders and resources in existence gives great clarity on what needs to be done. When we are all on the same page, we can write a great book! Stay tuned for more on DevOps in this blog along with articles on other areas of interest in the Writing and Cloud Infrastructure arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!