Fail to Plan, Plan to Fail
I ran the 2022 disaster recovery exercise on Maolte’s production site and on reflection, such exercises are not for the faint-hearted. My RPO and RTO were met proving my process handling was sufficient to allow me to restore my infrastructure to where it should be in the projected timeline of recovery. In this particular case, I lost 1 blog article on Cloud Migrations and Major Incidents so I am now going to write about this important topic from a slightly different angle.
When cloud migration happens, arising business ambitions including implementation timeline reduction tend to exceed the technical competencies of the day ending up with replicating what is on-prem into the cloud. This should never be considered the only available option. However, this 'lift and shift' choice can often be unavoidable unless clear leadership and structured project management objectives are set for the cloud migration project. As with all projects, there is a process of requirements analysis leading to project objectives that form the basis of a project plan. The technical project manager who is tasked with project planning should include discovery around the cloud vendor platform along with a data-driven assessment of the current on-prem infrastructure to see what one has, and what one is going to. Only then, can objectives be assessed for realism and feasibility within the scope of a project plan
As with all great intentions, the project may be well-planned meeting management expectations but even a well-planned project may not see the production impact of a successful migration to the cloud. Some of the key areas of impacting change are:
- Missed opportunities to rearchitect to a cloud-based platform embracing cloud-native products and services where it makes sense.
- Current code architecture is not modular in nature and does not embrace API-based architecture, which underpins modern practices in the cloud.
- No monitoring or observability in place on-prem with a why bother as it always ‘worked’ in the past attitude. The development of monitoring solutions adds true value and increases the quality of digital products.
- Old infrastructure patterns carried forward in an effort to ‘lift and shift’ what is on-prem into the cloud are missing value-creating change opportunities. If successful, it will bring existing issues into the cloud.
Bearing this in mind, the value of major incident management during and after cloud migration projects becomes very clear. Yet for so many successful cloud migration projects, major incident management as a process and a discipline are often overlooked. I would submit that this oversight elevates a project’s risk profile for failure and inhibits operational reach regarding site reliability going forward. Here are some pointers in major incident process design that I consider key before migration to the cloud.
- Time to detect, time to report, time to mitigate, and time to resolve are key indicators of operational response readiness. If you don’t know them, then you do not know your infrastructure. Observability implementation should predate a cloud migration project.
- Security and operations should have similar yet separate incident response workflows as they address similar yet separate risks.
- All incidents should be tracked on a ticketing system (e.g. Jira/Zendesk) for audit and follow-up.
- Process workflows should be documented in a manner, which supports post-incident resolution workflows around technical root cause analysis and remediation.
- Major incidents on-prem are often not held to SLAs or even a defined process workflow. This is resolvable in the cloud where SRE practices can be adopted for the structured development of operations resulting in increased product quality over time
There is more to this area, but as you can see creating infrastructure in the cloud is not enough to obtain the value creation sought out by management in the delivery of digital products. One must unlock the potential of the cloud to capitalize on access potential to a major incident management process that enhances availability, resiliency the quality of your digital project range. There is no greater time to begin this journey than today. Stay tuned for more on Cloud Infrastructure in this blog along with articles on other areas of interest in the Writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas, or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!