As more and more businesses are moving into the digital world, they search for the best and easiest adoption strategy that often points to the cloud. Many cloud migration projects have a bear-bones approach to getting existing digital infrastructure onto the cloud called a lift and shift. New entrants may adopt a strategy based on their ‘IT guy’, whose real expertise is in Windows and maybe Mac operating systems and hardware.
Even with cloud experts at the helm, the cloud migration project can be reasonably well formed and still miss major incident management on the process side of infrastructure management after the project. So why should it be an issue? Picture a successful deployment to the cloud, which meets the non-functional requirement of scaling. Whilst all will be well at the current customer usage level, scaling infrastructure over time leads to two risks. Firstly, what happens if your scaling infrastructure depreciates product performance as it scales? Will it impact your customer retention rate? Secondly, the larger your infrastructure the more likely it will fail. This will count down to certainty over time. What is your plan to deal with an outage?
SRE practices have a culmination of learned lessons leading to some guidelines when thinking of migration to the cloud or into the digital world. The following should be borne in mind when thinking of a digital migration project in terms of non-functional project requirements.
- Does my company have a major incident management process and is it up to date? Is it proven effective to deal with major incidents in a manner that scales by process versus tribal knowledge offered by a few select engineers?
- Is scalability a requirement of the project and if so, will the major incident management process be updated to reflect the new digital reality in the post-project stage?
- Even if my new architecture post project is highly available, do I have the technical staff and resources to effectively manage major incidents across the full stack in line with my major incident management process? Does it focus on process steps to define the problem/root cause, mitigate the outage and resolve the incident? Is there a follow-up stage to mitigate the risk of it happening again (aka. root cause analysis)?
- Is my migration project at least checking if not providing for appropriately trained technical staff to manage an effective major incident process in operations and/or security?
- If my intent is to scale quickly, do I have separate major incident management processes in place or a plan to put them in place for both operations and security?
- Do my major incident processes report on key metrics like ‘time to detect’, ‘time to report’, ‘time to mitigate’ and ‘time to resolve'?
The timeline of every major incident impacts a digital brand over time. End-state processes like this are a wise longer-term requirement in any digital migration project. There are a vast amount of known practices and development points that deepen and widen the success of a cloud migration project after it closes. However, the best cloud migration projects can only be enhanced by a step/cycle that evaluates and/or implements a major incident management process based on the project’s requirements for the new digital infrastructure. Stay tuned for more on Cloud Infrastructure in this blog along with articles on other areas of interest in the Writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!