Image of Tony Kirtly from Secureworks speaking at FirstConn22

Major Incident Command

Why incident response leadership is so important...

In technology, we all know what to do when things go right and have Runbooks to tell us what to do when things go wrong. However, in the operational and security domains, things can go really wrong in an impactful way that can leave a trail of damage in its wake. I covered Firstcon22 for Irish Tech News and got to talk to loads of interesting people so stay tuned for my pending article at Irish Tech News. I had a very interesting chat on incident response with Tony and Jeff from SecureWorks, which got me thinking about my time as a major incident manager in the cloud infrastructure space. 

Tony's talk on the emotional journey of a victim organisation in a Ransomware attack was excellent. Some of his commentaries are not only excellent advice in the security incident response context but also applicable to any incident management process. This cross-over connection led me to think about what works for a company and what should be considered good general advice in addition to being good incident response advice for a company's leadership team. Here are some key points:

  • Make sure you invest in monitoring and catch as many issues as you can before they become major incidents in any context
  • Make sure your security posture does not leave you vulnerable to avoidable attacks. Monitor your attack surface in a qualitative manner.
  • 24/7 production support from your teams is required to respond to alerts as technology issues never sleep
  • When a major incident happens, leaders and engineers need to take a breath and not panic. Panicked recovery actions intended to quickly fix an issue not only wastes time obtaining incident mitigation but can also be counterproductive by introducing new problems. 
  • When a massive outage happens, empower your teams to resolve it in a defined process, rather than chance instructions. Consider using logic flows in your company's troubleshooting methodologies.
  • Avoid a culture of blame by design. A company culture of blame or even a culture that is autocratic in nature can lead to a blame game. This often extends the time to mitigation, and dysfunctional politics can cause key talent to become distracted and not focus on the issue at hand.
  • Have a defined major incident management process that defines the roles and expectations for all parties. From major incident commanders, incident scribes, subject matter experts, management and ad-hoc personnel, make sure only participants that are key to the incident management process are only on an event bridge and no digressing inputs or demands are made. Digression by politically strong individuals often extends the time to mitigation.
  • When the issue is defined, ensure your path to mitigation is qualitative ensuring an effective outcome. Avoid succumbing to that push for a quicker turnaround at the expense of quality as reoccurring events are not only costly, they are unnecessary.
  • In large companies that span the globe, incident commanders should be very aware of allowing all key parties to be heard. Local culture for example may expect an engineer to speak only through her manager. This should not be tolerated. Address this early by counselling the manager on the company policy of allowing all key personnel to be heard directly. If there is pushback from the offending manager in this example, assert yourself by raising your volume but not your tone in your reassertion of what is acceptable and what is not. This asserts dominance and post-incident follow-up should note the confrontation in the incident commander's report to management. 
  • When the technical root cause is established, the key points on blame and guilt should be addressed through consistent leadership around a blameless culture that requires all to learn from the mistakes made and work together to not repeat them as a team. Team members should feel empowered to grow from the experience, not suffer crushing morale levels and possibly seek a job elsewhere as a result. 

There are many other aspects of major incident management but as you can see, the leadership of people is vital in this process. Without it, you will find it a struggle to achieve goals that serve the company's best interests. With this approach, goals can be achieved in a manner that builds up teams rather than breaks them down. Stay tuned for more on infrastructure in this blog along with articles on other areas of interest in the writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!

Related Articles

image of a project timeline for a Maolte Technical Solutions Limited article on major incidents and digital migration

Major Incidents and Digital Migrations

Image of Jenkins workflow

CICD and Jenkins

Image of a runbook template header on Confluence for technical writing purposes

Effective Technical Documentation