Blog

Blogs

Blog article category for blog articles on this site covering the areas of DevOps, Cloud Infrastructure, Site Reliability, Technical Writing, Project Management and Commerical Writing along with Event Management and associated areas. 

Image of Tony Kirtly from Secureworks speaking at FirstConn22

Major Incident Command

Why incident response leadership is so important...

In technology, we all know what to do when things go right and have Runbooks to tell us what to do when things go wrong. However, in the operational and security domains, things can go really wrong in an impactful way that can leave a trail of damage in its wake. I covered Firstcon22 for Irish Tech News and got to talk to loads of interesting people so stay tuned for my pending article at Irish Tech News. I had a very interesting chat on incident response with Tony and Jeff from SecureWorks, which got me thinking about my time as a major incident manager in the cloud infrastructure space. 

Tony's talk on the emotional journey of a victim organisation in a Ransomware attack was excellent. Some of his commentaries are not only excellent advice in the security incident response context but also applicable to any incident management process. This cross-over connection led me to think about what works for a company and what should be considered good general advice in addition to being good incident response advice for a company's leadership team. Here are some key points:

  • Make sure you invest in monitoring and catch as many issues as you can before they become major incidents in any context
  • Make sure your security posture does not leave you vulnerable to avoidable attacks. Monitor your attack surface in a qualitative manner.
  • 24/7 production support from your teams is required to respond to alerts as technology issues never sleep
  • When a major incident happens, leaders and engineers need to take a breath and not panic. Panicked recovery actions intended to quickly fix an issue not only wastes time obtaining incident mitigation but can also be counterproductive by introducing new problems. 
  • When a massive outage happens, empower your teams to resolve it in a defined process, rather than chance instructions. Consider using logic flows in your company's troubleshooting methodologies.
  • Avoid a culture of blame by design. A company culture of blame or even a culture that is autocratic in nature can lead to a blame game. This often extends the time to mitigation, and dysfunctional politics can cause key talent to become distracted and not focus on the issue at hand.
  • Have a defined major incident management process that defines the roles and expectations for all parties. From major incident commanders, incident scribes, subject matter experts, management and ad-hoc personnel, make sure only participants that are key to the incident management process are only on an event bridge and no digressing inputs or demands are made. Digression by politically strong individuals often extends the time to mitigation.
  • When the issue is defined, ensure your path to mitigation is qualitative ensuring an effective outcome. Avoid succumbing to that push for a quicker turnaround at the expense of quality as reoccurring events are not only costly, they are unnecessary.
  • In large companies that span the globe, incident commanders should be very aware of allowing all key parties to be heard. Local culture for example may expect an engineer to speak only through her manager. This should not be tolerated. Address this early by counselling the manager on the company policy of allowing all key personnel to be heard directly. If there is pushback from the offending manager in this example, assert yourself by raising your volume but not your tone in your reassertion of what is acceptable and what is not. This asserts dominance and post-incident follow-up should note the confrontation in the incident commander's report to management. 
  • When the technical root cause is established, the key points on blame and guilt should be addressed through consistent leadership around a blameless culture that requires all to learn from the mistakes made and work together to not repeat them as a team. Team members should feel empowered to grow from the experience, not suffer crushing morale levels and possibly seek a job elsewhere as a result. 

There are many other aspects of major incident management but as you can see, the leadership of people is vital in this process. Without it, you will find it a struggle to achieve goals that serve the company's best interests. With this approach, goals can be achieved in a manner that builds up teams rather than breaks them down. Stay tuned for more on infrastructure in this blog along with articles on other areas of interest in the writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!

Technical Troubleshooting and Logic Flows

Technical Troubleshooting and Logic Flows

Why Logic Flows are Critical to Success

We all know troubleshooting is a systematic approach to solving technical problems. We also know that a technical problem, especially on a system that is new to us requires a troubleshooting approach. Why do we in spite of this knowledge fail to resolve the issue via troubleshooting it and thus extend the time to mitigation? I guess it's fair to say that my career as with most engineers has involved a large amount of technical discovery and troubleshooting when we accept its required. Confronted with a technical problem on a system that is not familiar to the troubleshooter is a gap that even good technical writing does not always fill. Bad or no technical writing only makes it worse as you have to troubleshoot the problem, solve it and document your solution to make it reusable in a manner aligned with your company's technical writing convention should it exist.

So, with practice comes know-how. Knowing how to troubleshoot is as valuable as knowing the features and operational eccentricities of a system when it comes to technical problem-solving. In my learning curve to date, I can say that the more I learn about the technology the less I know making troubleshooting as a superpower even more important. This realisation has led to the refinement of my troubleshooting techniques, some of which are as follows:

  • Know your system operation. If a runbook or how-to document exists documenting the response to an issue, make sure you are familiar with it.
  • Understand the architecture of your system. This is where logic flows come into it and are critical to problem diagnosis. Make sure your system's workflow is known in terms of the TCP journey down the OSI stack from the level 7 application in the customer's hands to level 2 data transmission over the network to your system. Its journey then should be known in detail up your stack from the internet gateway through your network routers and switches up to your owned resources such as storage and VMs.
  • When something is broken, you have to find out what it is.  Understanding your architecture to make sure you are not looking in the wrong place is the first part along with testing the failure in your narrowed-down area assuming your monitoring does not provide this detail via automation. When you don't have the luxury of automation, test your failure point by recreating the issue. For example, have a 'tail -f /var/logs/messages' open on the sever's VM command line if it is a suspected VM issue and you want to see new log entries to confirm your diagnoses. Another example would be a suspect deployment on Kubernetes where you would recreate the issue like with the VM example and run 'kubectl logs -f -l name=mydeployment --all-containers' from your control plane node. Bear in mind that command line know-how around these examples is assumed.
  • The logic flow features again when you have found out where the issue is and know what to fix it. Resource dependencies are important to understand even if your documentation does not provide it. Ensure you understand the consequences of executing remedial action in terms of outcome and its effect on dependencies.
  • Escalation is required when you cannot build out your logic flows on architecture or remedial actions around resource dependencies. The key to addressing this is being honest with yourself on your knowledge level. Weigh up your estimated discovery time versus escalation costs. If your escalation to more experienced colleagues is the only solution to progress in a reasonable timeframe, document your escalation and make it happen. The only silly technical question is the one not asked. 
  • When you complete technical mitigation, think of time spent documenting the mitigation process as a time well spent after the fact. Do it once and a well-documented runbook or how-to document will save countless engineer hours going forward. After all, the reoccurrence of the technical issue does not need to be investigated so thoroughly in a discovery phase as a case presentation will efficiently identify the problem for the investigating engineer/technologist in the runbook or how-to that you document.

These are some of the things I've learned over the years that my training did not quite cover for me. It is good to know for those among us who are engaged in technical problem solving, especially in a production environment. Stay tuned for more on writing in this blog along with articles on other areas of interest in the infrastructure and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!

Managing Change

Managing Change

Why fortune always favours' the well prepared...

We have all heard the British SAS motto "Who Dares Wins". There was even a movie titled with the SAS motto, which was about wartime heroics that somehow got executed with gusto and charisma. The truth is somewhat different in the real world with the British SAS trooper Andy McNabb confirming that the SAS's real motto is 'Check and Test'. After all, managing adversity for special operations embraces nearly all of the change management best practice recommendations out there and I would not be surprised if they gave rise to many of them.

In business and technology, our relatively mundane existence can be just as tricky to navigate if we are not careful. This fact made me very cautious about giving up my full-time (permanent) job and investing a considerable amount of money, time and effort to become a contractor and business owner. The steps involved will seem alien to many seasoned professionals in established and consistent working environments. As strange as change may seem, everything is up for grabs as one reality is sunsetted for another. As I enter the final set-up stages of Maolte Technical Solutions Limited and the start of the manically busy post-launch stage, here are some key change management practices that have brought me to this point maximizing my chance of success in my risk-averse orientated approach. 

  • Understand your goals, in terms of where you want to be and what you want to feel when you achieve them. If you can connect your passion with your goals, your ability to reach your goals will be done in a more qualitative manner making for a more sustainable result. Detailed research and validation of your goals should always be your first step in a change management project.
  • Evaluate risk in continuous cycles over the period of your change project in terms of what elements both positive and negative were ranked and if the data behind your rankings is estimated or actual. This approach allowed me to build a data-driven picture of risk on a simple spreadsheet over time showing a lowering score and was helpful in connecting the dots to new risks previously unseen. Ranking each element in risk likelihood and impact on a scale of 1-10 is the best way with commentary around them to quantify risk, assuming you are not using any complex algorithms associated with managed risk services.
  • Project plans should break up your project into critical and non-critical paths with an approach to take in the objectives of that stage in the project. Remember change affects people more than processes so if people are impacted in your project path, ensure they are brought along in a critical path of change. The goal here is to ensure they are informed and educated about the substance of change when it is clear to you, not before then.
  • Be credible from the start. It's a good idea to ensure competencies and education around new tasking and requirements are in place. An ambitious yet phased approach is recommended where people feel in control. You need to ensure you bring your people with you along the project path.
  • Scenario plan for failure and success. Expect the best and plan for the worst but don't be caught out by your own success. Be sure of your competencies and those of your compatriots in whatever you are doing. In short, capitalise on your success using a planned data-driven approach, that is subject to review and refinement.
  • In change management, every next step and next phase is new to everybody so ensure your planned approach is timed to include support processes such as training and documentation steps. Also, change products always need to be auditable in companies at all levels given the inherent risk they bring to an organization.
  • Define processes in a manner that is woven into your overall process infrastructure at a base working level where steps are carried out manually at first. Automate what you can but make sure you have a central repository of process knowledge, not several.

There is more in terms of change management best practices but the above features in my prior and planned steps of my own journey. Change management requires good planning that is flexible, quick thinking and being comfortable with both structured and unstructured environments. Above all, it requires a calm and methodological approach as all change management success starts with people, then ideas, then process and finally their execution. Stay tuned for more on writing in this blog along with articles on other areas of interest in the infrastructure and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!