Why Logic Flows are Critical to Success
We all know troubleshooting is a systematic approach to solving technical problems. We also know that a technical problem, especially on a system that is new to us requires a troubleshooting approach. Why do we in spite of this knowledge fail to resolve the issue via troubleshooting it and thus extend the time to mitigation? I guess it's fair to say that my career as with most engineers has involved a large amount of technical discovery and troubleshooting when we accept its required. Confronted with a technical problem on a system that is not familiar to the troubleshooter is a gap that even good technical writing does not always fill. Bad or no technical writing only makes it worse as you have to troubleshoot the problem, solve it and document your solution to make it reusable in a manner aligned with your company's technical writing convention should it exist.
So, with practice comes know-how. Knowing how to troubleshoot is as valuable as knowing the features and operational eccentricities of a system when it comes to technical problem-solving. In my learning curve to date, I can say that the more I learn about the technology the less I know making troubleshooting as a superpower even more important. This realisation has led to the refinement of my troubleshooting techniques, some of which are as follows:
- Know your system operation. If a runbook or how-to document exists documenting the response to an issue, make sure you are familiar with it.
- Understand the architecture of your system. This is where logic flows come into it and are critical to problem diagnosis. Make sure your system's workflow is known in terms of the TCP journey down the OSI stack from the level 7 application in the customer's hands to level 2 data transmission over the network to your system. Its journey then should be known in detail up your stack from the internet gateway through your network routers and switches up to your owned resources such as storage and VMs.
- When something is broken, you have to find out what it is. Understanding your architecture to make sure you are not looking in the wrong place is the first part along with testing the failure in your narrowed-down area assuming your monitoring does not provide this detail via automation. When you don't have the luxury of automation, test your failure point by recreating the issue. For example, have a 'tail -f /var/logs/messages' open on the sever's VM command line if it is a suspected VM issue and you want to see new log entries to confirm your diagnoses. Another example would be a suspect deployment on Kubernetes where you would recreate the issue like with the VM example and run 'kubectl logs -f -l name=mydeployment --all-containers' from your control plane node. Bear in mind that command line know-how around these examples is assumed.
- The logic flow features again when you have found out where the issue is and know what to fix it. Resource dependencies are important to understand even if your documentation does not provide it. Ensure you understand the consequences of executing remedial action in terms of outcome and its effect on dependencies.
- Escalation is required when you cannot build out your logic flows on architecture or remedial actions around resource dependencies. The key to addressing this is being honest with yourself on your knowledge level. Weigh up your estimated discovery time versus escalation costs. If your escalation to more experienced colleagues is the only solution to progress in a reasonable timeframe, document your escalation and make it happen. The only silly technical question is the one not asked.
- When you complete technical mitigation, think of time spent documenting the mitigation process as a time well spent after the fact. Do it once and a well-documented runbook or how-to document will save countless engineer hours going forward. After all, the reoccurrence of the technical issue does not need to be investigated so thoroughly in a discovery phase as a case presentation will efficiently identify the problem for the investigating engineer/technologist in the runbook or how-to that you document.
These are some of the things I've learned over the years that my training did not quite cover for me. It is good to know for those among us who are engaged in technical problem solving, especially in a production environment. Stay tuned for more on writing in this blog along with articles on other areas of interest in the infrastructure and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!