Blog

image of an azure insights monitor on a cosmosdb account.

Monitoring Solutions and Digital Business

Why monitoring solution design is key to digital success

As digital transformation continues at pace thanks to the COVID pandemic, many companies are having issues with their digital transformation including what approach to take in reaching a successful outcome. To answer the question in full has far too much content for one article, so I would like to focus on monitoring and why it should be a prominent feature like no other for a successful digital company. A well-designed and implemented monitoring solution in the cloud or on-premise has primary value as an early warning system. With the correct monitoring automation in place for the infrastructure fleet, the value of the solution can extend to surrounding processes that underpin site reliability.

Think of a good monitoring solution that has integrated runbook automation to alert the on-call engineer to a CPU spike on one instance that has failed in a load balancer health check. This notifies the engineer of a single node issue via an automated ticket that requires investigation. If sticky sessions are not the cause then what is? What could it be and does it present a larger danger to our digital product’s SLA? In this example, the load balancer’s runbook automation around health checks has taken one node out of the active pool and in a separate action notified the on-call engineer of the incident via automated ticket generation. The automation can even extend if desired to restart the node, thus relieving the CPU pressure and restoring it automatically to the load balancer’s active node pool. All this process automation is not possible without a well-designed monitoring solution, which triggers automated and even manual process workflows.

To ensure the availability and maximize the performance of your digital fleet, I would recommend the following:

  • Choose your infrastructure platform carefully. On-premise has less attraction for most industries to cloud-based alternatives given the latter’s wide range of managed services, shorter time to deploy and very high SLA commitments on key infrastructure resources.

  • Model your infrastructure management processes to support site reliability setting service level objectives for key infrastructure resources and service level indicators, which can be automated via your monitoring solution.

  • Design a monitoring solution noting the time to detect, time to mitigate and audibility should feature strongly in the design and subsequent management metrics.

  • Automated monitoring agents (e.g. AWS ssm) on nodes provide application-level metrics. This reduces time to detect in a material way when compared to a monitoring solution set up using lag indicators such as logs. This metrics-based time saving can take up to 1 hour off time to detect metrics for your incident. In the case of major incidents impacting customers, it can mitigate the risk of severe damage to your digital products and your company’s brand reputation.

  • Time to mitigate especially in the case of major incidents can be reduced when your early warning monitoring solution alerts you quicker, automates lower-level remediation (without making it worse) and parses relevant log data for audit by the on-call engineer investigating the incident.

  • Centralized logging off node not only increases node health via a lighter storage burden, it also streamlines log review in an incident via centralized query tools. It also creates a better audibility structure and investigative path for technical root cause analysis after the incident.

There is no doubt that all of this process infrastructure and automation would not succeed if the end-user was not considered in designing the monitoring solution. The extent of its importance to the digital business when recognized can act as an internal productivity lever in operations and a competitive advantage against the competitor who overlooks it as a key infrastructure resource. Stay tuned for more on Cloud Infrastructure in this blog along with articles on other areas of interest in the Writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!

Related Articles

image of a project timeline for a Maolte Technical Solutions Limited article on major incidents and digital migration

Major Incidents and Digital Migrations

Image of Jenkins workflow

CICD and Jenkins

Image of a runbook template header on Confluence for technical writing purposes

Effective Technical Documentation

Image of AWS DR map and Azure Backup and Recovery Services Console

Disaster Recovery of Digital Resources