Site Reliability Image

Site Reliability and Monitoring Solutions

Why SREs need more than just Monitoring...

The rise of Site Reliability practices has become popular during this mass adoption of the Cloud where in-house solutions are been abstracted away in favour of value-creating cloud solutions. Whilst production monitoring will never lose the big three for infrastructure in terms of memory, CPU and storage, cloud providers along with cloud-agnostic monitoring solution providers are delivering unprecedented capabilities in their solutions. Gone are the days with bash scripts pinging a host IP on a cron and processing the responses in the same script. Such a solution in this example is being replaced by managed monitors pinging a domain and testing responses deriving real availability that accounts for network ICMP drops and other gremlins getting Site Reliability Engineers (aka SREs) out of bed in the middle of the night to record false alarms. 

All that said, the increase in monitoring effectiveness is driven by the understanding and vision of the engineers who are developing the logic to implement it. Before implementing or revising a monitoring solution, one should consider the following:

  • What are your goals? What is the expected outcome of your monitoring solution? Are your service level objectives (SLOs) defined and are availability, resiliency and performance objectives clear?  With clarity on what's needed, vision in design becomes clear for the monitoring solution.
  • Is the infrastructure supportive of an effective monitoring solution and will it create any security exploits in its implementation? What kind of platform is chosen for monitoring and what level of risk is acceptable to the company is key here. For instance, many logging based solutions require unencrypted data sources for log calls to the monitoring solution. Such a solution creates a risk-based proposition that should be accepted by the company's security team.
  • If logging is required in the monitoring solution for data sources, does that risk profile allow for time to detect delays as logs are called upstream by the logging solution provider infrastructure and parsed for your implemented monitors/alerts?
  • Does your monitoring provider provide metrics integration with your cloud platform for infrastructure and agent-based monitoring? If so, will the overhead let's say for instance memory metrics based on agent overheads on your nodes create time to detect benefits that outweigh its implementation and associated overheads?
  • Have you defined your risk by the key indicators of your fleet's health as the most alert sensitive, and configured your alerts based on a risk driven approach directing your SREs to the highest risks first and foremost?

So, as you can see infrastructure in the cloud is a lot broader when it comes to monitoring with wider maintenance needs that are risk orientated rather than infrastructure-based. This fluid requirement requires all SREs to be quick learners, and understand full-stack infrastructure, application-level processes, and major incident management. The SRE must also understand the risk associated with their monitors, which can lead to split-second decisions on mitigating issues that can save the company enormously in terms of major incident outages.

This brings me to technical writing, the need for effective process documentation via Runbooks attached to your individual monitors/associated alerts. As an experienced Engineer who gets paged at 3 am sometimes to answer a production level monitoring alert, the ability to make a mistake due to being still half asleep in the middle of the night is real. Effective and up to date Runbooks allow structured guidance on what to do with an alert helping the informed Engineer do the right thing in response. It also accounts for the fact that it is unrealistic to expect experienced and inexperienced engineers alike to remember every single monitoring response and/or where it's at in a confluence repository. The range of possibilities is risk-driven and thus very wide so if you attach the relevant runbook URL to your monitoring alert, you are actively mitigating the risk of an outage in response to the triggered alert. 

All monitoring solution implementations should start with service level objectives and end with effective Runbooks. Needless to say, all solutions once implemented should be maintained from monitor to runbook over time to prevent today's good work from becoming tomorrow's outage. Stay tuned for more on Infrastructure in this blog along with articles on other areas of interest in the Writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!

Best Regards


Related Articles

image of a project timeline for a Maolte Technical Solutions Limited article on major incidents and digital migration

Major Incidents and Digital Migrations

Image of Jenkins workflow

CICD and Jenkins

Image of a runbook template header on Confluence for technical writing purposes

Effective Technical Documentation