image of charmartin train station in Madrid Spain

Site Reliability and Change

Managing change and why site reliability should always be a priority...

Did you ever have those moments of clarity in a team standup after noticing site reliability is not making the headlines on a daily basis? As a cloud infrastructure and systems engineer who has been heavily involved in site reliability over the years, I have often wondered if I could have done things differently and created a better outcome by sharing more insights and pushing for wider change beyond my role in the company.

My pondering has led me to some key points on site reliability that affects all in a saas company.

  • When your company relies on digital products, site reliability is like marketing. It is everybody's responsibility. A company culture that does not embrace site reliability practices can often experience the below as a reality in their organisation.
  • Personal bias in the technical team is a huge cause of errors in how cloud products are constructed, maintained and supported. What worked in an on-prem data centre in 2014 is likely now obsolete in the cloud infrastructure paradigm. 
  • Documentation is often overlooked as a nuisance that has led to poor adoption of modern technical writing practices. This impact has traditionally been felt internally with poorly curated document repositories that don't follow a house standard or an open standard like DITA.
  • Production support has risks associated with error making and a blame culture on major incidents. These poor practices for the production support engineer are often carried forward to the cloud. Modern SRE practices require a highly curated and standardized document suite of alert response runbooks and how-to documents with a common convention applied to them. The idea of this investment in documentation is to provide the engineer on-call with a management-approved (up-to-date) procedure for dealing with x, y and z alerts. This SRE approach allows risk to be measured and rated in line with SLOs (service level objectives) for the system. Well-written documents also mitigate error risk whilst on-call.
  • The divergence of deployment and infrastructure practices from different development teams bypasses good SRE practices and many cloud features that could be leveraged by their adoption. If your development team(s) has always done it that way, why not see what they are doing and assess if your deployment workflows as a standard could service their project's functional and non-functional requirements? Such a review is a great place to start creating a deployment standard that can be automated via a DevOps approach using IaC (infrastructure as code) and CICD. It also shines a light on requirements that are not practical for the business to support going forward. 

Company culture does not change overnight for businesses pivoting to a cloud model. The sooner the adoption of new leadership and business practice begins, the sooner the attitude of all will change allowing new practice adoption to be embraced by all as one product team, not a host of silos that may work together. One of the first places for this change in a digital/saas company is making sure the value of site reliability makes its way early into the software development lifecycle. Such practices around modularizing code, and creating awareness of cloud infrastructure features enhancing resiliency, availability and cost are just the beginning.  Bedding down a modern and robust SRE-friendly company culture will underpin future success in an ever-demanding marketplace where speed, accuracy and quality as a bar are getting higher with every deployment. Stay tuned for more on Cloud Infrastructure in this blog along with articles on other areas of interest in the Writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas or anything of interest to all, sign up for one of my newsletters in the footer of any page on Maolte. I look forward to us becoming pen pals!

Related Articles

image of a project timeline for a Maolte Technical Solutions Limited article on major incidents and digital migration

Major Incidents and Digital Migrations

Image of interconnected points and a project startup text for a maolte article

New Business and IT Contracting

Image of Jenkins workflow

CICD and Jenkins