Image of St Stephens Green in Dublin, Ireland during the Christmas season 2022

Site Reliability and Santa Claus

There is no doubt as the holiday season starts, Murphy’s law also kicks up a gear or two to ensure what can go wrong does go wrong. While you can’t really hide from Murphy’s law, there are things you can do during the holiday season that increase your chances of getting your festive dinner eaten, as you enjoy the peace of a quieter on-call pager.

Firstly, if your company has embraced site reliability practices in highly available design, cloud-native services increasing resiliency and effective monitoring solutions, your pager will be quieter already. If your company embraces technical documentation levelling and repository curation, your time on incidents will be reduced as you follow a structured path to mitigation for known and ranked issues. With this substantial lead in infrastructure architecture and implementation along SRE design lines, you are in a position to further delight your customers whilst reducing your page count and time away from your festive pudding.

Many in our infrastructure world feel removed from the customer and even talking to customer-facing staff can prove a challenge. However, I urge you to be courageous as the collaboration can substantially improve your SLA around availability through a season that is still hard to predict as sharper spikes (aka. long tails) in activity can happen with fewer technical staff on the ground to remediate any issues. Here are some key areas to think about in your festive planning for supporting your production fleet:

  • Does business development or sales have a special promotion/marketing drive(s) or campaign(s) that is expected to materially increase site traffic? Will the conversion rate be high driving payments processing and associate resources to meet the load demands?
  • If there is a specific marketing distribution that is expected to generate a huge traffic load? Is it accounted for with specific infrastructure resources like a subdomain pointing to an S3 bucket fronted by AWS Cloudfront (a similar solution is available in Azure)? Such a proactive solution would help level loads on existing resources through a scantly resourced period of the year.
  • Does your digital solution have an analytics layer? If so, is there sufficient infrastructure capacity to process load increases based on expected and past trends?
  • Does your monitoring solution in place record application metrics like memory usage? Do these metrics have risk-profiled alarms set up on traffic spikes after products like AWS Auto Scaling Groups (Machine Sets in Azure) are altered to accommodate the predicted traffic increases? Do note longer-term use of scaling groups have features that can use predictive analytics to do this for you if your use of them is extended over time.
  • If using container solutions like Kubernetes or abstractions like EKS/AKS, is your deployment edited to reflect the increased demand expected on the cluster? Do you have your cluster on enough nodes to support it?
  • Is there a development and infrastructure change freeze in place for all application deployments save emergency deployments for the holidays e.g. Dec 21st to the end of the first week in January? This is a big one for countering the effects of Murphy's law.
  • Is your documentation readily accessible via repository access and navigation?
  • Is your on-call schedule fair? If one engineer is left with the entire holiday season support, the chances of errors are high as is time to report on major incidents. Spread the workload so everybody gets a break.
  • If there is a failure, do your digital products have the ability to automatically failover to a backup? If not, does your time to mitigate meet ‘RTO’ under disaster recovery metrics?
  • Do you have a major incident process you are trained in for major outages?

There is more of course to a great infrastructure solution but if you can comfortably answer the above with yup, it’s all in place then you are indeed ahead of the game. This product-focused approach is an all-around winner for you as the infrastructure engineer enjoying your festive pudding. It's also a winner for the business team for the availability and reliability of the digital product range, and above all the customer for delighting in what you are offering online during this holiday season. Stay tuned for more on Cloud Infrastructure in this blog along with articles on other areas of interest in the Writing and DevOps arenas. To not miss out on any updates on my availability, tips on related areas, or anything of interest to all, sign up for one of my newsletters in the footer of any page Maolte. I look forward to us becoming pen pals!

Related Articles

Image of RabbitMQ examination pass and course completion for John Mulhall on his CloudAMQP course examination page.

RabbitMQ Broker and Microservices

Image of Jenkins CICD tool login

Jenkins V Vendor Managed CICD Tools

Image of an illuminated keyboard with a hand typing on it.

Cyber Attacks and Social Engineering

Image of city high risk towers in Seattle in Washington State USA

Security Awareness and 2023

Image of Connaught House, Burlington Road where Morgan McKinley Recruitement consultants are located in Dublin's south city, Ireland

Event - Cloud Migration Projects