By Adrian Bridgett | 2021-02-15
What can we learn from aircraft investigations and how we can apply these to SRE?
Over Christmas I fell ill, dizzy, unable to concentrate on work or even complete personal projects.
Instead I watched Youtube. Lots of Youtube - including several accident related channels:
At one point I rolled my eyes at some approaches to safety during the 60’s airline industry.
Yet we have much to learn. The airline industry has shown how to take an inherently dangerous activity and by applying the right pressures, procedures and most of all learning, turn it into one of the safest forms of transport. Even within the last 20 years there has been a 5-fold reduction in deaths.
Prelude - Alerting
Beep, beep, beep.
“Oh, it always does that, just ignore it”.
You’d be horrified if this was “low fuel” pressure on your flight. Or “low blood pressure” during a surgery. Yet we do this all the time in the software industry - just as pilots and surgeons have done…
From surgeries experiencing 10-30 alarms/hour and aircraft automated warnings to database query latencies, false positives are a big problem.
Nearly everyone knows the story of “The boy who cried wolf”. I doubt there’s a silver bullet here.
One approach is to alert only for “customer centric” alerts (e.g. API response time) rather than underlying metrics (such as disk latency). Alerting on SLO burn is also a useful approach.
Fundamentally, don’t alert unless you want someone to do something. Draw some clear distinctions:
- Alerts - actively monitored. When triggered, someone takes action (see note).
- Notifications - a “flight recorder” - looked at only to determine what’s gone wrong. e.g. when an alert is triggered, the notifications might give more of a clue about where to look.
- Metrics - any measurement we take. If something isn’t monitored it means that we won’t be able to use it in a post-mortem (or live troubleshooting).
All too often I see pager duty alerts which result in no action. This causes mental stress and frustration - more so at 3am. We must balance false-positives and false-negatives. What’s the price we pay in excess alerts by trying to reduce missed alerts to zero?
Reduce the sensitivity, alert only when it triggers in combination with another alert or downgrade it to a notification.
This was the initial reason I spotted the similarity between the two industries. “You expect someone to type in the amount of fuel needed, accurately… during the era when the imperial/metric conversion was occurring?” Given the number of planes taking off each day, human nature and inevitable error, this seems a little risky.
At this point, automation on planes was rather minimal. There were also safeguards (both pilot/co-pilot entering numbers, repeating the numbers back to each other). The safeguards were easy to skip (time pressure, laziness, inadvertently) - this is human nature. I hope that these days the on-board computers can measure the quanity of fuel (e.g. by weight) and alert if the quantity of fuel on board isn’t sufficient for the planned destination.
Bringing this back to SRE, frequently I’ve see a lack of checklists and automation. Procedures as “institutional knowledge” - passed by word of mouth, implemented differently by each member, steps skipped, onboarding painful and stressful.
I believe that documentation is one of the best investments that a team can make. Dissapointingly I find it’s often one of the most neglected areas - leading to “ask so-and-so, they know that” (and no-one else).
The airline industry also relies heavily on what the computer industry might call “runbooks”. Procedures that have been carely prepared by experts, examined for problems and tested. “If your airplane has lost an engine, here’s what to do…” I’ve often thought that we are rather lacking in this area - often relying on “seats of our pants” flying.
Why are runbooks and procedures important? Let’s see:
- Written without time constraints
- Input from more people
- Reduces stress on the person dealing with the crisis
- Leads to a faster, safer fix
- Benefit of hindsight “take a backup before…”, “disable batch job before disabling foo service”
- More/better solutions to problems
- Continuously improved
- Automated/scripted (worst case, having a clear procedure makes it much easier to automate in future)
If there’s a procedure, write it down. Do so as you take the actions - it’s more accurate and faster than attempting to do so afterwards. Even better, start to script and automate it. Even if it’s just one simple command, it won’t be long before someone adds a test at the end to catch the times when it fails. Or when a new version needs different options.
One of the best tips I’ve seen recently was typing in the number of machines. It’s a gloriously simple approach that would have saved quite a few outages. Even if it’s perfect, having that little bit of confirmation that the command is going to do what you expect helps reduce unneccessary stress.
If you can, add time estimates to the runbooks. Knowing that the database took 5 hours to restore helps to sets expectations (for customers as well as the responder). Add pertinent detail - was the DB 1GB or 50TB? Was it replaying 1 hour of logs or 23? This helps improve the accuracy of the estimates.
This is another area where there is much to learn.
Firstly, a large number of problems are caused or exacerbated by failures in communication. Whether this is via absense, assumption or suppression due to deference to elders.
How do you communicate changes, planned maintenance or outages to your service to your end users? Is it pull or push notification? Are important announcements drowed out in a sea of noise? Is there a clear way to obtain “flight status” or are your users expected to memorise a website?
Providing a status API allows users to automatically trigger circuit breakers? Another great idea is to implement a “change freeze” API that can limit actions or non-essential automation.
In an aircraft there are several groups the flight crew need to communicate with:
- Cockpit crew
- Entire flight crew including cabin staff
- Air traffic control
- (Report - or “postmortem” in SRE parlance)
- (Press and families are out of scope)
Each of these have different needs and can provide different levels of assistence.
- During major incidents do you “clear the runway” by requesting that your “control tower” imposes a moratorium on any other changes?
- Do you alert other teams to “brace for impact”?
- Do you seek help from other teams? The equivalent of the fire-crew being ready at the end of the runway might be provisioning extra servers, or preparing a restored DB just in case.
Probably the most infamous part of flight safety is the “flight recorder” - or black box. Not least because they aren’t black. This captures two components.
The first component is the flight data recorder - what we’d call metrics. It’s critical to note that we cannot go back in time to ask what metrics were. So we need forethought to record all the metrics we need at sufficient resolution and duration. If during the postmortem we discover some are inadequate, it should be a priority to add them - even if purely to aid future investigations by (in)validating theories.
Generally we are lucky compared with the airline industry - tracking thousands or millions of metrics for years is commonplace.
One area we fall short is the durability of metrics in the event of problems - it’s not unusual to find timegaps in monitoring when there is a problem (often due to overloaded monitoring, or un-queued writes).
The second component is one that is oftenplace missed in my experience. This is the cockpit voice recorder. Postmortem timelines vary in quality - sometimes they are stitched together from vague memories, snippets of slack converstation spread across many channels and some git commits (IaaC).
I have some explicit tips here:
- Read the PagerDuty Incident Response process
- Open a new slack channel for the incident so that it’s clearly separated from any normal traffic. I strongly believe in open communication, so make it a public channel (to internal users). Use bots to help combine this with raising an incident on your favourite incident handling system.
- Create a conference room and turn on recording (and auto-captioning). This often faster than typing in slack, and most importantly improves cameraderie. Surprisingly it can be less distracting as people can think/listen out-loud whilst working (rather than having to bounce their eyes back and forth between slack and a terminal window)
- Updates to slack as needed (this can be helpful to form the timeline). Even better, I like to post them straight into the incident report as it goes as a clear record for incident handovers.
- Keep everyone else informed - in particular of the current situation, ETA for next update and ETA for fix. This can be as simple as using the slack channel notes. No-one should need to ask for an update - they should be able to rely upon the responders to do provide periodic updates.
- Use of automation (GitOps, ChatOps or adding timestamps to your terminal prompts) helps when constructing the timeline
Or “postmortems” as they are typically called. Much has been written about these elsewhere, so I won’t repeat them:
Primarily through painstaking thorough investigations and a clear drive to improve airplane safety, flying is now incredibly safe.
By comparison, the software industry operates with a far higher tolerance for failure. This may be acceptable - it allows us to move faster.
I wish to leave you with two thoughts:
- We should a make conscious decision about how much risk we wish to take
- By adopting some techniques from other industries we can find novel approaches to improving software system reliability