Incident Write-ups
(Cross-posted from certomodo.io)
When is an incident considered ‘done’? Is it when the production impact has been addressed and the on-call goes back to bed? If that were true, teams would pass up a huge opportunity to learn and improve from what the incident can teach them, and the on-call (and more importantly, customers) would continue to have a sub-par experience from repeat incidents.
This post discusses the importance and process of the write-up, which is documenting an incident in preparation for a post-mortem and later discussions.
Write-ups are common in other industries, such as airlines, healthcare, and manufacturing, where incidents often involve harm to human life. They facilitate an investigation into all of the details regarding the incident such as contributing factors and potential improvements in order to eliminate the risk of future harm. As professionals in the tech industry, we have a similar responsibility to create safer products for our customers, even if the risk isn’t as high-stakes.
Let’s go over the process of creating an effective write-up. A few tips when adopting this process:
The author of the write-up should be the engineer with the most context about the incident. It can be the on-call who responded, the subject matter expert for the service that was affected, or a combination of the two.
Begin the write-up as soon as possible after the incident is remediated, ideally immediately or on the next business day. Memories fade, and so do logs and time series!
Describe The Impact
Being able to explain the incident’s impact on the business brings into focus the importance of learning from it. The impact statement should be able to answer the following questions:
How long did the incident last?
How precisely did it affect customers?
How much money was lost?
A good example of an impact statement would be:
“On 2023-05-12 at 14:00 UTC, 15 customers were completely unable to use Service for 45 minutes, which led to SLA violations that cost us $50,000 in revenue.”
Create a Timeline
Gather as much information as possible about all events from the start of the incident impact to the moment of remediation. Useful data sources are:
Graphs and dashboards
Alerts from the monitoring system
Logs from production infrastructure or CI/CD
Chat messages
Command line histories
Put them in chronological order and ensure that all events are expressed in the same time zone. Make sure that the time zone is explicitly stated. I strongly recommend using UTC, especially for global teams.
Use the Timeline to Create a Story
Use the timeline as raw material to create a story clearly explaining the important details of the incident to the reader. This will be the main section of the write-up and will be the content presented at post-mortem meetings.
Meta actually provides a very useful acronym and mnemonic for this: DERP!, which stands for Detection, Escalation, Remediation, and Prevention.
Detection: What notified the on-call that the incident is happening?
Escalation: Was the on-call able to respond on their own, or did they need to reach out to other teams?
Remediation: How was the incident specifically addressed?
Prevention: What steps need to be taken to prevent recurrence?
Take the important items from the writeup and separate them into Detection, Escalation, and Remediation. Prevention will be filled out in future steps.
Perform a Cursory Retrospective
It’s important to identify gaps in the response to the incident itself. Many teams simply focus on ‘root causes’, which is a mistake. How can the team better respond to a similar incident next time?
Fill out the story further by doing a retrospective for each section. Retrospectives tend to answer the following questions:
What went well?
What could have gone better?
What was lucky or circumstantial? (This is just as important as what could have gone better, as luck means it could have been worse!)
More details will be revealed and discussed later in post-mortem meetings, so the goal right now is to identify and track the low-hanging fruit. Examples:
Detection: Was monitoring missing, failing, or taking too long to trigger? Did the customer have to do the monitoring system’s job?
Escalation: Was the escalation path not clear? Did teams take too long to respond to escalations? Did the escalation get bounced between multiple teams?
Remediation: were tools, runbooks, or operational metrics missing that prevented a swift response? Did the team struggle with fixing forward or rolling back? Did the team have to attempt remediation multiple times? Did an attempt at remediation make the incident impact worse?
Describe What Triggered the Incident
For many incidents, this step is the most time-consuming. Understanding why it happened in the first place can require significant brainpower debugging and making sense of what logs and metrics suggest. Once discovered, a software bug or defect can be a very straightforward explanation. In other cases, it can require analysis of human factors like emotional states, processes, and incentives— which are much less clear.
Entire books have been written about this subject. Many websites will talk about the ‘5 Whys’ and Ishakawa diagramming, which can be useful tools but also can fall short of revealing useful results. I’ll provide additional guidance here:
Avoid the term ‘root cause’ as incidents tend to be complex and are manifested by multiple factors. John Allspaw wrote an article about this issue and has spent a lot of time exploring it.
Never use ‘human error’ as an explanation for an incident. Keep the write-up blameless and focus on process and technology, not people. An excellent book to read is “The Field Guide to Understanding Human Error” by Dr. Sydney Dekker.
Provide a detailed description of each item contributing to the incident in its own section.
Identify Fixes and Improvements
Using the incident retrospective and the list of contributing factors, create a list of tasks that if performed would prevent recurrence or would reduce impact or duration. These go into the Prevention section of the write-up.
Again, focus on listing the items that are obvious. Future post-mortem meetings will reveal additional work with the help of the team and will contextualize that work in terms of impact and priority.
Write-up Format
Here’s a suggested format for writeups for easy consumption by the reader:
Title of Incident
Impact statement, including time and duration
Main Body
Detection
Escalation
Remediation
Prevention
What Triggered the Incident
Timeline and supporting data
Conclusion
It is our responsibility in software engineering to understand and improve from production incidents.
Write-ups are a very important step as it facilitates data collection, investigation, and the creation of an easy-to-understand story about what happened. They then allow us to conduct effective post-mortems and identify the work to prevent impact and recurrence.
I am passionate about learning from production incidents. Schedule an intro call with me to see how I can help your team with this process!