On-Call Retrospectives
(Cross-posted from certomodo.io)
Last time I shared my thoughts on blameless postmortems and how they create a safe space for revealing process and technology gaps contributing to past incidents.
Today I want to introduce another opportunity for teams to learn and improve from: the ‘on-call retrospective’, which:
Keeps the team in touch with the operational reality of their service(s);
Reveals opportunities to improve the on-call experience.
I was introduced to this practice by Jos Visser while onboarding at Meta- which inspired me to make it my own and then implement it in every on-call team I’m on. Here’s how it works:
Once a week, have a scheduled 30m meeting, ideally attended by everyone on the team. Designate someone to moderate the meeting and take notes (I call them the ‘scribe’).
The agenda will look like this:
First, the scribe will present top-level metrics from the previous week:
If you have Service Level Objectives (and I hope you do :-) ), how did they perform?
How many alerts did the on-call rotation receive? How many of them were critical(meaning: they were tied to an emergency)?
How many helpdesk requests did the team get (if applicable)?
Compare these metrics with past periods to see if there is an improvement or regression in general on-call health. Ideally, this data is automatically collected and made available on a dashboard to make that a trivial process.
Next: The scribe will debrief the previous week’s on-calls on their experience. They should be able to present the following information:
What incident(s) they were involved in, and their general impact(s).
WIP handed off to the next shift (incidents, alerts, and support requests still-in-flight). (IMO: an on-call should work the alerts/support requests they received to completion. It’s totally reasonable to hand off incidents, however.)
How painful was on-call, from a scale of 1-5 (1: no impact at all, 5: took up all of my time, as well as after-hours)
At Meta we actually put this information in a weekly report to be read before the meeting.
Finally: discuss specifics about the shift’s experience, such as:
Were there any alerts that were noisy or unactionable?
Are the alerts runbooks out of date?
Can alert remediation be better performed with a tool or automated away entirely?
Is there a bug that we need to bump the priority on?
How can we make this support request self-service?
This is the moment where the scribe is doing the most important work: taking notes on what can be improved for future on-call shifts.
Immediately after the meeting, the scribe converts those notes into tasks in the issue queue. I also suggest assigning the tasks to the on-call who was debriefed so that they may provide additional details.
If you and your team then prioritize and tackle those tasks, on-call shifts will be more pleasant and your team will be happier and more productive in doing the things that provide value!
Struggling with on-call burden on your team? Let’s talk!