On-Call Stories: Flying Blind
(cross-posted from certomodo.io)
Let’s try something new and recall one of my most memorable production incidents!
Earlier in my career, I managed an operations team at a medium-sized tech company. The main revenue-generating product consisted of thousands of EC2 instances, all depending on Puppet for configuration management.
Puppet, unlike recent CM systems like Ansible, used a centralized server to store config manifests and required authentication in order to apply them to clients. We configured our hosts to phone home periodically in order to correct any potential drift in their state over time, and we monitored their ability to do so using Nagios (keep that in the back of your mind for now). Every production environment had its own Puppet server. We knew it was a single point of failure in our architecture, but if it went down, our systems would continue to run. No problem!
And on a very unlucky moment over a weekend those years ago, the Puppet server for the largest production environment was claimed by inexorable hardware failure.
William, the current on-call, got paged for the Puppet server’s demise and started to follow a rather elaborate runbook to recover it. Meanwhile, he started to get a steady stream and then a torrent of Nagios emails for every host that couldn’t successfully do a Puppet run.
Our alerts dashboarding tool struggled to display this gargantuan list of failures. The on-call was now flying blind and couldn’t extract the signal of real customer-affecting issues from the noise.
“No problem.”, William thought. As soon as the Puppet server is back up, all of the alerts should clear.
And indeed, once he recovered Puppet and triggered successful runs across the affected hosts (which were basically no-ops), the alerts dashboard shrunk to a readable size again.
The pager remained quiet for several hours after this incident.. too quiet. William checked the alerts dashboard with skeptical curiosity and discovered there were at least a couple of customers down according to the dashboard!
William started to get flustered. After all, he got hundreds of alerts just a few hours ago, and now that something happened that actually required his attention, silence. What gives?
Now, back in those days, we weren’t using alerting systems like Pagerduty (it was an early-stage product, then!). We instead configured Nagios to send alert emails to special Gmail addresses that we would enable on our phones during on-call. We even had the ability to filter those emails by alert content and play specific sounds to tell us the nature of the alert to help us triage. At the time, really clever stuff.
William checked his alerts Gmail account and noticed that no emails were delivered over the past several hours. Zero. None.
Fearing the worst, he SSHed into the Nagios server and tailed the mail logs. William’s eyes bulged in horror as he read pages upon pages of mail delivery errors where Gmail refused to accept any new alert messages. The Nagios server was effectively banned from talking to Gmail due to the recent alerts flood.
So now, the only way William could ascertain the state of production and respond to incidents in a timely manner is to stare at the eye-melting alerts dashboard and wait for something important to arrive or for Gmail to allow alerts to flow again.
After a while, he got sick of waiting and called his manager: me.
While on the phone, I heard on the other end the voice of a really tired and frustrated member of my team dealing with a really absurd situation. And something deep inside of me just… snapped.
You have to understand, on-call was pretty challenging already. We were in a period of hyper-growth at the company and the amount of effort required to stay afloat was pretty high. Needless to say, it was a pretty stressful time for the Ops team.
In that moment of empathetic fury, I decided that I was going to solve this problem right now and make life easier for William and the next engineer on duty.
I immediately called for reinforcements (Tom) to relieve William so he can rest, buying me time to work on designing the solution.
I started building a Postfix mail server called alerts.company.tld with a corresponding MX record. The Nagios server would then be configured to send alerts to email aliases that were set up on the server. This granted me the ability to control and manage the stream of alerts any way I wished before being relayed to its final destination where the SMTP sender’s reputation would matter. But how would I do the actual filtering?
Procmail. That’s right, the 1990s-era mail filter written in the C language that was used by universities and turbo nerds the world over. I pored over the manpage, did some Googling for some example rules, and created a ruleset that would forward only the most important alerts to our mailboxes. After all, we didn’t want to get SMTP access blocked by Gmail again.
Over the course of a few hours, I cobbled this solution together in a fit of rage coding.
I then configured our Nagios servers to forward alerts to this special host, and… it worked. Alerts flowed again!
This special mail server and its procmail filter became the way the Ops team explicitly accepted alerts from production monitoring. It now required a conversation and a pull request against the filter rules before a team could delegate new alerts to us. I particularly enjoyed knowing that messages that didn’t match the filter literally got routed to /dev/null!
We eventually turned it into a redundant system and then configured it to route alerts to Pagerduty which made for really easy adoption by the team.
It’s rumored that this service is still in production to this day…
(Image produced by OpenAI DALL·E 2 with prompt “an expressionist oil painting of a blindfolded airline pilot”.)