(cross-posted from LinkedIn)
Hot Take: If you build it, you run it.
If you build software that people depend on and are not operationally responsible for it (particularly on-call): you should be. 🛑
Seriously. It’s 2024. The Phoenix Project was published over a decade ago. You know more about the system than anyone else, we shouldn’t be delegating on-call to another team. 🙅
Why do I feel so strongly about this? Time for a story. 📖
Earlier in my career, I worked in a cloud ops team that was catching fire due to all the alerts we received.
It was so bad, we:
were getting critical alerts from Pagerduty once every 10 minutes. Yes, ALL of them were emergencies.
handed off the pager every two hours during the business day because anything longer than that would be unbearable. Our on-call schedule looked like a candy cane.
How did we get into this situation? The classic ‘wall of confusion’: the Dev/Ops team split. They wrote the code, we did everything else. Therefore, we were shielding them from the consequences of their changes to the codebase.
One time, I was completely stuck troubleshooting a customer-facing incident as the on-call, and I couldn’t get a hold of a single engineer for help. I was resorting to scouring chat messages and emails and wiki pages for phone numbers to call. I had to call the VP of engineering to get someone to help me. Imagine having to do that.
To be clear, the engineers weren’t being malicious. (Many of them today are my dear friends!) They were under high pressure from the business to deliver new features. Many of them didn’t even know what was going on. The feedback loop was completely broken.
The thing that really angered me about the whole situation was that culturally, the business got the idea that it was our job to constantly firefight. I remember overhearing someone at an all-hands saying, “Why are they complaining? That’s the job they signed up for”. It took EVERYTHING in me to not verbally rip him a new one.
Eventually, after a lot of campaigning, I managed to get buy-in and budget from the C-level to roll out a DevOps and SRE program, starting with introducing engineers to being on-call. They began to receive escalations from our team, fixing that feedback loop, and things started to change for the better.
The fact is, when you’re on-call, you’re automatically incentivized to improve the state of production. Nobody wants to be woken up in the middle of the night, I get it- but nobody else can recover a production system more effectively than the engineers who build it.
Repeat after me: If you build it, you run it.
Introducing software engineers to on-call isn’t as simple as adding them to the rotation; you need organizational buy-in, enablement, and a process that protects morale and work-life balance. ✅
If you need to do this in your organization, let’s schedule an intro! 🗓️
As you can see, I’m pretty passionate about it. 😅
> I remember overhearing someone at an all-hands saying, “Why are they complaining? That’s the job they signed up for”.
This misconception that DevOps/CloudOps/SRE etc teams are solely reactive is such a driver of poor reliability culture. Imagine if product teams thought about customer complaints that way - "If the users don't like the product, that's customer support's problem". Reliability is a huge part of customer experience and if everyone at your org isn't bought into it, YNGMI.