How to Get an SRE Role
(cross-posted from certomodo.io)
Are you a software engineer or an IT professional interested in transitioning to an SRE role? You’ve come to the right place! This article provides guidance on the skills and behaviors needed to apply for an SRE position at medium-to-large-sized tech companies successfully.
(This article was inspired by a discussion that took place in the Boston DevOps community chatroom.)
Before we begin, I want to immediately mention the “DevOps Roadmap” brought up in numerous online discussions about how to prepare for a role in this space. While useful, it places an emphasis on learning specific technologies and tools without the context as to why they are important. For really large organizations like Meta or Google, many of those tools aren’t in use at all in favor of their own in-house versions. For example, neither of those companies uses Kubernetes for workload orchestration. Therefore, this article presents a more generalized approach that will enable you to apply for a broader set of job openings.
Character Traits
SRE as a practice involves a lot of communication, especially with software engineers as you introduce, sell, and implement new ideas such as toil management and SLO. I would argue that SRE is a form of change leadership as one of the effects of a successful engagement is a more healthy engineering culture. Therefore, cultivating these skills will differentiate you from other candidates (the same can be said for senior roles in software engineering as well).
Expect to have behavioral interviews where you will be asked questions about how you have addressed past situations with others. Make sure your answers showcase these character traits!
Emotional Intelligence
As mentioned in my previous article, emotional intelligence is defined as “the ability to identify and manage one’s own emotions as well as the emotions of others”.
This is a very important skill for an SRE as it allows you to quickly gather feedback on how your work is perceived by others. When embedding on a team, building credibility and trust with your SWE teammates is a prerequisite for any lasting changes. If your message is falling flat, your efforts will ultimately be ineffective. Without EQ, you won’t know easily where you stand with the team.
Resilience
According to the American Psychological Association, resilience is “adapting to difficult or challenging life experiences, especially through mental, emotional, and behavioral flexibility and adjustment to external and internal demands”.
Let’s be frank- site reliability engineers are assigned to teams where help is needed the most. Product launches, production incidents, and team dysfunction are all sources of stress- and it will be key to skillfully manage how those situations impact you during and after work.
Having tools such as setting healthy boundaries, taking the appropriate time to rest and recharge, and managing problematic emotions such as guilt and shame make it possible for you to lead an embattled engineering team through their current crisis.
Assertiveness
You have work items related to reliability in your backlog that are getting stale, and you know they are still important. How do you get them prioritized in your team’s roadmap? It won’t naturally happen, and it will be your responsibility to start that conversation.
Being an assertive SRE means that you are speaking up for the things that you know need to happen to create a more reliable product. That requires being confident in your knowledge and experience and being able to use that in your communication to change the perception and behavior of others.
Software engineers will commonly look to SRE for direction on how to approach all sorts of operational challenges, and they won’t feel confident in your insight if you aren’t! There will also be moments when you need to push back on something that you clearly know is problematic or harmful to the team, process, or technology.
Coding Skills
One of the primary responsibilities of an SRE is toil management, which is the practice of removing the manual effort involved in running a service by automating it away. It is always better to write a script to perform an operations task than to follow a runbook- especially when managing infrastructure containing thousands or even millions of hosts. The best solution is to eliminate the need of running scripts entirely by building services/features that handle the operational tasks for you.
Ben Treynor Sloss, the inventor of SRE, said it best: “Hire only coders.” Site Reliability Engineering relies upon software engineering as a main part of the discipline. If you want to be an SRE, you will need to write code. A good first language to learn is Python; it has a very large community and free resources, and many companies use it in production.
What does that mean when interviewing? Expect at least one coding interview that is roughly 45 minutes long. Typically these interviews are made up of two questions. Some companies will conduct it in person and use a markerboard; others will use an online collaboration tool such as coderpad.io. Some guidance:
Be comfortable with tasks such as file I/O and parsing.
Be aware of common data structures and which ones to use when tackling programming problems. Note that the language used will typically have specific ones available (eg: Python).
Program defensively. Know how exception handling works in your language and proactively handle likely problems encountered during runtime.
Have an awareness of the time and space complexity of the code you are writing using Big O Notation.
Be able to clearly explain your solution and respond to feedback.
Be honest if you don’t remember a specific method name and invocation- just say so to the interviewer, make the method up, and move on.
Programming interviews for SREs tend to not have a huge emphasis on algorithms- only top-tier companies like Google tend to have that requirement.
Cracking the Coding Interview is a decent book to help you think about the various types of programming questions that an interview can have, as well as the general format of how the interview is conducted. Sites like Codewars and Leetcode can be useful to find example questions to practice with but avoid the temptation to provide clever, one-liner solutions. Write your solutions as if they were production code.
All of that said, admittedly, coding interviews are an inaccurate assessment of a candidate’s actual programming ability. Some more enlightened companies provide take-home questions as part of the interview process to remove the pressure of writing code on the spot.
Systems Knowledge and Experience
Being able to automate tasks as an SRE is not enough- in order to improve the reliability of the system, there needs to be an understanding of the infrastructure and its hardware resources, and how your product utilizes them. Without this skill, you will not be able to identify performance bottlenecks or resource failures that contribute to outages.
Companies will conduct at least one systems interview to assess this area. Questions can vary from discussing how to best perform a task on a production system or a live simulated troubleshooting exercise. To prepare, here are some recommendations:
Linux Internals
The vast majority of production systems run on the GNU/Linux kernel and toolchain. If you gain an understanding of how this system works, you will be able to handle most questions in a systems interview. Example topics:
Process execution, scheduling, and threading
Memory utilization, paging, and swapping
What are the functions of the kernel?
System calls and file descriptors
Interprocess communication: pipes, locks/semaphores, signals, and signal handling
Files, directories, permissions, and inodes
When preparing for my interview at Meta back in 2019, I created a stack of flashcards containing all of these concepts!
I would also recommend learning the basics of networking and network services:
TCP/IP, UDP protocols and their differences
Addressing (IPV4, IPV6)
Packet routing
Ports
HTTP protocol and response codes
Troubleshooting Tools
When dealing with a production issue, there is always a likelihood that observability and dashboarding tools like Grafana, Splunk, or Datadog won’t surface the specific data you need to troubleshoot. And in the general case when those tools successfully do the job, it is really important to know where those metrics come from.
It is for this reason that I strongly recommend having a working knowledge of CLI tools that can be used to gather information from a running production system. There are many online resources on what tools are available, but here is a short list that I’ve used for many years:
CPU Utilization: top, dstat, vmstat
Memory Utilization: top, free
Storage Capacity Utilization: df, du
Storage I/O Utilization: iostat
Network Bandwidth Utilization: bwm-ng, iftop, vnstat, iptraf
Network Troubleshooting: ping, netcat, traceroute, netstat, tcpdump, nmap, cURL
Process Introspection: strace, bpftrace, lsof
Brendan Gregg’s Systems Performance book is a great reference that goes over all of the above plus even more advanced topics.
I also recommend being familiar with the command line in order to effectively navigate systems as they will not have GUIs. The Command Line Crash Course is excellent for this.
Linux System Administration Basics
In order to automate away toil, it makes sense to understand what goes into running systems manually. This will help explain why containerization and configuration management tools are essential for running systems at scale.
Concepts to be familiar with:
User Management and file permissions
Package management and host patching
Setting up an application stack (eg: LAMP)
Configuration using /etc, sysctl, and virtual filesystems (/proc, /sys)
Service management using the init system (eg: SystemD)
Logging
The best way to learn is by doing. Install an LTS version of a popular Linux distribution in a virtual machine such as Ubuntu or a free RHEL derivative (eg: Rocky Linux), and follow free online guides on how to manually set up services such as webservers, databases, file storage, or even entire production stacks (eg: Wordpress on LAMP).
Conclusion
If you can code, understand what production systems are doing at the process, OS, and resource level, and be able to effectively communicate with your peers in stressful situations, you have the makings of a successful Site Reliability Engineer!
I have built SRE teams, interviewed candidates, and have been a successful SRE myself. If you are interested in applying for an SRE role, please contact me and I’ll be happy to provide a free hour of coaching to help you prepare!