SRE Engagement Models
(Cross-posted from certomodo.io)
Last time we went over the basics of what it means to run an SRE team based on the original ideas that came from Google. Let’s talk about the ‘engagement model’, which describes the way that an individual SRE or team works with software engineering organizations to help them achieve their goals.
The SRE Workbook describes the various types of activities at length— in my experience individual SRE engagements tend to fall into different types depending on the situation. Important factors include which stage of the PDLC the team is in, current operational load, and how large the R&D organization is.
For example, the original engagement model from Google is that an SRE team assumes most of the operational responsibility of a service for a SWE team. While effective, this method is rather costly, as the team will need to be of significant size to create a sustainable on-call rotation. Another dynamic is that the service will need to undergo significant production readiness activities before the engagement can begin. Other companies, especially smaller ones, won’t be able to implement SRE in this fashion.
Running an SRE organization in my experience is similar to running a consultancy— you have multiple clients with different needs and timelines and not enough personnel to serve them all. The primary challenge at hand is to provide the maximum level of business value by serving the right teams in the right ways at the right times.
Here are the basic types of engagement models that I’ve encountered as an IC and used as a manager:
Consulting: SREs operate outside of software engineering teams and provide guidance and assistance where needed. They work from their own independent roadmap. This model is great when trying to establish a reliability culture and centralize tools and processes across R&D.
Activities: provide services such as conducting production readiness assessments, assisting with the design phase of a software project, running blameless post mortems, defining and publishing operational best practices, and providing incident management on-call.
Advantages: able to serve entire organizations rather than specific teams; can quickly reprioritize around organizational needs; can operate outside of the challenges and dysfunctional dynamics of specific engineering teams.
Disadvantages: can’t provide in-depth technical direction; difficult to establish and maintain credibility to drive change on individual teams.
Embedded: SREs onboard to a software engineering team and champion/drive reliability from the inside. As a team member, they work from the same roadmap and are part of the on-call rotation. Typical engagement model used by Meta’s Production Engineering. Also used by Google to help teams tackle operational overload.
Activities: directly contribute to product features and reliability improvements; lead initiatives such as creating monitoring and SLOs and improving oncall health.
Advantages: can develop in-depth understanding for a given product, and address highly-technical issues for a given team relating to architecture or scaling; easier to drive change by building relationships with fellow engineers.
Disadvantages: without support from SWE leadership, SRE can have limited influence on team process and roadmap prioritization, especially if only one engineer is embedded on the team. Misaligned incentives can diminish an SRE’s effectiveness, and the engagement can be seen as free SWE headcount, or worse, dedicated ops personnel. There is an up-front cost required in time to onboard, usually measured in months. SREs will need to be experienced or need frequent coaching to thrive on difficult teams.
Infra Team: A team composed of entirely SREs build and operate a key low-level system that the company depends on.
Activities: similar to any software engineering team, with the additional focus on building resilience from the start.
Advantages: Total control of the roadmap and product scope. Team can be used as an example/case study showing other teams how to operate reliable systems.
Disadvantages: missed opportunity cost of not working with other teams. If team scope and boundaries are not clearly defined, there is a risk of the team regressing to Ops/IT behaviors.
As you can see, there are various approaches to implementing SRE in your organization based on challenge, budget, and need— and like any other engineering discipline, it comes with a set of tradeoffs. Next time we’ll discuss the specifics of starting an engagement with a software engineering team and how to avoid the common pitfalls!
Want hands-on assistance with building an SRE competency at your company? Let’s schedule an introduction!