A site reliability engineer (SRE) can be a generalist or specialist. Recently, the team at Blameless elaborated on the advantages of a specialized SRE team. The specialist nature of the SRE role can be highlighted from the recruitment process. Depending on the individual skillset, organizations can engage an SRE in a number of specialist roles, such as educator, SLO guard, infrastructure architect, and incident response leader.
The role of SRE has a unique premise that allows specialization alongside the awareness of the bigger picture. SREs take the role of "reliability guardian", ensuring adherence to service level objectives (SLOs). SREs also encourage teams to do experiments, taking calculated risks to learn from them.
When progressing toward reliability, organizations often create SRE teams that work in either distributed or centralized models. SREs may contribute to the codebase of a service or write development policies and procedures. Sometimes, SREs take up glue work - work that is essential for the project to succeed, but it doesn’t entail contributing to code.
Organizations can let specialized SREs spend more time and energy on their strongest areas. For instance, the SREs with strong technical backgrounds can work on infrastructure or in-house tools, and those with less-technical backgrounds can focus on being full-time educators or policy writers.
The theme resonates with the forward-deployed SRE model from QCon Plus July 2021, which looks at an SRE capable of meeting the operational needs of teams alongside serving a strategic role in the long term operational excellence of an organization.
Organizations can create job postings leveraging the specialization mindset to build an SRE team. A recent post from DevOps.com analyzes SRE job descriptions, stating the varied expectations from SREs.
An organization can look for below specialist roles in the SRE teams
- The Educator
- Builds development policies, procedures, cultural values and infrastructure that will benefit the organization
- They can conduct informational sessions to impart the new practices and analyze how much they are used
- Educators need to possess empathy, alongside being able to convince people to adopt new practices
- The SLO Guard
- Ensures an SLO measures what it needs to and it is not in breach
- Sets up SLO review meetings, incorporating additional tools to capture relevant data
- Cultivates the "ability to say NO". Communicating to someone that development needs to be delayed to preserve the SLO is a critical skill
- Infrastructure Architect
- Builds SRE infrastructure for different projects, including documentation for internal tools, runbooks for procedures, processes for completing projects, etc
- Works closely with development team playing a role similar to an SRE-Developer
- Being a technical role, the desired skill set includes grasping the development processes. Deep knowledge of an organization’s codebase becomes a must-have
- Incident Response Leader
- Responds effectively to incidents in an articulate manner, ensures that the organization is "incident-ready"
- This role is essential before, during, and after the incident. Hence, the responsibilities range from creating an on-call schedule as a prep before the incident to collaborating with teams during the incident, eventually creating retrospectives after the incident
- People skills are required for this role, coupled with prioritization skills and awareness of the tools
In related news, the DevOps Institute has recently announced launch of the 2022 State of Site Reliability Engineering Survey. With site reliability engineering being a must-have process and framework, the survey aims to provide the global IT community with practical guidance and support for SRE adoption.