Over recent years we've discussed the role of Site Reliability Engineering (SRE) and particularly how that group has grown from at one time the domain of companies such as Google, to being an expectation within companies in other sectors such as financial and medical. Recently Technology Journalist Alex Handy has written about how SREs and microservices architectures fit:
[...] while SREs and microservices evolved in parallel inside the world’s software companies, the former actually makes life far more difficult for the latter.
For Handy the reason for this is fairly clear:
[...] SREs live and die by their full stack view of the entire system they are maintaining and optimizing. The role combines the skills of a developer with those of an admin, producing an employee capable of debugging applications in production environments when things go completely sideways.
Handy goes on to cover some of the background of SRE and how that function works at scale within Google as an example, quoting Todd Underwood, one of Google's SRE directors, about how Google has put practices and systems in place to help development groups consider reliability and availability as well as technology approaches such as using Paxos for consensus in their distributed systems.
Underwood highlights another aspect of the SRE job that is essential, here, however: visibility. When microservices are throwing billions of packets across constantly changing ecosystems of cloud-based servers, containers, and databases, finding out what went wrong where is essential to troubleshooting any type of problem. This is where the full stack aspects of an SRE’s job come into place.
According to one of the product managers at Google, Morgan McLean, the key here is monitoring and traceability of microsrvices, something others have stated in the past and we've covered elsewhere. In the article by Handy, he mentions a few new tools Google has released to help tackle the problem:
[...] Google recently released Stackdriver Trace, Stackdriver Debugger, and Stackdriver Profiler. There’s a reason these tools sound like old-school testing and operations tools from traditional enterprise vendors: they perform the more traditional troubleshooting tasks developers and operations people are used to, but with a focus on microservices and performing these duties in the cloud.
Morgan McLean is quoted within Handy's article summarising what these tools do to enable the SRE group to better manage new microservice-based architectures and stating that although tracing is important, Google believes that the profiling and debugging aspects of their tools are unique at this stage and bring key benefits to developers and SRE. Handy then finishes up his article by covering further the topics of monitoring, metrics and observability with more Google and other industry references, which are worth considering because they are likely to be relevant to a growing number of companies.
As we see more and more developers and companies employing microservices and many of them also using, or beginning to use, SRE teams, it will be interesting to see how architectures and tooling evolve to ensure that reliability, availability, consistency etc. are maintained, such that developers and SRE teams can work in harmony. If you have any experiences to share in that regard, positive or negative, it would be useful for the wider community to hear about them.