When we design team and departmental processes, we want to know what’s happening in the software teams. Asking team members to provide information or fill in fields in tools adds a burden and distorts reality. Setting up observability in the software can provide alternative insights in a less intrusive way. Observability in the software can be an asset to organizing teams.
Jessica Kerr spoke about applying observability for speed and flow of delivery at QCon London 2022 and QCon Plus May 2022.
When you want to go fast, it helps to see where you’re going, Kerr stated. To deliver fast and focus on flow, teams can use observability, as Kerr explained:
Observability gives developers clues into the consequences of each change they make: did it do what we expected? Did it do anything else harmful?
Observability lets developers see performance regressions and error rates, check usage on features, and show the functionality to others in an intelligible, referenceable way, Kerr said.
Kerr explained how observability in the software can become an asset to organizing teams:
As leaders of teams, we can use observability by adding tracing to continuous integration. Then we can measure deploy frequency and build times. We can graph those the same way we measure performance in software. And when it’s time to improve lead time (from commit to production), we can see what’s taking so long in our builds and fix it.
A little bit of system knowledge plus a distributed trace gives a lot of insight, Kerr concluded.
InfoQ interviewed Jessica Kerr about how observability can be applied to increase the speed and flow of delivery.
InfoQ: How can building in observability help to see performance and cost impacts?
Jessica Kerr: When Honeycomb added the ability to store our customers’ event data for up to 60 days, instead of only what fits in local storage, lots of consequences happened. Queries over a wide range of data took minutes instead of seconds — even tens of minutes. Querying our traces, we could see exactly how much. Looking at a trace, we could see why: hundreds of fetches to S3 bogged down our database servers.
To fix this, we moved those fetches to AWS lambda functions (I gave a talk at StrangeLoop 2020 on how we used serverless to speed up our servers). This lets us scale our compute power with the scope of the query, live on demand. It also scales our AWS costs rather unpredictably. To help with this, we built observability into our lambda functions, so we can see exactly which queries (and whose queries) are costing us a lot. We got in touch with some customers to help them use Honeycomb more efficiently.
And then! when AWS released Graviton 2 for Lambda—it’s a different computer architecture, cheaper and supposedly faster—we tried it out. We easily measured the difference. At first it was less predictable and slower, so we scaled back our use of it until we made our function more compatible.
Serverless is particularly inscrutable without observability. With it, we can measure the cost of each operation, such as this database query.
InfoQ: How can developers benefit from adding observability to their software?
Kerr: Let me give an example. In one of my personal toy applications, I started with traces instead of logs. As soon as I looked at one, I found a concurrency error. That would have been really hard to find any other way, because the waterfall view of the distributed trace clearly showed an overlap where I knew there shouldn’t be one.