BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Pam Selle on Serverless Observability

Pam Selle on Serverless Observability

On today’s podcast, Pam Selle (an engineer for IOPipe who builds tooling for serverless observability) talks about the case for serverless and the challenges for developing observability solutions. Some of the things discussed on the podcast include tips for creating boundaries between serverless and non-serverless resources and how to think of distributed tracing in serverless environments.

Key Takeaways

  • Coca Cola was able to see a productivity gain of 29% by adopting serverless (as measured by the amount of time spent on business productivity applications).
  • Tooling for serverless is often a challenge because resources are ephemeral. To address the ephemeral nature of serverless, you need to think about what information you will need to log ahead of time.
  • Monitoring should focus on events important to the business.
  • Build barriers between serverless and flat scaling non-serverless resources to prevent issues. Queues are an example of ways to protect flat scaling resources.
  • In-memory caches are a handy way to help serverless functions scale when fronting databases.
  • There are limitations with tracing and profiling on serverless. Several external products are available to help.
  • Serverless (and Microservices) are not for every solution. If you are choosing between two things, and one of them lets you ship and the other does not choose the thing that lets you ship.

Why did your lightening talk “go the pluck home” resonate so well with people?

  • 1:50 Someone told me recently that they found it on a self-care tech website.
  • 2:00 The talk was in 2009 and has apparently aged quite well - people still ping me about it.
  • 2:15 I wrote and performed it at a time when I realised that other people in the industry weren’t expressing the same opinions I had about how to deal with not working all the time.
  • 2:40 It’s a very entertaining 5 minute talk, and with quite a catchy title.
  • 3:20 There’s a pattern that teams reflect their leadership, for better or for worse.
  • 3:30 That goes from work-life balance to any personality tics - it’s been interesting moving into a leadership position over the last few years.

What do you do at IOpipe?

  • 3:50 We have been around for a few years, and we focus on developers in serverless compute and helping them with observability, monitoring and instrumenting their functions.
  • 4:15 Serverless computing is a new way of looking at cloud computing, that we think is revolutionary.
  • 4:30 At IO Pipe, we focus on to help developers overcome challenges and be even more productive.

What’s your favourite serverless revolution story?

  • 5:00 I love Coca-Cola’s example - they were at re:Invent, and they talked about the productivity gains of their development team.
  • 5:20 When you see the numbers on how much more developers are able to get done when they change to this new model, I was really impressed at the gains they had.
  • 5:40 They went from 39% of their time working on business functionality to 68% - a huge gain in productivity.
  • 5:00 They went from 24% of unplanned work - the machines are down - to 6% of unplanned work.
  • 6:10 The ultimate limiting resource in the world is our time.
  • 6:20 The fact that a new cloud compute model can have such impact for developer productivity is my favourite serverless success story.

How is observability different for serverless functions rather than traditional cloud compute?

  • 7:05 When you think about observability, the promise of serverless is supposedly that you don’t have to worry about the servers.
  • 7:15 As you move to serverless and you tend to care less about the machines that run your compute - they are ephemeral.
  • 7:30 When you do want to know things, the usual tooling that you would reach for aren’t available.
  • 7:45 You have to think about how you’re going to get information from a system that’s no longer there.
  • 7:50 The answer is that you have to think about it ahead of time - what you need to observe - and collect information about the system at runtime.
  • 8:25 You have to do it a bit differently than you would on a running system, where you’d have an agent that sends the data and poke the machine later.
  • 8:40 Those standard methods for debugging the machine’s state afterwards are no longer present - the serverless function only exists while it is running.
  • 8:50 The container (which is running your function) can change between requests, and can be cold started to a new image.
  • 9:10 We’re investigating how noisy neighbours have an impact on serverless applications.
  • 9:20 You don’t know who is running next to you, or how isolated you are from the neighbour.

What are people using to monitor their apps in serverless?

What types of events should be emitted by a serverless system?

  • 12:25 I want people to care more about business events in the function, rather than the platform.
  • 12:35 You can pay attention to cold starts - or any other similar technical event.
  • 12:45 The first time your function is loaded into a fresh container needs to be fast enough for a percentage of load.
  • 13:00 After that, you should focus on events important to the business.
  • 13:10 It depends on how you organise your functions - some people have a super-function that is at the end of many API calls - I have seen Nordstrum has an example of that.
  • 13:30 That’s how you expose more value to the business stakeholders - they don’t really care whether the container died, but they care that something happened when someone was trying to complete a sale.
  • 13:50 Try to think more about the business event, and less about the container event.

What about the serverless function limits?

  • 14:20 The [memory, time] limitations are valuable - if you run into some of these things, the natural question is why are you using serverless.
  • 14:35 Serverless architectures are best for event-driven architectures - they should not be long-running.
  • 14:40 Five minutes is a long time for computers - if you’re trying to take longer than that you should think about why.

Are there different things you might monitor in an AMI or container world versus serverless?

  • 15:25 In an environment where you have serverless things playing with non-serverless things, that is an anti-pattern in serverless.
  • 15:40 The best thing you can do is have a barrier between the serverless and non-serverless things of some kind.
  • 15:50 My favourite example is of a database; it isn’t going to scale like the serverless functions.
  • 16:00 Leaving the database un-guarded is a very bad idea.
  • 16:10 It applies to any situation where you have limited resources used by a flat scaling environment.
  • 16:20 We’ve used queues in the past to separate the two.
  • 16:30 It makes a lot of sense - queues are part of any event-driven architecture system.
  • 10:20 Events need to be organised in some way, and don’t attack your system in some way.

Are there any other tips?

  • 17:00 Connections is something that you have to tune.
  • 17:10 You don’t want to be opening connections all the time, and don’t want to leave connections open afterwards.
  • 17:35 We’re thinking how to optimise that more, and it’s an evolving field.

What do you look for in code reviews?

  • 18:0518:05 As well as connections, we think about cache.
  • 18:10 We do leverage some in-memory cache for our serverless functions.
  • 18:20 Some things that I would be loathe to do in a non-serverless environment, in a serverless environment I would do it without problems.
  • 18:30 If you get some data, and you want it to be for the lifecycle of that serverless container, in-memory cache works very well.
  • 18:45 You have to think about it - if you use in-memory cache, how do you notice if a memory leak occurs?
  • 19:00 You tend to see memory leaks expose themselves as fresh containers restarting, which fixes the memory leak for that particular problem.
  • 19:10 That’s something where profiling tools can be used, and have that able to turn on for a particular function to capture a full trace.

When does in-memory caching work well?

  • 19:50 If you have something that you can get from the database, you can build up a key-value store from the database, as long as it’s a small set of data.
  • 20:05 When you look something up, you can see if it’s in the cache first of all, and if it’s not, then get the database.
  • 20:10 This allows us to optimise the database lookups when we have a lot of hits in some of our serverless functions.

How do you profile a serverless function?

  • 20:20 Emitting a bunch of metrics is one option, but it’s really nice to use tools that we have such as the system’s own profiler on itself.
  • 20:35 That’s something our IOpipe libraries have, and we want to be able to use the native profiler in Node.JS, Python or Java.
  • 20:55 You need to be able to take the tools that you have outside of serverless, and be able to use those in a serverless environment.
  • 21:00 When you need profiling data, you always want it to be available.

What can you do with profiling in a serverless environment?

  • 21:20 There are limitations and they are very real.
  • 21:25 For example, you can mark the start of an invocation and the end, to profile that particular function.
  • 21:40 However, you can’t profile before the start of an invocation or at the end of an invocation.
  • 21:55 I haven’t run into cases where that is needed, but it will happen one day.

What about the system as a whole with tracing a request all the way through?

  • 22:25 Distributed tracing gets thrown around a lot, and I think it’s not really well defined.
  • 22:35 When I think of distributed tracing, I think of the full story of what happened across all my services for a defined particular time.
  • 22:50 So as an example, consider an e-commerce transaction where someone is buying something.
  • 23:00 In a serverless environment, it’s pretty available.
  • 23:10 If you’re on AWS lambda, you can use x-ray to see the full architecture.
  • 23:25 You can see how all your services are talking to each other, when they hit this lambda, what that talks to.
  • 23:40 When you talk about analysing just your lambda and just your serverless functions, you aren’t going to get the full story.
  • 23:50 It can still give you an interesting story about creating tracing for my serverless functions and sending data along with that requests for other services.
  • 24:15 So at IOpipe we do offer serverless function tracing - we can measure the request times, to turn on tracing automatically for network requests.
  • 25:00 I love that distributed tracing is getting so much attention, but I think it still is underserved.

What are some of the things that bite you?

  • 23:30 So, should you be in a serverless environment?
  • 25:35 If you’re in a system where you don’t have an event driven architecture for, then probably don’t.
  • 25:40
  • 25:45 Building an event driven architecture or a microservices driven architecture, is not always the best answer.
  • 25:55 If you can avoid building a distributed system, then do so.
  • 26:10 That said, there are plenty of times where they do make sense - unknown load, events coming into a system - and the gains work out then use serverless.
  • 26:502 Making sure that flat-scaling system interacts with non-flat-scaling systems such as databases or any kind of limited system can handle the load.
  • 27:20 If you have limitations of the platform then don’t use them.
  • 27:45 From a friend: “If you are choosing between two things, and one of them lets you ship and the other does not, choose the thing that lets you ship”
  • 27:55 If it doesn’t make sense for you then don’t do it.
     

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and YouTube. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT