In this podcast, Liran Haimovitch, CTO at Rookout, sat down with InfoQ podcast co-host Daniel Bryant. Topics discussed included: the concept of “understandability” and how this relates to building modern software systems, how complexity impacts a system’s understandability, and the benefits of live debugging tooling.
Key Takeaways
- Understandability is focused on presenting a system in a way that an engineer can comprehend it easily. A system is understandable if it is complete, concise, clear, and organized.
- There is a static and a dynamic component to understandability. Design time understandability is focused on architecture diagrams and code, and runtime understandability is concentrated on data, data flows, and side effects.
- All architecture styles provide their own understandability challenges. System complexity, software age, size of the team, employee turnover, and overall code quality and architectural quality are strongly correlated with understandability.
- Understandability and observability are linked. Observability is often more ops-driven, and focuses on understanding the state of the system from the outside and being able to ask questions about how the state occurred.
- Live debugging tools, such as Rookout, are focused on enabling developers to dynamically build an understanding of what is happening as a system runs in a production-like setting.
Subscribe on:
Transcript
00:01 Daniel Bryant: Just before introducing my guest, I wanted to mention that InfoQ have launched a new virtual event called InfoQ Live. These will be half day events that run once a month, and there'll be deep dives into one or two related technical topics. InfoQ Live is focused on software practitioners and tech leaders, and will offer real time Q&A with speakers and peer sharing with your fellow attendees. Head over to live.infoq.com to find out more.
00:26 Daniel Bryant: Hello and welcome to the InfoQ podcast. I'm Daniel Bryant, news manager at InfoQ, and product architect at Datawire. And I recently had the pleasure of sitting down with Liran Haimovitch, CTO at RookOut. Liran has recently been exploring the topic of understandability in relation to software development and delivery. He provided a great overview of the fundamental ideas behind this concept in a recent InfoQ article that I'll link in the show notes.
00:47 Daniel Bryant: And one of my main goals of the chat today was to dive a little deeper into understandability and other related properties of software systems, such as completeness, conciseness, clarity, and organization. I was also beginning to understand how the concept of understandability relates to observability, and explore how the dynamic versus static nature of software impacts this. I know from my past experiences working as a software architect, that sometimes the well-drawn architectural diagrams an organization has don't always help us with understanding how the application is actually run in production.
01:16 Daniel Bryant: Hello, Liran, and welcome to the InfoQ podcast. Thanks for joining me today.
01:19 Liran Haimovitch: Hey Daniel, it's great being here.
01:21 Introductions
01:21 Daniel Bryant: Could you briefly introduce yourself, please?
01:24 Liran Haimovitch: So my name is Liran Haimovitch. I'm a co-founder and CTO at RookOut. And before founding RookOut, I actually spent most of my career doing cybersecurity.
01:32 What made you write your recent InfoQ article, “Understandability: The Most Important Metric You're Not Tracking”?
01:32 Daniel Bryant: So today I wanted to focus on understandability. You very kindly wrote a great piece on InfoQ for us recently, Understandability, The Most Important Metric You're Not Tracking. Great title, right? Always a bit of click bait there, getting people thinking. Well we can dive into understandability a bit more in a moment, but could you share the core premise of the article? What made you write it?
01:53 Liran Haimovitch: So what made me write it is our experiences with our customers. And we've built this query debugger, and I can go on for hours about it, but working with customers, they've kind of tried framing it into their existing processes, their existing needs, and how does it relate to their existing tooling, especially around observability. And how does debugging, which is often equated with observability, come together with tools such as tracing and so on? And working with them, we found that understandability, the ability to comprehend their code, the ability to adopt it, to fix bugs, to build new features, that's what's best describing their needs that we can address.
02:34 At the fundamentals, what is understandability?
02:34 Daniel Bryant: Very nice. At the fundamentals, what is understandability?
02:38 Liran Haimovitch: So I kind of stole the definition from financials, and there's a term, financial understandability. And so paraphrasing on the original definition, it's about presenting a system in a way that an engineer can comprehend it easily.
02:51 Could you break down some of the related concepts, such as completeness?
02:51 Daniel Bryant: Nice. And you mentioned in the article that a system is understandable if it's complete, concise, clear, and organized. Could you break down some of those concepts for us a bit more please?
03:02 Liran Haimovitch: So starting with complete, which is kind of obvious. If you want somebody to understand something, then provide them with all the information he needs. And you would be surprised, the first thing you think, when you're hearing of a software application, is source code, especially if you're a software engineer. But at the end of the day, when a social application is running in the real world, there is so much beyond just the source code. There is configuration, there is state, there is the runtime itself. There is inputs and outputs, and there are various service dependencies. And you have to keep track of all of those. When you speak of all the criteria, but especially completeness, then you must provide your engineers with visibility and information about all of those components. I mean, it doesn't help if you have the source code, but you don't know how the system is configured, or what records exist within the database.
03:48 And what about conciseness, clarity, and organization?
03:48 Daniel Bryant: What about conciseness?
03:48 Liran Haimovitch: So conciseness is kind of the opposite in a way, because we have so much information. I mean, you can just brain dump everything on someone, telling him, "That's it." All of us have worked on software applications with hundreds of thousands or millions of lines of code, and databases with billions of records. And you need some way to drill down. And instead of saying, "Here, those are the billion database records we have. Go ahead and read all of them," you can provide some insights into those. We have 100 million records of people. Each person has a first name, a last name, and a social security number. And those are the common values for those. And so we're trying to keep the data concise, we're trying to keep it clear, which is the third criteria. You want the information to be easily understandable in byte size.
04:34 Liran Haimovitch: And then last but not least, you want the information organized, because there is only so much you can feed on the first byte, and you want to be able to move from the first byte to the second or third, based on what you're interested in, without losing yourself midway. So you have multiple components in the systems, you have multiple tables in the databases, you have multiple APIs, each with its own input, and you want an engineer to be able to easily find the piece of data he's looking for within that complete data.
05:03 Is there a static and dynamic nature in relation to a system’s understandability?
05:03 Daniel Bryant: Very interesting, Liran. Very interesting. So, as I heard you talking there, I'm guessing there's some kind of dynamic nature to understandability, as well as a static nature. I can look at architecture diagrams. I can look at ERD diagrams. But also, when the system's running, that's a bit different. Data is flowing in a different way. Do you look at it like that? Is there a static and a dynamic nature?
05:24 Liran Haimovitch: Very much. I mean, the static nature of the system, basically of how it was designed to be operated, what's the source code, what's the architecture diagram, what are the tables within the databases? But a lot of that data is very dynamic in nature. I mean, you might have implemented a system that support three encryption algorithms, but only one is being used. And it's very important to know, like TLS, for instance. There are dozens, if not hundreds of ciphers that can be used, some of them are secure. Many of them aren't, at least today.
05:54 Liran Haimovitch: And so it's very important, not just to know that I support 200 ciphers, but which exact one is going to be used in runtime. If I have a table, and that table has a record with, as I mentioned, first name, last name and social security numbers, does that record contain a hundred records, or a hundred million records? That's going to be very different, from a system perspective. And some of that information can only be gathered at runtime.
06:19 Liran Haimovitch: In fact, one might argue that the real picture is the runtime, and the start and compile time, it's kind of simulation, expectations. It doesn't always reflect what's happening in the real world.
06:32 Is understandability different for greenfield and brownfield projects?
06:32 Daniel Bryant: And I'm guessing greenfield versus brownfield projects are very different, right? When I'm starting out something, like you said, the design of my system is very clear, but over time we all know, right? Entropy creeps in, new requirements pop up. So I'm guessing understandability, from a static or dynamic mindset, is very different for greenfield and brownfield projects?
06:51 Liran Haimovitch: So the thing is about brownfield versus greenfield, is that one of the most significant causes, often understandability issues, is complexity. If you have a very complex system, then understandability tends to be hard. While if you have a very simple system, it's much easier to achieve the same degree of understandability, with less knowledge, with less documentation.
07:13 Liran Haimovitch: Now, if you're in a greenfield project, there are many things you can do to reduce complexity. Use high level languages, such as JavaScript, running a Node. Use web frameworks, use cloud services, use high level building blocks, and you can solve complex problems with simple solutions, because you're handing off much of the heavier things to somebody else. But if you're working with an existing project, where a lot of requirements accumulate over the years, where a lot of things got patched up, where different technologies were deployed and then had to be patched over, and then you get a monster of complexity.
07:48 Liran Haimovitch: And just saying, "I'm going to make it simpler," doesn't work all that often. And saying, "I'm going to migrate it to the cloud or microservices. I'm going to rebuild it with ECMAScript 2020," it doesn't work all that well. I mean, those are nice dreams, but at the end of the day, most of us don't have the time and resources to rewrite projects from scratch. And so much more has to be done about understanding the application as it is, rather than wishing it was a nice blue defense technology stack.
08:18 How do we get an understandability baseline for a brownfield system?
08:18 Daniel Bryant: Yes, that makes sense. Makes sense. So you mentioned the title of your article is about the most important metric you're not tracking. Now how do we get a baseline for a brownfield system? Imagine I woke up as an engineer to a brownfield system, but I want to make sure the understandability does not get worse. How do I establish that baseline and make sure that happens?
08:38 Liran Haimovitch: So the way I like to think of it, is take one of those first semester computer science problems, or a hello world, kind of front end issues. If I were to ask you to sort in the rate, or make a button larger, or add a validation somewhere, those are the kinds of tasks, if I were to give you a pen and paper, or just an empty scratchpad, you could probably solve anywhere between five and 15 minutes. How long would it take you to do that in a complex software application? How long would it take you to find the right time to make the change, to understand its impact across the system? It's probably not going to be 15 minutes, because unless you wrote the application, and you wrote it yesterday, then you aren't going to be fluent enough.
09:19 Liran Haimovitch: And there's going to be some overhead, but the question is, is it going to be, I don't know, couple of hours, or are you going to be spending two days running around trying to figure out where do you have to make the change? And so that can provide you with some baseline to the understandability of your software, and how well do you, and potentially your team, understand the system. Overall, you obviously aim to make it better, so that there's going to be less overhead for each of those engineering tasks, rather than more.
09:47 Do you use metrics like cyclomatic complexity for understandability?
09:47 Daniel Bryant: Do you use any measures like, I know back in the day we obviously had a lot of Java coding, we used cyclomatic complexity for just measuring complexity that was popping up in our system. Do you recommend any kind of metrics like that, for software?
10:01 Liran Haimovitch: That's actually a very interesting question, because when that article went live, it got posted on Reddit, and that thread on Reddit got hundreds of responses. It was crazy. And people started arguing around it, because people love arguing on the internet. And one of the topics they picked up was readability versus understandability. And going into whether a source code is readable or not, versus whether or not it's understandable. And actually brought a lot of interesting points.
10:29 Liran Haimovitch: And one example I like, is let's say, even going back to the cipher example, or another choice example, you might have a huge switch statement that has dozens of options. And the choice is based on a configuration, state, database, whatever, and discerning the system as a whole, you might know that the value is a constant, that it's always option number five, regardless of how many options are there. And even if the code is a little bit messy, and it's hard to read, if you know the value's number five, you're going to figure out which one of those is the relevant one. Even though the code might not be readable, it can take you 10 to 15 minutes.
11:05 Liran Haimovitch: On the other hand, even if the code is very readable, very clean, very nice, but you don't know what's the configuration value, then you're still stuck looking at dozens of options and you can't understand them. And those options might be very similar, as in the example of ciphers, or they might be very different. And regardless of how clean and readable the code is, you would have a hard time understanding the system as a whole.
11:29 How much does architectural style affect understandability: monolith, microservices, functions etc?
11:29 Daniel Bryant: Do you think different architectures give different results? So is, for example, monolith versus microservices versus functions, and obviously there's some level of banded context and encapsulation there. Are you seeing from customers that perhaps folks are struggling more understanding a monolith than function as a service, or not?
11:48 Liran Haimovitch: I would say all architectures provide their own challenges. Some of the bigger drivers we're seeing, of lack of understandability, are about software age, size of the team, employee turnover, and overall code quality and architectural quality. I would say that as much as we like to hate the monolith these days, we have a pretty good grasp of how to build a good monolith, from an architect perspective, because we had dozens of years to get it wrong. So now we sometimes get it right.
12:15 Liran Haimovitch: When it comes to microservices, you can see over the last decade or so, there has been so much movement, going smaller and smaller and smaller, and how actually the industry is moving back into larger components. There is more talk of microservices and Istio, which was broken down into, I don't know, five microservice, now being combined back into a single one.
12:34 Liran Haimovitch: We are still learning architecture for microservices, and I feel that term, functional services, to be misleading. Just because you can run a single function doesn't mean you should. It's awesome that every function is an entry point. But in a way, it's not that different from an API, which is also a function as an endpoint. Just because an application is 1000 functions doesn't mean that it translates very well into 1000 vendors running on Amazon. Maybe you should be running it as a single vendor. Maybe you should have 10, or a hundred. And I think we're still struggling to find the answer to that. What's the good boundary for function? What's the good boundary for a microservice?
13:14 How do you think testing relates to understandability?
13:14 Daniel Bryant: So how do you think testing relates to observability? And I think testing of different architectures is very different. Testing a monolith is one thing to test, testing for microservices, there's a few testing functions, as you've already alluded to there. There's a lot of things to test, and they often have to be orchestrated. So how do you think testing relates to understandability?
13:33 Liran Haimovitch: I think testing has a huge impact on understandability, and the other way around. On the one hand, the better you understand the system, the better you can test it. You know which use cases matter, which use cases don't. By knowing which parts of the system are complex, then you know those probably deserve excess testing. While if you know that other parts of the application are very simple, then they probably don't require as much testing.
13:59 Liran Haimovitch: And so I think much of the benefit of a good computer engineer is knowing where to test and what to test, because he understands the system, he understands the business domain, and how they come together. While at the same time, testing is also a great opportunity to learn the system, whether it's exploratory testing, or just having a high quality lab where you can test the system. And that's actually becoming even harder, because as data becomes more regulated, as tech stack becomes more sprawling and more complex, then it becomes even harder to set up high quality labs for testing. But if you do manage to get it, then that's a great place to learn your own software, see how it acts, hopefully where you have less restrictions on performance, availability, security, so you can dig in deeper than you could in a real production environment, and let your engineers toy around with the software and understand it, truly understand it.
14:56 Daniel Bryant: No, I like that. And definitely as an engineer myself, being able to have something I can poke and prod without causing production issues is super important.
15:04 Liran Haimovitch: Definitely.
15:05 How does understandability relate to observability?
15:05 Daniel Bryant: Brilliant. Moving on to observability, Liran. How does understandability relate to observability?
15:09 Liran Haimovitch: So as I'm sure most of the audience know, observability comes from physics, which isn't finance, so it's an entirely different thing. But observability is about understanding the state of the system from the outside. And in a way, when you're trying to understand the state of the system, you assume you know the behavior of the system. And usually when you're speaking to ops, to SREs, and sometimes even to sales and marketing, they kind of know the purpose of the system. If you're Facebook, then you are the biggest social network out there and you're serving messages and photos, and have people liking those. And then you're monitoring those. You can monitor it through metrics or tracing or logging, how many people are liking the posts. What's the latency of my APIs? What's the error rates of my APIs? And those questions tend to be the same, day in day out, because you pretty much understand what you're trying to do.
16:01 Liran Haimovitch: If you're a software engineer and you're discussing understandability, then your tasks tend to change on a daily basis. It's not just about how are people liking, but I want to know exactly how the newsfeed algorithm works. I want to know exactly what the parameters each of those users are getting, and why is this user getting this post, while this user is getting the other post. Those are much more in-depth questions that are changing much more frequently. And so the nature of the data tends to change much more often. You tend to collect data much deeper from within the business logic of the app, rather than just from the perimeter of the app. And so essentially, you're serving, where it's a different audience, SREs versus application engineers, and different types of data, for the most part.
16:48 What kind of tooling helps with understandability?
16:48 Daniel Bryant: Yes, totally makes sense. This has been a great tour of the concept of understandability and observability. If we take a step further in now, what kind of tooling helps with understandability?
17:00 Liran Haimovitch: So we've mentioned some of the options, reduce complexity, which is something we, as software engineers, have been trying to do since forever. Build test apps, those are great options to learn from the application. And we've also mentioned observability tools, which do provide some limited view of the perimeter, but that's also useful to understand the system.
17:19 Liran Haimovitch: And there are a couple of options we haven't mentioned. One is just documentation, very hard to write, how to keep up-to-date. But knowledge is useful to understanding stuff. And we're also seeing a new category of understandability tools. I mean, debuggers have always been about understandability, but their function has always been limited to our laptops and to small scale systems. The next generation of debuggers, of debugging platforms such as RookOut, allow you to debug remotely, to debug in distributed environments, to debug at scale, to even debugging production. As you mentioned, I think you mentioned it a couple of minutes ago, you said, "I like poking around and experimenting without hurting stuff." So that's the spirit of RookOut. Go ahead, stick your point in, get the data you need, and don't worry about anything. We've got you covered.
18:07 What does Rookout offer?
18:07 Daniel Bryant: Interesting, interesting. So is it some kind of library? How does it work, in that you mentioned about running it in production now. I'm always super nervous about running things in production, as we should be. How do I run Rookout and understand things when it's in production?
18:19 Liran Haimovitch: Well, you add RookOut as a library, a software library. It's a one time deployment. And then you can go into our web console, select one or more of your application or application instances. And then you just visually set a breakpoint, like you would in a traditional debugger, except that breakpoint is in fact live instrumentation. And so we go into the app in that specific instance you've selected, or multiple instances if you want to. And then we add to the code that the next time the line is hit, we're going to get you a snapshot of the line, we are going to get you a new load line, and send it to Elasticsearch and Splunk. We're going to get you a new metric and send it to Datadog or AppDynamics, anything you want with the click of a button.
19:04 How does Rookout work, under the hood?
19:04 Daniel Bryant: Very interesting. How does that work under the hood, Liran? Is it doing some kind of byte code manipulation in the Java world, for example?
19:07 Liran Haimovitch: Exactly. So we implemented different SDKs for Java, .NET, Python, and Node. And for each of those, we do our own version of the bytecode manipulation or other low level technologies interacting with the runtime. It's kind of similar in a way to aspect-oriented programming, except it's much deeper. And so we just update application on the fly and get you the data you need, ensuring there is no risk to the operation.
19:33 Liran Haimovitch: What kind of performance overhead does that have?
19:35 Liran Haimovitch: Negligible. I mean, globally, it's less than 1% in CPU and memory, barely measurable. And there are breakpoints for most runtimes. It's under one millisecond. And even if it does grow, then we have built in rate limiting, so that if the breakpoint gets to a halt, we automatically disable it, or move it into sampling mode, so that it won't have any impact on the application.
19:58 Liran Haimovitch: We have a lot of security features built into that, a data reduction. We've made sure to include everything your ops and security folks would want, to ensure that you're not stepping out of bounds, while providing engineers with an easy to use tool to get the data they need, on the fly.
20:15 What do you think is going to be the biggest thing in software development over the next five years?
20:15 Daniel Bryant: So wrapping up now, what do you think is going to be the biggest thing in software development over say the next five years? Obviously we've got fantastic tools these days, fantastic programming languages, clouds, containers, Kubernetes, new ways of debugging things. What do you think is going to be the most interesting thing over the next coming years?
20:34 Liran Haimovitch: If you would ask me that six months ago, I would say that we're going to see software development moving to a much more value-oriented, revenue-oriented operating model, because software is becoming such a big part of business today. I think COVID-19 is accelerating that process. I think more and more businesses are finding a way to be digital first, and making software even more valuable. At the same time we have to answer to higher levels, and we can't just say, "We're up." We can't just say, "Latency is good." We have to validate the value we're bringing in. And at the same time, I'm also thinking that many businesses are going to find a way to stay digital first in some industries. Industries that would never have sought to do so just a year ago are now finding those models. And I'm thinking we're going to see a lot of new software and software-based industries and products that are enabling more and more digital experiences.
21:32 What do you think the future development experience will look like?
21:32 Daniel Bryant: What do you think the future development experience will look like? Will we all be using Kubernetes, cloud, platform as a service? What do you think?
21:40 Liran Haimovitch: That's a tough one. I think we're seeing that we're still searching, as a group, as a culture, as a profession, we're still searching for the right way of doing things. And things are rapidly evolving. Kubernetes is doing great. Serverless is doing great. But neither of those were here 10 years ago. I'm not sure if they will be here 10 years from now. I mean, when I was in computer sciences and people were making fun of COBOL, and now I kind of feel that Java Enterprise Edition is where COBOL was 20 years ago. And so things are moving so fast and so crazy. I have no idea what's the future going to look like. I'm sure it's going to be interesting though.
22:19 Daniel Bryant: Probably a very sensible answer, right. We just can't tell with the pace you've alluded to, several times in the conversation, the pace our industry moves at, it's just so fast, isn't it?
22:27 Liran Haimovitch: Yes.
22:27 Daniel Bryant: Awesome. So thanks for joining us, Liran.
22:29 Liran Haimovitch: Thanks for having me.