BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Evolutionary Architecture as Product @ CircleCI

Evolutionary Architecture as Product @ CircleCI

Bookmarks
50:32

Summary

Robert Zuber discusses how the evolution of software development since 2011 has driven the evolution of CircleCI's architecture. From the explosive adoption of Docker to the steady rise of microservice architectures, the changing demands of software engineering teams have proven to be deeply coupled with the structure of their service–far more than they anticipated when they started the business.

Bio

Rob Zuber is a 20-year veteran of software startups; a four-time founder, three-time CTO. Since joining CircleCI, he has seen the company through its Series B, C and D, and delivered on product innovation at scale. He leads a team of 100+ engineers who are distributed around the globe.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Zuber: One quick correction, I wouldn't normally make this correction because it sounded great but it's important for how I start my first story. If you read the keynote this morning, Randy [Shoup] described me as the CTO and co-founder of CircleCI. I was not a co-founder. I was a co-founder of a business that was acquired by CircleCI and my co-founder in that business is now the CEO of CircleCI. There is a thread there, it's just a little bit confusing. Given that my first story opens with when I joined CircleCI, I thought that might be confusing for folks.

When I joined CircleCI, we were about 20 people, mostly engineers, and the way that we ran support was through a week-long rotation. Every week, two different engineers would sit on what was intercom then. I don't know if there are intercom users in the house, maybe your orgs are big enough that you don't know if you use intercom or not. We also had a live, like a public chat room in HipChat. Hopefully some of you remember HipChat. We would sit all week and basically answer customer questions. This is actually a new chat widget that we have on our site, but we had something like this, and this is how you got to intercom and you showed up and you said, "I have this question."

If you've ever used one of these on someone else's site, there's real people on the other side of that. It was great to have engineers because you would say, "I'm trying to build this thing, I'm having trouble with my CI," and you'd be talking to me or you'd be talking to some other engineer and we pretty much understood or can help you really well.

Early in my time, I discovered some interesting things about being on support. Luckily, when I was in high school, I worked in retail as probably lots of people did as a first job. In particular, I worked at a member service desk, so I was really used to the changing moods of customers, let's say. Some days, they were very happy because you could help them and they were, "it's so great that you were able to help me out and this is awesome," and some days you couldn't help them and they were sad. Other days you couldn't help them and they had interesting ideas to share with you, maybe some feedback on your product.

io.js

This is a story about one of those times at CircleCI. Does anyone remember io.js? One person? Awesome. This will land perfectly. I started in August of 2014, this is maybe February 2015. I'm doing my support rotation, and someone shows up with that emoji that I just showed you previously, saying, "Why don't you have first class for io.js?" "The world is running on this platform for software development and we're trying to use CircleCI to do our builds. Why don't you have support for this framework?" They were angry and, of course, my initial reaction was, "What's io.js?" It wasn't my response to the customer, I turned to a bunch of people internally, which means I use HipChat, and I said, "Does anybody know what io.js is?" and everyone is "No, we don't know what that is." I was "Cool, people are angry that we don't support it, so we should probably figure that out really quickly."

Let me give you, for those of you who don't know, which is the whole room, a brief history of io.js. In November of 2014 there was a spat in the Node community. There was some frustration over joints, stewardship of Node, and so, io.js was forked; and you can tell by the date, this is basically Thanksgiving weekend. Nobody was even paying attention and all of a sudden, half of Node users were using a different thing called io.js. By early February, they're screaming at me in our support channel saying, "I can't believe you don't support this thing," and "You're obviously terrible people who don't care about software development." It's a while ago, but imagine that terminology.

At the end of February, two weeks later, basically, we released support. We diverted a bunch of energy, we went and figured out what this io.js thing was, why we were having problems supporting it in our platform. We implemented support for io.js and we released that support and the docs and we got all that out into the public and we were champions of the io.js community. Then in June, io.js and Node re-merged, form the Node Foundation, and this platform no longer existed.

On this slide, I was going to say, "Meet my customer," but this is you. You are the people who behave like this. We love, in this community, to try new technologies, to do new things, to experiment and that's great. We learn lots of interesting new things, we push the envelope constantly, but this is my customer. From a day to day perspective, the customer that I'm trying to solve for, that I am at the whim of, if you will, is a very whimsical customer, The choices that we make as software developers drive the kinds of solutions that we at CircleCI need to provide. We'll get into the detail a little bit later, but the way that our system is architected defines our ability to support this whimsical behavior.

As I sat down and started to think about this, I thought at first that we were the only ones, I thought that we were unique and different in this way in our pursuit of our customer and the constant change in which people develop software but it turns out, this really does apply to all of you. Regardless of who your customer is, understanding who your customer is and how they think about how you think about software delivery and how you think about architecting your systems, is really tied to trends in your industry and how people continue to grow.

I want to talk a little bit today about a few things that we've done. I'll give you some specific cases and examples of where these things worked or failed miserably, really driving to three points: one, it's very important to design for change. Two, you need to keep your head up and what it means to keep your head up and be looking ahead at where things are going. Then third, you should really be thinking like a product manager.

My Biases

A tiny little bit about me, I'm not big on doing personal interests and these are generally things that people don't care but a really smart friend of mine told me recently, "Never take advice from someone if you don't understand their biases." I made a biases meter, I don't really know what to call it, and I randomly selected these three axes upon which to measure my own biases that I thought might inform you a little bit about how I think. First of all, I entered the software development world, the startup world, doing more systems engineering or what's probably called SRE today, we called it systems engineering back then. Then I migrated into software, so I often think more about the overall system than I do about small bits of software.

I am a CTO now, I would say this line has shifted significantly over my career, so I spend a lot more time thinking about customers, delivering customer value than I do about specific, cool, exciting new technology except for when it's costing me a lot of problems. I've spent most of my career in small startups. CircleCI continues to scale and grow and so I'm shifting more to the scaling side, but I'm constantly thinking about how to be scrappy and how to deliver customer value without lighting my systems on fire but doing it in a way that really delivers customer value effectively and quickly.

Design for Change

Let's talk about designing for change. In order to talk about this, I need to tell you a little bit about what we do. Given the number of hands that were up in the room, I think everybody has a general sense of what CI and CD is, but just to give you then a sense of how we manage it in terms of what we do so you can see how horribly things can go wrong.

You're a developer, you're sitting at your laptop, you do some work, you push that, let's say, to GitHub, and then what happens at that point is we get a webhook, a notification that you've made some change. Then we go back to GitHub and we pull the latest version of your source code, all of your tests that you have defined, your configuration that you have for how your build should run. Then we execute your build, run your tests, and then we have some feedback, some status. We probably send that back to GitHub, maybe we update your PR and we say, "this is good to go, it's ready to merge because all the tests have passed," or it's not because the tests have failed and so you can't merge it. Again, I think people generally understand this flow. Then if it's passed, maybe we'll deploy it to AWS or to GCP or to Heroku or, who knows, probably Kubernetes somewhere these days, regardless of which of those platforms it's working out.

No big surprises there but that's the general flow, so let's dig into that green box a little bit. For the purposes of this, I'm going to go back in time, back in history a little bit - quite a lot, actually - and talk about what is inside of that green box. Back in the day, in 2014-2015 when we had the io.js incident, we had a monolith, which is described as application there. This big green box would be one very large instance or a very large host in AWS, The application or monolith receives that hook because yes, the thing that ran all the builds and talk to all the containers locally was also listening to the public internet to receive webhooks and do pretty much everything else, that's why we call them monoliths.

Then it would take your work and push it into an LXC container. CircleCI was started late 2011, launched at the beginning of 2012, so this predates Docker. In order to slice up these boxes and create isolated space to run builds, we would use LXC container. LXC, you can argue is the precursor to Docker. We'll talk later about how Docker was originally built on top of LXC. Each one of these might be running one of our customer's builds, or if you took your build and split it up into multiple parallel pieces, then it would be running in multiple of these LXC containers. In fact, they might be on different instances, it doesn't super matter. The point is that container is the environment in which you build runs.

Every single one of those containers that we spun up was built from one single image. We talk all the time about monoliths, this was our mono image. We had one specific image that contained everything that we thought anyone would ever want to run their test environment. We all have the pleasure of hindsight at this point. It sounds like a terrible idea, but it was actually a fantastic idea. In terms of operating our system, it created great simplicity for us. You can tell the era because it was built with Ubuntu 12.04 and that was what the majority of our customers were using.

Then, it came to languages. People use different versions of different languages, which is awesome, and languages, for the most part, we have tools to deal with this. People use multiple versions locally on their laptops and they use things like rvm, virtualenv, nvm to manage and deploy multiple versions of the same programming language, as well as maybe multiple sets of dependencies and stuff like that. We were able to take this one container and make it possible to have multiple different versions of Ruby, multiple different versions of Python, Node, io.js very briefly, which was great.

Then we had to add the databases, but we could only have one version of MySQL, one version of Postgres, one version of Mongo, one version of Redis, and so on. We had to be much more careful, first of all, about how we did upgrades and changes. I remember very early in my time at CircleCI, we did a patch release upgrade of Neo4j and let me tell you, everyone's favorite place to massively change core behavior is in the last.dot of the version number. Basically, they changed the way that the authentication work for the user in going from 1.1.1 to 1.1.2 and we were, "Dot release, let's just upgrade," and every single user who use Neo4j, their tests were immediately broken basically.

Luckily, we had a tool we called pseudo-hacks where we could change the state of the container after it spun up before your tests got into it. This was a very tough situation to manage, so that was a problem. Second, over time, as you can imagine, our customers started to change the versions that they were using. Let's say half of our customers are using MySQL 5.5 and then half of them upgrade to 5.6 but, again, don't take version numbers as [inaudible] it's never true. There will be some change in the default behavior such that if you were using 5.5, 5.6 wasn't compatible, but you couldn't get the capabilities you needed in 5.6 if you were stuck on 5.5./p>

What our customers would do is, at the beginning of their build, they would have a script that completely uninstalled all of MySQL, and then reinstalled MySQL, adding two minutes, three minutes, depends to the startup of their build. People talk about build times in tens of seconds and now we're adding two minutes for that and if Mongo is different and they use Redis and all these other things, it's getting longer and longer to actually get a build running. On top of that, every time we added another version of Ruby and Python and Node and whatever else, our container image got bigger and bigger.

This is what a typical day in our life looks like. This is typical because this is the day, the previous day, and the seven-day prior, so you can see they generally run together. This is the scale of our system, there's no scale numbers on the side but you can see it more than doubles during the course of the day. We have to be constantly bringing machines online and you can see there's some significant spikes in there. You have a choice, which is, "I overscale and constantly have a huge buffer pool and spend tons of money on machines that I'm not using," or, "I try to run it thinner, much closer, keep a thin buffer pool and be able to react quickly."

At the end of this mono container's existence, it took us about 45 to 50 minutes to bring a machine online. Imagine trying to react to a spike of traffic that comes in when someone runs 100 builds at the same time or something with machines that take 45 to 50 minutes to bring online because we're moving these giant containers around the inside of AWS. In May 2015, we hired our first human scaler because our autoscaler was no longer working, on a three-month contract. We're going to do this with people until we get this fixed. After four contract extensions, we made that person full-time and we hired two more because getting out of this situation was not easy.

Also, not all days look the same. This is a real day at CircleCI. This real day is Monday and the purple line was Sunday, so we don't have as much traffic on the weekends. What happened here was, GitHub wasn't able to send us webhooks for a few hours and then we got all those webhooks, basically, at once. You can imagine trying to scale up that cliff, again, with the 45-minute rise time, We have customers who are frustrated because we can't support their new tools that they're trying to use. They have long build times because they can't get the container into the state that they wanted in in time and we have massive costs and overhead to scaling because the size of systems that we're moving around is huge.

Then we had to support Ubuntu 14.04. Now we have two parallel systems that we're managing and scaling with two different mono containers, the one true container and the other one true container, with all the same problems duplicated. The overhead is high, the work being put in is high, the people we're investing to keep the system running are high.

Now, good things happen. I talked a little bit about Docker earlier, some things changed in how people thought about delivering software and luckily enough, as much as a massive rewrite of your entire core infrastructure can be considered lucky, we were able to change the way that this system worked such that we pulled out some capabilities from that monolithic application that you saw to basically receive hooks, scheduled things, hand them off to an agent, and then pull back the output. Use tiny little machines, a small cluster - well, it's not that small - but a small cluster of machines to run those services, and then a local agent, which then is able to compose all of the Docker containers that it needs in order to actually create the environment exactly as the customer wants it.

I as a customer can say, "Yes, I want Ruby 2.6.5, I want that running on stretch because that's what I use. I want this version of MySQL, a down version of Redis, etc., etc., and compose all that together at runtime." The databases in here, we build our own versions of those images for a couple of reasons. One, that way, we get much better caching inside of our internal network as we're moving around consistent base layers. I don't know how many people know how Docker actually works under the covers, but the base layers are all consistent and therefore can be shared.

We very tightly optimize them for use inside of a CI and CD environment, specifically fsync is not super important when you're running tests. All of the writing to disk and flushing things to make sure everything is consistent is really expensive on a container that you're going to use for about 30 to 60 seconds, so we do tons of stuff to pull all that stuff up and make it run much faster inside of our environment.

That is lesson number one, which is design for change. Now, did we design for change? No, but what came out of this was we came up with a system that wasn't now supporting io.js or now supporting whatever new technology, but rather now able to support any combination of tools that our customers wanted to use.

Head Up

Let's talk about being heads up because a big part of that problem and a big part of many problems we have is actually seeing far enough ahead to identify these problems before they are crises. This is a terrible photo and I apologize for that. I was just going to say that my kids wouldn't sign off on the release to give me their likeness but really, this is just a really bad picture. It doesn't matter, though.

My kids play soccer. I don't know if anybody recognizes all the stuff that's on the field here. Right now, they're doing some core training but there's agility ladders and little posts. If you watch kids do foot skills or honestly anyone, I just don't have access to watch actual professional soccer players do this stuff, they do some really amazing things in here. Watching my kids jump in and out of an agility ladder with one foot while they volley the ball back with their other foot is really cool and being able to control the ball in and out of these little posts.

All the while they're doing that, their coach is yelling at them. It's more like soft spoken, reminding them - it is 2019 after all - to keep their head up. Why are they able to keep their head up when they play? First of all, it's important to keep your head up when you play so you can see what's developing on the field but why can they do that? Because now they are so proficient with their feet, they're so good at the basics that they can turn their attention to what's happening around them. I feel like in software, we love to tie our shoelaces together before we go into this situation. We do things that make it really hard to focus on what's ahead because we make some easy problems really hard for ourselves.

After the launch of 2.0, platform change that I showed you, we were able to run jobs much more quickly, much more easily in smaller amounts. It didn't take a huge amount of time to get up and going. Then we were able to introduce a capability that we call Workflows and this is our monolith, actually, as it's built. I had this brief moment where I put this in and thought, "Is there anything in here that I should worry about?" Whatever, it's there now. This is our monolith being built, there's a large set of different activities that happen, some are dependent, some are not, and only at the very end do we get to the point where we get some deploys stages, all of which happened in parallel because when you have a monolith, you're probably doing some pretty interesting things around your deploy. Then, finally, notifying that the deploy has happened and downstream systems can do their thing.

This is fantastic, this is how people actually think about building, testing, and deploying software for the most part. We love saying it's a DAG, I think the DAG came up in the keynote this morning, everybody loves a DAG. You just feel better about yourself when you can say that you built a DAG. Unfortunately, we build DAGs and YAML, so it's 50/50 whether you love it or not, but this is great. This is exactly what our customers always wanted, so we're really excited about that.

The problem is, somehow, in our user interface, we still are using an old system. This is exactly the same architecture that I was just describing, with the insertion of what we call the workflow conductor. When we receive a hook, we pass it to the workflow conductor, it breaks it down and manages the state, are these jobs running? Have they completed? Are there dependencies? Then hands those to the scheduler which runs them inside of this execution system. Getting back to the point about the user interface, this is still the dashboard that you see as a CircleCI users when you come to our platform.

This dashboard is based on jobs, which were the old unit, really, what we used to call builds. If you look over to the right, you can see a little squiggly that says, "This is a workflow and this is the workflow that it was part of and this is the job, the name of the job that got run," etc. Why is our user interface still in this old domain when we shifted our experience to this new domain?

Let's talk about closure for a second. We're a Clojure shop. Is anybody familiar with Clojure? A few people. Really want to use it on your home projects, love it, super expressive. Everybody loves Lisp. Clojure is a Lisp that runs on the JVM and our monolith was built in Clojure, most of our services are built in Clojure. Somewhere back in early 2014, before I joined CircleCI, someone made the decision and I've heard this in many different domains, that we would be better at front end development if we use the same language that we use in the back end. We adopted something called the ClojureScript. In particular, we use the framework, which is called Om and Om is a ClojureScript wrapper around React.

Given that we've had a long keynote this morning about state and immutability, people appreciate a lot of the capabilities of functional programming. In particular, Clojure and ClojureScript, they're designed to allow you to think effectively about immutability, the data structures are immutable. React, for anyone who is not familiar with React, from a front-end programming perspective, is driven off of this concept of defining your state in one single place and then react the programming off of that. Having met Pete Hunt from Facebook/Instagram many years ago who was a big part of building out React and managing that team, he stated at some ClojureScript meetups, so maybe he was just trying to be friendly, but it was actually Om and the ClojureScript community that really drove early adoption of React because the models made a lot of sense, which is super cool. We were excited about that because we were functional programmers and ClojureScript programmers and so it was exciting to hear that we were a big part of that.

For anyone who's not familiar with StackShare, StackShare is this cool site, stackshare.io, you should check it out, where you can go describe the stack that you use, You basically go through and click little buttons and say, "Yes, we use React and we use Postgres and we use Redis and we use Java," or Python or whatever, and then you can look at what other companies use and then there's some long form content describing why people made the choices they use, etc. Then if you pick specific items in the stack, you can see who's using them, and particularly by number. These aren't, of course, all the numbers in the world. This is the number of companies and developers who chose to sign up for StackShare and chose to stay on StackShare, "Yes, I use this technology."

7856 companies using React and 18,312 developers. Here is Om. The first one on the left there, Precursor, those are two people that left CircleCI and started another company. There's a Slack channel called Clojurians, I'm not even sure how to pronounce it, and we have historically gone into there. Maybe this only happened once because when this happens you stop, and said, "We're trying to do this thing with Om, does anybody know how to do this effectively?" And we got links to our own code. "I heard that CircleCI is using Om." We're "Yes, we know about that. We'd like to know who else is using it."

Two years ago, David Nolan, the person who started Om, decided he didn't like the model, so he created Om Next. We tried to implement Om Next iteratively inside of our Om stack. Speaking of problems of state, we then ended up with two state models inside of our front end. Some components reacting to some, some reacting to others, and then us trying to synchronize and that wasn't working and we finally gave up and said, "We should just switch to React." We tried to do that iteratively inside the same stack, now we have three and we're just rewriting the whole thing.

Good news if you're a customer, the workflows base, the user interface is coming soon but this is not how you get your head up and look at the next thing, this is looking down at your feet. If you can't do basic stuff like deliver a front end because you've chosen all these tools that don't make a ton of sense, you're never going to get on to the next thing and really drive value for your customers. What's so interesting to me about this is I think we all believe that we are going to get that step function advantage. Or, there's this cool piece of technology that no one else is smart enough to use, so we are going to take that and it's going to make us better than everybody else because we're going to move faster. Please don't believe that. The thing that's going to keep you moving fastest and delivering real value to your customers is ignoring these desires. It's focusing on the things that allow you to easily build solutions and value for your customers.

You Are a Product Manager

It seems somewhat related to make this claim, you are product manager. As you advance in your technical career and move from worrying about the basics of implementing functions and then methods, objects, classes, whatever, however you think about your code, and get to the size of services and multiple services talking to each other in large scale systems, you need to think more and more about the customer and think about your product in the way that your product managers think about it.

I don't know how many people have read this book. I don't even know how I came across this book, someone that worked with us, I think, recommended it to me. John Ousterhout, I think he's from Stanford. Anyway, he wrote this book called "A Philosophy of Software Design." He says, "If we want to make it easier to write software, so that we can build more powerful systems more cheaply, then we must find ways to make software simpler." The inverse of simple is complex, we all talk about complexity. His whole approach in this book, his whole philosophy in this book – philosophy is a good word – is we got to make software simpler, so let's figure out how to do that, let's remove complexity at every turn.

In this paper, "Out of the Tar Pit," Ben Moseley and Peter Marks describe essential versus accidental complexity. This has been done before, I think Brooks describes essential and accidental complexity, he maybe uses different words, but they're very absolute in their definition, which I like. They say that, "Essential Complexity is inherent in, and is the essence of, the problem, as seen by the users." In your domain, the only complexity that is essential, the only complexity you must have in your system is the complexity of the user problem itself. They take a lot of words to say accidental complexities is everything else. Anything that has to do with performance, suboptimal languages, infrastructure, whatever, is accidental, and therefore, you should spend your time reducing that. If your job, according to Ousterhout, is to simplify your systems and reduce complexity as much as possible and the only way to understand what is essential in terms of complexity in the system is to understand the user problem, then you are a product manager.

This is my kitchen – not right now, this was my kitchen – we remodeled it a couple years ago. About partway through the process, somewhere around this stage, maybe a little bit later, the contractors turned to us, myself and my wife, and said, "we're ready to put in the cabinets, where are the cabinets?" We're, "What cabinets?" We made frantic calls to our architect and it's super confusing to use an example with architects in this room. This is a person who designs buildings and houses and stuff, it's where the word came from. We made frantic calls and he said, "Of course you should have ordered cabinets, I thought that was obvious." We're, "Obvious? You do this every day, we've remodeled one kitchen in our lifetime. Which one of us should have known that at this point, we should have already had cabinets delivered?" There is like a six or eight-week timeline to get cabinets delivered. This was a very exciting stage.

Now, invert that because in this case, you are the architect/product manager, you're the one who is seeing all the patterns. You're seeing the constant use and you should be ahead of what you're being asked for.

I'll tell a little story about Docker. I referred to Docker at the beginning as not even being in existence when CircleCI started, then it was in existence. Not only did we use it to create this system, which was much more flexible and capable, but over my time, certainly, at CircleCI, all of our customers started using it. How many people in this room packaged Docker containers as part of a build cycle? Ok, it's actually not quite as many as I thought, but I will tell you that in 2014, not one of you would have put up your hand unless you were "Yes, I'm trying it out at home." Luckily, you could come into our HipChat channel and talk to me about Docker now and it's "Cool, what's Docker?"

Our customers started using it more and more and that was a primary part or use case of using our system. If you go back to the LXC containers that we had at that time, luckily, Docker was built on top of LXC and so we were able to do construction of Docker, like actually execute Docker commands to build images and push them to repositories or whatever inside of an LXC container. We are using the Docker LXC driver. By default, Docker was built on top of LXC. Then in March of 2014, Docker launched something called libcontainer and they pulled apart the access to the underlying system. Basically, they reimplemented, LXC and Go is probably a high-level way of looking at it and then created execution drivers. So, you could use libcontainer but you could also use the Docker LXC driver and others. That was fine. Then in late summer 2015, they deprecated the LXC driver. Ok, we'll use deprecated technology. We'll live on the edge like that, no problem.

This part wasn't as good. Try to explain to your customer that they have to continue to use Docker 1.9 because the LXC driver has been replaced with libcontainer. They're, "What are you talking about? I type Docker build, it doesn't work." Nobody cares about the internals of Docker, this is our job. We're the ones that are seeing hundreds or thousands of companies using Docker to build their containers or their images and then to deploy that into production. We're the ones that are responsible for identifying the trend and, honestly, this was below our product team. This is stuff that our product team doesn't necessarily understand but our engineers were talking about it all day every day. They're, "Yes, this is going to go badly. Coffee?"

This is why, as an engineer, as an architect, as a leader in technology, you need to be thinking about the direction of the product and thinking like a product manager.

Coming back into this world, we're almost there, except Docker and Docker is still a disaster. Also, the way that we cache data between builds is we tar up a bunch of stuff and ship it off to S3 and then we ship it back and un-tar it. One of the greatest things that gets you performance in Docker, I talked about images being moved around, is stuff sitting inside the Docker cache. Our suggestion to customers who wanted to improve the performance of their Docker builds, because every time they tried to build, they're pulling down the entire set of base images to this machine, we would say, "What you should do is tar it all up and then we'll stick it in S3 for you." They're "That's taking longer than just pulling the stuff in the first place." We're "Ok, no problem, what we'll do is instead of all of this stuff, all of this environment that we've given you, we'll just give you a VM, we'll just give you a machine and we'll mount some persistent block store that we have in the cloud and allow you to store your cache in there." They're "Awesome, but all that other stuff that you built for me, I want it. Right now, I'm basically back to 1.0 container land where I have to install everything that I need." Ultimately, we came back and shifted the way that this worked, got you back into the container and then used remote Docker engine, so instead of using the local socket, we built an entire system to talk Docker, basically, over to another system that you had your persistence stored in it and to you, it felt like just a natural use of Docker.

These, all of these steps to get there, were very much done under duress, is probably the best way to describe it. We knew all of this was coming but we're sitting around waiting for our product team to be, "This is important, we should probably work on this," instead of saying, "Let's get together and talk about these things that we need to do."

Where Are We Now?

Where are we now? Trust me, nothing is perfect. We have lots of places to go and lots of things to build, but if I go back not quite as far, here's another example that I personally remember. This is late 2016 now and io.js has already gone and Elixir comes around. This one lasted longer, I think people are doing this and actually quite enjoying it. I don't know if there are Elixir devs in here? I didn't actually say, "Raise your hands," but I feel like Elixir devs are super proud and throw their hands up anyway. Same deal, "We just started this new language and it doesn't seem very well supported on your platform and going from time to run your tests," which is super important, "going from 20 seconds to 10 minutes and 9 seconds" – I love the precision. By the way, if you're going to shame someone publicly, shame them with precision.

Then it ends with, "CircleCI is awesome, though." I'm, "I don't even know how to react." Are they the happy customer? Are they the sad customer? I'm not sure. I need one of those little things that they have at the back here and they could badge and be "No, this was great." "How are you feeling today?" "I'm not sure."

The good news, though, the same day, I was "we're building some new stuff, email me if you want to check it out." That was a Friday night, I remember, and by Saturday morning, my good friend, Jay Tregunna, had his builds down to 45-50 seconds. It's not 20 but let's all be honest, if you expect your CI platform to run your tests at exactly the same speed as your laptop, your CI prices will go up.

To wrap it up, these are the things I want you to take away from this. Design for change. Don't just fix the problem that you have but think, "Ok, we're seeing this problem repeatedly, how can we build a system in a way that's going to absorb the next round of the same problem?" Simplify the basics so you can look ahead. You're trying to look at the future, don't tie your shoes together and be awkwardly running and trying to figure out how to not fall down. You have a big job, which is to understand where your market is going, where your domain is going, what your customers are going to need so you can design for that. Finally, think like a product manager. Go meet your customers. What's great is you guys can meet them on Twitter, you don't even have to leave the building, they're all out there. Understand their goals and then architect better.

Questions and Answers

Participant 1: Thank you, I appreciate you being really honest about the failures you've gone through and so we can learn from it. On one hand, you could make the argument that you're still here, you're successful. I bet that you could go the opposite extreme, "I'm trying to look too far ahead instead of solving for the problems we have now." How do you think you will balance that going forward? In some sense, you made tactical decisions that turned out well in the long-term, perhaps, but weren't too far ahead, so how do you find that right sweet spot?

Zuber: I champion this everywhere – I was joking with Randy [Shoup] about it earlier and you mentioned talking openly about failures – everyone talks about evolutionary architecture as if before you even start, you should understand all the places that are going to change, all of the systems that are important to invest in and maybe build some microservices too while you're at it before you even launch your product. I had that little scrappy meter. I think companies that survive and get to the stage that we're at have to be on that scrappy end.

There are two things. One, it's easy to listen to people talk in large scale in spaces like this about the end state, "This is how awesome and scalable our system is now," and not reveal all of the pain that they went through to get there. I think to answer your question, I would say, one, it doesn't matter how big you are. Don't do things that are unnecessarily complex just to try out new things or be cool or whatever. Save that for your side projects and invest in doing the most you can with the smallest amount of effort.

Then on that second pass, third pass, we talked a little bit about the rule of three; don't abstract the system until you've had to do it twice. If we were building CircleCI 3.0 right now, I'd probably be fired. That was a big investment to rebuild that whole thing. We ended up baking in a bunch of different constraints and requirements but as a result, we did it in a very incremental way. We made sure that we could pull off a tiny bit of traffic with the smallest amount of change and learn from that and then do a little more and do a little more, so that we could mitigate that risk.

I do think it's not going to be perfect from the beginning, you're not going to be perfectly abstracted, but pay attention to what's going on around you. It's embarrassing as a software developer to not know what's going on in the software industry but some days, we're just totally down in our terminal worried about the one thing that we're trying to get done. It's a balance, it's always a balance. Go simple from the start and go back to complexity and simplicity, stay simple. Simple is easier to change, it's always easier to change even if you didn't design for that specific change.

Participant 2: I have a question about looking ahead. As an engineer, we can see we have some abstract that needs to be upgraded. Also, the product managers, they also are looking ahead for the features that they want to build. As an engineer, how do I convince the product manager that it is important to invest into upgrading the technology?

Zuber: I've been in this track all day and every single talk has had the question, "How do I convince management?" You're not alone is the first thing that I would say, this is the problem that everyone is facing. It's interesting because I am management and so I find the question really fascinating. For me, one of the tools that I love is called cost of delay and it's from Reinertsen's book, "Principles of Software Development Flow," but you don't have to read the book, you can just look at cost delay. You don't have to necessarily use cost of delay, but the concept is really good because ultimately, it's about prioritizing work based on financial impact.

I think we have a tendency in software to say, "This will be better, we'll go faster if we have this." I don't know if everyone does, but I demand from a product clarity about what is the opportunity we're going after and what is the business impact that we're expecting to have and you should be able to define the business impact of your technical work on the same terms. Ultimately, you can convert everything into dollars or euro or whatever currency you're spending to run your business and if you can do that effectively, then you can have the discussion. Then, it's a discussion about data, first of all, versus a discussion about narrative.

I don't know if anyone has explored those two approaches to having a conversation, but if you can have a conversation about a model and say, "This input to the model doesn't seem right," and then get to the place where you agree on the model, then the answer will become obvious. If you have a discussion about narrative, which is my personal opinion that we should do this thing and I'm going to tell you a cool story about why that's really important and it's your opinion that we should do this thing and you're going to tell me a story about why that's really cool, then we just end up arguing about each other. We're arguing about opinions and it doesn't end up being effective.

Then, the product will win, honestly. I think regardless of the framework that you use, being able to get to the same common currency in terms of the value of something you're going to do is a great way to do that. Then, make it smaller. Whatever it is you're trying to do, find a way to deliver it incrementally so you can manage the risk.

Participant 3: You mentioned that we should focus on how we could fast and easily deliver the solution or product to our customers instead of focusing on the fancy latest new tech. When you find something, it looks like it's helpful to your solution but it's not very widely adapted to the developers at this moment, how would you decide if you'd like to use that or not?

Zuber: I think every time I answer any question, I talk somewhere about small iterations. First of all, understand the problem you're trying to solve. I can't speak for everyone in the room, but I think the thing that's very common practice in tech is, "This is a really interesting piece of technology, I wonder if I can find a problem that I could solve with it?" As opposed to, "This is the problem I'm trying to solve," and then I'm going to evaluate on those parameters. If it turns out there is one thing that you can do and that's going to be a game changer in terms of your ability to do that, then cool, but think about total cost.

I think we often think about early delivery cost. I used the ClojureScript example and our ability as software developers who understood Clojure to write ClojureScript is great, but the ability to manage the ecosystem to debug, honestly, you can't get a frame out of that in a browser, etc. All of that stuff comes in and user interfaces tend to be all in, "I'm all in on this framework," or, "I'm all in on this framework." Whereas if you have a back end comprised of services, for example, then try something out on one service and when it fails, just throw it out and rewrite it, Find a spot where you can test something out and the cost is really low to reverse it.

 

See more presentations with transcripts

 

Recorded at:

Jan 20, 2020

BT