Introductions
In this interview, Sam Haskins shared his experience on how DevOps is done in Etsy. Etsy uses Git for code control, the average number of update deployments per day is around 30, and can be as high as 70. Under such a high deployment frequency, how do they audit their codes? How do they monitor system performance after each update? How do they deal with failures? These questions will be covered in this interview.
InfoQ: Sam, thank you for taking the interview. So when did you join Etsy?
Sam: I started about 2 1/2 years ago. I was actually an intern then. I was still in school. But since then I've graduated school and then I'm full-time.
InfoQ: So this is your first job, then?
Sam: Yeah, it's a very good one to have. I'm very lucky.
InfoQ: You are kind of a DevOps. Are you more Dev or more Ops?
Sam: I work more on the dev side of DevOps, but my team is definitely in the larger development organization. We're definitely very close to ops.
InfoQ: How many people are there in your team?
Sam: My team is about 10 or 15 I think.
InfoQ: What team is that?
Sam: So that's the core platform team. We work on basically just the interface between the servers and operations level things and where people developing features work. So our customers are engineers developing features.
InfoQ: So what are the main responsibilities? Is it more like developing features?
Sam: I work primarily on our database access layer, on our ORM, and also a little bit on things like our web framework, sort of like rails or something like that, our framework for developing things. Other people in my team work on developing code for our photo storage system, asynchronous job management, and then we all, during emergencies, we tend to know most of the code base and so usually it's us who work on fixing it.
Can you tell us something about your code deployment system? What tools do you use (like Git, Puppet or BT)? What are the code audit procedures?
InfoQ: The first question is about the deployment system. What tools do you use basically?
Sam: We use Git for source control and then we use a tool that we wrote ourselves, Deployinator, for taking what's in Git and putting it on all the servers. That basically just copies the files though. It's not anything special. But we use Chef for configuration management. That's about it for keeping the code. We use GitHub actually. We have an internal GitHub for managing the code.
And for the code audit procedures, we don’t have a lot of formal checks. It's not required so as long as your code will run, as long as it compiles, then it can be put into source control, but we encourage people to use code review and they often do. So code review usually it's - you just show someone your patch and they look at it and give you comments. So most people will do that before they push something.
InfoQ: Is there someone like a QA?
Sam: No. We don’t have traditional QA. Because we work in such small chunks, it's usually fairly easy to determine that you have broken things. We do have people who work sort of like QA. Their job is to just try and break things and let us know what broke. Because we do so many deployments, we do sometimes as many as 50 in a day, sometimes more. It would be impossible to have a full suite of QA after every deploy or before. There's not enough time. But we use automated tests and things like code review. Because we work in such small changes, we believe that it gets rid of the need for having a large formal QA process.
InfoQ: When you roll out an update, how do you do environment control and securing the code?
Sam: So the tool that we have, Deployinator, what it does is you click a button and it just takes what Git had and shifts it.
InfoQ: So just source code? No packages?
Sam: No, we don’t package it up. Deep down I think it sinks to a folder on the web servers, but it's not very complicated. We don’t do a lot of work that way.
It's very simple, and it takes only a few minutes for it to run so it's very easy which is why we do it so many times.
InfoQ: So that 50 a day is like a typical frequency?
Sam: Yeah. The actual average this year was about 30 or 33 and that's because that includes weekends when we don’t do any at all and other things. But I think the largest ones we've ever done have been maybe 60 or almost 70 in one day, but nowadays we almost always are at least 20 to 30. It's a lot, but that's how we've changed our style of coding to make that be a reasonable thing. If we were just like pushing big changes all the time, that would not work, that would be a disaster. But because we work in such small pieces, it works well.
What are the responsive procedures / guidelines? How do you grade failures, and what are the recommended procedures to deal with failures at each level?
InfoQ: Since you push out so much in one day, when you push out a change, do you observe it for a few minutes first and then push another?
Sam: Yeah. So we have a very large number of metrics. We have hundreds of thousands of metrics. And so when you push a change at the very least, in addition to monitoring whether you think that what you changed worked, we also have a standard deploy dashboard that you need to look at. It has a bunch of graphs, you're supposed to look at that and make sure the things on that dashboard are still behaving normally.
Also, we have a tool for reading the logs live and you're supposed to look at that as well. And if anything looks like it's broken, then you need to fix it. But after a few minutes of watching that, usually it's only one or two minutes. I'd have to get some more graph data in. As long as nothing has broken and you believe that your change worked correctly, then you send off the next group.
InfoQ: So when you push a change, do you push it to a test environment first?
Sam: The assumption is that you've tested it on your own before you push, but we do. We push it to our staging environment, which is our production environment. It's the same environment. It's just to a version ahead and so no users hit those servers; it's just us. So when you push code, first you put it in the staging and then our CI suite runs the tests, the unit tests. We use Jenkins. At the same time, you're going to the staging environment and Jenkins runs a bunch of tests. And then once everyone who's pushing code believes that they're all right in staging, they've tested out what they changed and they think it works correctly, and then also once Jenkins is done that usually takes about five minutes total, then we move to production.
InfoQ: When you mean production, you mean all the machines?
Sam: All the machines. All of them. The tool is just two buttons. It's staging and production.
InfoQ: Do you divide node clusters by services? Like one service per cluster?
Sam: Sort of. These deploys are all one stack. Those numbers are all one web stack. We do have a separate stack for search and for photo storage but those are about it. We don’t do many services. We have one code so it's a large code base. It's like 2 gigabytes of code. But for the most part we only use one stack.
So when you deploy, it goes to the web servers and then also the API servers. We have an internal support system, asynchronous jobs, Gearman, and various tools like Cron running boxes and stuff. It all goes at once.
InfoQ: So do you divide these pushes by their importance? You know, some pushes might have a bigger impact on the users.
Sam: We like people to work in such a way that that doesn’t happen very often. If you had something very important going on, then you can take over the line and push and do it on your own and make sure that things are working a little better. But normally, we like to encourage people to develop a style so that doesn’t happen.
InfoQ: So you're kind of preventing big changes at one point of time? So you are pushing small changes all the time?
Sam: Right. And the way that we launch new features because you can't just launch a new feature piece by piece. We use a configuration file where you say, this feature is off; this feature is on. And so when it's time to launch, you turn it on. But the way to make that not have a large effect, to make that be something that's safe to do, we have the ability to launch it just internally. We have the ability to launch it just to beta groups of people, and then we also have the ability to turn it on like a percentage at a time.
And so when something is just an implementation change, often we'll turn it on to 1% to make sure that it doesn’t break anything, and then we'll slowly bring it up to 100 instead of doing it all at once. If it's a feature, it's more likely we'll just turn it on but sometimes we launch a feature to only a certain percentage of the site and see how they behave over a few days. We'll watch it.
InfoQ: So is it like all the beta users go to a specific cluster of nodes?
Sam: It's all the same nodes. At runtime we check. It's just like their user ID gets checked.
InfoQ: Do you write your own system for that, or do you use something else to do it?
Sam: We wrote the code ourselves for running that. An older version of it is open source. The newer version is much, much, much better and is not yet released but it will be.
The project doesn’t have a very good name, though. I think it's just Feature Flags or something. But it's on our Git homepage when it's out.
I can lead you to some other talks that describe developing that way. We do that instead of doing branches in Git. People don’t push or don’t work on branches for a long period of time. They only work for a short period of time and then deploy the code.
Can you briefly describe a recent failure you have encountered, and how did you solve it?
Sam: Do you want to talk about how we deal with failures then?
InfoQ: Yes, please.
Sam: So because people are trained to look at graphs and to look at the logs, we usually detect failures very, very fast because there's always people looking at the metrics and we have a lot of metrics so people know what the site is supposed to look like.
InfoQ: So is it like something checked once a minute or once a second?
Sam: We use Nagios to do automated checks. It depends on the check. Some checks are run very, very often; some of them are not. We have a number of Nagios checks but then humans are also watching graphs and trying to see patterns as well.
So when we detect those problems, because people normally look at the graphs because they're deploying code all the time, it's not something that's unusual to them. They've practiced it and so they do a good job.
InfoQ: That sounds very human intensive.
Sam: It is. But we do have things like Nagios checks for things that can just go wrong. We don’t have very many failures that don’t get caught quickly. They're almost always caught fairly instantly. I think things don’t tend to go wrong in the middle of the night because no one is deploying code when everyone is asleep. The kinds of things that go wrong are usually system level things and those are checked by Nagios. During the day when we're working, other kind of things might go wrong because of a deploy, but then people are also watching because they're deploying code. So stuff that's hard to catch with a computer tends not to go wrong at night.
InfoQ: So let's say, when a failure happens, what kind of normal procedures do you take?
Sam: If something has broken a site in a very bad way, what we'll do is ask people to stop pushing codes, we'll put a hold on the line, and then whoever is watching will start digging in and trying to figure out what's wrong. So we have access to these logs and the graphs. As we determine what feature is going wrong, we will start contacting the people who are involved in that feature or who might know. Usually, if it was related to a deployment, those people are already around. We use IRC Chat all the time.
So they're in the channel. They are around and so if it's related to a deploy, the people who are deploying are watching and they know what went wrong. They probably already know how to fix it because it was a small change that they made so they can revert the change and push that. For smaller bugs, we say that we like to roll forward instead of roll backward. Instead of reverting, you fix it the way you wanted it to be.
InfoQ: But then that takes longer.
Sam: Right. So if it's a small problem and it's something that you can fix quickly, then you roll forward. Because rolling back, you still have to run a push which is about ten minutes total. You have to go to staging and then to production. So it's going to take about that long no matter what you do. If you spend the next five minutes fixing it the correct way, it's not that bad and it's probably better.
So that's usually what we do. When it's a bad bug, then obviously we'll just roll it back and figure it out later. But oftentimes it's not clear that the revert is actually what you want and so rather than blindly revert, we try and figure out what the real problem is.
InfoQ: So what's the normal way of doing rolling forward?
Sam: Most bugs that get encountered at deploy time don’t affect the entire site. They affect a very small portion. If you're rolling forward, then other people can keep pushing their code and keep doing their job while you're making the fix. But obviously, you make that decision based on how bad things broke.
InfoQ: What happens when it doesn’t fix?
Sam: If it doesn’t fix the thing, you would start reverting.
InfoQ: So let's say, after 10 minutes and you can't fix it?
Sam: It's not a hard and fast rule, but if people have to wait for you to fix your problem, other people won't deploy code. If you prevent people from deploying code for too long, then we want you to do whatever it is that lets people keep deploying code. So yeah, if it's been about 10 minutes or so and you don’t know what the problem is but you think the revert will get us back to a good place, absolutely. Revert and try and figure it out on your machine.
InfoQ: Ok then. So can you describe a recent failure?
Sam: Well, more typically we see very small problems, like if someone is changing maybe a page. When a seller who is selling an item on our website, is editing how many number of things there are, right? Let’s say, someone made a change to that page and all of a sudden every time the seller changes number of items there is an error but that's not preventing people from selling things on the site. For the moment we're okay. That person looks into what they did and tries to fix it. And normally because it was such a small change, they just fix the code and roll forward. And again, if it were a bad one, they would roll back.
An example of a more recent, a very bad problem: we use full IDs in our database - they are not auto-increment. We don’t use the database to do them because we run all of our databases with two masters. We have a master-master. So you can't do auto-increment because you don’t know which master incremented last. So instead we have separate servers and all they do is give numbers. There are also two of those. One of them only generates odd numbers: 1, 3, 5, 7 and the other one even numbers: 2, 4, 6, 8. And so that's where you get your ID accounts from.
A few months ago, that server went over 231. So it overflowed a 32-bit integer column, and that's okay because you're supposed to store them in 64-bit columns. There was a small number of places in the code base that it did not. There's a small number of things that people had ran and they didn’t realize the reason why it was 64-bit. And so when that went too large, those features stopped working.
So, they couldn’t get an ID that made sense. They try and store it and the database didn’t like it. So when that happened we immediately stopped people from pushing code. It took us about a minute to determine that that was the problem because in the logs there were a lot of errors and they had those numbers. And if you've seen that number before, you would know that number. You know the bad feelings that come with seeing that number.
So we saw that and we immediately stopped pushing, and my team, core platform, was the one who detected it. But because we knew that was a possibly very bad problem, we did not know yet how many tables would have this problem because there's nowhere to just look. We gathered people, we gathered some operations engineers and some of my team and we went into a conference room and we started planning. We decided to go through all the tables and see which ones had columns that were not large enough.
We also write our schemas and code as well as in the database. We had to go through the code and determine how many schemas we find wrong that way. Some of them were wrong only in the code. Some of them were only wrong in the database. Some were both. So we just started going down the list in determining what was wrong. Other people at that same time were trying to determine which features were shown as broken, maybe because they had error logs or because their graphs were bad and they were going into the configuration file and turning those features off.
Once we were able to determine all the right ones to change, we ran the alter on the database, changed those columns to 64-bit, and turned the features back on, and then we started looking to see what data may have been bad during that time. That whole process lasted only for a few hours, and the only things that broke were minor side features, but we didn’t know how many features were broken until we had fixed all of them because it wasn’t obvious.
InfoQ: That's a rather fast fix.
Sam: Yeah. And I mean that's one of the worst. I don’t think that we've had any problems in the last few years. I think we haven't had more than three hours of downtime in the past couple of years and that happens very seldom. It's very bad when it does.
For server monitor, how do you deal with network performance monitoring? On the application side, how do you monitor up to the user experience level?
InfoQ: How do you do server monitoring? I mean, especially for network performance monitoring and application side monitoring.
Sam: For network performance I think we used something Cacti. We used Cacti and we also used another tool that we wrote internally that I believe is open source and it's called FITB.
So FITB looks at all the switches and keeps metrics small. Yeah, I think it's a lot like Cacti. I don’t remember off the top of my head what the difference was, I don’t often deal with the network metrics, but we use those two tools for monitoring the performance of switches in various ports. We very seldom have trouble with that. The network is almost never giving us trouble. We have a lot of capacity.
InfoQ: What are the common bottlenecks for your system?
Sam: Much more often things like the database or memcache. Recently, we've had troubles with memcache that actually were network related. There was a particular key that restored the memcache that many, many people needed to use and our site traffic went up recently in the United States. It was Cyber Monday.
So traffic was very high and there's this one key that was being accessed a very large number of times because one of the memcache servers saturated the network connection. And the fix for that was just to not store that data in memcache. It didn’t need to be there so we removed it. But that was just bad coding, but it didn’t show up until we saturated the network card.
But yeah we use Cacti and - I don’t know the resolution of that, how often we look at that data - but we certainly use Nagios on that data to determine that everything is all right.
For client side data, we have our own set of beacons that log actions that people take on the site and that information gets put into our big data stack. So for monitoring user behavior we have our Hadoop cluster and we process all of these beacon logs for things like I loaded the home page and clicked on this item and then checked out. We look at patterns that way. We also do things like we log JavaScript errors from the frontend to the backend. We have both server side performance and things like webpage tests running often. We have a webpage test cluster, we use that for determining that the performance is okay or not.
We also correlate application metrics with user actions. So we have metrics of things like people listing things for sale so that's some measure of how people are using the site. If this action is no longer happening, then maybe we change something about the user experience. We also keep graphs of our help forums. So if more people are asking for help in the help forums, then we probably broke something. And then obviously the support. So we communicate with them often to try and determine user issues.
About the Interviewee
Sam Haskins currently works in the Core Platform Team in Etsy, which focuses on the interface between the servers and operations level. Sam joined Etsy in mid 2010, he currently works primarily on database access layer. Sam graduated from Carnegie Mellon University with a BS in Mathematics.