Transcript
Palomino: Welcome to my QCon presentation on production infrastructure cloning++, an emphasis on reliability and repeatability. I am your host and emcee J.D. Palomino.
Introduction
Firstly, a little bit of introductions. Right now, I'm an infrastructure engineer at Flexport, where my team and I maintain and procure all of our AWS and Kubernetes resources through our various different accounts. As I mentioned, I'm working at Flexport. What's Flexport? We're a modern freight forwarder. We're trying to go and bring global shipping to the 21st century with good software. Our headquarters are in San Francisco, but we have offices all across the country and in three different continents. We currently are about 200 engineers, but a total headcount of about 2000 employees.
QA Story
Suppose in your company there is a QA engineer. One day, they need to go and test this really important change. They're deployed under a QA environment, and everything looks spot on. All the tests are passed, all of the code reviewers that are supposed to code review have done so. It all looks great. They green light it, and it goes to production. Unfortunately for you, however, there is a private DNS bug that wasn't surfaced in the QA account because it didn't have the private hosted zone there, but now they're in production. It fails in production, and you got downtime. That's not good.
Load Test Story
Suppose that in your company there is important changes coming along, and you need to go and load this. You need to make sure this is not going to go and break in production. You spend a load test environment, and you go and you give it all you got. All the code seems to be working. There's been no massive CPU spikes, the databases have been working correctly with no queuing. It all looks perfect. You and your team are happy, so you're proven. Unfortunately for you, however, the production system, it runs with different server size. As consequence to that, the notice environment maybe had bigger servers, and the production environments can't really handle the same load. It starts failing. You got downtime. It's not good either.
Sales Story
Suppose that your sales team goes and pitches your product to potential customer and they love it. You guys have all the features that they want, it is a spot-on match, they want to buy. However, before so they ask is there some sort of demo account that they can go and test integration account, some sort of sandbox. Unfortunately for you, that's where your sales then goes and scratches their head and says, "There is a sandbox account, but the code there is two months old. No one's maintaining that infrastructure there so it goes down from time to time." The client doesn't like it, the client says, "Your competitor, they do have this. This gives us a lot of insurance and makes sure that our integration will work well. Unfortunately, we're going to go with your competitor and not you." Just like that, you lost the customer.
Common Theme
So all these stories are very unique and occur in different departments, but they all share the common theme. That theme is that businesses need as close a copy to live production as possible. You guys may need it for QA. You guys may need it for demo. You guys may need it for load testing. Maybe you even need it as a backup environment in case something happens to production, or there may be other use cases. You guys will need this.
It Will Affect You
This will affect you. These issues that we talked about, these aren't specific to one industry. This is cross industry. It's going to happen to your company, whether it's a big or small company. The one thing that is consistent is the consequences. You guys are going to lose money with downtime. The question now is what are you going to do. In the spring of 2020, Flexport faced this exact problem, and we wanted to go and have a solution. My team and I were tasked to provide with such a solution. We went and we sat down for a while, and we thought, "What are some good principles that we need to follow? How can we go and stop this from happening? How can we get a reliable clone of production?" We came up with three key principles that would guide our implementation. The ones were mass production, velocity, and reliability of results.
Key Principles: Mass Production
The first one is mass production. When you think about mass production, I want you to think about it like an assembly line, like the Ford Model T assembly line, but instead of creating cars, we are creating new accounts and their infrastructure that's necessary. The first thing we must think about is that with all of this is set up for an account, the infrastructure must be able to repeat it reliably again and again. When you account a new [inaudible 00:05:58], we need to same infrastructure over and over. For each of these accounts, there will need to be configuration and customization. For example, they might have different CIDRs for their IPS or different hosted zones or DNSes that they should resolve. This sort of customization is top priority. Whatever assembly line we need to create, it needs to be able to handle this. It must also be able to go and handle different types of accounts or different archetypes. For example, a data engineering account may look very different from a service platform account. It's different. Think about it like different models of cars. What's really important is that with all of these customisations, different archetypes, we need to keep the manual intervention to a minimum. This needs to be as automated as possible.
Key Principles: Velocity
For the next key principle, we have velocity. This one is much more straightforward to go and explain. We need to make sure that whatever we create with a mass production can be operational ASAP. Not tomorrow, not in a week. It needs to be ready today. For this, we need to go and get a robust deployment pipeline. We need to make sure that if something happens when we're deploying, it's fine, it's no biggie, we can just retry again, be able to go and debug it well, and make sure that it's self healing. That also needs to be put in place.
Key Principles: Reliability of Results
Lastly, but not least, we need to make sure that there's reliability of results. What I mean by that is if you want to test and make sure that it looks like production, then you're going to need to have production-like data for production-like behavior. For example, you can't just have one user in the test account, because then, I promise you, you won't be able to simulate the sort of traffic that you'll be getting in production where you have 2000 20,000, or 2 million users. You need to have production-like data there. If you are in the service-oriented architecture, this doesn't mean just for one select service. You need to be able to handle this for all of the services that you have. Lastly, you need to be able to sync with production at-will. If there's new important data that comes in, you need to be able to handle that as well.
Principles in Action: Mass Production
We took all these different principles, we thought about them, and we came up with implementation piece for the different ones. The first one was mass production. The state that Flexport had in 2020 is that we already had infrastructure as code via terraform. We have this whole terraform blog post. Feel free to go and look at it about the wonders of infrastructure as code. However, we still had shared modules and handcrafted directories. We shared libraries, but different implementations again and again. With these different implementations, we had ad-hoc add ons, or special [inaudible 00:09:00] IPs that some of them have, or databases that are here but not in the other places. All of that was done manually. The problem here is that copying and pasting is not good enough. That's very manual, and doesn't handle the customization. We need to do better.
When we started this project in the spring of 2020, we sat down and we thought about it for a while, and we came up with the solution that we call Account Blueprints. What these Account Blueprints are is that we took a look at our IaC, and we realized that terraform is great, but it did not go the whole way in template pacing as we wish. There were some things that we found missing. What we'd say to do is we take all of these terraform files that we would write for a given type of accounts, and they were often the same ones, and just templatize the whole thing as well. We would be adding this sort of templating wrapper on top of terraform to make up for its deficits. We use ERB for this wrapper, as we're Ruby [inaudible 00:10:06] so ERB was a natural choice.
Once we had this Account Blueprint, we made sure that we had a specific one for each of the different archetypes. This way we can handle all of them. The way that the execution would work is that we would have one of these Shared Account Blueprints, and then we would have a config yaml that would be specific to an account. Then we would run our EOB executer over it, and it would take the ERB templates and update them with all of the value stored in the config yamls. Once that's done, we would have our account specific infrastructure as code that we were looking for.
That's the solution that we came up with and this is a little bit of what it looks like. On the left, we have some terraform templates with some ERB for loops attached to the beginning, as well as some conditional statements. On the right, we see we have some blueprint directories where we have all of the ERB files. A little bit lower there, we see what happens when we execute the template generation. We get these different terraform files based on the ERB ones with all of the configurations applied to them. Just like that, we've got mass production.
Principles in Action: Velocity
Moving on to the next principle, we have velocity. This one was much more straightforward at the time in spring 2020. We already had a CI/CD for the terraform infrastructure as code changes. However, it did not take into account special circumstances such as account setup, or post-deployment tasks that needed to have an afterwards to do some kind of cleanup. The solution for that actually was straightforward. We took our current existing CI/CD, and we augmented it by adding the pre-deploy customization hooks, followed by the infrastructure as code generation from blueprint. Then we did the terraform plan/apply as before. Last, we got the post-deploy hooks for customization. This was all done in the builder account in order to get out of the chicken-egg solution. We made sure that they could assume admin roles in the new account. Just like that, we got our infrastructure as code CI/CD.
Principles in Action: Reliability
Last but not least, we had reliability. For reliability, we wanted to get production-like data. We really had a couple of these test production-like databases in production. What we need to do is we need to take the data that was there, and then put it in this new account, share with it. We sat down for a while and realized that there is a rather simple pipeline that we could do to go and get this working. The way that this was turned, in the beginning in the production account, we will create a shared encryption key that we could go and encrypt snapshots with to make sure to transfer it securely with a destination account, which we would share the key with. We would then go and take that key and we would create encrypted snapshots of the databases for choosing. Once those were done, we would go on, we share them with the new duplicate account that we're creating. On the duplicate account, once it realized that they had these new snapshots, what we'd tried to do is we tried to go and restore the databases for them. This would be one database per each NIS that had a database. This would be done multiple times.
Some cleanup that needs to be done is we need to make sure that we did not use the same database name and password as production. That would be rather large leak. We just went and we scrubbed them out, and we created new ones. Following that, we also modify the data as necessary in our site. For example, maybe you want to deactivate some special users. You can do so at that point.
Once you have these new databases up and running, and ready to be used in your production-like accounts, then what you can do is you can update the references that you're using to direct your server traffic to the old databases, and update and redirect them to the new database. In our case, we're using the Route 53 entries. We updated them to point to the new RDS's endpoints. Just like that, we were able to replicate the latest production-like databases into our duplicate account. Afterwards, we just destroyed the old databases and snapshots.
Learnings
This was a large project, it took us five months, I want to say. We learned a lot along the way. The first thing that we learned, and I can't stress this enough, is that infrastructure as code is a fantastic investment. The availability that we had, the repeatability, just being able to use GitOps with it, it saved us so many times. This is sort of prerequisite to attend any of this. More than just being able to have the templates, it was also really good to have a tear down to like AWS Nuke handy because we will make small mistakes. Then we need to try to start from scratch again and again. Something like AWS Nuke that just blasted the whole thing to smithereens was really helpful when we wanted to have a fresh start.
Much like the emphasis we had with infra CI/CD being idempotent, being robust, we need to make sure that the migration pipeline would also be idempotent as well, because things will go wrong there. You need to be able to recover from it elegantly, so idempotency goes a long way in pretty much all of its use cases. Also, when we're trying to go and make all of these infrastructure changes, it would be fantastic if your engineers can at least go and see the terraform plans locally, so that they don't need to go and push, commit, and wait for this CI/CD to reveal that. Having that sort of instant validation really helps out with the productivity.
Of course, everybody wants to be the hero and try to go and create their own template thing, like what we did with ERB, but I'm telling you, it's not that good of an idea. I don't recommend it. We had no choice in some aspects, but in some aspects, not so much. There's lots of things I would go back and rewrite. A word of advice, any wrapper you put on it use as sparingly as possible.
Current State
Where does that leave us now in the fall? Now Flexport does have these archetype blueprints for the Kubernetes and ECS-based service platforms. We're working for new ones as we speak, we do have the full production test data migration and duplication that I mentioned. That's up and running. With that, all of this, we've been able to go and have a four hours spawn time. All of this can be done within the single day. We have successfully created a load test environment. We've successfully created a demo environment. Those have been great to go and have. We really started to reap the benefits. People are quite happy with this change that we had. Mission accomplished? Unfortunately not. There's always new future work that needs to be done. There's new changes that can be applied. It's a constant struggle to go and get better, more feature-rich and robust system. Definitely not mission accomplished yet. There's always more to do.
Thank you all for seeing my presentation. You may contact me on my LinkedIn here. We are hiring at Flexport in case you're interested by these sort of projects that we've been mentioning. I'd like to acknowledge my teammates, Hailong Li and Andrew Tsui. They have been terrific help.
See more presentations with transcripts