InfoQ Homepage Presentations How DoorDash Ensures Velocity and Reliability through Policy Automation

How DoorDash Ensures Velocity and Reliability through Policy Automation

Bookmarks

View Presentation

Speed:

47:37

Summary

Lin Du discusses the details of their approach at DoorDash, and how they enabled their engineers to self-serve infrastructure through policy automation while ensuring both reliability and high velocity.

Bio

Lin Du is a software engineer on the cloud team at DoorDash, where they focus on infrastructure self-serve for their cloud primitives and governance. Prior to DoorDash, he worked at Nutanix, and mainly worked on building hyper-converged infrastructure for on-prem private cloud.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Du: For this particular session, I want to start with a story about an incident that happened at DoorDash two years ago. Although our infrastructure engineer was able to quickly identify the problems, able to recreate a database cluster, restore everything from the backup, they still caused our order volume to drop a bit, as you can clearly see from the screenshot. This is one of our daily order volume chart. I made the order volume blurry, it's sensitive data for us. It took us an hour to mitigate the problem, which is not good. During troubleshooting, our team tried to figure out how this issue occurred. We looked up the order microservice dashboard and metrics to identify the correlation between different components, we found that the only change that was made before this incident, was this Terraform pull request in GitHub. Terraform is an infrastructure as code tool that allows us to manage our infrastructure more efficiently with a human readable file. It actually automated the lifecycle of the infrastructure changes for us. According to this Terraform pull request plan, you can clearly see there's a 91 resource to be added and 92 resource to be destroyed. Those destroy operations were particularly concerning here, and there is a big red flag for us. If you opened up the plan, output it one by one from this pull request, this is how it looks like.

It's still not done yet. There we go. It's just almost impossible for people to review this pull request. The screenshot that's shown here is only showing one-third of the Terraform plan outputted from that pull request. Even though our core infra engineer is patient enough to review it, doing so is simply not a good use of our time. If you quickly go through the Terraform plan from this particular pull request, the majority of change is around like AWS IAM role updates, security version updates, secret version updates. It all seems like harmless updates. That's why in our case, we have two senior engineers review that, and stamp that, and approve those change requests. I made their names and faces blurry, so we don't blame them. If I were them, I probably will do the same, because it's a really complicated Terraform pull request to review it. Upon closer inspection of the Terraform pull request for that change, we discovered that two critical resource deletions were part of the change request among all the other harmless updates but were overlooked during the review process. When that pull request was merged, it actually destroyed one of our database nodes and actually destroyed one of our ElastiCache cluster. That's how this incident got triggered. It's pretty bad for us. These types of issues can be simply caught by our policy automations.

Overview

I'm going to cover the following topics to help you identify some of the policy that could have benefited your organizations. By the end of the session, hopefully you should be able to start a similar policy, and it's the best fit for your needs. My name is Lin. I'm a software engineer on the cloud team at DoorDash. My main focus is on the infrastructure self-serve for our cloud primitives and governance and also some of the cost-efficiency work. I will walk you through what are the background information, and how are we managing our infrastructure at DoorDash? Including the automation tooling and all the tools. Also, I will discuss some of the infrastructure as code tool that we leverage at DoorDash to help us manage all the infrastructure changes. I will also share some of the policy that we write internally to help us increase the engineering velocity and efficiency. Also, I will briefly share some of the future of the infrastructure change workflow, what our team is currently working on, including the approach that we are exploring.

History of Infrastructure at DoorDash

Just to give some background information, DoorDash started their infrastructure as code journey back in 2017 with a small team managing the entire infrastructure. What engineers typically do is running the HCL code locally. HCL stands for HashiCorp Configuration Language, which is mainly used for Terraform. Then they ran a Terraform plan locally from their machine to see what's going to happen and apply changes from local as well. This setup works for the most part, and work a while for the team. Obviously, there are major downsides of doing this. What if another engineer tried to modify the same resource from their local machine? How do we keep the Terraform states in consistency? Although we could leverage the Terraform remote state config, but it's still hard for us to know who did what and when. Also, this setup has a lack of collaboration between engineers. After some of the research and investigation, we start using Atlantis, which is an open source tool, in late 2018, which we're still using today. The slide shows here 2021, because that's when the incident happened. We had already fully adopted the Atlantis workflow. Basically, it's just like a GitOps workflow, you will push your Terraform code to a version control system, like in our case, GitHub, and Atlantis would run as admin users operating Terraform on your behalf. Since then, we're kind of forcing our engineering team to move towards the Atlantis workflow, so users are no longer able to run any Terraform command from their local.

Key Takeaways from the Incident

Even with the Atlantis GitOps, Atlantis workflow if you think about the incident such as I mentioned at the beginning, it highlights the fact that humans are prone to errors, even the senior engineers that can easily stamp or approve some of the pull requests that are inevitably bad. There are a lot of chances an engineer can introduce a new security risk. Like in our incident, engineering can unintentionally destroy some of the critical resources from the productions. Also, a lot of Terraform pull requests has a long Terraform plan, it's really time consuming to review it, which creates a new bottleneck that require more code reviewers and spending more time in doing approvals. This is like decreasing the engineering velocities. As DoorDash's business continues to grow, our cloud infrastructure expanded significantly. On one hand, we realized that if our team cannot scale with our growth, we could have become the new bottleneck that slowed down our engineering team to deploy any resources more quickly. On the other hand, we also knew that our infrastructure team can only scale so much, so we are not the bottleneck for the wider engineering org. We need to find a balance between scaling the infrastructure team, or ensuring everyone from the team can work more efficiently and effectively.

Put Things Together to Win

Part of the post-mortem of that incident, we were trying to figure out what kind of rules, what things we can add as guardrails to prevent such an incident from happening again. We did some research. This is how we put everything together at DoorDash at a very high level. Let me briefly walk you through how this workflow looks like. A user from the left side trying to create an infrastructure change pull request to the GitHub from step 1. Once the user commits their change to GitHub, it will trigger the webhook event to our Atlantis worker running on AWS, as an ECS task. It's like a container on the cloud. Atlantis will run a Terraform plan of the latest commit based off the pull request branch. While simultaneously we're pulling down all the rules we define in the S3 bucket. If you see from the top, we have a dedicated GitHub repo to maintain all our policies that we wrote internally. We have a CI/CD pipeline to build the package and ship to the S3 bucket. From step 3, Atlantis then evaluated the Terraform plan based off the rules we just downloaded from S3 bucket using a tool called Open Policy Agent. Finally, Atlantis will post back all the evaluation results and the plan outputted back as GitHub comments. This actually enables everyone from our organization to see, what was the change, who made the change, and approved the changes, if they expected the change or not? Once the user received all the necessary approvals, the user can run the Atlantis apply from step 7. Then Atlantis will drive all the infrastructure changes to make sure the resource gets provisioned to whatever the cloud provider your company configured. In our case it's AWS. This is just a very high level of how everything fits together at DoorDash, and that's just to zoom in to each one of the components to get some details.

One of the core components of policy automation is OPA, Open Policy Agent. A policy agent that enables us to write a single language for enforcing the policy across different stacks, including microservices, Kubernetes, CI/CD pipeline, API gateway, and more. Here it shows you how the OPA policy engine works in general. It integrates with any JSON input, which is made easier for working with Terraform as Terraform generates a JSON output. In this example, we are parsing a JSON file to serve as input for the Open Policy Agent. Then the Open Policy Agent evaluated the JSON input based on the policy we defined from step 2, then generates the decision accordingly with either deny or allow. You might be wondering, what language it's using to write the policy in this GitOps workflow. Open policy is using a language called Rego. It's a policy language for defining the policy that can be evaluated by the OPA engine. For engineers who are familiar with imperative language like JavaScript or Python, Rego looks a bit different. It's a declarative policy language, and every OPA we're going to talk about here will be written in Rego.

I just want to give you some high level of how the Rego syntax works. Every Rego file starts with a package name, which defines the scope of the policy, and all the policies with the same package name are within the same namespace. This is all similar to all the other programming languages. In this example, we defined two policies. On the left side it's called a utility package, on the right side is demo IAM role policies. Utility package is where we define all the helper functions, so we can import to other packages, we can use all the functions from there. In this example, we defined a couple of the functions here. The first one is to check if any resource change update is none of the operations, and next one is to make sure the operation is not a delete operation. The next one is to get all the module source addresses. The last one is to get all the resource change type from the input. On the right side is our main IAM role policies, it includes one of our hardcore data AWS IAM role policy that only allows users to access from our VPN internal set range. The final decision block will depend on all the truth values of the conditions, all the conditions from the final block. In our case, if any AWS resource change updates as IAM role policy updates, but without using our allowed VPN set range of policy, a deny message shows up here. This is a very simple example of how the Rego syntax works. For anyone who wants to learn, the Rego official documentation is a pretty good place to start with. It has a lot of good examples and explanations. For anyone who wants to try Rego syntax, I would recommend the Rego Playground. It's a very good place.

I want to show a real example how you can test your OPA logic before you make it to production. Here we are taking one of the Terraform output JSON file as input here. Now we define a single OPA policy that deny any S3 bucket creation without a lifecycle rule attached. If you look at the input from the right side, we don't have any lifecycle configuration defined there. That's why when you evaluate this plan, you will get a deny message from the bottom. This is actually the very straightforward way to test your logic OPA syntax. It doesn't require any installations for local computers. This is what I always do very often. The OPA seems like a good solution for anyone who wants to start with policy automations, but it does not quite fit into our particular case at DoorDash, because OPA is a primary design for server-side services, and many use it either as a library or as an independent server. About that time, we discovered another open source tool called Conftest. Unlike the OPA, Conftest is providing us a simple CLI user interface, so it makes it easy for us to jump into the OPA locally and to integrate it with the CI/CD pipeline. Additionally, Conftest is mainly for testing the data against the policy assertion in a very CLI driven way. Also, OPA only support JSON format. By contrast, Conftest supports a wide range of data formats, including HCL, Dockerfile, init file, and more, which is a very special use for when you're doing the static code analysis on Terraform config. Or you're showing like you can write a very comprehensive policy automation.

Now we have all the necessary tools for doing the policy automations, but everything I have demonstrated so far, is only run locally. How do we integrate all the tools with our pipeline, so everything can be done automatically? I want to show you what Conftest offer us first. Conftest, to get started, we defined as a couple of Terraform files here. We also defined one of the Rego policy in the same directory. Then you ran the Terraform plan command to generate the plan and save it to our file. Next, then you can run a Terraform show command to convert the plan into a JSON format. Finally, you can run a Conftest command to evaluate the Terraform plan JSON file based on the policy we defined. Here, the Conftest command get a nonzero exit code, and we'll feed our message that indicate that we don't use any tagging configuration in our Terraform config. That's why you get all this failure message.

To go back to the point, like everything I've done here is all running locally, so we need something that can help us to run everything automatically for us. At DoorDash, we use a tool called Atlantis. It helps us to define the customized workflow as a part of the Terraform plan step. You can output the plan into JSON format, so we can evaluate the plan, by using the Conftest from our helper script as part of the Atlantis workflow. It works super well for us. All you have to do is just writing a pull request to commit your changes to GitHub. Then Atlantis will automatically run the Terraform plan for you, and output the results back to GitHub as comments. Or you can just manually tap Atlantis plan from GitHub comments, will essentially do the same. Once you've got all the necessary approvals for your change request, you can run Atlantis apply, it will provision all the resources for you. One step down, your pull requests will get merged automatically.

For now, we have built our entire streamline pipeline. The next question is, what if there is a policy failure? What can we do to alert the appropriate team to take a closer look out of your pull request? At DoorDash we use a service called PullApprove, which is a framework for code review assignment and policies. It integrates with GitHub directly. It allows us to determine how we want to Slack the particular reviewers for the pull request based on the GitHub status checks, which include the OPA status checks. It is configured with .pullapprove.yml file at the root of your GitHub repo. In this example, we configure the PullApprove to only be triggered when Atlantis plan successfully ran. Then we defined, what are the required teams to review this particular pull request based on the status checks. In this case, if a OPA for cloud status check failed, we would like a cloud team to take a closer look at it, this is a pull request to check all the changes. If you look at the OPA status from the GitHub pull request, this is how it looks like. In this example, it shows you like the cloud and cost, the OPA failed. If you look at the PullApprove status, you should see like now it requires both cloud and the cost efficiency team to review this particular pull request. Here I just take a screenshot of the cloud team.

Now everything should be making more sense, how everything integrated together. There are various ways to support this setup. You could also be using GitHub Actions, which are already supporting Conftest and Terraform, to do the same, or you might just want to integrate Conftest as a part of your CI/CD pipeline after Terraform ran. The implementation may vary depending on the specific requirements from your company. The screenshot I showed here is just how we implemented it at DoorDash to help us best fit our needs.

What Kind of Policies Could We Write?

Now the remaining question for this entire policy automation since we already built the entire pipeline end-to-end, the only question left is, what are the policies we can write to help us prevent such an incident from happening again? What other simple recipe I can share, so you guys can get the benefits for your teams as well. I would like to categorize the policy into four different types. The first one is the reliability. A good scenario is if you are mainly doing the code review for the infrastructure changes and the change is very complex as the incident I mentioned at the beginning, you want to make sure there's no deletion that happened on certain critical resources. You also want to make sure some of the pull requests doesn't have so many changes. You might want to write a policy to make sure users can only update some of the attributes of the particular resources.

I just want to show you a real example of how this reliability policy looks like. This is a critical resource of production policy. First, we define what are the critical resources to us. In this case, we consider an ElastiCache cluster, a database node, an S3 bucket, and also network routes. Those are the critical resources to us at DoorDash. Then we check if any critical resource has been deleted as a part of the Terraform plan. If we detect there's a deletion that happened on the Terraform plan, you will get a deny message that they require according for our admin review for this particular pull request. Now, if any engineer creates a pull request that involve any of the critical resource deletion, the OPA will fail, and you will get this error message asking you to reach out to the core infra admin for additional reviews. Now when the core infra admin get notified about this OPA check failure, they already know what they're looking for, and they will make sure that resource deletion is expected or not.

Often, the Terraform pull request is very complex and long. It involves various different Terraform modules to provision different resources required for a task. To actually optimize review time, if we have well written Terraform modules with predefined parameters, that an engineer can modify with specific limits which can reduce the chance of the mistakes. We can certainly write some of the policy that anyone using our pre-approved Terraform modules, doesn't require any additional reviews. This is what reduces the burden of infrastructure engineers by not requiring them to review certain types of pull requests. We can also write some policies that some of the resource updates for particular attributes don't require review at all. This is the type of policy I would like to call like a velocity policy. It's super useful for large organizations with many teams constantly modifying their cloud resources. We can do something like this. First, we define, what are the Terraform modules we rolled internally? Then we check if any resource change is using the module from your Terraform plan, and we look up the module source addresses to see if these module source addresses are part of our pre-approved list. If it's not, you will get a deny message saying like you are not using our pre-approved Terraform modules for creating the cloud resources. This policy not only helps us to increase the engineering velocity, but it also helps us to encourage people to use the Terraform module that we wrote internally. Since those modules that we designed is mainly for like reducing the human error to make sure all the resources are getting provisioned consistently. Anyone who doesn't use the pre-approved Terraform module here, you will get the deny message asking you to use the pre-approved Terraform modules.

The next important category is the cost efficiency policy. It is crucial for the large organizations to optimize their cost spendings. When I joined DoorDash, our AWS resource tagging rate was low by the industry standards. Also, the tagging value applied inconsistently in regards of the enumeration and formatting. This created a challenge for our FinOps team to do any cost optimization work, and to gain any visibility on the resource ownerships. To address these issues, we really need to create a unified tagging standard that can be adopted by all the engineering teams to increase the cost visibility as well as unlock the efficiency work in cost forecasting, optimizations, and automations. In order to enforce every pull request in GitHub to have tags, a good example is to leverage our OPA policies. We could write a policy that's enforcing like everyone, when they create a cloud infrastructure change pull request require tags. Another good example of the cost efficiency policy is, make sure every S3 bucket creation have lifecycle rule attached. Without lifecycle rule, unused data can result in unnecessary storage costs. By forcing the engineer to have lifecycle rule configured for their S3 bucket, they help us to manage the object lifecycle in S3 bucket, and also automatically transition the objects into different storage tiers. We could also write some policy that ensure that we are using the appropriate volume type or instance type for many different factors, for cost saving purpose.

I just want to show a tagging enforcement policy example as part of the efficiency policy here. In order to have tagging standards, we write a tagging module internally, which is containing our self-defined metadata for our microservices, and all the predefined tags. It's basically just a map between the microservices and all the tagging values. When the engineer tried to instantiate our tagging module from their product, all the relevant tags will be pulled automatically for their cloud resources, when they try to deploy all the resources. If you look at this example, first, we check if we are using a default underscore tags configuration as a part of AWS providers config, which is a requirement to leverage the default tagging features from AWS. Then we check if any module call is presented from the Terraform plan. If it is, we then check if the Terraform DoorDash tagging module is a part of that module call list. Which means like you are actually instantiating our tagging module as required. If not, you will get a message saying like, you need to include your tagging configurations here. This not only helped us to speed up the review process, this policy, but also make sure our tagging value gets applied accurately and consistently across all the cloud resources. Anyone not including any tagging value from their pull request, you will get this deny message from your PR.

The last category is security policy. It's also important for organizations to make sure all the resources that are going to be provisioned are in compliance with the security policies. As you can see, if anyone is trying to make a pull request to make resource changes for their infrastructure, it's very hard for our security team to keep up with all the changes and ensuring all the resources meet with the security regulations. We could write some policies to prevent any security vulnerability from being introduced into our productions, and make sure everything we provision meet the security requirements. I just want to give you a very simple security policy. This is a policy to just check, if any like AWS security ingress rules has port 22 open, which is SSH port, with a set range of 0.0.0.0, means like you open the SSH into all the network, which we don't want that to happen. If we detect these particular changes, you will get the deny message showing like, this requires a security team to review it.

For anyone new joining the team, they might copy and paste the code from external sources. Like in our case, you might ask ChatGPT, how do I create a Terraform code to create an EC2 instance with SSH port open? This is how ChatGPT generates code for you. I'm not trying to discourage people using ChatGPT. You have to pay attention, what are you trying to ask precisely in order to get an accurate response. In this case, if you are like a non-experienced infrastructure engineer, you think, looks like the code is fine. You copy the code and make it into production. This is a very common scenario leading to the security issues for companies. Due to the security policy in place, the engineering code will be checked for compliance, and if we detect there are any security rule that allow SSH open to all the network, you will get a message like, requires security for additional reviews. This is to help us to prevent any potential issues, and to avoid any security issues.

Demo

I have a quick demo. This is just trying to show you like we can define like a PullApprove of a YAML file from the root of the GitHub repo. We can define what are the required groups for particular OPA check failure. In this case, if an OPA for a cloud team failed, we will require cloud team for review. If OPA for like core infra cloud admin failed, you might require a core infra admin for reviews. Similar to all the other OPA policies. This is how you configure your PullApprove configurations. Next, I can show you, this is how we are creating a pull request from GitHub. What if we remove the S3 bucket configuration from GitHub. Now I just comment out all the Terraform configuration, which is essentially telling Atlantis we are going to remove the S3 bucket. Once you do that, you will get an OPA policy check failures for cloud admins. Because, as I mentioned, the S3 bucket is one of our critical resources that we define for all of our policies. Since the policy failed, if you look at the PullApprove, now it requires the core infra admin for reviewing your change request.

I just want to show you another quick example. This is for the efficiency tagging demo. We can create a pull request here. I just created like a simple VPC subnet and an EC2 instance. From the provider configuration, I'll comment out all the default tag configurations, which tells Terraform, right now instantiate any tagging values. Also, I comment out the tagging module we wrote internally. When this pull request runs, you will see the cost efficiency check failed. Now we look at the pull approval configuration is asking for the cost efficiency team to review. The cost efficiency team will get notified and take a look at the pull request, and they're seeing like, you are missing a tag from your change request. You can easily fix the issue, but looking at the result from the pull request. Same for all the other policies which I already talked about.

What Were the Outcomes?

What are the main benefits of doing the policy automation for us at DoorDash? By automating the policy check, the required review time for our infrastructure is significantly reduced. Because the OPA can help us quickly identify the issue based on the results of the policy evaluations. This helped us free up a considerable amount of time for our infrastructure engineers, so they can work on more important projects other than performing a code review daily. With the policy automation, there are a lot of security violations that can be flagged, identified at the early stage. From the pull request, you can easily take a look at the evaluation results to know what went wrong, and fix it by yourself. This ultimately helped us to reduce the number of incidents caused by the infrastructure on pull requests. Also, additionally, this policy automation has helped us increase our tagging coverage from 20% to 97% in less than a year. The security improvement unlocked our efficiency team to do any cost optimization work and we are able to manage our resources more efficiently.

However, the GitOps workflow works great for us, but during our journey, we discovered that running a Terraform is not meant for non-infrastructure engineers, as it requires time for them to learn and relearn Terraform-Atlantis workflow, which is not their daily task. They have to spend a good amount of time to understand the configurations in order to make an appropriate code change for their resources. Additionally, a lot of engineers tend to copy paste the code from others, leading to many code duplications in our code base. As for our core infrastructure engineers, the GitHub workflow involves many different components, increasing the system complexity, and the maintenance efforts.

What's the Future?

The question is like, is there any way we can further increase the engineering velocity, not only for our core infrastructure team, but also for all the backend teams as well? Is there any way we can create a simple streamlined interface or another abstraction layer to allow engineers self-serve infrastructure creation without the fear of breaking anything, or without worry about any reviewers at all? There are many different ways to increase engineering velocity and simplify the resource creation. At DoorDash, one approach we are taking here is to create a developer friendly self-serve platform with a simple UI interface. Now the engineers can create the resources and provision the resources through this self-serve platform with a few clicks and minimal configurations, which is super helpful for the non-infrastructure engineers to use. As for our core infrastructure engineers, we only expose the most common resource configuration that is relevant to our backend engineering needs, rather than overwhelming them with all the possible configurations. Also, additionally, since we have all the control of our resources' primitives, we can reduce the investment in the GitOps workflow. This example just shows you like, the user can request in the AWS Elastic Container repo, from the self-serve platform. By doing this approach, it offers us absolutely no review option for the users. Anyone who uses our self-serve platform to create an infrastructure resource, it doesn't require any additional review at all, which speeds up the deployment time for the backend engineering team as well. To go back to my original point, each company has their unique, specific requirements. The self-serve platform is not the best fit for everyone, since it does require a significant investment in engineers to develop both backend and the frontend. Regardless of the chosen approach, the policy automation is always the first crucial step. It helps us to identify the problem quickly, and enforcing the best practice and ensuring that we have an efficient infrastructure for the companies. Hopefully, my talk has inspired you to start thinking about how the policy automation works and how it best fits your organizations, or maybe your team already started doing the same, which is great. You might be facing the same problems as we do today at DoorDash. Policy automation is always the first step for you when it comes to self-serve.

How Diff is Handled at DoorDash

Du: How do we handle the diff at DoorDash?

Normally, we will require at a minimum like one reviewer for any infrastructure changes for the pull requests. If you want to make any pull request, at least you need to require one reviewer from your team to take a look at it first. The only time when the OPA comes in the picture is when you're breaking the rules. Like on the example I showed earlier, if you are not using tagging, if you are trying to destroy the resources, that's when the other teams get notified, alerted, so they can jump in. They know what's going to happen because you're breaking certain rules. We don't have any policy to mainly focus on the diff. Our OPA policies mainly focus on the results of the plan. What are you trying to change? What are you actually trying to take? This is actually the only thing we get benefit from the OPA policies.

Preventing Critical Resource Deletion, Other than Utilizing OPA

Is there any other way we can prevent the resource getting deleted, other than using the OPA?

We do have other mechanisms that we use to prevent such critical resource deletions from happening. We don't leverage OPA only, because OPA is only like, as you mentioned, for the human. We are taking a look at it proactively to prevent such issues. We also leverage the AWS cp rules. This way we can lock the Atlantis permissions, so Atlantis won't be able to destroy any particular resources at all. Even if destroy shows up, and even the OPA catches that, and even if you have engineers to review, you're still not able to destroy any resources because we have other guardrails added on top of the OPA to prevent such an issue from happening.

Questions and Answers

Participant 1: I was wondering, how do you approach new AWS services or the changes in requirements from AWS sometimes, and just Terraform changes as well. What is your approach like reviewing the changes in the policy as well? For example, sometimes AWS will come up with new services, or they'll change the requirements from an IAM perspective, what is needed to create a resource. Terraform will do the same thing. They'll sometimes come up with new ways of creating the same resource. I'm just wondering how much effort did you have to put into first initially creating these policies? Then, what is the ongoing effort to maintain that to make sure that no one could get around them, whether it be resource or a new syntax, and Terraform to create the same resource?

Du: We do face some of the issues as well at DoorDash. For example, we do have a lot of like security inline rules. Sometimes when AWS updated the recommender you're using, like standalone resources is not embedded with some of the resources from your Terraform. We do face some sort of issues. As of today, we don't have any good way to solve all these dependency issues. Every time when we come to this issue, we always recommend users to use whatever the latest recommended way to provision the resources. We don't have any OPA policy to help us to ensure that happens. We do have a lot of legacy resources still using the very old version of Terraform Atlantis, to provision the resources in that way. For every new product, we do recommend people to use the new ways to provision all the resources, to leverage all the new attributes. For the older resources, we don't leverage OPA to check on that. The OPA is only for a particular critical, important factor for DoorDash.

Participant 2: You described setting up the OPA first and then determining that Rego probably wasn't as good as Conftest. I'm just wondering if you were setting this up from day one, would you use Conftest directly, or do you still think that there are other things that are worth having OPA doing for?

Du: Actually, when we started to think of this workflow, we were using Conftest directly, because at that time, there was no CLI way for the OPA. It's very hard for us to integrate it with our CI/CD pipeline. Because the way we're using the Atlantis workflow and Conftest is not only for the other cloud resources, for the pull requests, we also integrate it with our CI/CD pipeline. At that time, the OPA didn't have any CLI option for us. It's very hard for us to integrate anything with that. That's why we kind of chose Conftest at the very beginning. We stuck with it and it worked super well for us.

See more presentations with transcripts

Recorded at:

Feb 15, 2024

Lin Du

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?