BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles The Great Lambda Migration to Kubernetes Jobs—a Journey in Three Parts

The Great Lambda Migration to Kubernetes Jobs—a Journey in Three Parts

Key Takeaways

  • Infrastructure as code has become the backbone of modern cloud operations, but its benefits in configuring systems are just the tip of the iceberg.
  • Serverless is a common infrastructure choice for growing SaaS companies, as it’s easy to get started, provides minimal overhead, and complements IaC practices.
  • As ephemeral runtime, serverless breaks down with overly complex and high-scale systems and also has a high-cost tradeoff.
  • Our IaC backbone provided an added layer of portability and extensibility, and exploring new systems to migrate to became less daunting (starting with Elastic Container Service and then Kubernetes).
  • Kubernetes eventually afforded us the cost, scale visibility, and portability we needed to maintain the high throughput, volume, and scale our systems required.

The world of Infrastructure as Code has taken cloud native by storm and today serves as a general best practice when configuring cloud services and applications. When cloud operations grow exponentially, which happens quite rapidly with today’s SaaS-based and cloud-hosted on-demand applications, things quickly start to break down for companies still leveraging ClickOps, which creates drift in cloud apps through manual configuration.

It’s interesting to note that while IaC is a best practice and provides many benefits (including avoiding drift) we witnessed its true value when we needed to undertake a major infrastructure migration. We found that because we leveraged the power of IaC early, and aligned ourselves to best practices in configuration, formerly complex migration processes became much simpler. (Remember companies like VeloStrata and CloudEndure that were built for this purpose?) When we talk about cloud and vendor lock-in, we have now learned that how we package, configure, and deploy our applications directly impacts their portability and extensibility.

In this article, I’d like to share our journey at Firefly on a great migration from serverless to Kubernetes jobs, lessons learned, and the technologies that helped us do so with minimal pain.

Station I: Our Lambda Affair

Serverless is becoming a popular architecture of choice for many nascent and even well-established companies, as it provides all of the benefits of the cloud––scale, extensibility, elasticity––with the added value of minimal management and maintenance overhead. Not just that it was fast and scalable, but it was also quite fun to build upon.

Lambdas, functions, the services that connect them, and event-based architecture are a playground for developers to experiment in and rapidly iterate formerly complex architecture. This is particularly true with Terraform and Terraform Modules built for just this type of job. It was suddenly possible to build infrastructure to support large-scale concurrent operations, through lambda runners, in hours––something that used to take days and weeks.

Over time though, we started encountering issues due to our event-driven architecture and design. With the diversity of services required to have our data and flow work properly––API gateway, SQS, SNS, S3, event bridges, and more, the number of events and their inputs/outputs started to add up. This is where we started to hit the known serverless timeout wall. As serverless is essentially ephemeral runtime, it largely has a window of 15 minutes for task completion. If a task does not complete in time, it fails. 

We started to realize that the honeymoon might be over and that we needed to rethink our infrastructure choice for the specific nature of our use case and operation. When you go down the microservices route––and in our case, we chose to leverage Go routines for multi-threaded services (so we’re talking about a lot of services), you often start to lose control of the number of running jobs and services. 

Our “microservices to rule them all!” mindset, which we formerly took as a sign of our incredible scalability, was also ultimately the source of our breakdown. We tried to combat the timeouts by adding limitations, but this slowed down our processes significantly (not a good thing for a SaaS company), certainly not the outcome we hoped for. When we increased our clusters, this incurred significant cost implications—also not ideal for a nascent startup.

The technical debt aggregated, and this is when we started to consider our options––rewrite or migrate? What other technologies can we look at or leverage without a major overhaul?

Station II: A Stop at ECS (Elastic Container Service)

The next stop on our journey was ECS. Choosing ECS was actually a byproduct of our initial choice for packaging and deploying our apps to serverless. We chose to dockerize all of our applications and configure them via Terraform. This early choice ultimately enabled us to choose our architecture and infrastructure.

We decided to give ECS a try largely because of its profiling capabilities and the fact that there are no time limitations on processing tasks, events, and jobs like with serverless.

The benefit of ECS is its control mechanism––the core of its capabilities, where AWS manages task scheduling, priority, what runs where, and when. However, for us, this was also a double-edged sword. 

The nature of our specific events required us to have greater control when it comes to task scheduling––such as finer-grained prioritization, ordering of tasks, pushing dynamic configurations based on pre-defined metrics and thresholds––and not just programmatic limitations, ones that are more dynamic and leverage telemetry data. For example, if I have a specific account or tenant that is overloading or spamming the system, I can limit events more dynamically, with greater control of custom configurations per tenant. 

When we analyzed the situation, we realized that what was lacking was a “computer,” or an operator in the Kubernetes world. (And this is a great article on how to write your first Kubernetes operator, in case you’d like to learn more).

Station III: Our Journey Home to Kubernetes Jobs

Coming back to our choice of using containerized lambdas, we realized we are not limited to AWS-based infrastructure as a result of this choice, and suddenly an open and community standard option started to appear like the right move for us and our needs.

If we were to look at the benefits of migrating to Kubernetes, there were many to consider:

  • With Kubernetes jobs, you have an operator that enables much more dynamic configuration
  • As an IaC-first company, Helm was a great way to configure our apps
  • Greater and finer-grained profiling, limitations, and configurations, at infinite scale

For us the benefit of being able to manually configure and manage CPU and memory allocation, as well as customize and automate this through deep profiling, was extremely important. Particularly when we’re talking about scale comprising a diversity of clients with highly disparate usage behavior, where one tenant can run for two hours, and others run for only three seconds. Therefore, this customizability was a key feature for us and what eventually convinced us the move was necessary.

Next was examining the different layers of our applications to understand the complexity of such a migration.

How Do You Convert Lambdas to Kubernetes Jobs?

Now’s our moment to get philosophical. Eventually, what’s a lambda? It’s a type of job that needs to be done a single time with a specific configuration that runs a bunch of workers to get the job done. This brought us to the epiphany that this sounds a lot like…K8s jobs.

Our containerized lambdas and fully-codified configurations enabled us to reuse the runtime and configurations, with very minimal adjustments when moving between environments. Let’s take a look at some of the primary elements.

Networking

The large majority of the networking elements were covered via containerization, including security groups and much more. The flip side is that if your networking is not configured as code and well-defined––then you can find your communication between services crash. Ensuring that all of your security groups and their resources, from VPCs to anything, are properly configured ensures a much more seamless transition and essentially is the backbone to democratizing your infrastructure choice.

Permissions and External Configurations

Another critical aspect that can make or break this transition is permissions and access control. With serverless (AWS), ECS, and Kubernetes working with IAM roles, it’s just a matter of how you design your roles so that flows don’t break, and then you can port them quite easily across environments. This way, you ensure your flow does not break in such a transition. There are minor changes and optimizations, such as configuration trust relationships; however, this beats configuring all of your permissions from scratch.

Changing your IAM Role’s Trust relationship from this:

    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "lambda.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

To this—makes it portable and reusable:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Federated": "arn:aws:iam::123456789:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/XXXXXXXXXXXXXXXXX"
            },
            "Action": "sts:AssumeRoleWithWebIdentity"
        }
    ]
}

Other changes you need to ensure you cover are converting environment variables to the configmap format in Kubernetes deployments. Then you are ready to attach to your preferred runtime and environment.

The Unhappy Runtime Flow

This doesn’t mean there can’t be unhappy flows. Docker is not a panacea, and there are situations where there are compatibility issues, such as the base image that can change from service to service, or between different OS distributions, alongside Linux issues, such as dependencies in file directories. 

However, you can overcome these challenges by building your own Docker images and dependencies with as much abstraction as possible. For example, compiling our Golang app in a separate builder image and using it in our target image or managing our environment variables in a struct with explicit references to avoid relying upon any runtime to inject them for you are good practices to avoid runtime issues.

Blue/Green >> GO!

So what does the final rollout look like? Although there was some downtime, it wasn’t significant. Our team chose the blue/green method for deployment and monitored this closely to ensure that all the data was being received as it should and the migration went smoothly.

Before we dive into this further, here is a brief word about monitoring and logging. This is another aspect you need to ensure you properly migrate before you deploy anything. When it comes to monitoring, there are elements you need to ensure you properly convert. If you were previously monitoring lambdas, you now need to convert these to clusters and pods. You need to validate that logs are shipping and arriving as they should––CloudWatch vs. fluentd.

Once we had all of this in place, we were ready to reroute our traffic as the blue/green rollout. We routed some of our event streams via SQS to the new infrastructure and did continuous sanity checks to ensure the business logic didn’t break, that everything was transferring, and that the monitoring and logging were working as they should. Once we checked this flow and increased the traffic slowly from our previous infrastructure to our new infrastructure without any breakage, we knew our migration was complete.

This can take hours or days, depending on how large your deployment is and how sensitive your operations, SLAs, data, and more are. The only obvious recommendation is to ensure you have the proper visibility in place to know it works.

About the Author

BT