Plaid's engineering team cut their deployment times on AWS ECS by 95% with a custom wrapper to relaunch their node.js processes without recreating the containers.
Plaid.com - a financial technology company that enables applications to connect with users' bank accounts - has integrations with over 9600 different financial institutions, from which it pulls and processes data that can be analyzed later. Plaid runs over 20 internal services with 50+ code commits per day for their core services. The bank integration service, which runs as node.js processes in containers running on ECS, faced slow deployment startup times which in turn affected overall code ship time. Multiple environments in the pipeline added to the slowdown. The long term plan was to move to Kubernetes. A short term solution was found by writing a custom process wrapper to relaunch the application in the same container, thus avoiding container recreations.
Plaid runs 4000 node.js processes in containers. A profiling exercise done by the team exposed some possible areas for optimization in the deployment process, during application startup. ECS health checks - similar to Kubernetes liveness checks - were tweaked, but to not much avail. Reducing the number of containers was another option, but it needed a re-architecting of the service. Spinning up more instances was not a cost-effective approach. They managed to shave off a few minutes with these approaches. InfoQ got in touch with Evan Limanto, engineer at Plaid, to learn more about the internals.
The team came up with a hot reloading technique by writing a process wrapper. Internally called Bootloader, the wrapper runs in the containers and launches the actual application as a sub-process. Bootloader also traps and forwards signals, and handles logging output. The application was modified to listen on a gRPC endpoint for a message sent from the Jenkins deployment pipeline. Limanto says that "each container advertises its own address on a Redis set with an application level heartbeat." This Redis set is used to keep track of all healthy containers at any given time.
The gRPC message has the commit hash in its payload, so it's possible to perform a rollback by sending an older hash, explains Limanto. The message triggers a download of application code from AWS S3, and the app exits with a special status code. Bootloader traps this code and relaunches the app, thus loading the new code in memory. How does the reload happen across all containers? Limanto explains:
The reload happens in a phased manner according to a simple formula:
Reload the current container if the hash of its address is less than `min(TargetPercentage, MaxUnhealthyPercentage + % of containers on new commit)`. Some background job runs this reloading logic on an interval.
It is possible that a reload can be triggered for a process while it is processing requests. How is this handled? Opinions on this differ. While Plaid keeps track of requests being processed and exits only after they are all done, another view endorses writing the app so that it can recover from abrupt shutdowns.