BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Building Tech at Presidential Scale

Building Tech at Presidential Scale

Key Takeaways

  • Technology played a central role in the 2020 US Presidential election cycle.
  • The Biden for President 2020 campaign build a robust microservices architecture.
  • Speed of delivery, reliability, and security were critical requirements.
  • CI/CD best practices were employed to deliver changes with unprecedented velocity.
  • Data-movement automation and artificial intelligence were among some of the architectures built by the campaign.

I was the chief technical officer for US President Joe Biden’s campaign during the 2020 election. At Qcon Plus in November 2020, I spoke about some of the elaborate architectures our tech team built and the specific tools we built to solve a variety of problems. This article is a distillation of that talk.

As CTO with Biden for President 2020, I led the technology organization within the campaign. I was responsible for the technology operations as a whole and I had a brilliant team that built the best tech stack in presidential politics.

We covered everything from software engineering to IT operations to cybersecurity and all pieces in between.

I joined the campaign as a distinguished engineer at Target, where I focused on infrastructure and operations with an emphasis on building high-scale, reliable distributed systems. I had previously worked on the Hillary for America 2016 technology platform by way of The Groundwork, which was a campaign adjacent technology company.

Campaign structure

The intricacies of a political campaign’s moving parts influence all of the specific choices when considering tech in that environment.

Figure 1 lays out the different departments within the campaign and gives you a sense of the varying degrees of responsibility across the organization.

Figure 1: The organizational structure of a political campaign

Each of the many teams across the campaign had its own specific focus. Although campaign tech gets a lot of attention in the press, tech is not the most important thing in politics. The goal for everyone is to reach voters, drive the candidate's message, and connect as many people as possible to the democratic process.

Role of tech on a campaign

Tech on a campaign does quite literally whatever needs to be done. Nearly every vertical of the campaign needs tech in one form or another. In practical terms, my team was responsible for building and managing our cloud footprint, all IT operations, vendor onboarding, and so forth.

Our approach to building technology during the election cycle is simply to build the glue that ties vendors and systems together. This is not the right environment for delving into full-blown product development.

When a vendor or open-source tool we needed didn't exist or, frankly, when we didn't have the money to buy something, we would build tools and solutions.
A huge portion of the work that needs to be done by a campaign tech team is simply getting data from point A to point B. Strictly speaking, this involves creating a lot of S3 buckets.

In addition, we ended up developing a wide range of technology over the year and a half of the campaign cycle. We built dozens of websites, some big and some small. We built a mobile app for our field organizing efforts. We built Chrome extensions to help simplify the workloads for dozens of campaign team members. Though unglamorous, we developed Word and Excel macros to drive better efficiency. On a presidential campaign, time is everything, and anything that we could do to simplify a process to save a minute here or there was well worth the investment.

Much of what we did boils down to automating tasks to reduce the load on the team in any way that we could. It's desirable at times to distill campaign tech down to one specific focus: data, digital, IT, or cybersecurity. In reality, the technology of the campaign is all of those things and more, and is a critical component of everything the campaign does.

What we did

We were a small and scrappy team of highly motivated technologists. We'd take on any request and do our best where we could. Over the course of the campaign, we built and delivered over 100 production services. We built more than 50 Lambda functions that delivered a variety of functionality. We built a best-in-class relational organizing mobile app for the primary cycle, the Team Joe app, which gave the campaign the leverage to connect thousands of eager voters and volunteers directly with the people they knew.

We had more than 10,000 deployments in the short time we were in operation, all with zero downtime and a focus on stability and reliability.

We implemented a bespoke research platform with robust automation that built on cloud-based machine learning, which saved the campaign tens of thousands of hours of human work, and gave us incredible depth of insight. We built a number of services on top of powerful machine-learning infrastructure.

The campaign early on made a commitment to hold ourselves to a higher standard, and to make sure that we would not accept donations from organizations that harm our planet or from individuals who may have ulterior motives. To hold us to that promise, we built an automation framework that vetted our donors in near real time for FEC compliance and campaign commitments.

When we won the primary, we wanted new, fresh branding on the website for the general election so we designed and delivered a brand-new web experience for joebiden.com.

A huge aspect of every campaign is directly reaching voters to let them know of an event in their area or when it's time to get out to vote. For that, we built an SMS outreach platform that scaled nationwide and saved us millions of dollars in operational expense along the way. Beyond that, we operationalized IT for campaign headquarters, state headquarters, and ultimately facilitated a rapid transition to being a fully remote organization at the onset of quarantine.

Through all of this, we made sure we had world-class cybersecurity. That was a core focus for everything we did.

Ultimately, though, no one job on the campaign is single responsibility, which means everyone needs to help wherever and whenever help is needed, whether that's calling voters to ask them to go vote, sending text messages, or collecting signatures to get on the ballot. We did it all.

Infrastructure and platform

We relied on a fully cloud-based infrastructure for everything we did. It was paramount that we didn't spend precious hours reinventing any wheels for our core tech foundations. We used Pantheon for hosting the main website, joebiden.com. For non-website workloads, APIs, and services, we built entirely on top of AWS. In addition, we had a small Kubernetes deployment within AWS that helped us deliver faster simple workloads, mostly for cron jobs.

We still needed to deliver with massive scale and velocity overall. As a small team, it was critical that our build-and-deploy pipeline was reproducible. For continuous integration, we used Travis CI. For continuous delivery, we used Spinnaker.

Once we deployed services and they were up and running in the cloud, they all needed a core set of capabilities like being able to find other services and securely accessing config and secrets. For those, we used HashiCorp's Consul and Vault. This helped us build fully immutable and differentiated environments between development and production, with very few handcrafted servers along the way — not zero, but very few.

A huge part of the technology footprint was dedicated to the work being done by the analytics team. To ensure they had best-in-class access to tools and services, the analytics data was built on top of AWS Redshift. This afforded a highly scalable environment with granular control over resource utilization.

We used PostgreSQL via RDS as the back-end datastore for all of the services that we built and deployed.

From an operational perspective, we desired a centralized view into the logging activity for every application so that we could quickly troubleshoot and diagnose any problem to achieve the quickest possible recovery. For that, we deployed an ELK Stack and backed it with AWS Elasticsearch. The logs are very important to applications. Metrics were the primary insight into the operational state of our services and were critical when integrating with our on-call rotation. For service-level metrics, we deployed Influx and Grafana, and wired them into PagerDuty to make sure we never experienced an unknown outage. Many of the automation workloads and tasks didn't fit the typical deployment model and for those, we'd favor AWS Lambda where possible. We also used Lambda for any workload that needed to integrate with the rest of the AWS ecosystem and for fan-out jobs based on the presence of data.

We created a truly polyglot environment. We had services and automation built in a wide variety of languages and frameworks, and we were able to do it with unparalleled resiliency and velocity.

We covered a lot of ground during the campaign cycle. It would take a lifetime to thoroughly detail everything we did, so I'm going to pick out a few of the more interesting architectures the software-engineering side of the tech team put together. By no means should my focus detract from the impressive array of work that the IT and cybersecurity sides accomplished. While I dive deep on the architectures, be aware that for each detail I discuss there are at least a dozen more conversations, covering the breadth of what we accomplished across the entire tech team.

Donor vetting

As noted, the campaign committed early on to reject donations from certain organizations and individuals. The only way to do this at scale is to have a group of people comb through the donations periodically, usually once a quarter, and flag donors who might fit a filtering criteria.

This process is difficult, time consuming, and prone to error. When done by hand, it usually involves setting a threshold for the amount of money an individual has given, and then researching whether or not the donor fits a category we flagged as problematic.

To make this more efficient, we built a highly scalable automation process that would correlate details about a donation against a set of criteria we wanted to flag. Every day, a process would kick off. It would go to NGP VAN, which is a source of truth for all contributions, export the contribution data to a CSV file, and dump that file into S3. The act of the CSV file appearing in S3 would trigger an SNS notification, which in turn would activate a sequence of Lambda functions. These Lambda functions would split the file into smaller chunks, re-export them to S3, and kick off the donor-vetting process. We could see a massive scale in Lambda while this workflow was executing, up to 1,000 concurrent executions of the donor-vetting code.

For example, we committed to not accepting contributions from lobbyists and executives in the gas and oil industry. It would take only minutes for the process to go through the comprehensive list of donors and check them against lobbyist registries, foreign-agent registries, and block lists for gas and oil executives.

Once the process completed, the flagged entries were collated into a single CSV file, which was then re-exported to S3 and made available for download. SES would then send an email to Danielle, the developer who built the system, indicating that the flags were ready for validation. After validation, the results were forwarded to the campaign compliance team, who would review the findings and take the appropriate action, whether that be refunding a contribution or investigating it further.

Figure 2: The automated donor-vetting process

To say this process saved a material amount of time for the campaign would be a dramatic understatement. The donor-vetting pipeline quickly became a core fixture in the technology platform and an important part of the campaign operations.

Tattletale

Early on, when we were a small and very scrappy team, we had so many vendors and so many cloud services and not nearly enough time or people to constantly check the security posture on each of them, we needed an easy way to develop rules against critical user-facing systems to ensure we were always following cybersecurity best practices.

Tattletale was developed as a framework for doing exactly that. Tattletale was one of the most important pieces of technology we built on the campaign.

We developed a set of tasks that would leverage vendor system APIs to ensure things like two-factor authentication were turned on or would notify us if a user account was active on the system but that user hadn't logged in in a while. Dormant accounts present a major security risk, so we wanted to be sure everything was configured toward least privilege.

Furthermore, the rules within Tattletale could check that load balancers weren't inadvertently exposed to the internet, that IAM permissions were not scoped too widely, and so forth. At the end of the audit ruleset, if Tattletale recorded a violation, it would drop a notification into a Slack channel to prompt someone to investigate further. It also could notify a user of a violation through email so they could take their own corrective action. If a violation breached a certain threshold, Tattletale would record a metric in Grafana that would trigger a PagerDuty escalation policy, notifying the on-call tech person immediately.

Figure 3: The Tattletale schema

Tattletale became our cybersecurity eyes when we didn't have the time or resources to look for ourselves. It also made sure we were standardizing on a common set of cybersecurity practices, and holding ourselves to the highest possible standard.

Conductor and Turbotots

When we reached a certain point of complexity and span of internal tools, we knew we would need to better organize the API footprint and consolidate the many UIs we had built to manage those systems. We also needed to standardize on a unified security model, so we wouldn't have bespoke authentication and authorization all over the place.
And so, we created Conductor (named after Joe Biden’s affinity for trains), our internal tools UI, and Turbotots (one of the many tools we named after potatoes), our platform API.

Figure 4: Conductor and Turbotots

Figure 4 is a dramatically simplified diagram of a complex architecture, but these broad strokes are enough to give you an idea of what Conductor and Turbotots did and represented.

Conductor became the single pane of glass for all the internal tools we were developing for use by people on the campaign. In other words, this was the one place anybody on the campaign needed to go to access the services we'd made. Conductor was a React web app that was deployed via S3 and distributed with CloudFront.

Turbotots was a unified API that provided a common authentication and authorization model for everything we did, and which Conductor talked to. We built the AuthN and AuthZ portions of Turbotots on top of AWS Cognito, which saved loads of work and provided a simple single sign-on (SSO) through G Suite/Google Workspace, which we were able to configure for only our internal domains. Authentication was achieved by parsing JWT tokens through the workflow. For managing this on the front end, we used the React bindings for AWS Amplify, which integrated seamlessly into the app.

Things get a little bit trickier to understand on the right side of figure 4, but I'll try to simplify. The forward facing API was an API Gateway deployment with a full proxy resource. API Gateway easily integrates with Cognito, which gave us full API security at request time. The Cognito authorizer that integrates with API Gateway also performs JWT validation before a request makes it through the proxy. We could rely on that system to know that a request was fully validated before it made it to the back end.

When developing an API Gateway proxy resource, you can connect code running in your VPC by way of a VPC-link network load balancer. This is complicated but as I understand it, this effectively creates an elastic network interface that spans the internal zone in AWS between API Gateway and your private VPC. The NLB, in turn, is attached to an autoscaling group that has a set of NGINX instances, which served as our unified API. This is the actual portion that we called Turbotots, and is essentially a reverse proxy to all of our internal services that were running inside the VPC. Following this model, we never needed to expose any of our VPC resources to the public internet. We were able to rely on the security fixtures that were baked into AWS and that made us all a lot more comfortable.

Once the requests made it through to Turbotots, a lightweight Lua script within Turbotots extracted the user portion of the JWT token and passed that data downstream to services as part of a modified request payload. Once the request reached the destination service, it could then inspect the user payload, and determine whether or not the user was validated for the request.

Users could be added to authorization groups within Cognito, which would then give them different levels of access in the downstream services. The principle of least privilege still applies here, and there are no default permissions.

Conductor and Turbotots represented a single user interface for the breadth of internal tools that was fully secured with seamless SSO to the user's campaign G Suite account. We use this same architecture in the next section to expose a limited subset of the API to non-internal users as well.

Pencil and Turbotots

Pencil was the campaign peer-to-peer SMS platform. Pencil grew from simple set of early requirements into a bit of a beast of an architecture to which we dedicated a significant amount of time.

The original motivation to build our own P2P SMS platform was cost savings. I knew that we could send text messages through Twilio for less money than any vendor would demand to resell us the same capabilities. At the start, we didn't need the broad set of features that vendors were offering, so it was fairly easy to put together a simple texting system that met the needs of the moment.

The scope of the project grew dramatically as it became popular. Pencil came to be used by thousands of volunteers to reach millions of voters and became a critical component in our voter outreach workflow. If you received a text message from a volunteer on the campaign, it was likely sent through Pencil.
Pencil’s external architecture is going to look familiar because we were able to reuse everything that we did with Conductor and just change a few settings without having to change any code.

Figure 5: The Pencil architecture looks a lot like Conductor’s

Pencil’s user-facing component is a React web app deployed to S3 and distributed via CloudFront. The React app in turn talks to an API Gateway resource that is wired to a Cognito authorizer.

Users were invited to the Pencil platform via email, which was a separate process that registered their accounts in Cognito. At each user’s first login, Pencil would automatically add them to the appropriate user group in Cognito, which would give them access to the user portion of the Pencil API.

That's a little confusing to reason about — it's enough to say that from a user perspective, they clicked a link to sign up to send text messages. After a short onboarding process with the campaign digital field-organizing team, a user was all set up and able to get to work immediately.

It was very easy to do and worked great because it was built on top of the same Turbotots infrastructure we'd already developed.

Tots

Tots represents a different slice of the architecture than Turbotots, but was something of a preceding reference implementation for what Turbotots eventually became. T
Tots sat on a path directly downstream from Turbotots. Tots parsed a number of the services listed beneath Turbotots on prior figures, and those are a part of the Tots architecture.

Figure 6: The Tots platform

Tots was an important platform. As we built more tools that leveraged machine learning, we quickly realized that we needed to consolidate logic and dry up the architecture.
Our biggest use case for machine learning in various projects was extracting entity embeddings from blocks of text. We were able to organize that data in an index in Elastic and make entire subjects of data available for rapid retrieval. We used AWS Comprehend to extract the entity embeddings from text. It's a great service — give it a block of text and it'll tell you about the people, places, subjects, and so forth discussed in that text. The texts came into the platform in a variety of forms, including multimedia, news articles, and document formats.

Much of the mechanics in the Tots platform in figure 6 involved figuring out what to do with something before sending it to Comprehend. The process codified in this platform saved the campaign quite literally tens of thousands of hours of human work on organizing and understanding these various documents. This includes time spent transcribing live events like debates and getting them into a text format that could be read later on by people on the campaign.

CouchPotato

CouchPotato helped us to solve the hardest problem in computer science: making audio work on Linux. CouchPotato became very important to us because of the sheer amount of time it saved on working with audio and video material. It was also one of the most technically robust platforms that we built.

Figure 7: The campaign used CouchPotato to handle multimedia

CouchPotato is the main actor in this architecture in figure 7 but the lines connecting the many independent services illustrate how it needed a lot of supporting characters along the way.

CouchPotato's main operating function was to take a URL or media file as input, open that media in an isolated X11 Virtual Frame Buffer, listen to the playback on a PulseAudio loopback device (FFmpeg), and then record the contents from the X11 session. This produced an MP3 file, which it then sent to AWS Transcribe for automated speech recognition (ASR).

Once the ASR process was completed, it would go through the transcript to correct some common mistakes. For example, it could rarely get Mayor Pete's name (Pete Buttigieg) correct. We had a RegEx in place for some common replacements.

Once all that was done, the transcription would be shipped to Comprehend, which would extract the entity embeddings. Finally, the text would be indexed in Elasticsearch.
The real key to CouchPotato is that it was able to make use of FFmpeg's segmentation feature to produce smaller chunks of audio for analysis. It would run the whole process for each segment, and make sure that they all remained in uniform alignment as they were indexed in Elasticsearch.

The segmentation was one of CouchPotato’s initial features, because we used it to transcribe debates in real time. Normally, a campaign would have an army of interns watching the debate and typing out what was being said. We didn't have an army of interns to do that, and so CouchPotato was born.

However, the segmentation created a real-world serialization problem. Sometimes Transcribe would finish a later segment’s ASR before the prior segment had completed — meaning all the work needed to be done asynchronously and recompiled in the proper order after the processing. It sounds like a reactive programming problem to me.
We devoted a lot of time to solving the problem of serializing asynchronous events across stateless workers. That was complicated but we made it work.

That's also where DocsWeb came in handy for the debate. DocsWeb connected CouchPotato’s transcription output for the segments to a shared Google Doc, allowing us to share with the rest of the campaign the transcribed text nearly in real time — except for the lag from Transcribe to Elastic, which wasn't too bad. We transcribed every debate and loads of other media that would have taken a fleet of interns a literal human lifetime to finish.

Part of solving the serialization problem for the segments was figuring out how we could swap a set of active CouchPotato instances for a new one — say, during a deploy or when a live Transcribe event was already running. There's a lot more to say on that subject alone, like how we made audio frame-stitching work so that we could phase out an old instance and phase in a new one. It's too much to go into just for this presentation. Just know that HotPotato a lot of work that allowed us to do hot deployments of CouchPotato with zero deployments and zero state loss.

KatoPotato was the entry point into the whole architecture. It's an orchestration engine, and it coordinated the movement of data between CouchPotato, DocsWeb, and Elastic, as well as acting as the main API for kicking off CouchPotato workloads. Furthermore, it was responsible for monitoring the state of CouchPotato and deciding if HotPotato needed to step in with a new instance.

Sometimes audio on Linux can be finicky. CouchPotato was able to report to KatoPotato whether it was actually capturing audio or not. If it was not capturing audio, KatoPotato could scale up a new working instance of CouchPotato and swap it out via HotPotato.

The marketing line for CouchPotato is it's a platform that built on cloud machine learning that gave us rapid insight into the content of multimedia without having to spend valuable human time on such a task. It's a great piece of engineering, and I'm glad it was able to deliver the value it did.

Conclusion

There's so much more I could add about the technology we built, like the unmatched relational organizing app, the data pipelines we created to connect S3 and RDS to Redshift, or the dozens of microsites we built along the way. I could also explain how we managed to work collaboratively in an extremely fast-paced and dynamic environment.

I’ve covered some of the most interesting architectures we built for the Biden for President 2020 campaign. It’s been a pleasure to share this and I look forward to sharing more in the future.

About the Author

Dan Woods is CISO and VP of cybersecurity at Shipt. Before and after the working as the CTO of Biden for President 2020, he worked as a distinguished engineer at Target, where he focused on infrastructure and operations with an emphasis on building high-scale, reliable distributed systems. Prior to joining Target, he worked on the Hillary for America 2016 campaign’s technology platform by way of The Groundwork, which was a campaign-adjacent technology company. Before his foray into presidential politics, Woods worked as a senior software engineer at Netflix in the operations engineering organization, and helped create Netflix's open-source continuous-delivery platform, Spinnaker. Woods is also a member of the open-source team for Ratpack web frameworks. He wrote Learning Ratpack, published in 2016 by O'Reilly Media.

Rate this Article

Adoption
Style

BT