BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations The World Is on Fire and so Is Your Website

The World Is on Fire and so Is Your Website

Bookmarks
18:41

Summary

Ann Lewis discusses how MoveOn architects and scales an ecosystem of custom tools that power political organizing work like rapid response mobilizations, vote programs, and data-driven campaigns.

Bio

Ann Lewis is the CTO of MoveOn.org. She is a technical leader, architect, and active coder with 15+ years of experience in software engineering and software management, with a focus on distributed systems and scalability.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Lewis: What exactly is MoveOn? MoveOn is a grassroots campaigning organization that fights for social justice. We advocate across a variety of social justice issues. We support progressive policies. Typically, in election years, we also endorse a slate of progressive candidates. MoveOn is really a community of millions of Americans who identify as progressive in all 50 states. We're also a small, scrappy, fully distributed team. We've all been working remotely for the last 10-plus years, and we're embedded in the grassroots communities that make up America. This allows us to run nationally impactful programs. To be able to achieve that national scale, all of these programs are powered by tech tools and data. To make this happen, we have a complex ecosystem of 30-plus websites and tools that all work together, and need to scale on a nonprofit budget.

Background

Who am I? I'm MoveOn's Chief Technology Officer. I've been at MoveOn since 2015. I've been in tech for more than 15 years working at a variety of Fortune 500 companies, startups, consulting companies. As I've progressed in my tech career, I've deepened my expertise in distributed system scaling. I'm really excited about using those skills to build tech that powers collective action.

Outline

We're going to talk about the new attention economy, how we get our information online and where we get it from. I'd like to tell you a story about a huge protest we pulled off in 2018 that was deeply affected by our place in the attention economy. Then we'll get into all the systems, tools, and architecture that makes this all possible, and how we managed to scale our nonprofit budget.

A Walk Down Memory Lane

Who remembers Slashdot? Who remembers the internet before big social media? The rest of you can get off my lawn. Let's talk about the Slashdot effect. This was a nerdy nickname for a website scaling events in the late '90s. It's when a website with significantly more engagement and viewership, links to a smaller website with less viewership, and the resulting firehose of web traffic could often take the smaller website down. This is a classic late '90s graph to go with a classic late '90s nickname. This is like a baby picture for internet scaling.

The Attention Economy

Today, we live in a more complicated internet. As the total amount of information to consume has steadily grown over the last few years, the amount of human attention available to consume all this information, even when we're psychologically incentivized to consume as much as possible, this attention has become a real limit. Content publishers have evolved into deeply personalized content ecosystems driven by algorithmic feedback loops. I'm talking about the social media platforms here, and what you see in your feed. These social media platforms all compete for our eyeballs, and they want to control what we see based on behavior they want to drive, which is usually more clicks.

Previously, the internet was more spread out. Many content creators, many content publishers, and amplification was a job owned by viewers and readers. Today, we have just a few dominant content publishers who not only control what we see when we're using their platforms, but they also control the shape and characteristics of engagement. Because of algorithmic feedback loops involved in customizing your content feeds, and incentivizing particular types of engagement, typically, social media platforms are always optimizing for more clicks. They end up amplifying highly inflammatory content, like basically everything that's happened so far in 2020. They do this directly, and also via behavioral feedback loops.

Oligarchy

As you may have noticed, either from your pile of bills or giant pile of money in your backyard if you've been lucky, Americans now live in an economic oligarchy. We participate daily in another oligarchy, a social media oligarchy. Not only do we get most of our information from deeply personalized social media platforms, we often get most of our social media content directly or indirectly from the people on these platforms that have the most reach. This follows an oligarchy pattern. 0.1% of users have the majority of followers, more than 100,000, typically. Then there's a tier of folks below that, representing about 2% of social media users that have 10,000 to 100,000 followers. Then everyone else, including myself, has maybe 500 and 700 followers or less. Social media is an oligarchy, too.

Influencers

We call these social media users with huge follower accounts and reach, influencers. Influencers with a capital I, are people with more than 100,000 followers, and then micro-influencers are social media users with 10,000 to 100,000 followers. These influencers control the nature of virality in today's attention economy. They're also the only thing that's more powerful right now than the social media algorithmic feedback loops. I don't know what to say about this other than I am not an influencer on Twitter so you all can get off my lawn.

Story: A Protest Goes Viral

I want to tell you a story about a moment that in the progressive movement we're all really proud of, and it's also a story about the influencer economy. In 2018, MoveOn and a handful of other organizations banded together and created a coalition called, No One Is Above The Law, to protect the ongoing Mueller investigation from tampering or being undermined. As a part of this, red lines were identified. These are actions that if the Trump administration took them we'd immediately protest in response, and by we, I mean hundreds of thousands of people. By November 2018, over 500,000 people had pledged to protest if any red lines were crossed. Through the summer, the threat of mass protests appeared to be keeping various risks to the Mueller investigation at bay via the threat of collective action. Then the day after the election, Trump does cross a red line by firing Jeff Sessions and replacing him with a loyalist. This coalition had an emergency meeting about an hour after this happened and decided to launch this "Trump Is Not Above the Law" protest network.

Trump Is Not Above the Law

Few hours later, we launched the protest network and we released posts and call to actions on all of our social media pages, and we emailed our list, and broadcast to everyone who subscribed to our SMS list. We see on social media a little bit of a lift, mostly from our own micro-influencers, of tens of thousands of retweets and the website itself observed moderate surges of traffic. A few hours later, when our favorite influencer, Rachel Maddow mentioned the protest website on her evening show, we saw a traffic surge up to 3.5 million views all of a sudden, just in minutes, which made our site fall over, but we quickly brought it back up. Here you can see our Google Analytics graph of our Rachel Maddow moment. We were able to capture most of that influencer energy and attention after we quickly brought our website back up, and we converted it into even more protest events and RSVPs. By the end of the day, we now had 1000 protest events scheduled nationwide, and over 500,000 people had RSVPed to these events. That's an additional 300 events and 100,000 more RSVPs in just 24 hours, most of that due to the Rachel Maddow mention. Then, at the end of the day, we had nationwide protests. Here you can see some beautiful pictures of people protesting all around the nation, in big cities, in small towns. It's just so much incredible solidarity and show of collective power. Getting back to the subject of lawns, I was at my local Charlottesville, Virginia protest and the city of Charlottesville literally told the protesters during this protest to get off the county office building's lawn. We had to stretch around the sidewalk around that building instead. This just made us look even more powerful. Look at all of this beautiful collective action.

Key Technical Takeaways

The key technical takeaways from this protest event for me were that the observed behavior of virality in the modern social media attention economy is tightly controlled by the social media platforms. Going viral only means a huge surge of traffic in the Slashdot sense if the platforms decide it does, with a major exception, influencers can still generate that organic viral behavior if you get their attention.

The Tech behind Protest Networks

What do you think is involved in pulling off mass national protests? Turns out, it's mostly logistics. Where do people go? Who's handling which tasks? How do you get hundreds of thousands of people to know about a thing or to do a thing? It's all about disseminating information at all the right times to guide and amplify this energy that's already there. Specifically, for this protest, and our protest infrastructure in general, we have a hub website, so that's a server on top of a database managing events that lets you sign up to host events, sign up to RSVP to particular events. We usually have a map showing where all the events are and some search tools. One key to scaling here is crowdsourced event creation, anyone can host a protest. We have these rolling post-prep and training calls after the website launches to make sure that all the hosts have all the information they need to run a really great protest. The vast majority of the technical complexity here is in the mobilization tools. These are the buffet of ways where we try and meet people where they already are and try to find people who are already interested in attending a protest event. We send emails. We send SMS messages. We do social media posts. We do targeted ad buys. Anything we can do to try and find people who might be interested in attending a nearby protest.

Stepping Up to Big Moments

To be able to step up to big moments like this, and to make our work and mission matter, we need to be able to turn around on a protest action very quickly and get all the related logistics in place in just days, sometimes even just a single day. This in itself presents a series of technical challenges, and we all have to pull them all off on a nonprofit budget.

Problems to Solve

The problems to solve here are generally around timing. You don't know when the protest is going to launch. You don't know when a moment will happen that you need to organize a protest for. You need to be able to carry out all the related preparation moments and technical scaling moments within just a few hours typically. Some big companies may deal with issues like this by pre-scaling up 4x, or 10x, but we can't afford to do that on a nonprofit budget. Also, our infrastructure is relatively complex. We use a variety of tools, especially when it comes to mobilization tools that are a combination of in-house and vendor tools. This ecosystem itself presents its own emerging scaling challenges. Scale testing by itself is actually very complicated and time consuming.

Monitoring and Measurement

How do we pull this off at all? First and foremost, the most important thing is monitoring. You need to monitor everything from your high-level page level SLAs and clickstream throughput, all the way down to things like database CPU. You need to monitor not just your systems but vendors' systems too. Setting SLAs across your stack for key workflows and identifying those key workflows ahead of time and critical failure points is key. You should assume that for all of your most critical workflows and failure points, all the most important parts of your system will fail. Like Senator Warren, you should have a plan for that.

Vendors

This is true in general, but especially true for scrappy nonprofits that have to exist in a vendor ecosystem that we can't control. Your system doesn't scale if your vendors don't scale. The best way to make sure your vendors scale is to identify very specifically what scaling means. Then, write down your scaling needs in your contract and contractually bind your vendor to them. This can be difficult to pull off, especially if you're a small organization, or you're working with relatively small scale vendors. Sometimes making this work involves building a strong relationship, running a regular RFP process to make sure that vendors are competing. The critical point is get these scaling SLAs as much as you can into your contract.

Scaling Incident Response Plans

Who has a cybersecurity incident response plan? Hopefully, most folks in the audience do. Who has a scaling incident response plan? A scaling incident response plan is basically documentation. It's a description of what to do before, during, and after a scaling incident. Which thresholds to monitor, which thresholds are actionable, who to call, what to check, which decisions need to get made, who makes them? Just a clear mapping out of a workflow that allows a system of computers and tools and humans and data to scale together. There should be just enough detail in your scaling response plan to ensure that everyone involved knows how to diagnose what's happening and knows what to do next. A scaling incident response plan, of course should include your systems and also your vendors' systems. Even if you can't control them, you can control what action you take when they go down and roll out things like static backup sites, or stopgap solutions.

Granular Autoscaling

It's important to get really granular with autoscaling. Probably most web architectures today are deployed to some cloud system that comes with a variety of autoscaling levers at your disposal. For really fast turnaround work, it's important to get really granular with your autoscaling thresholds and actions. That's done to the level of minutes for political work or anything related to the news cycle. Typically, you only have 24 hours to put together a meaningful response. If you aren't able to get it done in time, you basically just missed that moment and have no impact. There's also an action curve involved in most news related user behaviors. That's the amount of time between when a thing happens and people find out about that thing. Then some percentage of them will take an action that typically follows a bell curve. In most news or media related work, the bell curve is spread out over 24 or 48 hours. We don't have hours to respond, and that means we can't miss 15 minutes even waiting for autoscaling to kick in. Autoscaling thresholds have to be super granular, and responsive down to the minute.

Another way, depending on how your application is structured to make this work even better for you and work more cheaply, is to consider using microservices for a small set of scaling bottlenecks. It's always possible to throw more servers behind the CPU or memory bound website, but it's helpful to take your application into consideration and see which pieces of it could be abstracted out into microservices. I'm not suggesting take your entire website and break it into 100 overdesigned microservices, spend 4 years working on this. Then, surprise, the cloud computing paradigms have changed out from under you. Don't microservice everything, but take a look at one or two scaling bottlenecks. Consider whether these could be abstracted out, potentially vendored up or containerized out. This in our experience has driven down the cost of autoscaling significantly for us, because the per-invocation cost of a microservice is typically well below the per hour cost of a virtual machine. We were able to drive down our scaling costs to be 10% of what they were, year-over-year, compared to the cost of scaling up dedicated hardware.

Scaling response plan should also include all of the levers at your disposal as a distributed system. Many of us are already familiar with horizontally scaling web servers. In most cloud computing ecosystems, while it's easy to add servers or additional containerized capacity, it's also possible to, just-in-time, upgrade the brainpower of the hardware that you're using. Typically, scaling a huge amount of writes to a database is a hard problem, but scaling considerations for a database are usually that you are probably going to be CPU bound if your queries are computationally expensive, or if you're just doing a whole lot of writing, you're going to be bound at the network I/O level. It's possible to, just-in-time, upgrade your database's network I/O capacity or CPU as needed. Don't forget that is an option. At the application level, it's also possible to add additional caching at a variety of different layers in between user behavior, frontend, backend, and the database. This is a significant investment in developer time as debugging caching issues is notoriously difficult, but it is an option at your disposal. If your system is bound by a firehose of writes instead of reads, then you can also add a queue to your architecture or several queues, and that will allow you to collect a burst of writes to process in a way that you can control.

Don't Forget the CAP Theorem

When considering distributed system scaling issues overall, it's important to pull back even further and remember the CAP theorem, which states that you can only have two of consistency, availability, and tolerance network partitioning. This is not a suggestion. This is a theorem proved by Nancy Lynch at MIT decades ago. It's helpful to analyze your architecture as a whole ahead of any scaling incidents, and map out the choices that you may be forced to make and the choices that you can make in the event of a loss of data consistency, component availability, network partitioning. Be prepared to make hard choices, map this out and decide which hard choices you're willing to make.

Conclusion

The big social media companies have changed the shape of the attention economy, where we get our information, how we get it, whose voices shout louder based on their reach and followership. Social media is a bit of an oligarchy right now. Influencers have the majority of the power. Because of the power of micro-influencers and algorithmic feedback loops, if your website is going to get a traffic surge, this is going to happen in minutes instead of hours, and so you need to be able to respond in minutes. The most important way of preparing yourself for being able to do this is to do scale planning and have scaling incident response plans. Scaling today is much harder than it used to be, but it's also very important. Monitor everything, create emergency response plans, and get really granular.

 

See more presentations with transcripts

 

Recorded at:

Jun 12, 2021

BT