In a recent blog post, Twitter announced the start of their journey, a project known internally as Partly Cloudy, to the public cloud. The post outlines some of the constraints that have prevented them from doing so in the past and describes why now is a good time to embark on this transformation. Twitter began an investigation more than three years ago, but parked the idea as it wasn’t feasible. Ultimately, the ability to take advantage of new cloud offerings and providing a broader geographic footprint prompted the resurgence of moving to the cloud.
When Twitter initially evaluated moving workloads to the cloud, more than three years ago, it was determined that the timing was not right and that lifting and shifting Twitter’s infrastructure was not feasible. Joep Rottinghuis, senior software manager at Twitter, explains some of the previous concerns:
Putting our infrastructure in the cloud sounds simple: just lift and shift, right? However, our data infrastructure alone ingests over a trillion events, processes hundreds of petabytes, and runs many tens of thousands of jobs on over a dozen clusters each day—it would be anything but simple.
The Twitter team identified several use-cases that were candidates for moving to the cloud. However, there were concerns with freezing existing projects during this period while other teams focused on cloud migrations. A decision was made to continue to manage their existing on-premises infrastructure.
Twitter had already established cost-effective scale in managing their own data centers, but felt they were missing out on new opportunities being released by hyper-scale cloud providers. Some of these missed opportunities were related to availability, elasticity and scalability.
In May 2018, a new initiative was launched that saw Twitter establishing a partnership with Google to manage cold data storage for their Hadoop clusters. Parag Agrawal, CTO at Twitter, explains the driver for leveraging the cloud for this use case:
[The move to the cloud] enabled us to enhance the experience and productivity of our engineering teams working with our data platform.
Twitter has identified four different use cases for their Hadoop clusters, which host more than 300 PB of data across tens of thousands of servers:
- Real-time clusters which ingest incoming data generated by users
- Processing clusters where regularly scheduled jobs execute
- Ad-hoc clusters which support one-off queries and occasional analysis
- Dedicated dense storage clusters that manage cold data
The Twitter team looked for opportunities where they could benefit most from cloud capabilities, with the least amount of risk. This led to the decision to move both the clusters that support ad-hoc and cold data use-cases to the Google Cloud Platform.
Another challenge that Twitter is experiencing is rationalizing the infrastructure tooling that has been developed over the past decade to support software deployments, monitoring, alerting, security and logging. These tools were not necessarily designed with the cloud in mind and require modifications to support a new cloud-based model.
As Twitter’s cloud journey matures, they plan on sharing more details on their engineering blog.