InfoQ Software Architects' Newsletter

A monthly overview of things you need to know as an architect or aspiring architect.

Enter your e-mail address

Select your country

We protect your privacy.

InfoQ Homepage News How Allegro Reduced the Cost of Running a GCP Dataflow Pipeline by 60%

Architecture & Design

How Allegro Reduced the Cost of Running a GCP Dataflow Pipeline by 60%

Nov 13, 2024 2 min read

Write for InfoQ

Feed your curiosity. Help 550k+ global
senior developers
each month stay ahead.Get in touch

Allegro achieved significant savings for one of the Dataflow Pipelines running on GCP Big Data. The company continues working on improving the cost-effectiveness of its data workflows by evaluating resource utilization, enhancing pipeline configurations, optimizing input and output datasets, and improving storage strategies.

Allegro runs many data pipelines on the Google Cloud Dataflow processing engine and has identified possible cost savings from optimizing them. Jakub Demianowski, senior software engineer at Allegro, shared a case study detailing steps taken to achieve an estimated 60% cost reduction in optimizing a single pipeline.

CPU Utilization Statistics (Source: Allegro Technology Blog)

The cost optimization effort focused on three key areas and involved testing hypotheses about potential inefficiencies contributing to the overall cost of running the pipeline. The first hypothesis to explore was the possible underutilization of compute resources. The analysis of the CPU utilization metrics revealed that the mean CPU utilization was 85%, reduced by data shuffling, which indicated that the CPU wasn’t underutilized.

Memory Utilization Statistics (Source: Allegro Technology Blog)

Looking at the memory utilization metrics, Demianowski concluded that only 50% of available memory was used and opted to change the compute instance type to adjust CPU to memory ratio, which resulted in 10% cost savings.

The second hypothesis considered by the author assumed that the price-to-performance ratio of virtual machine types was suboptimal. Based on CoreMark scores provided by Google Cloud, the t2d-standard-8 VM type offered the best cost-effectiveness, further confirmed by running the data pipeline with 3% of the original dataset and achieving a 32% cost reduction. The third hypothesis focused on the VM storage type. Demianowski compared different VM families working with HDD or SSD disks, and it turned out that using SSDs was cheaper.

The last hypothesis covered possible cost inefficiencies in the job configuration. One specific area of concern was the disproportionally high cost of the Dataflow Shuffle service. The author evaluated running the job with and without the shuffle service and concluded that turning off the shuffle service reduced the costs considerably and additionally made worker nodes fully utilize the available memory.

After implementing the steps described in the blog post, Demianowski estimates the annual cost of running the pipeline was reduced from $127k to around $48k. He summarised the efforts aimed at improving the cost-effectiveness of running the pipeline:

We achieved excellent outcome without even touching the processing code. Speculative approach provided good results. There may still be some space for optimization, but within the timeframe I was given, I treat these results as first-rate and do not find any more reasons to further optimize the environment and configuration of the Dataflow job.

The author highlighted that each data pipeline is different, and engineers need to methodically access and exercise different avenues to reduce operational costs and empirically evaluate related costs.

About the Author

Rafal Gancarz

Rafał is an experienced technology leader and expert. He's currently helping Starbucks make its Commerce Platform scalable, resilient and cost-effective. Previously, Rafał has been involved in designing and building large-scale, distributed and cloud-based systems for Cisco, Accenture, Capita, ICE, Callsign and others. His interests span architecture & design, continuous delivery, observability and operability, as well as sociotechnical and organisational aspects of software delivery.

Show moreShow less

The InfoQ Newsletter

A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example

We protect your privacy.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

How Allegro Reduced the Cost of Running a GCP Dataflow Pipeline by 60%

Write for InfoQ

About the Author

Rafal Gancarz

This content is in the Cloud Computing topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter