Allegro achieved significant savings for one of the Dataflow Pipelines running on GCP Big Data. The company continues working on improving the cost-effectiveness of its data workflows by evaluating resource utilization, enhancing pipeline configurations, optimizing input and output datasets, and improving storage strategies.
Allegro runs many data pipelines on the Google Cloud Dataflow processing engine and has identified possible cost savings from optimizing them. Jakub Demianowski, senior software engineer at Allegro, shared a case study detailing steps taken to achieve an estimated 60% cost reduction in optimizing a single pipeline.
CPU Utilization Statistics (Source: Allegro Technology Blog)
The cost optimization effort focused on three key areas and involved testing hypotheses about potential inefficiencies contributing to the overall cost of running the pipeline. The first hypothesis to explore was the possible underutilization of compute resources. The analysis of the CPU utilization metrics revealed that the mean CPU utilization was 85%, reduced by data shuffling, which indicated that the CPU wasn’t underutilized.
Memory Utilization Statistics (Source: Allegro Technology Blog)
Looking at the memory utilization metrics, Demianowski concluded that only 50% of available memory was used and opted to change the compute instance type to adjust CPU to memory ratio, which resulted in 10% cost savings.
The second hypothesis considered by the author assumed that the price-to-performance ratio of virtual machine types was suboptimal. Based on CoreMark scores provided by Google Cloud, the t2d-standard-8 VM type offered the best cost-effectiveness, further confirmed by running the data pipeline with 3% of the original dataset and achieving a 32% cost reduction. The third hypothesis focused on the VM storage type. Demianowski compared different VM families working with HDD or SSD disks, and it turned out that using SSDs was cheaper.
The last hypothesis covered possible cost inefficiencies in the job configuration. One specific area of concern was the disproportionally high cost of the Dataflow Shuffle service. The author evaluated running the job with and without the shuffle service and concluded that turning off the shuffle service reduced the costs considerably and additionally made worker nodes fully utilize the available memory.
After implementing the steps described in the blog post, Demianowski estimates the annual cost of running the pipeline was reduced from $127k to around $48k. He summarised the efforts aimed at improving the cost-effectiveness of running the pipeline:
We achieved excellent outcome without even touching the processing code. Speculative approach provided good results. There may still be some space for optimization, but within the timeframe I was given, I treat these results as first-rate and do not find any more reasons to further optimize the environment and configuration of the Dataflow job.
The author highlighted that each data pipeline is different, and engineers need to methodically access and exercise different avenues to reduce operational costs and empirically evaluate related costs.