At a recent QCon San Francisco talk, Amitai Stern, engineering manager at Logz.io and a member of the OpenSearch Leadership Committee, shared practical insights on managing OpenSearch clusters efficiently in environments with fluctuating workloads. His "OpenSearch Cluster Topologies for Cost-Saving Autoscaling" session explored strategies to scale OpenSearch effectively while minimizing costs.
OpenSearch, a fork of Elasticsearch, is designed to handle large-scale data processing and analytics. Its distributed architecture leverages nodes and shards to manage data across the cluster. However, many clusters experience fluctuating workload patterns—for example, day/night or weekday/weekend cycles—leading to periods of under- or over-utilized resources.
While OpenSearch supports scaling by adding nodes (horizontal scaling), scaling down or removing resources when workloads decrease is more complex. This limitation can result in unnecessary costs, especially in cloud-based environments.
Stern explained the intricacies of OpenSearch scaling and identified factors that make cost-effective autoscaling difficult:
- Horizontal Scaling:
- Adding nodes distributes data across shards, improving performance and reducing hotspots (overloaded nodes).
- However, imbalanced workloads—caused by uneven shard distribution or resource use (CPU, memory, disk)—can create inefficiencies.
- Vertical Scaling:
- Increasing machine resources (e.g., CPU, memory) is less flexible in the cloud and primarily addresses disk constraints, making it a limited solution.
- Shard Management:
- OpenSearch partitions data using shards. However, shards are not dynamically resizable. When scaling in, data must be redistributed across fewer nodes, which is time-consuming and resource-intensive.
Stern continued to introduce practical approaches to mitigate these challenges and reduce costs:
- Oversharding for Flexibility:
By creating more shards than is currently needed, clusters can accommodate future growth without requiring immediate scaling actions. This strategy avoids hotspots during high workloads. - Rollover Indices:
Rollover strategies dynamically create new indices to handle write-heavy operations, reducing the risk of overloaded nodes while maintaining balanced shard distribution. - Burst Topologies:
Stern highlighted two burst-oriented designs:- Burst Indices: Temporary indices to handle spikes in write operations.
- Burst Clusters: Additional nodes activated during peak loads, then scaled down during idle periods.
To optimize OpenSearch for fluctuating workloads, Stern emphasized focusing on three key resources:
- Disk: Use searchable snapshots to reduce disk usage while maintaining access to archived data.
- CPU and Memory: Plan ahead to provision resources for anticipated spikes. High-performing nodes or burst clusters can handle temporary increases in workload.
- Load Distribution: Rollover and oversharding help distribute resources evenly, preventing hotspots and inefficiencies.
Lastly, Stern discussed potential advancements in OpenSearch, such as reader/writer separation, which would decouple resource-heavy write operations from read operations. This feature could simplify scaling by enabling clusters to allocate resources better for distinct workloads.