If you could handle all of the data you need to work with on one machine, then there is no reason to use big data techniques. So clustering is pretty much assumed for any installation larger than a basic proof of concept. In Splunk Enterprise, the most common type of cluster you’ll be dealing with is the Indexer Cluster.
Basic Architecture
In a Splunk Indexer Cluster, there are three roles: Master, Search Head, and Indexer.
The master node manages the cluster. It coordinates the replicating activities of the peer nodes and tells the search head where to find data. It also helps manage the configuration of peer nodes and orchestrates remedial activities if a peer goes down.
The peer nodes receive and index incoming data, just like non-clustered, stand-alone indexers. Unlike stand-alone indexers, however, peer nodes also replicate data from other nodes in the cluster. A peer node can index its own incoming data while simultaneously storing copies of data from other nodes. You must have at least as many peer nodes as the replication factor. That is, to support a replication factor of 3, you need a minimum of three peer nodes.
The search head runs searches across the set of peer nodes. You must use a search head to manage searches across indexer clusters.
If you are using multisite replication, then each site would have its own search heads and peer nodes. One master node would be responsible for all of the sites. “Although the master resides physically on a site, the master is not actually a member of any site.”
Know the logs
There are two key logs used in Splunk Indexer Clustering
The first is splunkd_access.log. This tracks each administrative message sent between the master and the peers. For example, you can use this to see exactly when a peer tells the master that one of its buckets changed state.
Related to this is the mertics.log. As the name implies, this tells you how well the cluster is performing by logging metrics such as how long each operation took, the number of searches currently being performed, the number of queue searches, etc.
Splunk recommends that you become familiar with both these logs as there is no better source for information when troubleshooting an Indexer Cluster.
More Buckets mean More Problems
In Splunk, data is grouped into buckets. When the number of buckets are increased, so does the amount of time spent managing them. By default Splunk scans the status of all the buckets every second. That should be adjusted to 1 sec per 50K buckets.
Heartbeat intervals should also be increased as you scale up. If you have more than 50 peers or 100K buckets, it should be increased from 1 sec to 5 to 15 sec between heartbeats.
Multisite Search Affinity
When using multi-site clustering, it is generally preferable to enable “multisite search affinity”. Under this model, each bucket is copied to every site. Searches will default to only using the local copy, but will automatically reach out to other sites in the event that a local peer fails.
In terms of troubleshooting, this can cause a sudden and significant drop in performance. So if searches that were previously fast start taking a long time, check for failed peers or stalled data replication.