“Fast and Probably Good Seedings for k-Means” by Olivier Bachem et al. was presented on 2016’s Neural Information Processing Systems (NIPS) conference and describes AFK-MC², an alternative method to generate initial seedings for the k-Means clustering algorithm that is several orders of magnitude faster than the state of art method k-Means++.
k-Means clustering can be used to group data points, or observations, for which you don't know the labels but you have some knowledge on how many groups you have in total (K). The objective is to build K clusters that group the data together using some metric of similarity (e.g euclidean distance). The algorithm, often called Lloyd’s algorithm, consists of finding cluster centers that minimize the distance of between each data point in the group.
As all non-convex optimization algorithms, Lloyd’s algorithm may converge to a local minimum. In order to improve solution quality, the algorithm is bootstrapped with initial cluster centers points called seeds. Random seeds are fast to generate but may converge into far from optimal solutions.
k-Means++ improves the seeds by doing an adaptative sampling on the data points. It starts by selecting a random one as initial seed. Then it computes the distance of all data points to the the nearest seed (initial seed in the first iteration). The next seed is selected randomly from the data points with a probability proportional to the square of the distance calculated.
The drawback of k-Means++ is that it does not scale easily to massive data sets since both its seeding step and every iteration of Lloyd’s algorithm require the computation of all pairwise distances between cluster centers and data points.
The proposed AFK-MC² method is presented as a simple, yet fast, alternative for k-Means seeding that produces probably good clustering without assumptions on the data.
The method consists in an approximation of k-Means++ using Markov chains, where the data points are states. The first chain state is a random sampled data point. A randomized process determines if the chain transitions to other random data points. The decision of state transition is dependent of the initial distances of all data points, that are computed once, as part of a pre-processing step. Unlike k-Means++ it only requires a full pass through all the dataset.
The stationary distribution of AFK-MC² is the same as k-Means++ given enough steps in the chain are simulated. The results showed speedups of 200 to 1000 times faster when compared with k-Means++ with 0 to 1% relative error, for large datasets.
A Cython based implementation of the algorithm is available on Github with examples on how to use it together with scikit-learn.