Iterative Clustering
K-means clustering can be performed iteratively using different embeddings of the data. For example, with high-dimensional time series data, it may be advantageous to:
Down-sample the data via the Haar transform (aka averaging)
Solve the K-means clustering problem on the down-sampled data
Assign the downsampled points to clusters.
Create a new KMeansModel using the assignments on the original data
Solve the K-Means clustering on the KMeansModel so constructed
This technique has been named the "Anytime" Algorithm.
The com.massivedatascience.clusterer.KMeans
helper method provides a method, timeSeriesTrain
that embeds the data iteratively.
High dimensional data can be clustered directly, but the cost is proportional to the dimension. If the divergence of interest is squared Euclidean distance, one can using Random Indexing to down-sample the data while preserving distances between clusters, with high probability.
The com.massivedatascience.clusterer.KMeans
helper method provides a method, sparseTrain
that embeds into various dimensions using random indexing.
If multiple embeddings are provided, the KMeans.train
method actually performs the embeddings and trains on the embedded data sets iteratively.
For example, for high dimensional data, one way wish to embed the data into a lower dimension before clustering to reduce running time.
For time series data, the Haar Transform has been used successfully to reduce running time while maintaining or improving quality.
For high-dimensional sparse data, random indexing can be used to map the data into a low dimensional dense space.
One may also perform clustering recursively, using lower dimensional clustering to derive initial conditions for higher dimensional clustering.
Should you wish to train a model iteratively on data sets derived maps of a shared original data set, you may use KMeans.iterativelyTrain
.
Last updated