Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
A decision framework for choosing the right divergence and algorithm.
| Your Data Type | Use This Divergence |
|---|---|
| General numeric (measurements, coordinates) | squaredEuclidean |
| Probability distributions (sum to 1) | kl |
| Text vectors (TF-IDF, embeddings) | cosine |
| Power spectra, audio features | itakuraSaito |
| Raw counts (not normalized) | generalizedI |
| Binary probabilities (0-1) | logistic |
| Data with outliers | l1 |
| Your Situation | Use This Algorithm |
|---|---|
| Standard clustering, known k | GeneralizedKMeans |
| Don’t know optimal k | XMeans |
| Points belong to multiple clusters | SoftKMeans |
| Need hierarchical structure | BisectingKMeans |
| Data arrives in streams | StreamingKMeans |
| Very large dataset | MiniBatchKMeans |
| Need outlier resistance | KMedoids or RobustKMeans |
| Need equal-sized clusters | BalancedKMeans |
| Have must-link/cannot-link constraints | ConstrainedKMeans |
| Time series data | TimeSeriesKMeans |
| Non-convex clusters | SpectralClustering |
new GeneralizedKMeans()
.setK(5)
.setDivergence("squaredEuclidean")
new GeneralizedKMeans()
.setK(10)
.setDivergence("kl")
new XMeans()
.setMinK(2)
.setMaxK(20)
new RobustKMeans()
.setK(5)
.setRobustMode("trim")
See Choose the Right Divergence for detailed guidance.
| Back to Explanation | Home |