Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
Time: 15 minutes Goal: Select the best algorithm for your use case
Start
│
├─ Do you know how many clusters?
│ ├─ NO → XMeans (automatic k selection)
│ └─ YES ↓
│
├─ Is your data probability distributions?
│ └─ YES → GeneralizedKMeans with divergence="kl"
│
├─ Is your data time series / sequences?
│ └─ YES → TimeSeriesKMeans with DTW
│
├─ Do you have outliers / noise?
│ └─ YES → RobustKMeans or KMedoids
│
├─ Do you need equal-sized clusters?
│ └─ YES → BalancedKMeans
│
├─ Do you have must-link/cannot-link constraints?
│ └─ YES → ConstrainedKMeans
│
├─ Do you need soft/probabilistic assignments?
│ └─ YES → SoftKMeans
│
├─ Is data arriving in streams?
│ └─ YES → StreamingKMeans
│
├─ Do you have very high dimensions + sparse data?
│ └─ YES → SparseKMeans
│
├─ Do you have non-convex cluster shapes?
│ └─ YES → SpectralClustering
│
└─ DEFAULT → GeneralizedKMeans with squaredEuclidean
| Algorithm | When to Use | Key Parameters |
|---|---|---|
| GeneralizedKMeans | General-purpose clustering | k, divergence, maxIter |
| XMeans | Unknown number of clusters | minK, maxK, criterion |
| SoftKMeans | Overlapping/fuzzy clusters | k, beta |
| BisectingKMeans | Hierarchical structure needed | k, minDivisibleClusterSize |
| Algorithm | When to Use | Key Parameters |
|---|---|---|
| StreamingKMeans | Real-time data streams | k, decayFactor, halfLife |
| KMedoids | Outlier-resistant, interpretable centers | k, distanceFunction |
| BalancedKMeans | Equal-sized clusters required | k, balanceMode, maxClusterSize |
| ConstrainedKMeans | Semi-supervised with constraints | k, mustLinkCol, cannotLinkCol |
| RobustKMeans | Noisy data with outliers | k, robustMode, trimFraction |
| Algorithm | When to Use | Key Parameters |
|---|---|---|
| SparseKMeans | High-dimensional sparse data | k, sparseMode, sparseThreshold |
| MultiViewKMeans | Multiple feature representations | k, viewSpecs |
| TimeSeriesKMeans | Sequence/time-series data | k, distanceType |
| SpectralClustering | Non-convex shapes, graph data | k, affinityType, laplacianType |
| InformationBottleneck | Information compression | k, beta, relevanceCol |
| MiniBatchKMeans | Very large datasets | k, batchSize |
Best for: Most clustering tasks with well-defined feature vectors.
val kmeans = new GeneralizedKMeans()
.setK(5)
.setDivergence("squaredEuclidean") // Standard k-means
.setMaxIter(100)
.setTol(1e-4)
Divergence options:
squaredEuclidean — Standard k-means (default)kl — Probability distributions (topic modeling)itakuraSaito — Spectral/audio datal1 — Manhattan distance (robust to outliers)generalizedI — Count datalogistic — Binary probabilitiesspherical / cosine — Text/document vectorsBest for: When you don’t know how many clusters to use.
val xmeans = new XMeans()
.setMinK(2)
.setMaxK(20)
.setCriterion("bic") // or "aic"
How it works: Starts with minK, splits clusters, evaluates with BIC/AIC, stops when splitting doesn’t improve.
Best for: Points that belong to multiple clusters with different degrees.
val soft = new SoftKMeans()
.setK(3)
.setBeta(2.0) // Higher = more deterministic
Output: Adds probabilities column with membership weights.
Best for: Interpretable centers (actual data points) and outlier resistance.
val kmedoids = new KMedoids()
.setK(5)
.setDistanceFunction("euclidean") // or "manhattan", "cosine"
Key difference: Centers are actual data points, not computed means.
Best for: Datasets with noise, outliers, or contamination.
val robust = new RobustKMeans()
.setK(5)
.setRobustMode("trim") // or "noise_cluster", "m_estimator"
.setTrimFraction(0.1) // Ignore worst 10%
.setOutlierScoreCol("outlier") // Output outlier scores
Modes:
trim — Ignore points far from centersnoise_cluster — Assign outliers to special clusterm_estimator — Down-weight outliers during updatesBest for: Sequences that may be misaligned in time.
val ts = new TimeSeriesKMeans()
.setK(3)
.setDistanceType("dtw") // or "softdtw", "gak", "derivative"
.setBandWidth(0.1) // Sakoe-Chiba band
Distance types:
dtw — Dynamic Time Warpingsoftdtw — Differentiable DTWgak — Global Alignment Kernelderivative — Shape-based (offset/amplitude invariant)Best for: Non-convex cluster shapes, graph/network data.
val spectral = new SpectralClustering()
.setK(3)
.setAffinityType("rbf") // or "knn", "epsilon"
.setLaplacianType("normalized") // or "unnormalized", "randomWalk"
.setSigma(1.0) // RBF kernel width
When to use: Clusters that are non-spherical or connected by paths.
| Algorithm | Complexity | Memory | Best Scale |
|---|---|---|---|
| GeneralizedKMeans | O(n·k·d·iter) | O(k·d) | Billions |
| XMeans | O(n·k·d·iter·splits) | O(k·d) | Millions |
| SoftKMeans | O(n·k·d·iter) | O(n·k) | Millions |
| KMedoids | O(n²·k·iter) | O(n²) | Thousands |
| StreamingKMeans | O(batch·k·d) | O(k·d) | Unlimited |
| SpectralClustering | O(n²) or O(n·m²) | O(n²) | Thousands |
// Use cosine similarity for TF-IDF vectors
new GeneralizedKMeans().setDivergence("cosine")
// KL divergence for word distributions
new GeneralizedKMeans().setDivergence("kl")
// Itakura-Saito for power spectra
new GeneralizedKMeans().setDivergence("itakuraSaito")
// Balance clusters for equal marketing spend
new BalancedKMeans().setBalanceMode("hard")
// Robust clustering with outlier scores
new RobustKMeans()
.setRobustMode("noise_cluster")
.setOutlierScoreCol("anomaly_score")
| Back to Tutorials | Home |