Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
Making k-means faster without sacrificing quality.
Standard Lloyd’s algorithm is O(n × k × d × iterations). These techniques reduce the constant factor significantly.
| Technique | Speedup | Applicable To |
|---|---|---|
| Elkan | 2-10x | Squared Euclidean |
| Mini-batch | 5-50x | Any divergence |
| Coresets | 10-100x | Any divergence |
Uses triangle inequality to skip distance computations.
If we know:
Then by triangle inequality:
| d(x, c₂) ≥ | d(x, c₁) - d(c₁, c₂) | = 3 |
But if d(c₁, c₂) = 12:
| d(x, c₂) ≥ | 5 - 12 | = 7 > 5 = d(x, c₁) |
// Enabled automatically for SE with k >= 5
new GeneralizedKMeans()
.setDivergence("squaredEuclidean")
.setK(20)
// Elkan is used automatically
Updates centers using random samples instead of full data.
1. Initialize centers
2. For each iteration:
a. Sample a mini-batch of b points
b. Assign batch points to nearest centers
c. Update centers using batch points (with momentum)
3. Return centers
center[j] = (1 - η) * center[j] + η * batch_mean[j]
Where η decreases over time (learning rate schedule).
| Aspect | Full K-Means | Mini-Batch |
|---|---|---|
| Per-iteration cost | O(n × k × d) | O(b × k × d) |
| Iterations needed | 10-50 | 100-500 |
| Total cost | O(n × k × d × 30) | O(b × k × d × 300) |
| Quality | Optimal | Near-optimal |
With b = n/100, mini-batch is ~3x faster with <1% quality loss.
new MiniBatchKMeans()
.setK(100)
.setBatchSize(10000) // Points per iteration
.setMaxIter(200)
Compress data to a small weighted sample that preserves clustering structure.
Instead of n points, cluster a coreset of m « n weighted points that approximate the original data’s clustering cost.
For an ε-coreset:
(1-ε) × cost(P, C) ≤ cost(S, C) ≤ (1+ε) × cost(P, C)
For any center set C, coreset cost approximates true cost.
new CoresetKMeans()
.setK(20)
.setCoresetSize(10000) // Compress to 10K points
.setEpsilon(0.1) // 10% approximation
For very large datasets, combine multiple techniques:
// 1B points → 10K coreset → mini-batch clustering
val coreset = new CoresetKMeans()
.setCoresetSize(10000)
.fit(massiveData)
// Or use streaming for continuous updates
val streaming = new StreamingKMeans()
.setK(100)
.setDecayFactor(0.9)
| Data Size | Recommendation |
|---|---|
| < 100K | Standard GeneralizedKMeans |
| 100K - 1M | Elkan (automatic for SE) |
| 1M - 100M | Mini-batch |
| > 100M | Coreset + Mini-batch |
On 10M points × 100 dimensions, k=100:
| Method | Time | Quality (vs optimal) |
|---|---|---|
| Standard | 15 min | 100% |
| Elkan | 3 min | 100% |
| Mini-batch (b=10K) | 2 min | 99.5% |
| Coreset (m=100K) | 30 sec | 98% |
| Back to Explanation | Home |