Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
The #1 question: Which divergence should I use for my data?
This guide helps you pick the right distance measure in under 2 minutes.
START HERE: What kind of data do you have?
├─ General numeric data (measurements, sensor readings, coordinates)
│ └─ Use: squaredEuclidean (default)
│
├─ Probability distributions (histograms, topic mixtures, normalized frequencies)
│ └─ Use: kl
│
├─ Power spectra, audio features, variance estimates
│ └─ Use: itakuraSaito
│
├─ Text vectors (TF-IDF, embeddings where direction matters)
│ └─ Use: cosine / spherical
│
├─ Count data (word counts, event frequencies - NOT normalized)
│ └─ Use: generalizedI
│
├─ Binary probabilities (click rates, conversion rates in [0,1])
│ └─ Use: logistic
│
└─ Data with outliers, need robustness
└─ Use: l1
→ squaredEuclidean — Standard numeric data
→ kl — These are probability distributions
→ cosine — You care about direction, not magnitude
→ itakuraSaito — Designed for spectral data, scale-invariant
→ generalizedI — Handles unnormalized count data
→ logistic — Perfect for probabilities in open interval (0,1)
→ l1 — More robust to outliers than squared Euclidean
| Divergence | Domain | Symmetric? | Outlier Robust? | Best For |
|---|---|---|---|---|
| squaredEuclidean | Any real | Yes | No | General purpose |
| kl | Positive | No | No | Distributions |
| itakuraSaito | Positive | No | No | Spectra, scale-invariant |
| cosine | Non-zero | Yes | Somewhat | Text, embeddings |
| l1 | Any real | Yes | Yes | Robust clustering |
| generalizedI | Non-negative | No | No | Count data |
| logistic | (0, 1) | No | No | Probabilities |
// Default — good for most numeric data
val kmeans = new GeneralizedKMeans()
.setK(5)
.setDivergence("squaredEuclidean") // This is the default
val model = kmeans.fit(data)
// For probability distributions (must sum to 1, all positive)
val kmeans = new GeneralizedKMeans()
.setK(10)
.setDivergence("kl")
.setSmoothing(1e-10) // Avoid division by zero
val model = kmeans.fit(topicDistributions)
// For TF-IDF or embedding vectors
val kmeans = new GeneralizedKMeans()
.setK(20)
.setDivergence("cosine")
val model = kmeans.fit(tfidfVectors)
// For power spectra, audio features
val kmeans = new GeneralizedKMeans()
.setK(8)
.setDivergence("itakuraSaito")
.setSmoothing(1e-10)
val model = kmeans.fit(spectralFeatures)
// When you have outliers — L1 is more robust
val kmeans = new GeneralizedKMeans()
.setK(5)
.setDivergence("l1")
val model = kmeans.fit(noisyData)
Don’t use squaredEuclidean when:
klcosineitakuraSaitol1Don’t use kl when:
squaredEuclideangeneralizedIsquaredEuclidean or cosine// Ensure positive values with smoothing
val smoothed = data.withColumn("features",
transform(col("features"), x => x + 1e-10))
// Normalize vectors (optional, but common)
import org.apache.spark.ml.feature.Normalizer
val normalizer = new Normalizer()
.setInputCol("features")
.setOutputCol("normalized")
val normalized = normalizer.transform(data)
// Ensure values in (0, 1) - clip extremes
val clipped = data.withColumn("features",
transform(col("features"), x => greatest(lit(0.001), least(lit(0.999), x))))
Try multiple and compare:
val divergences = Seq("squaredEuclidean", "kl", "cosine", "l1")
val results = divergences.map { div =>
val model = new GeneralizedKMeans()
.setK(5)
.setDivergence(div)
.fit(data)
(div, model.summary.wcss, model.summary.silhouette)
}
results.foreach { case (div, wcss, sil) =>
println(f"$div%-20s WCSS: $wcss%.2f Silhouette: $sil%.3f")
}
Choose the divergence with the best silhouette score for your use case.
| Back to How-To Guides | Divergence Reference |