Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
How to evaluate clustering quality.
Unlike supervised learning, there’s no “ground truth” to compare against. We need intrinsic measures of cluster quality.
What it measures: Compactness — how tight are the clusters?
WCSS = Σ_j Σ_{x ∈ cluster j} ||x - c_j||²
Interpretation:
val model = kmeans.fit(data)
println(s"WCSS: ${model.summary.trainingCost}")
What it measures: Separation vs cohesion — are clusters well-separated?
For each point:
a(x) = average distance to points in same cluster
b(x) = average distance to points in nearest other cluster
s(x) = (b(x) - a(x)) / max(a(x), b(x))
Interpretation:
val silhouette = model.summary.silhouette
println(s"Silhouette: $silhouette")
What it measures: Ratio of between-cluster to within-cluster variance.
CH = [BCSS / (k-1)] / [WCSS / (n-k)]
Where BCSS = between-cluster sum of squares.
Interpretation:
val ch = model.summary.calinskiHarabasz
println(s"Calinski-Harabasz: $ch")
What it measures: Average similarity between clusters.
DB = (1/k) Σ_i max_{j≠i} (σ_i + σ_j) / d(c_i, c_j)
Where σ_i = average distance of points in cluster i to center.
Interpretation:
val db = model.summary.daviesBouldin
println(s"Davies-Bouldin: $db")
Plot WCSS vs k, look for “elbow”:
val metrics = (2 to 15).map { k =>
val model = new GeneralizedKMeans().setK(k).fit(data)
(k, model.summary.trainingCost)
}
// Plot and find elbow
metrics.foreach { case (k, wcss) =>
println(f"k=$k%2d WCSS=$wcss%.2f ${"█" * (wcss/1000).toInt}")
}
Output:
k= 2 WCSS=15234.00 ███████████████
k= 3 WCSS=8456.00 ████████
k= 4 WCSS=5123.00 █████ ← Elbow here
k= 5 WCSS=4567.00 ████
k= 6 WCSS=4234.00 ████
Pick k with highest silhouette:
val metrics = (2 to 10).map { k =>
val model = new GeneralizedKMeans().setK(k).fit(data)
(k, model.summary.silhouette)
}
val bestK = metrics.maxBy(_._2)._1
println(s"Best k by silhouette: $bestK")
Let the algorithm decide using BIC/AIC:
val xmeans = new XMeans()
.setMinK(2)
.setMaxK(20)
.setCriterion("bic")
val model = xmeans.fit(data)
println(s"Optimal k: ${model.k}")
| Metric | Range | Best Value | Pros | Cons |
|---|---|---|---|---|
| WCSS | [0, ∞) | Low | Simple, fast | Always improves with k |
| Silhouette | [-1, 1] | High (+1) | Interpretable | O(n²) naive |
| Calinski-Harabasz | [0, ∞) | High | Fast | Favors convex clusters |
| Davies-Bouldin | [0, ∞) | Low (0) | Intuitive | Sensitive to outliers |
import com.massivedatascience.clusterer.ml.{GeneralizedKMeans, ClusteringMetrics}
// Train model
val model = new GeneralizedKMeans()
.setK(5)
.fit(data)
// Get all metrics
val summary = model.summary
println(s"""
|Clustering Evaluation:
| WCSS: ${summary.trainingCost}
| Silhouette: ${summary.silhouette}
| Calinski-Harabasz: ${summary.calinskiHarabasz}
| Davies-Bouldin: ${summary.daviesBouldin}
| Cluster sizes: ${summary.clusterSizes.mkString(", ")}
""".stripMargin)
If you have ground truth labels:
// Adjusted Rand Index
val ari = ClusteringMetrics.adjustedRandIndex(predictions, labels)
// Normalized Mutual Information
val nmi = ClusteringMetrics.normalizedMutualInfo(predictions, labels)
| Metric | Range | Perfect |
|---|---|---|
| Adjusted Rand Index | [-1, 1] | 1 |
| Normalized Mutual Information | [0, 1] | 1 |
| Back to Explanation | Home |