Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
When points can belong to multiple clusters.
Each point belongs to exactly one cluster:
Point → Cluster 2
Output: Single integer prediction per point.
val model = new GeneralizedKMeans().setK(3).fit(data)
model.transform(data).select("prediction").show()
// +----------+
// |prediction|
// +----------+
// | 0|
// | 2|
// | 1|
// +----------+
Each point has a probability of belonging to each cluster:
Point → [0.7, 0.2, 0.1] (70% cluster 0, 20% cluster 1, 10% cluster 2)
Output: Probability vector per point.
val model = new SoftKMeans().setK(3).setBeta(2.0).fit(data)
model.transform(data).select("prediction", "probabilities").show()
// +----------+--------------------+
// |prediction| probabilities|
// +----------+--------------------+
// | 0|[0.85, 0.10, 0.05]|
// | 2|[0.05, 0.15, 0.80]|
// | 1|[0.10, 0.75, 0.15]|
// +----------+--------------------+
Customer segments that aren’t mutually exclusive:
Know confidence in cluster assignment:
Feed membership probabilities into other models:
// Use soft memberships as features
val withProbs = softModel.transform(data)
val downstream = logisticRegression.fit(withProbs)
When cluster boundaries are fuzzy, not sharp.
Controls “softness” of assignments:
| Beta | Behavior |
|---|---|
| 1.0 | Very soft (nearly uniform) |
| 2.0 | Default (balanced) |
| 5.0 | Harder (more peaked) |
| ∞ | Hard clustering (one-hot) |
// Softer assignments
new SoftKMeans().setBeta(1.5)
// Harder assignments
new SoftKMeans().setBeta(5.0)
Point at distance 1 from center A, distance 2 from center B:
| Beta | P(A) | P(B) |
|---|---|---|
| 1.5 | 0.59 | 0.41 |
| 2.0 | 0.67 | 0.33 |
| 3.0 | 0.80 | 0.20 |
| 5.0 | 0.94 | 0.06 |
assign(x) = argmin_j d(x, c_j)
P(cluster j | x) = exp(-d(x, c_j) / β) / Σ_k exp(-d(x, c_k) / β)
This is a softmax over negative distances, with temperature β.
c_j = (1/n_j) Σ_{x ∈ cluster j} x
c_j = Σ_i P(j|x_i)^m × x_i / Σ_i P(j|x_i)^m
Where m is the fuzziness exponent (typically 2).
import com.massivedatascience.clusterer.ml.SoftKMeans
// Train soft clustering model
val soft = new SoftKMeans()
.setK(5)
.setBeta(2.0)
.setMaxIter(50)
val model = soft.fit(data)
// Get predictions with probabilities
val results = model.transform(data)
// Find points with high uncertainty (no dominant cluster)
val uncertain = results.filter(
array_max(col("probabilities")) < 0.5
)
// Find points strongly in one cluster
val confident = results.filter(
array_max(col("probabilities")) > 0.9
)
| Scenario | Hard or Soft? |
|---|---|
| Customer segmentation for targeted marketing | Hard |
| Topic modeling with mixed documents | Soft |
| Image segmentation with clear boundaries | Hard |
| Anomaly detection (low max probability) | Soft |
| Recommendation systems | Soft |
| Data compression | Hard |
| Back to Explanation | Home |