Generalized K-Means Clustering

Scalable clustering with Bregman divergences on Apache Spark

View the Project on GitHub derrickburns/generalized-kmeans-clustering

Soft vs Hard Clustering

When points can belong to multiple clusters.


Hard Clustering (Standard)

Each point belongs to exactly one cluster:

Point → Cluster 2

Output: Single integer prediction per point.

val model = new GeneralizedKMeans().setK(3).fit(data)
model.transform(data).select("prediction").show()
// +----------+
// |prediction|
// +----------+
// |         0|
// |         2|
// |         1|
// +----------+

Soft Clustering (Fuzzy)

Each point has a probability of belonging to each cluster:

Point → [0.7, 0.2, 0.1]  (70% cluster 0, 20% cluster 1, 10% cluster 2)

Output: Probability vector per point.

val model = new SoftKMeans().setK(3).setBeta(2.0).fit(data)
model.transform(data).select("prediction", "probabilities").show()
// +----------+--------------------+
// |prediction|       probabilities|
// +----------+--------------------+
// |         0|[0.85, 0.10, 0.05]|
// |         2|[0.05, 0.15, 0.80]|
// |         1|[0.10, 0.75, 0.15]|
// +----------+--------------------+

When to Use Soft Clustering

1. Overlapping Groups

Customer segments that aren’t mutually exclusive:

2. Uncertainty Quantification

Know confidence in cluster assignment:

3. Downstream Probabilistic Models

Feed membership probabilities into other models:

// Use soft memberships as features
val withProbs = softModel.transform(data)
val downstream = logisticRegression.fit(withProbs)

4. Smooth Boundaries

When cluster boundaries are fuzzy, not sharp.


The Beta Parameter

Controls “softness” of assignments:

Beta Behavior
1.0 Very soft (nearly uniform)
2.0 Default (balanced)
5.0 Harder (more peaked)
Hard clustering (one-hot)
// Softer assignments
new SoftKMeans().setBeta(1.5)

// Harder assignments
new SoftKMeans().setBeta(5.0)

Visual Example

Point at distance 1 from center A, distance 2 from center B:

Beta P(A) P(B)
1.5 0.59 0.41
2.0 0.67 0.33
3.0 0.80 0.20
5.0 0.94 0.06

Mathematical Formulation

Hard Assignment

assign(x) = argmin_j d(x, c_j)

Soft Assignment

P(cluster j | x) = exp(-d(x, c_j) / β) / Σ_k exp(-d(x, c_k) / β)

This is a softmax over negative distances, with temperature β.


Center Updates

Hard K-Means

c_j = (1/n_j) Σ_{x ∈ cluster j} x

Soft K-Means

c_j = Σ_i P(j|x_i)^m × x_i / Σ_i P(j|x_i)^m

Where m is the fuzziness exponent (typically 2).


Code Example

import com.massivedatascience.clusterer.ml.SoftKMeans

// Train soft clustering model
val soft = new SoftKMeans()
  .setK(5)
  .setBeta(2.0)
  .setMaxIter(50)

val model = soft.fit(data)

// Get predictions with probabilities
val results = model.transform(data)

// Find points with high uncertainty (no dominant cluster)
val uncertain = results.filter(
  array_max(col("probabilities")) < 0.5
)

// Find points strongly in one cluster
val confident = results.filter(
  array_max(col("probabilities")) > 0.9
)

Use Cases

Scenario Hard or Soft?
Customer segmentation for targeted marketing Hard
Topic modeling with mixed documents Soft
Image segmentation with clear boundaries Hard
Anomaly detection (low max probability) Soft
Recommendation systems Soft
Data compression Hard

Back to Explanation Home