Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
Mathematical details of all supported Bregman divergences.
A Bregman divergence D_φ(x, y) is defined by a strictly convex function φ:
D_φ(x, y) = φ(x) - φ(y) - ∇φ(y)·(x - y)
The library implements 8 divergences, each optimal for different data types.
Name: squaredEuclidean
Formula:
D(x, y) = ½ Σᵢ (xᵢ - yᵢ)²
| Generator: φ(x) = ½ | x | ² |
Properties:
Best for: General-purpose clustering, continuous features
Domain: All real numbers
new GeneralizedKMeans().setDivergence("squaredEuclidean")
Name: kl
Formula:
D(x, y) = Σᵢ xᵢ log(xᵢ/yᵢ) - Σᵢ xᵢ + Σᵢ yᵢ
Generator: φ(x) = Σᵢ xᵢ log(xᵢ)
Properties:
Best for: Probability distributions, topic models, document clustering
Domain: Strictly positive (use smoothing parameter)
new GeneralizedKMeans()
.setDivergence("kl")
.setSmoothing(1e-10)
Name: itakuraSaito
Formula:
D(x, y) = Σᵢ (xᵢ/yᵢ - log(xᵢ/yᵢ) - 1)
Generator: φ(x) = -Σᵢ log(xᵢ)
Properties:
Best for: Power spectra, audio signals, spectral analysis
Domain: Strictly positive
new GeneralizedKMeans()
.setDivergence("itakuraSaito")
.setSmoothing(1e-10)
Name: l1
Formula:
D(x, y) = Σᵢ |xᵢ - yᵢ|
Note: Not a true Bregman divergence, but supported for practicality.
Properties:
Best for: Robust clustering, sparse features, when outliers present
Domain: All real numbers
new GeneralizedKMeans().setDivergence("l1")
Name: generalizedI
Formula:
D(x, y) = Σᵢ (xᵢ log(xᵢ/yᵢ) - xᵢ + yᵢ)
Generator: φ(x) = Σᵢ xᵢ log(xᵢ) - xᵢ
Properties:
Best for: Count data, word frequencies (not normalized)
Domain: Non-negative
new GeneralizedKMeans().setDivergence("generalizedI")
Name: logistic
Formula:
D(x, y) = Σᵢ (xᵢ log(xᵢ/yᵢ) + (1-xᵢ) log((1-xᵢ)/(1-yᵢ)))
Generator: φ(x) = Σᵢ xᵢ log(xᵢ) + (1-xᵢ) log(1-xᵢ)
Properties:
Best for: Binary/probability data, Bernoulli-distributed features
Domain: Open interval (0, 1)
new GeneralizedKMeans().setDivergence("logistic")
Names: spherical, cosine
Formula:
D(x, y) = 1 - (x · y) / (||x|| ||y||)
Properties:
Best for: Text/document vectors (TF-IDF), when magnitude doesn’t matter
Domain: Non-zero vectors
new GeneralizedKMeans().setDivergence("cosine")
// or
new GeneralizedKMeans().setDivergence("spherical")
| Data Type | Recommended | Why |
|---|---|---|
| General continuous | squaredEuclidean |
Standard, well-understood |
| Probability distributions | kl |
Information-theoretic optimal |
| Word frequencies (normalized) | kl |
Natural for distributions |
| Word frequencies (counts) | generalizedI |
Handles unnormalized |
| Power spectra / audio | itakuraSaito |
Scale-invariant |
| TF-IDF / embeddings | cosine |
Magnitude-invariant |
| Binary probabilities | logistic |
Natural for [0,1] data |
| Robust clustering | l1 |
Less outlier-sensitive |
| Divergence | Requirement | Smoothing Needed |
|---|---|---|
| squaredEuclidean | Any real | No |
| kl | Strictly positive | Yes |
| itakuraSaito | Strictly positive | Yes |
| l1 | Any real | No |
| generalizedI | Non-negative | Optional |
| logistic | (0, 1) exclusive | Yes |
| spherical/cosine | Non-zero | No |
| Back to Reference | Home |