Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
Understanding the mathematical foundation of generalized k-means.
A Bregman divergence is a measure of “distance” between two points, defined by a strictly convex function φ (called the generator):
D_φ(x, y) = φ(x) - φ(y) - ∇φ(y) · (x - y)
Intuition: The divergence measures the difference between φ(x) and its linear approximation at y.
For any Bregman divergence, the point that minimizes the sum of divergences from a set of points is the arithmetic mean.
argmin_c Σᵢ D_φ(xᵢ, c) = (1/n) Σᵢ xᵢ
This is why k-means (with any Bregman divergence) uses simple averaging to update centers.
Each Bregman divergence corresponds to a member of the exponential family of distributions:
| Divergence | Distribution | Natural for |
|---|---|---|
| Squared Euclidean | Gaussian | Continuous data |
| KL | Multinomial/Poisson | Counts, probabilities |
| Itakura-Saito | Gamma | Power spectra |
| Logistic | Bernoulli | Binary data |
The k-means objective (minimize within-cluster divergence) has the same form regardless of which Bregman divergence you use:
minimize Σᵢ Σⱼ wᵢⱼ D_φ(xᵢ, μⱼ)
Each divergence is fully specified by its generator φ:
φ(x) = ½||x||² = ½ Σᵢ xᵢ²
∇φ(x) = x
D_φ(x,y) = ½||x - y||²
φ(x) = Σᵢ xᵢ log(xᵢ) (negative entropy)
∇φ(x) = log(x) + 1
D_φ(x,y) = Σᵢ xᵢ log(xᵢ/yᵢ) - xᵢ + yᵢ
φ(x) = -Σᵢ log(xᵢ)
∇φ(x) = -1/x
D_φ(x,y) = Σᵢ (xᵢ/yᵢ - log(xᵢ/yᵢ) - 1)
D_φ(x, y) ≥ 0 with equality iff x = y
D_φ(x, y) is convex in x (but not necessarily in y).
In general, D_φ(x, y) ≠ D_φ(y, x).
Squared Euclidean is the only symmetric Bregman divergence.
Bregman divergences generally do not satisfy the triangle inequality (except squared Euclidean).
The “right” divergence depends on:
Is your data probability distributions?
→ KL divergence
Is your data power spectra or variances?
→ Itakura-Saito
Is scale important (absolute values matter)?
→ Squared Euclidean
Is angle important (direction matters)?
→ Cosine / Spherical
Do you have counts (not normalized)?
→ Generalized I-divergence
Is your data binary probabilities?
→ Logistic loss
For an exponential family with natural parameter θ and log-partition A(θ):
p(x|θ) = h(x) exp(θ·x - A(θ))
The corresponding Bregman divergence uses:
φ = A* (convex conjugate of A)
This is why:
| Squared Euclidean matches Gaussian (log-partition = ½ | θ | ²) |
Bregman divergences define a dually flat geometry on the parameter space. The primal coordinates (means) and dual coordinates (natural parameters) are connected by the gradient of φ.
| Back to Explanation | Home |