Generalized K-Means Clustering

Scalable clustering with Bregman divergences on Apache Spark

View the Project on GitHub derrickburns/generalized-kmeans-clustering

Divergence Reference

Mathematical details of all supported Bregman divergences.


Overview

A Bregman divergence D_φ(x, y) is defined by a strictly convex function φ:

D_φ(x, y) = φ(x) - φ(y) - ∇φ(y)·(x - y)

The library implements 8 divergences, each optimal for different data types.


Squared Euclidean

Name: squaredEuclidean

Formula:

D(x, y) = ½ Σᵢ (xᵢ - yᵢ)²
Generator: φ(x) = ½   x   ²

Properties:

Best for: General-purpose clustering, continuous features

Domain: All real numbers

new GeneralizedKMeans().setDivergence("squaredEuclidean")

Kullback-Leibler (KL)

Name: kl

Formula:

D(x, y) = Σᵢ xᵢ log(xᵢ/yᵢ) - Σᵢ xᵢ + Σᵢ yᵢ

Generator: φ(x) = Σᵢ xᵢ log(xᵢ)

Properties:

Best for: Probability distributions, topic models, document clustering

Domain: Strictly positive (use smoothing parameter)

new GeneralizedKMeans()
  .setDivergence("kl")
  .setSmoothing(1e-10)

Itakura-Saito

Name: itakuraSaito

Formula:

D(x, y) = Σᵢ (xᵢ/yᵢ - log(xᵢ/yᵢ) - 1)

Generator: φ(x) = -Σᵢ log(xᵢ)

Properties:

Best for: Power spectra, audio signals, spectral analysis

Domain: Strictly positive

new GeneralizedKMeans()
  .setDivergence("itakuraSaito")
  .setSmoothing(1e-10)

L1 (Manhattan)

Name: l1

Formula:

D(x, y) = Σᵢ |xᵢ - yᵢ|

Note: Not a true Bregman divergence, but supported for practicality.

Properties:

Best for: Robust clustering, sparse features, when outliers present

Domain: All real numbers

new GeneralizedKMeans().setDivergence("l1")

Generalized I-Divergence

Name: generalizedI

Formula:

D(x, y) = Σᵢ (xᵢ log(xᵢ/yᵢ) - xᵢ + yᵢ)

Generator: φ(x) = Σᵢ xᵢ log(xᵢ) - xᵢ

Properties:

Best for: Count data, word frequencies (not normalized)

Domain: Non-negative

new GeneralizedKMeans().setDivergence("generalizedI")

Logistic Loss

Name: logistic

Formula:

D(x, y) = Σᵢ (xᵢ log(xᵢ/yᵢ) + (1-xᵢ) log((1-xᵢ)/(1-yᵢ)))

Generator: φ(x) = Σᵢ xᵢ log(xᵢ) + (1-xᵢ) log(1-xᵢ)

Properties:

Best for: Binary/probability data, Bernoulli-distributed features

Domain: Open interval (0, 1)

new GeneralizedKMeans().setDivergence("logistic")

Spherical / Cosine

Names: spherical, cosine

Formula:

D(x, y) = 1 - (x · y) / (||x|| ||y||)

Properties:

Best for: Text/document vectors (TF-IDF), when magnitude doesn’t matter

Domain: Non-zero vectors

new GeneralizedKMeans().setDivergence("cosine")
// or
new GeneralizedKMeans().setDivergence("spherical")

Choosing a Divergence

Data Type Recommended Why
General continuous squaredEuclidean Standard, well-understood
Probability distributions kl Information-theoretic optimal
Word frequencies (normalized) kl Natural for distributions
Word frequencies (counts) generalizedI Handles unnormalized
Power spectra / audio itakuraSaito Scale-invariant
TF-IDF / embeddings cosine Magnitude-invariant
Binary probabilities logistic Natural for [0,1] data
Robust clustering l1 Less outlier-sensitive

Domain Requirements Summary

Divergence Requirement Smoothing Needed
squaredEuclidean Any real No
kl Strictly positive Yes
itakuraSaito Strictly positive Yes
l1 Any real No
generalizedI Non-negative Optional
logistic (0, 1) exclusive Yes
spherical/cosine Non-zero No

Back to Reference Home