Generalized K-Means Clustering

Scalable clustering with Bregman divergences on Apache Spark

View the Project on GitHub derrickburns/generalized-kmeans-clustering

Generalized K-Means Clustering

Scalable clustering with Bregman divergences on Apache Spark

Build Status Scala 2.13 Spark 3.5


What is this library?

A production-ready Spark ML library providing 15 clustering algorithms with support for multiple distance functions (Bregman divergences), including KL divergence for probability distributions, Itakura-Saito for spectral data, and more.

Key features:


Quick Example

import com.massivedatascience.clusterer.ml.GeneralizedKMeans
import org.apache.spark.ml.linalg.Vectors

val data = spark.createDataFrame(Seq(
  Tuple1(Vectors.dense(0.0, 0.0)),
  Tuple1(Vectors.dense(1.0, 1.0)),
  Tuple1(Vectors.dense(9.0, 8.0)),
  Tuple1(Vectors.dense(8.0, 9.0))
)).toDF("features")

val kmeans = new GeneralizedKMeans()
  .setK(2)
  .setDivergence("squaredEuclidean")  // or "kl", "itakuraSaito", etc.
  .setMaxIter(20)

val model = kmeans.fit(data)
val predictions = model.transform(data)

Documentation

This documentation follows the Diátaxis framework:

Tutorials — Learning-oriented

Step-by-step guides to get you started:

How-To Guides — Task-oriented

Practical recipes for specific tasks:

Reference — Information-oriented

Technical specifications:

Explanation — Understanding-oriented

Conceptual guides:


Algorithms

Algorithm Use Case Key Feature
GeneralizedKMeans General clustering 7 divergence functions
XMeans Unknown k Automatic cluster count
SoftKMeans Overlapping clusters Probabilistic assignments
BisectingKMeans Hierarchical Top-down divisive
StreamingKMeans Real-time Online updates
KMedoids Outlier-resistant Uses actual data points
BalancedKMeans Equal-sized clusters Size constraints
ConstrainedKMeans Semi-supervised Must-link/cannot-link
RobustKMeans Noisy data Outlier detection
SparseKMeans High-dimensional Sparse optimization
MultiViewKMeans Multiple features Per-view divergences
TimeSeriesKMeans Sequences DTW distance
SpectralClustering Non-convex Graph Laplacian
InformationBottleneck Compression Information theory
MiniBatchKMeans Large scale Stochastic updates

Installation

SBT

libraryDependencies += "com.massivedatascience" %% "massivedatascience-clusterer" % "0.7.0"

spark-submit

spark-submit --packages com.massivedatascience:massivedatascience-clusterer_2.13:0.7.0 your-app.jar

Databricks

%pip install massivedatascience-clusterer

See Installation Guide for detailed instructions.


Version Compatibility

Spark Scala 2.12 Scala 2.13
4.0.x
3.5.x
3.4.x

Getting Help


License

Apache License 2.0 — See LICENSE

Copyright © 2025 Massive Data Science, LLC