Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
Scalable clustering with Bregman divergences on Apache Spark
A production-ready Spark ML library providing 15 clustering algorithms with support for multiple distance functions (Bregman divergences), including KL divergence for probability distributions, Itakura-Saito for spectral data, and more.
Key features:
import com.massivedatascience.clusterer.ml.GeneralizedKMeans
import org.apache.spark.ml.linalg.Vectors
val data = spark.createDataFrame(Seq(
Tuple1(Vectors.dense(0.0, 0.0)),
Tuple1(Vectors.dense(1.0, 1.0)),
Tuple1(Vectors.dense(9.0, 8.0)),
Tuple1(Vectors.dense(8.0, 9.0))
)).toDF("features")
val kmeans = new GeneralizedKMeans()
.setK(2)
.setDivergence("squaredEuclidean") // or "kl", "itakuraSaito", etc.
.setMaxIter(20)
val model = kmeans.fit(data)
val predictions = model.transform(data)
This documentation follows the Diátaxis framework:
Step-by-step guides to get you started:
Practical recipes for specific tasks:
Technical specifications:
Conceptual guides:
| Algorithm | Use Case | Key Feature |
|---|---|---|
| GeneralizedKMeans | General clustering | 7 divergence functions |
| XMeans | Unknown k | Automatic cluster count |
| SoftKMeans | Overlapping clusters | Probabilistic assignments |
| BisectingKMeans | Hierarchical | Top-down divisive |
| StreamingKMeans | Real-time | Online updates |
| KMedoids | Outlier-resistant | Uses actual data points |
| BalancedKMeans | Equal-sized clusters | Size constraints |
| ConstrainedKMeans | Semi-supervised | Must-link/cannot-link |
| RobustKMeans | Noisy data | Outlier detection |
| SparseKMeans | High-dimensional | Sparse optimization |
| MultiViewKMeans | Multiple features | Per-view divergences |
| TimeSeriesKMeans | Sequences | DTW distance |
| SpectralClustering | Non-convex | Graph Laplacian |
| InformationBottleneck | Compression | Information theory |
| MiniBatchKMeans | Large scale | Stochastic updates |
libraryDependencies += "com.massivedatascience" %% "massivedatascience-clusterer" % "0.7.0"
spark-submit --packages com.massivedatascience:massivedatascience-clusterer_2.13:0.7.0 your-app.jar
%pip install massivedatascience-clusterer
See Installation Guide for detailed instructions.
| Spark | Scala 2.12 | Scala 2.13 |
|---|---|---|
| 4.0.x | — | ✓ |
| 3.5.x | ✓ | ✓ |
| 3.4.x | ✓ | ✓ |
Apache License 2.0 — See LICENSE
Copyright © 2025 Massive Data Science, LLC