Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
Complete list of all 15 clustering algorithms.
Standard k-means with pluggable Bregman divergences.
Class: com.massivedatascience.clusterer.ml.GeneralizedKMeans
Use when: General-purpose clustering with known k
Complexity: O(n × k × d × iterations)
val kmeans = new GeneralizedKMeans()
.setK(5)
.setDivergence("squaredEuclidean")
.setMaxIter(100)
Automatic k selection using BIC/AIC.
Class: com.massivedatascience.clusterer.ml.XMeans
Use when: Unknown number of clusters
Algorithm: Iteratively splits clusters, evaluates with information criterion
val xmeans = new XMeans()
.setMinK(2)
.setMaxK(20)
.setCriterion("bic")
Probabilistic/fuzzy cluster assignments.
Class: com.massivedatascience.clusterer.ml.SoftKMeans
Use when: Points may belong to multiple clusters
Output: Adds probabilities column with membership weights
val soft = new SoftKMeans()
.setK(3)
.setBeta(2.0)
Hierarchical divisive clustering.
Class: com.massivedatascience.clusterer.ml.BisectingKMeans
Use when: Need hierarchical structure, more deterministic than random init
Algorithm: Recursively bisects largest cluster
val bisecting = new BisectingKMeans()
.setK(10)
.setMinDivisibleClusterSize(5)
Online clustering with exponential decay.
Class: com.massivedatascience.clusterer.ml.StreamingKMeans
Use when: Data arrives in streams, concept drift
val streaming = new StreamingKMeans()
.setK(5)
.setDecayFactor(0.9)
Stochastic updates on mini-batches.
Class: com.massivedatascience.clusterer.ml.MiniBatchKMeans
Use when: Very large datasets, faster convergence
val minibatch = new MiniBatchKMeans()
.setK(10)
.setBatchSize(1000)
Uses actual data points as centers (PAM algorithm).
Class: com.massivedatascience.clusterer.ml.KMedoids
Use when: Need interpretable centers, outlier resistance
Complexity: O(n² × k × iterations)
val kmedoids = new KMedoids()
.setK(5)
.setDistanceFunction("manhattan")
Explicit outlier handling.
Class: com.massivedatascience.clusterer.ml.RobustKMeans
Use when: Noisy data, need outlier detection
Modes: trim, noise_cluster, m_estimator
val robust = new RobustKMeans()
.setK(5)
.setRobustMode("trim")
.setTrimFraction(0.1)
Equal-sized cluster constraints.
Class: com.massivedatascience.clusterer.ml.BalancedKMeans
Use when: Need approximately equal cluster sizes
val balanced = new BalancedKMeans()
.setK(5)
.setBalanceMode("hard")
Semi-supervised with must-link/cannot-link.
Class: com.massivedatascience.clusterer.ml.ConstrainedKMeans
Use when: Have prior knowledge about relationships
val constrained = new ConstrainedKMeans()
.setK(3)
.setMustLinkCol("must_link")
.setCannotLinkCol("cannot_link")
Optimized for high-dimensional sparse data.
Class: com.massivedatascience.clusterer.ml.SparseKMeans
Use when: High sparsity ratio (> 50%)
val sparse = new SparseKMeans()
.setK(10)
.setSparseMode("auto")
Multiple feature representations.
Class: com.massivedatascience.clusterer.ml.MultiViewKMeans
Use when: Data has multiple modalities/views
val multiview = new MultiViewKMeans()
.setK(5)
.setViewSpecs(Seq(
ViewSpec("view1", "squaredEuclidean", 1.0),
ViewSpec("view2", "cosine", 0.5)
))
Sequence clustering with DTW.
Class: com.massivedatascience.clusterer.ml.TimeSeriesKMeans
Use when: Time series, sequences of varying length
val timeseries = new TimeSeriesKMeans()
.setK(5)
.setDistanceType("dtw")
Graph-based via Laplacian eigenvectors.
Class: com.massivedatascience.clusterer.ml.SpectralClustering
Use when: Non-convex clusters, graph/network data
Complexity: O(n²) or O(n×m²) with Nyström
val spectral = new SpectralClustering()
.setK(5)
.setAffinityType("rbf")
.setUseNystrom(true)
Information-theoretic clustering.
Class: com.massivedatascience.clusterer.ml.InformationBottleneck
Use when: Compression with relevance preservation
val ib = new InformationBottleneck()
.setK(5)
.setBeta(1.0)
.setRelevanceCol("labels")
| Algorithm | Complexity | Memory | Outliers | Constraints | Streaming |
|---|---|---|---|---|---|
| GeneralizedKMeans | O(nkd) | O(kd) | — | — | — |
| XMeans | O(nkd×splits) | O(kd) | — | — | — |
| SoftKMeans | O(nkd) | O(nk) | — | — | — |
| BisectingKMeans | O(nkd) | O(kd) | — | — | — |
| StreamingKMeans | O(batchkd) | O(kd) | — | — | ✓ |
| MiniBatchKMeans | O(batchkd) | O(kd) | — | — | — |
| KMedoids | O(n²k) | O(n²) | ✓ | — | — |
| RobustKMeans | O(nkd) | O(kd) | ✓ | — | — |
| BalancedKMeans | O(nkd) | O(kd) | — | ✓ | — |
| ConstrainedKMeans | O(nkd) | O(kd) | — | ✓ | — |
| SparseKMeans | O(nnzkd) | O(kd) | — | — | — |
| TimeSeriesKMeans | O(nkL²) | O(kL) | — | — | — |
| SpectralClustering | O(n²) | O(n²) | — | — | — |
| InformationBottleneck | O(nkd) | O(kd) | — | — | — |
| Back to Reference | Home |