Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
Complete documentation of all parameters across all algorithms.
The core k-means estimator with Bregman divergences.
| Parameter | Type | Default | Description |
|---|---|---|---|
k |
Int | 2 | Number of clusters (must be > 1) |
divergence |
String | “squaredEuclidean” | Distance function: squaredEuclidean, kl, itakuraSaito, l1, generalizedI, logistic, spherical, cosine |
maxIter |
Int | 20 | Maximum iterations (>= 0) |
tol |
Double | 1e-4 | Convergence tolerance for center movement |
seed |
Long | random | Random seed for reproducibility |
featuresCol |
String | “features” | Input features column |
predictionCol |
String | “prediction” | Output prediction column |
distanceCol |
String | — | Output distance column (optional) |
weightCol |
String | — | Point weights column (optional) |
smoothing |
Double | 1e-10 | Smoothing for KL/IS divergences |
assignmentStrategy |
String | “auto” | Strategy: auto, crossJoin, broadcastUDF |
emptyClusterStrategy |
String | “reseedRandom” | Empty handling: reseedRandom, drop |
initMode |
String | “k-means||” | Initialization: random, k-means|| |
initSteps |
Int | 2 | K-means|| initialization steps |
checkpointInterval |
Int | 10 | Checkpoint interval (0 = disabled) |
checkpointDir |
String | — | Checkpoint directory |
Automatic k selection using information criteria.
| Parameter | Type | Default | Description |
|---|---|---|---|
minK |
Int | 2 | Minimum clusters to consider |
maxK |
Int | 10 | Maximum clusters to consider |
criterion |
String | “bic” | Selection criterion: bic, aic |
| Plus all GeneralizedKMeans parameters |
Probabilistic/fuzzy cluster assignments.
| Parameter | Type | Default | Description |
|---|---|---|---|
beta |
Double | 1.0 | Temperature (higher = more deterministic) |
probabilitiesCol |
String | “probabilities” | Output probabilities column |
| Plus all GeneralizedKMeans parameters |
Hierarchical divisive clustering.
| Parameter | Type | Default | Description |
|---|---|---|---|
minDivisibleClusterSize |
Int | 1 | Minimum size to split |
| Plus all GeneralizedKMeans parameters |
Online clustering with decay.
| Parameter | Type | Default | Description |
|---|---|---|---|
decayFactor |
Double | 1.0 | Exponential decay (0.0-1.0) |
halfLife |
Double | — | Alternative to decayFactor |
timeUnit |
String | “batches” | Decay unit: batches, points |
| Plus all GeneralizedKMeans parameters |
Clustering with actual data points as centers.
| Parameter | Type | Default | Description |
|---|---|---|---|
k |
Int | 2 | Number of clusters |
distanceFunction |
String | “euclidean” | Distance: euclidean, manhattan, cosine |
maxIter |
Int | 20 | Maximum iterations |
seed |
Long | random | Random seed |
featuresCol |
String | “features” | Features column |
predictionCol |
String | “prediction” | Prediction column |
Equal-sized cluster constraints.
| Parameter | Type | Default | Description |
|---|---|---|---|
balanceMode |
String | “soft” | Mode: soft, hard |
maxClusterSize |
Int | auto | Maximum cluster size |
balancePenalty |
Double | 1.0 | Soft mode penalty weight |
| Plus all GeneralizedKMeans parameters |
Semi-supervised with constraints.
| Parameter | Type | Default | Description |
|---|---|---|---|
mustLinkCol |
String | — | Must-link pairs column |
cannotLinkCol |
String | — | Cannot-link pairs column |
constraintMode |
String | “soft” | Mode: soft, hard |
violationPenalty |
Double | 1.0 | Soft mode penalty |
| Plus all GeneralizedKMeans parameters |
Outlier-resistant clustering.
| Parameter | Type | Default | Description |
|---|---|---|---|
robustMode |
String | “trim” | Mode: trim, noise_cluster, m_estimator |
trimFraction |
Double | 0.1 | Fraction to trim (trim mode) |
noiseThreshold |
Double | 2.0 | Distance threshold (noise mode) |
mEstimatorType |
String | “huber” | Type: huber, tukey, cauchy |
outlierScoreCol |
String | — | Output outlier scores |
| Plus all GeneralizedKMeans parameters |
High-dimensional sparse data.
| Parameter | Type | Default | Description |
|---|---|---|---|
sparseMode |
String | “auto” | Mode: auto, force, dense |
sparseThreshold |
Double | 0.5 | Sparsity threshold for auto |
| Plus all GeneralizedKMeans parameters |
Sequence clustering with DTW.
| Parameter | Type | Default | Description |
|---|---|---|---|
distanceType |
String | “dtw” | Distance: dtw, softdtw, gak, derivative |
bandWidth |
Double | 0.1 | Sakoe-Chiba band width |
gamma |
Double | 1.0 | Soft-DTW smoothing |
| Plus all GeneralizedKMeans parameters |
Graph-based clustering.
| Parameter | Type | Default | Description |
|---|---|---|---|
affinityType |
String | “rbf” | Affinity: rbf, knn, epsilon |
laplacianType |
String | “normalized” | Laplacian: unnormalized, normalized, randomWalk |
sigma |
Double | 1.0 | RBF kernel width |
numNeighbors |
Int | 10 | k-NN neighbors |
epsilon |
Double | 1.0 | Epsilon neighborhood |
useNystrom |
Boolean | false | Nyström approximation |
nystromSamples |
Int | 100 | Nyström sample size |
| Plus k, seed, featuresCol, predictionCol |
Information-theoretic clustering.
| Parameter | Type | Default | Description |
|---|---|---|---|
beta |
Double | 1.0 | Compression-relevance trade-off |
relevanceCol |
String | — | Relevance variable column |
convergenceTol |
Double | 1e-6 | Blahut-Arimoto tolerance |
| Plus all GeneralizedKMeans parameters |
Stochastic mini-batch updates.
| Parameter | Type | Default | Description |
|---|---|---|---|
batchSize |
Int | 100 | Points per batch |
reassignmentRatio |
Double | 0.01 | Reassignment threshold |
| Plus all GeneralizedKMeans parameters |
| Back to Reference | Home |