Algorithms Implemented

Most practical variants of K-means clustering are implemented or can be implemented with this package.

Clustering with Bregman Divergences - observes that Lloyd's algorithms converges for distance functions defined by Bregman Divergences
Fast k-means algorithm clustering - uses a 2-step iterative algorithm to cluster a subset of the data and then the full set
A Random Indexing Approach for Web User Clustering and Web Prefetching - uses random indexing to lower the dimension of high dimensional data
A Wavelet-Based Anytime Algorithm for K-Means Clustering of Time Series - uses the Haar Transform to embed time series data before clustering
Metrics Defined By Bregman Divergences - shows metrics can can make use of the triangle inequality to speed up clustering
On the performance of bisecting K-means and PDDP - a recursive subdivision algorithm
Scalable K-Means++ - a provably good initial set of cluster centers
Streaming k-means approximation - a mini-batch algorithm suitable for online data sets

If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant! Here are some newer algorithms that are worth investigating:

Fast and Provably Good Seedings for k-Means - even better seeding
An efficient approximation to the K-means clustering for Massive Data - a recursive subdivision algorithm
Scalable K-Means by Ranked Retrieval - a novel inversion of the k-means algorithm with dramatic speedups on large data sets

PreviousRelation to Spark K-Means Clusterer NextRequirements

Last updated 1 year ago

Algorithms Implemented

Most practical variants of K-means clustering are implemented or can be implemented with this package.

Clustering with Bregman Divergences - observes that Lloyd's algorithms converges for distance functions defined by Bregman Divergences
Fast k-means algorithm clustering - uses a 2-step iterative algorithm to cluster a subset of the data and then the full set
A Random Indexing Approach for Web User Clustering and Web Prefetching - uses random indexing to lower the dimension of high dimensional data
A Wavelet-Based Anytime Algorithm for K-Means Clustering of Time Series - uses the Haar Transform to embed time series data before clustering
Metrics Defined By Bregman Divergences - shows metrics can can make use of the triangle inequality to speed up clustering
On the performance of bisecting K-means and PDDP - a recursive subdivision algorithm
Scalable K-Means++ - a provably good initial set of cluster centers
Streaming k-means approximation - a mini-batch algorithm suitable for online data sets

Fast and Provably Good Seedings for k-Means - even better seeding
An efficient approximation to the K-means clustering for Massive Data - a recursive subdivision algorithm
Scalable K-Means by Ranked Retrieval - a novel inversion of the k-means algorithm with dramatic speedups on large data sets

PreviousRelation to Spark K-Means Clusterer NextRequirements

Last updated 1 year ago