Scalable clustering with Bregman divergences on Apache Spark
View the Project on GitHub derrickburns/generalized-kmeans-clustering
How to add generalized-kmeans-clustering to your project.
Add to build.sbt:
libraryDependencies += "com.massivedatascience" %% "clusterer" % "0.7.0"
For specific Scala/Spark versions:
// Scala 2.13 + Spark 3.5
libraryDependencies += "com.massivedatascience" % "clusterer_2.13" % "0.7.0"
// Scala 2.12 + Spark 3.4
libraryDependencies += "com.massivedatascience" % "clusterer_2.12" % "0.7.0"
<dependency>
<groupId>com.massivedatascience</groupId>
<artifactId>clusterer_2.13</artifactId>
<version>0.7.0</version>
</dependency>
spark-submit \
--packages com.massivedatascience:clusterer_2.13:0.7.0 \
--class com.example.MyApp \
my-app.jar
# Scala
spark-shell --packages com.massivedatascience:clusterer_2.13:0.7.0
# Python
pyspark --packages com.massivedatascience:clusterer_2.13:0.7.0
com.massivedatascience:clusterer_2.13:0.7.0%pip install massivedatascience-clusterer
Create /dbfs/init-scripts/install-clusterer.sh:
#!/bin/bash
pip install massivedatascience-clusterer
#!/bin/bash
sudo pip3 install massivedatascience-clusterer
{
"Name": "Spark Submit with Clusterer",
"ActionOnFailure": "CONTINUE",
"HadoopJarStep": {
"Jar": "command-runner.jar",
"Args": [
"spark-submit",
"--packages", "com.massivedatascience:clusterer_2.13:0.7.0",
"s3://my-bucket/my-app.jar"
]
}
}
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
spec:
deps:
packages:
- com.massivedatascience:clusterer_2.13:0.7.0
git clone https://github.com/derrickburns/generalized-kmeans-clustering.git
cd generalized-kmeans-clustering
# Build for Scala 2.13
sbt ++2.13.14 publishLocal
# Build for Scala 2.12
sbt ++2.12.18 publishLocal
| Library | Spark 3.4 | Spark 3.5 | Spark 4.0 |
|---|---|---|---|
| Scala 2.12 | ✓ | ✓ | — |
| Scala 2.13 | ✓ | ✓ | ✓ |
import com.massivedatascience.clusterer.ml.GeneralizedKMeans
val kmeans = new GeneralizedKMeans()
println(s"Library loaded: ${kmeans.getClass.getName}")
| Back to How-To | Home |