Installation Guide

How to add generalized-kmeans-clustering to your project.

SBT

Add to build.sbt:

libraryDependencies += "com.massivedatascience" %% "clusterer" % "0.7.0"

For specific Scala/Spark versions:

// Scala 2.13 + Spark 3.5
libraryDependencies += "com.massivedatascience" % "clusterer_2.13" % "0.7.0"

// Scala 2.12 + Spark 3.4
libraryDependencies += "com.massivedatascience" % "clusterer_2.12" % "0.7.0"

Maven

<dependency>
    <groupId>com.massivedatascience</groupId>
    <artifactId>clusterer_2.13</artifactId>
    <version>0.7.0</version>
</dependency>

spark-submit

spark-submit \
  --packages com.massivedatascience:clusterer_2.13:0.7.0 \
  --class com.example.MyApp \
  my-app.jar

spark-shell / pyspark

# Scala
spark-shell --packages com.massivedatascience:clusterer_2.13:0.7.0

# Python
pyspark --packages com.massivedatascience:clusterer_2.13:0.7.0

Databricks

Option 1: Cluster Library

Go to Compute → Select your cluster → Libraries
Click Install New → Maven
Enter coordinates: com.massivedatascience:clusterer_2.13:0.7.0
Click Install

Option 2: Notebook

%pip install massivedatascience-clusterer

Option 3: Init Script

Create /dbfs/init-scripts/install-clusterer.sh:

#!/bin/bash
pip install massivedatascience-clusterer

EMR

Bootstrap Action

#!/bin/bash
sudo pip3 install massivedatascience-clusterer

Step Configuration

{
  "Name": "Spark Submit with Clusterer",
  "ActionOnFailure": "CONTINUE",
  "HadoopJarStep": {
    "Jar": "command-runner.jar",
    "Args": [
      "spark-submit",
      "--packages", "com.massivedatascience:clusterer_2.13:0.7.0",
      "s3://my-bucket/my-app.jar"
    ]
  }
}

Kubernetes / Spark Operator

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
spec:
  deps:
    packages:
      - com.massivedatascience:clusterer_2.13:0.7.0

Build from Source

git clone https://github.com/derrickburns/generalized-kmeans-clustering.git
cd generalized-kmeans-clustering

# Build for Scala 2.13
sbt ++2.13.14 publishLocal

# Build for Scala 2.12
sbt ++2.12.18 publishLocal

Version Compatibility

Library	Spark 3.4	Spark 3.5	Spark 4.0
Scala 2.12	✓	✓	—
Scala 2.13	✓	✓	✓

Verify Installation

import com.massivedatascience.clusterer.ml.GeneralizedKMeans

val kmeans = new GeneralizedKMeans()
println(s"Library loaded: ${kmeans.getClass.getName}")

Back to How-To

Home