Skip to content

Spark RAPIDS MLlib – accelerate Apache Spark MLlib with GPUs

License

Notifications You must be signed in to change notification settings

NvTimLiu/spark-rapids-ml

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAPIDS Accelerator for Apache Spark ML

The RAPIDS Accelerator for Apache Spark ML provides a set of GPU accelerated Spark ML algorithms.

API change

We describe main API changes for GPU accelerated algorithms:

1. PCA

Comparing to the original PCA training API:

val pca = new org.apache.spark.ml.feature.PCA()
  .setInputCol("feature")
  .setOutputCol("feature_value_3d")
  .setK(3)
  .fit(vectorDf)

We used a customized class and add some extra API switches:

val pca = new com.nvidia.spark.ml.feature.PCA()
...
  .useGemm(true) // or false, switch to use original BLAS bsr or cuBLAS gemm to compute covariance matrix
  .useCuSolverSVD(true) // or false, switch to use cuSolver to compute SVD
  .meanCentering(true) // or false, switch to do mean centering before computing covariance matrix
...

Build

Prerequisites:

  1. essential build tools:
  2. CUDA Toolkit(>=11.0)
  3. conda: use miniconda to maintain header files and cmake dependecies
  4. RMM:
    • we need all header files and some extra cmake dependencies, build instructions:
    $ git clone --recurse-submodules https://github.com/rapidsai/rmm.git
    $ cd rmm
    $ mkdir build                                       # make a build directory
    $ cd build                                          # enter the build directory
    $ cmake .. -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX     # configure cmake ... use $CONDA_PREFIX if you're using Anaconda
    $ make -j                                           # compile the library librmm.so ... '-j' will start a parallel job using the number of physical cores available on your system
    $ make install                                      # install the library librmm.so to '/install/path'
  5. RAFT:
    • raft provides only header files, so not build instructions for it.
    $ git clone https://github.com/rapidsai/raft.git
  6. export RMM_PATH and RAFT_PATH:
    export RAFT_PATH=PATH_TO_YOUR_RAFT_FOLDER
    export RMM_PATH=PATH_TO_YOUR_RMM_FOLDER

Build target jar

User can build it directly in the project root path with:

mvn clean package

Then rapids-4-spark-ml_2.12-21.10.0-SNAPSHOT.jar will be generated under target folder.

Note: This module contains both native and Java/Scala code. The native library build instructions has been added to the pom.xml file so that maven build command will help build native library all the way. Make sure the prerequisites are all met, or the build will fail with error messages accordingly such as "cmake not found" or "ninja not found" etc.

How to use

Add the artifact jar to the Spark, for example:

$SPARK_HOME/bin/spark-shell --master $SPARK_MASTER \
 --driver-memory 20G \
 --executor-memory 30G \
 --conf spark.driver.maxResultSize=8G \
 --jars target/rapids-4-spark-ml_2.12-21.10.0-SNAPSHOT.jar \
 --conf spark.task.resource.gpu.amount=0.08 \
 --conf spark.executor.resource.gpu.amount=1 \
 --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \
 --files ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh

About

Spark RAPIDS MLlib – accelerate Apache Spark MLlib with GPUs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 55.0%
  • Scala 24.5%
  • Cuda 10.2%
  • Java 4.4%
  • CMake 3.0%
  • Python 2.3%
  • Dockerfile 0.6%