Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
xwu99 committed May 31, 2023
1 parent 5826a9f commit 20e24b6
Showing 1 changed file with 45 additions and 39 deletions.
84 changes: 45 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

[Apache Spark MLlib](https://spark.apache.org/mllib) is a scalable machine learning library based on Spark unified platform. It seamlessly integrates with Spark SQL, Spark Streaming and other machine learning and deep learning frameworks without additional glue code for the entire pipeline.

However, JVM-based MLlib only has limited use of BLAS acceleration and Spark shuffle is also slow for communication during distributed training. It doesn't fully utilize modern CPU and GPU capabilities to achieve best performance.
However, JVM-based MLlib only has limited use of BLAS acceleration and Spark shuffle is also slow for communication during distributed training. Spark's original design is CPU-centric that can't leverage GPU acceleration. It doesn't fully utilize modern CPU and GPU capabilities to achieve best performance.

## OAP MLlib Solution

Expand Down Expand Up @@ -36,15 +36,15 @@ You can find the all the OAP MLlib documents on the [project web page](https://o

# Getting Started

## Java/Scala Users Preferred
## Python/PySpark Users Preferred

Use a pre-built OAP MLlib JAR to get started, you can download OAP MLlib JAR from [Release Page](https://github.com/oap-project/oap-mllib/releases/download/v1.5.0/oap-mllib-1.5.0.jar).
Use a pre-built JAR to get started. If you have finished [OAP Installation Guide](./docs/OAP-Installation-Guide.md), you can find compiled OAP MLlib JAR `oap-mllib-x.x.x.jar` in `$HOME/miniconda2/envs/oapenv/oap_jars/`.

Then you can refer to the following [Running](#running) section to try out.

## Python/PySpark Users Preferred
## Java/Scala Users Preferred

Use a pre-built JAR to get started. If you have finished [OAP Installation Guide](./docs/OAP-Installation-Guide.md), you can find compiled OAP MLlib JAR `oap-mllib-x.x.x.jar` in `$HOME/miniconda2/envs/oapenv/oap_jars/`.
Use a pre-built OAP MLlib JAR to get started, you can download OAP MLlib JAR from [Release Page](https://github.com/oap-project/oap-mllib/releases/download/v1.5.0/oap-mllib-1.5.0.jar).

Then you can refer to the following [Running](#running) section to try out.

Expand All @@ -54,50 +54,56 @@ You can also build the package from source code, please refer to [Building Code]

## Running

### Supported Spark Versions
### Prerequisites

OAP MLlib's latest version supports multiple Spark versions as below.
* Generally, our common system requirements are the same with Intel® oneAPI Toolkit, please refer to [Intel® oneAPI Base Toolkit System Requirements](https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-base-toolkit-system-requirements.html) for details.

* Please follow [this guide](https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-1/install-using-package-managers.html) to install Intel® oneAPI Runtime Library Packages using package managers.
The following runtime packages should be installed in __all cluster nodes__:
```
intel-oneapi-ccl-<version>
intel-oneapi-mpi-<version>
intel-oneapi-compiler-dpcpp-cpp-runtime-<version>
intel-oneapi-dpcpp-cpp-<version>
```
* (Optional) If you plan to use Intel GPU, [install the Intel GPU drivers](https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-1/install-intel-gpu-drivers.html). Otherwise only CPU is supported.

### Supported Spark Versions

* Apache Spark 3.1.1
* Apache Spark 3.1.2
* Apache Spark 3.1.3
* Apache Spark 3.2.0
* Apache Spark 3.2.1

### Prerequisites

* CentOS 7.0+, Ubuntu 18.04 LTS+
* Java JRE 8.0+ Runtime
* Apache Spark 3.1.1, 3.1.2, 3.1.3, 3.2.0 or 3.2.1
### Supported Intel® oneAPI Toolkits

Generally, our common system requirements are the same with Intel® oneAPI Toolkit, please refer to [Intel® oneAPI Base Toolkit System Requirements](https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-base-toolkit-system-requirements.html) for details.

Intel® oneAPI Toolkits components used by the project are already included into JAR package mentioned above. There are no extra installations for cluster nodes.
* Intel® oneAPI 2023.1

### Spark Configuration

#### General Configuration

##### YARN Cluster Manager
Users usually run Spark application on __YARN__ with __client__ mode. In that case, you only need to add the following configurations in `spark-defaults.conf` or in `spark-submit` command line before running.
##### Standalone Cluster Manager
For using standalone cluster manager, you need to upload the jar to every node or use shared network folder and then specify absolute paths for extraClassPath.

```
# absolute path of the jar for uploading
spark.files /path/to/oap-mllib-x.x.x.jar
# absolute path of the jar for driver class path
spark.driver.extraClassPath /path/to/oap-mllib-x.x.x.jar
# relative path of the jar for executor class path
spark.executor.extraClassPath ./oap-mllib-x.x.x.jar
# absolute path of the jar for executor class path
spark.executor.extraClassPath /path/to/oap-mllib-x.x.x.jar
```

##### Standalone Cluster Manager
For standalone cluster manager, you need to upload the jar to every node or use shared network folder and then specify absolute paths for extraClassPath.
##### YARN Cluster Manager
For users running Spark application on __YARN__ with __client__ mode, you only need to add the following configurations in `spark-defaults.conf` or in `spark-submit` command line before running.

```
# absolute path of the jar for uploading
spark.files /path/to/oap-mllib-x.x.x.jar
# absolute path of the jar for driver class path
spark.driver.extraClassPath /path/to/oap-mllib-x.x.x.jar
# absolute path of the jar for executor class path
spark.executor.extraClassPath /path/to/oap-mllib-x.x.x.jar
# relative path of the jar for executor class path
spark.executor.extraClassPath ./oap-mllib-x.x.x.jar
```

#### OAP MLlib Specific Configuration
Expand Down Expand Up @@ -142,8 +148,8 @@ We use [Apache Maven](https://maven.apache.org/) to manage and build source code
* JDK 8.0+
* Apache Maven 3.6.2+
* GNU GCC 7+
* Intel® oneAPI Base Toolkit (>=2022.1) Components :
- DPC++/C++ Compiler (dpcpp/clang++)
* Intel® oneAPI Base Toolkit Components :
- DPC++/C++ Compiler (icpx)
- Data Analytics Library (oneDAL)
- Threading Building Blocks (oneTBB)
- MPI Library (MPI)
Expand All @@ -153,7 +159,7 @@ Generally you only need to install __Intel® oneAPI Base Toolkit for Linux__ wit

Scala and Java dependency descriptions are already included in Maven POM file.

***Note:*** You can refer to [this script](dev/install-build-deps-centos.sh) to install correct dependencies.
***Note:*** You can refer to [this script](dev/install-build-deps-ubuntu.sh) to install correct dependencies.

### Build

Expand All @@ -173,7 +179,6 @@ Environment | Description
JAVA_HOME | Path to JDK home directory
DAALROOT | Path to oneDAL home directory
TBB_ROOT | Path to oneTBB home directory
I_MPI_ROOT | Path to Intel MPI home directory
CCL_ROOT | Path to oneCCL home directory

We suggest you to source `setvars.sh` script into current shell to setup building environments as following:
Expand All @@ -197,6 +202,15 @@ The built JAR package will be placed in `target` directory with the name `oap-ml

## Examples

### Python Examples

Example | Description
-----------------------|---------------------------
kmeans-pyspark | K-means example for PySpark
pca-pyspark | PCA example for PySpark
als-pyspark | ALS example for PySpark
random-forest-pyspark | Random Forest example for PySpark

### Scala Examples

Example | Description
Expand All @@ -209,15 +223,6 @@ linear-regression | Linear Regression example for Scala
correlation | Correlation example for Scala
summarizer | Summarizer example for Scala


### Python Examples

Example | Description
----------------|---------------------------
kmeans-pyspark | K-means example for PySpark
pca-pyspark | PCA example for PySpark
als-pyspark | ALS example for PySpark

## List of Accelerated Algorithms

Algorithm | CPU | GPU |
Expand All @@ -226,7 +231,8 @@ K-Means | X | X |
PCA | X | X |
ALS | X | |
Naive Bayes | X | |
Linear Regression | X | |
Linear Regression | X | X |
Ridge Regression | X | |
Random Forest | | X |
Correlation | X | X |
Summarizer | X | X |

0 comments on commit 20e24b6

Please sign in to comment.