Skip to content

Commit

Permalink
[ML-133][Correlation] Add Correlation algorithm (#127)
Browse files Browse the repository at this point in the history
* 1. enable correlation on OAP MLlib
2. add Correlation example
3. enable spark 3.1.1 and spark 3.0.0

Signed-off-by: minmingzhu <minmingzhu@intel.com>

* 1. resolve comments

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1. fix oap matrix result error.

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1. add jni file

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1. add test code

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1. vanilla rdd add cache

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* remove rdd cache

* 1. change algorithmFPType float to double
2. modify test
3. enable spark 3.0.1 and spark 3.0.2
4. remove print

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1.remove mean vector

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1. enable Correlation GPU

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1. add GPU test script.

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1. update correlation
2. add GPU test

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1. resolve comments

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1. used ccl::gatherv in OneCCL.h instead of ccl::allgatherv

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1.used ml.Matrix for Correlation, reduce a step to mllib.Matrix convert to ml.Matrix.

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1. reformat code style
2. make-up code

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* check whether data was cached, if no cached, set data to be cached

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* update Correlation that check whether data was cached, if no cached, set data to be cached on spark 3.0.1, 3.0.2 and 3.1.1

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* 1. add Correlation to CI

Signed-off-by: minmingzhu <minming.zhu@intel.com>

* Update build-all.sh

build correlation

Co-authored-by: minmingzhu <minmingzhu@intel.com>
  • Loading branch information
minmingzhu and minmingzhu authored Oct 21, 2021
1 parent c7fbb98 commit 9871ab2
Show file tree
Hide file tree
Showing 23 changed files with 1,133 additions and 9 deletions.
2 changes: 1 addition & 1 deletion examples/build-all.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env bash

exampleDirs=(kmeans pca als naive-bayes linear-regression)
exampleDirs=(kmeans pca als naive-bayes linear-regression correlation)

for dir in ${exampleDirs[*]}
do
Expand Down
1 change: 1 addition & 0 deletions examples/correlation/IntelGpuResourceFile.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
[{"id":{"componentName": "spark.worker","resourceName":"gpu"},"addresses":["0","1","2","3"]}]
3 changes: 3 additions & 0 deletions examples/correlation/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/usr/bin/env bash

mvn clean package
94 changes: 94 additions & 0 deletions examples/correlation/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>

<groupId>com.intel.oap</groupId>
<artifactId>oap-mllib-examples</artifactId>
<version>1.2.0</version>
<packaging>jar</packaging>

<name>CorrelationExample</name>
<url>https://github.com/oap-project/oap-mllib.git</url>

<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<oap.version>1.2.0</oap.version>
<scala.version>2.12.10</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.1.1</spark.version>
</properties>

<dependencies>

<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.12.10</version>
</dependency>

<dependency>
<groupId>com.github.scopt</groupId>
<artifactId>scopt_2.12</artifactId>
<version>3.7.0</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>

</dependencies>

<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
<args>
<arg>-target:jvm-1.8</arg>
</args>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.0.0</version>
<configuration>
<appendAssemblyId>false</appendAssemblyId>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>

</project>
42 changes: 42 additions & 0 deletions examples/correlation/run-gpu-standalone.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
#!/usr/bin/env bash

source ../../conf/env.sh

# Data file is from Spark Examples (data/mllib/sample_kmeans_data.txt) and put in examples/data
# The data file should be copied to $HDFS_ROOT before running examples
DATA_FILE=$HDFS_ROOT/data/sample_kmeans_data.txt

APP_JAR=target/oap-mllib-examples-$OAP_MLLIB_VERSION.jar
APP_CLASS=org.apache.spark.examples.ml.CorrelationExample

USE_GPU=true
RESOURCE_FILE=$PWD/IntelGpuResourceFile.json
WORKER_GPU_AMOUNT=4
EXECUTOR_GPU_AMOUNT=1
TASK_GPU_AMOUNT=1

# Should run in standalone mode
time $SPARK_HOME/bin/spark-submit --master $SPARK_MASTER -v \
--num-executors $SPARK_NUM_EXECUTORS \
--executor-cores $SPARK_EXECUTOR_CORES \
--total-executor-cores $SPARK_TOTAL_CORES \
--driver-memory $SPARK_DRIVER_MEMORY \
--executor-memory $SPARK_EXECUTOR_MEMORY \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.default.parallelism=$SPARK_DEFAULT_PARALLELISM" \
--conf "spark.sql.shuffle.partitions=$SPARK_DEFAULT_PARALLELISM" \
--conf "spark.driver.extraClassPath=$SPARK_DRIVER_CLASSPATH" \
--conf "spark.executor.extraClassPath=$SPARK_EXECUTOR_CLASSPATH" \
--conf "spark.oap.mllib.useGPU=$USE_GPU" \
--conf "spark.worker.resourcesFile=$RESOURCE_FILE" \
--conf "spark.worker.resource.gpu.amount=$WORKER_GPU_AMOUNT" \
--conf "spark.executor.resource.gpu.amount=$EXECUTOR_GPU_AMOUNT" \
--conf "spark.task.resource.gpu.amount=$TASK_GPU_AMOUNT" \
--conf "spark.shuffle.reduceLocality.enabled=false" \
--conf "spark.network.timeout=1200s" \
--conf "spark.task.maxFailures=1" \
--jars $OAP_MLLIB_JAR \
--class $APP_CLASS \
$APP_JAR $DATA_FILE \
2>&1 | tee Correlation-$(date +%m%d_%H_%M_%S).log

26 changes: 26 additions & 0 deletions examples/correlation/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/usr/bin/env bash

source ../../conf/env.sh


APP_JAR=target/oap-mllib-examples-$OAP_MLLIB_VERSION.jar
APP_CLASS=org.apache.spark.examples.ml.CorrelationExample

time $SPARK_HOME/bin/spark-submit --master $SPARK_MASTER -v \
--num-executors $SPARK_NUM_EXECUTORS \
--executor-cores $SPARK_EXECUTOR_CORES \
--total-executor-cores $SPARK_TOTAL_CORES \
--driver-memory $SPARK_DRIVER_MEMORY \
--executor-memory $SPARK_EXECUTOR_MEMORY \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" \
--conf "spark.default.parallelism=$SPARK_DEFAULT_PARALLELISM" \
--conf "spark.sql.shuffle.partitions=$SPARK_DEFAULT_PARALLELISM" \
--conf "spark.driver.extraClassPath=$SPARK_DRIVER_CLASSPATH" \
--conf "spark.executor.extraClassPath=$SPARK_EXECUTOR_CLASSPATH" \
--conf "spark.shuffle.reduceLocality.enabled=false" \
--conf "spark.network.timeout=1200s" \
--conf "spark.task.maxFailures=1" \
--jars $OAP_MLLIB_JAR \
--class $APP_CLASS \
$APP_JAR $DATA_FILE \
2>&1 | tee Correlation-$(date +%m%d_%H_%M_%S).log
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

// scalastyle:off println
package org.apache.spark.examples.ml

// $example on$
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
// $example off$
import org.apache.spark.sql.SparkSession

/**
* An example for computing correlation matrix.
* Run with
* {{{
* bin/run-example ml.CorrelationExample
* }}}
*/
object CorrelationExample {

def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.appName("CorrelationExample")
.getOrCreate()
import spark.implicits._

// $example on$
val data = Seq(
Vectors.dense(5.308206,9.869278,1.018934,4.292158,6.081011,6.585723,2.411094,4.767308,-3.256320,-6.029562),
Vectors.dense(7.279464,0.390664,-9.619284,3.435376,-4.769490,-4.873188,-0.118791,-5.117316,-0.418655,-0.475422),
Vectors.dense(-6.615791,-6.191542,0.402459,-9.743521,-9.990568,9.105346,1.691312,-2.605659,9.534952,-7.829027),
Vectors.dense(-4.792007,-2.491098,-2.939393,8.086467,3.773812,-9.997300,0.222378,8.995244,-5.753282,6.091060),
Vectors.dense(7.700725,-6.414918,1.684476,-8.983361,4.284580,-9.017608,0.552379,-7.705741,2.589852,0.411561),
Vectors.dense(6.991900,-1.063721,9.321163,-0.429719,-2.167696,-1.736425,-0.919139,6.980681,-0.711914,3.414347),
Vectors.dense(5.794488,-1.062261,0.955322,0.389642,3.012921,-9.953994,-3.197309,3.992421,-6.935902,8.147622),
Vectors.dense(-2.486670,6.973242,-4.047004,-5.655629,5.081786,5.533859,7.821403,2.763313,-0.454056,6.554309),
Vectors.dense(2.204855,7.839522,7.381886,1.618749,-6.566877,7.584285,-8.355983,-5.501410,-8.191205,-2.608499),
Vectors.dense(-9.948613,-8.941953,-8.106389,4.863542,5.852806,-1.659259,6.342504,-8.190106,-3.110330,-7.484658),
)

val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head
println(s"Pearson correlation matrix:\n $coeff1")

spark.stop()
}
}
// scalastyle:on println
2 changes: 1 addition & 1 deletion examples/run-all-scala.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env bash

exampleDirs=(kmeans pca als naive-bayes linear-regression)
exampleDirs=(kmeans pca als naive-bayes linear-regression correlation)

for dir in ${exampleDirs[*]}
do
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
/*******************************************************************************
* Copyright 2020 Intel Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*******************************************************************************/

package org.apache.spark.ml.stat;

public class CorrelationResult {
public long correlationNumericTable;
}
Loading

0 comments on commit 9871ab2

Please sign in to comment.