Skip to content

Latest commit

 

History

History
185 lines (138 loc) · 7.93 KB

Developer-Guide.md

File metadata and controls

185 lines (138 loc) · 7.93 KB

Developer Guide

Building

Building SQL Index and Data Source Cache

Building with Apache Maven*.

Clone the OAP project:

git clone -b <tag-version>  https://github.com/Intel-bigdata/OAP.git
cd OAP

Build the oap-cache package:

mvn clean -pl com.intel.oap:oap-cache -am package

Running Tests

Run all the tests:

mvn clean -pl com.intel.oap:oap-cache -am test

Run a specific test suite, for example OapDDLSuite:

mvn -DwildcardSuites=org.apache.spark.sql.execution.datasources.oap.OapDDLSuite test

NOTE: Log level of unit tests currently default to ERROR, please override oap-cache/oap/src/test/resources/log4j.properties if needed.

Building with Intel® Optane™ DC Persistent Memory Module

Prerequisites for building with PMem support

Install the required packages on the build system:

memkind installation

The memkind library depends on libnuma at the runtime, so it must already exist in the worker node system. Build the latest memkind lib from source:

git clone -b v1.10.1-rc2 https://github.com/memkind/memkind
cd memkind
./autogen.sh
./configure
make
make install

vmemcache installation

To build vmemcache library from source, you can (for RPM-based linux as example):

git clone https://github.com/pmem/vmemcache
cd vmemcache
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCPACK_GENERATOR=rpm
make package
sudo rpm -i libvmemcache*.rpm

Plasma installation

To use optimized Plasma cache with OAP, you need following components:

(1) libarrow.so, libplasma.so, libplasma_jni.so: dynamic libraries, will be used in Plasma client.
(2) plasma-store-server: executable file, Plasma cache service.
(3) arrow-plasma-0.17.0.jar: will be used when compile oap and spark runtime also need it.

  • so file and binary file
    Clone code from Intel-arrow repo and run following commands, this will install libplasma.so, libarrow.so, libplasma_jni.so and plasma-store-server to your system path(/usr/lib64 by default). And if you are using Spark in a cluster environment, you can copy these files to all nodes in your cluster if the OS or distribution are same, otherwise, you need compile it on each node.
cd /tmp
git clone https://github.com/Intel-bigdata/arrow.git
cd arrow && git checkout apache-arrow-0.17.0-intel-oap-0.9
cd cpp
mkdir release
cd release
#build libarrow, libplasma, libplasma_java
cmake -DCMAKE_INSTALL_PREFIX=/usr/ -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_TESTS=on -DARROW_PLASMA_JAVA_CLIENT=on -DARROW_PLASMA=on -DARROW_DEPENDENCY_SOURCE=BUNDLED  ..
make -j$(nproc)
sudo make install -j$(nproc)
  • arrow-plasma-0.17.0.jar
    arrow-plasma-0.17.0.jar is provided in maven central repo, you can download it and copy to $SPARK_HOME/jars dir. Or you can manually install it, change to arrow repo java direction, run following command, this will install arrow jars to your local maven repo, and you can compile oap-cache module package now. Beisdes, you need copy arrow-plasma-0.17.0.jar to $SPARK_HOME/jars/ dir, cause this jar is needed when using external cache.
cd $ARROW_REPO_DIR/java
mvn clean -q -pl plasma -DskipTests install

Building the package

You need to add -Ppersistent-memory to build with PMem support. For noevict cache strategy, you also need to build with -Ppersistent-memory parameter.

mvn clean -q -pl com.intel.oap:oap-cache -am  -Ppersistent-memory -DskipTests package

For vmemcache cache strategy, please build with command:

mvn clean -q -pl com.intel.oap:oap-cache -am -Pvmemcache -DskipTests package

Build with this command to use all of them:

mvn clean -q -pl com.intel.oap:oap-cache -am  -Ppersistent-memory -Pvmemcache -DskipTests package

Integrating with Spark

Although SQL Index and Data Source Cache act as a plug-in JAR to Spark, there are still a few tricks to note when integrating with Spark. The OAP team explored using the Spark extension & data source API to deliver its core functionality. However, the limits of the Spark extension and data source API meant that we had to make some changes to Spark internals. As a result you must check whether your installation is an unmodified Community Spark or a customized Spark.

Integrating with Community Spark

If you are running a Community Spark, things will be much simpler. Refer to User Guide to configure and setup Spark to work with OAP.

Integrate with Customized Spark

In this case check whether the OAP changes to Spark internals will conflict with or override your private changes.

  • If there are no conflicts or overrides, the steps are the same as the steps of unmodified version of Spark described above.
  • If there are conflicts or overrides, develop a merge plan of the source code to make sure the code changes you made in to the Spark source appear in the corresponding file included in OAP the project. Once merged, rebuild OAP.

The following files need to be checked/compared for changes:

•	org/apache/spark/sql/execution/DataSourceScanExec.scala   
		Add the metrics info to OapMetricsManager and schedule the task to read from the cached 
•	org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
                Return the result of write task to driver.
•	org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
		Add the result of write task. 
•	org/apache/spark/sql/execution/datasources/OutputWriter.scala  
		Add new API to support return the result of write task to driver.
•	org/apache/spark/status/api/v1/OneApplicationResource.scala    
		Update the metric data to spark web UI.
•	org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
		Change the private access of variable to protected
•	org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
		Change the private access of variable to protected
•	org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
		Change the private access of variable to protected
•	org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
		Add the get and set method for the changed protected variable.

Enabling NUMA binding for PMem in Spark

Rebuilding Spark packages with NUMA binding patch

When using PMem as a cache medium apply the NUMA binding patch numa-binding-spark-3.0.0.patch to Spark source code for best performance.

  1. Download src for Spark-3.0.0 and clone the src from github.

  2. Apply this patch and rebuild the Spark package.

git apply  numa-binding-spark-3.0.0.patch
  1. Add these configuration items to the Spark configuration file $SPARK_HOME/conf/spark-defaults.conf to enable NUMA binding.
spark.yarn.numa.enabled true 

NOTE: If you are using a customized Spark, you will need to manually resolve the conflicts.

Using pre-built patched Spark packages

If you think it is cumbersome to apply patches, we have a pre-built Spark spark-3.0.0-bin-hadoop2.7-intel-oap-0.9.0.tgz with the patch applied.

*Other names and brands may be claimed as the property of others.