- Building
- Integrating with Spark*
- Enabling NUMA binding for Intel® Optane™ DC Persistent Memory in Spark
Building with Apache Maven*.
Clone the OAP project:
git clone -b <tag-version> https://github.com/Intel-bigdata/OAP.git
cd OAP
Build the oap-cache
package:
mvn clean -pl com.intel.oap:oap-cache -am package
Run all the tests:
mvn clean -pl com.intel.oap:oap-cache -am test
Run a specific test suite, for example OapDDLSuite
:
mvn -DwildcardSuites=org.apache.spark.sql.execution.datasources.oap.OapDDLSuite test
NOTE: Log level of unit tests currently default to ERROR, please override oap-cache/oap/src/test/resources/log4j.properties if needed.
Install the required packages on the build system:
The memkind library depends on libnuma
at the runtime, so it must already exist in the worker node system. Build the latest memkind lib from source:
git clone -b v1.10.1-rc2 https://github.com/memkind/memkind
cd memkind
./autogen.sh
./configure
make
make install
To build vmemcache library from source, you can (for RPM-based linux as example):
git clone https://github.com/pmem/vmemcache
cd vmemcache
mkdir build
cd build
cmake .. -DCMAKE_INSTALL_PREFIX=/usr -DCPACK_GENERATOR=rpm
make package
sudo rpm -i libvmemcache*.rpm
To use optimized Plasma cache with OAP, you need following components:
(1) libarrow.so
, libplasma.so
, libplasma_jni.so
: dynamic libraries, will be used in Plasma client.
(2) plasma-store-server
: executable file, Plasma cache service.
(3) arrow-plasma-0.17.0.jar
: will be used when compile oap and spark runtime also need it.
- so file and binary file
Clone code from Intel-arrow repo and run following commands, this will installlibplasma.so
,libarrow.so
,libplasma_jni.so
andplasma-store-server
to your system path(/usr/lib64
by default). And if you are using Spark in a cluster environment, you can copy these files to all nodes in your cluster if the OS or distribution are same, otherwise, you need compile it on each node.
cd /tmp
git clone https://github.com/Intel-bigdata/arrow.git
cd arrow && git checkout apache-arrow-0.17.0-intel-oap-0.9
cd cpp
mkdir release
cd release
#build libarrow, libplasma, libplasma_java
cmake -DCMAKE_INSTALL_PREFIX=/usr/ -DCMAKE_BUILD_TYPE=Release -DARROW_BUILD_TESTS=on -DARROW_PLASMA_JAVA_CLIENT=on -DARROW_PLASMA=on -DARROW_DEPENDENCY_SOURCE=BUNDLED ..
make -j$(nproc)
sudo make install -j$(nproc)
- arrow-plasma-0.17.0.jar
arrow-plasma-0.17.0.jar is provided in maven central repo, you can download it and copy to$SPARK_HOME/jars
dir. Or you can manually install it, change to arrow repo java direction, run following command, this will install arrow jars to your local maven repo, and you can compile oap-cache module package now. Beisdes, you need copy arrow-plasma-0.17.0.jar to$SPARK_HOME/jars/
dir, cause this jar is needed when using external cache.
cd $ARROW_REPO_DIR/java
mvn clean -q -pl plasma -DskipTests install
You need to add -Ppersistent-memory
to build with PMem support. For noevict
cache strategy, you also need to build with -Ppersistent-memory
parameter.
mvn clean -q -pl com.intel.oap:oap-cache -am -Ppersistent-memory -DskipTests package
For vmemcache cache strategy, please build with command:
mvn clean -q -pl com.intel.oap:oap-cache -am -Pvmemcache -DskipTests package
Build with this command to use all of them:
mvn clean -q -pl com.intel.oap:oap-cache -am -Ppersistent-memory -Pvmemcache -DskipTests package
Although SQL Index and Data Source Cache act as a plug-in JAR to Spark, there are still a few tricks to note when integrating with Spark. The OAP team explored using the Spark extension & data source API to deliver its core functionality. However, the limits of the Spark extension and data source API meant that we had to make some changes to Spark internals. As a result you must check whether your installation is an unmodified Community Spark or a customized Spark.
If you are running a Community Spark, things will be much simpler. Refer to User Guide to configure and setup Spark to work with OAP.
In this case check whether the OAP changes to Spark internals will conflict with or override your private changes.
- If there are no conflicts or overrides, the steps are the same as the steps of unmodified version of Spark described above.
- If there are conflicts or overrides, develop a merge plan of the source code to make sure the code changes you made in to the Spark source appear in the corresponding file included in OAP the project. Once merged, rebuild OAP.
The following files need to be checked/compared for changes:
• org/apache/spark/sql/execution/DataSourceScanExec.scala
Add the metrics info to OapMetricsManager and schedule the task to read from the cached
• org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala
Return the result of write task to driver.
• org/apache/spark/sql/execution/datasources/FileFormatWriter.scala
Add the result of write task.
• org/apache/spark/sql/execution/datasources/OutputWriter.scala
Add new API to support return the result of write task to driver.
• org/apache/spark/status/api/v1/OneApplicationResource.scala
Update the metric data to spark web UI.
• org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java
Change the private access of variable to protected
• org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
Change the private access of variable to protected
• org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
Change the private access of variable to protected
• org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
Add the get and set method for the changed protected variable.
When using PMem as a cache medium apply the NUMA binding patch numa-binding-spark-3.0.0.patch to Spark source code for best performance.
-
Download src for Spark-3.0.0 and clone the src from github.
-
Apply this patch and rebuild the Spark package.
git apply numa-binding-spark-3.0.0.patch
- Add these configuration items to the Spark configuration file $SPARK_HOME/conf/spark-defaults.conf to enable NUMA binding.
spark.yarn.numa.enabled true
NOTE: If you are using a customized Spark, you will need to manually resolve the conflicts.
If you think it is cumbersome to apply patches, we have a pre-built Spark spark-3.0.0-bin-hadoop2.7-intel-oap-0.9.0.tgz with the patch applied.