Skip to content
This repository has been archived by the owner on Sep 18, 2023. It is now read-only.

Commit

Permalink
[NSE-283] Pick S3/CSV supports to OAP 1.1 (#284)
Browse files Browse the repository at this point in the history
* [NSE-237] Add ARROW_CSV=ON to default C++ build commands (#238)

* [NSE-261] ArrowDataSource: Add S3 Support (#270)

Closes #261

* [NSE-276] Add option to switch Hadoop version

* [NSE-119] clean up on comments (#288)

Signed-off-by: Yuan Zhou <yuan.zhou@intel.com>

* [NSE-206]Update installation guide and configuration guide. (#289)

* [NSE-206]Update installation guide and configuration guide.

* Fix numaBinding setting issue. & Update description for protobuf

* [NSE-206]Fix Prerequisite and Arrow Installation Steps. (#290)

Co-authored-by: Yuan <yuan.zhou@intel.com>
Co-authored-by: Wei-Ting Chen <weiting.chen@intel.com>
3 people authored Apr 27, 2021

Verified

This commit was signed with the committer’s verified signature.
frapell Franco Pellegrini
1 parent 5b2bf6c commit 230df38
Showing 17 changed files with 209 additions and 72 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tpch.yml
Original file line number Diff line number Diff line change
@@ -44,7 +44,7 @@ jobs:
git clone https://github.com/oap-project/arrow.git
cd arrow && git checkout arrow-3.0.0-oap-1.1 && cd cpp
mkdir build && cd build
cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DARROW_JEMALLOC=OFF && make -j2
cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DARROW_JEMALLOC=OFF && make -j2
sudo make install
cd ../../java
mvn clean install -B -Dorg.slf4j.simpleLogger.log.org.apache.maven.cli.transfer.Slf4jMavenTransferListener=warn -P arrow-jni -am -Darrow.cpp.build.dir=/tmp/arrow/cpp/build/release/ -DskipTests -Dcheckstyle.skip
2 changes: 1 addition & 1 deletion .github/workflows/unittests.yml
Original file line number Diff line number Diff line change
@@ -47,7 +47,7 @@ jobs:
git clone https://github.com/oap-project/arrow.git
cd arrow && git checkout arrow-3.0.0-oap-1.1 && cd cpp
mkdir build && cd build
cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DGTEST_ROOT=/usr/src/gtest && make -j2
cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON -DGTEST_ROOT=/usr/src/gtest && make -j2
sudo make install
- name: Run unit tests
run: |
27 changes: 21 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -40,7 +40,20 @@ We implemented columnar shuffle to improve the shuffle performance. With the col

Please check the operator supporting details [here](./docs/operators.md)

## Build the Plugin
## How to use OAP: Native SQL Engine

There are three ways to use OAP: Native SQL Engine,
1. Use precompiled jars
2. Building by Conda Environment
3. Building by Yourself

### Use precompiled jars

Please go to [OAP's Maven Central Repository](https://repo1.maven.org/maven2/com/intel/oap/) to find Native SQL Engine jars.
For usage, you will require below two jar files:
1. spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar is located in com/intel/oap/spark-arrow-datasource-standard/<version>/
2. spark-columnar-core-<version>-jar-with-dependencies.jar is located in com/intel/oap/spark-columnar-core/<version>/
Please notice the files are fat jars shipped with our custom Arrow library and pre-compiled from our server(using GCC 9.3.0 and LLVM 7.0.1), which means you will require to pre-install GCC 9.3.0 and LLVM 7.0.1 in your system for normal usage.

### Building by Conda

@@ -51,18 +64,18 @@ Then you can just skip below steps and jump to Getting Started [Get Started](#ge

If you prefer to build from the source code on your hand, please follow below steps to set up your environment.

### Prerequisite
#### Prerequisite

There are some requirements before you build the project.
Please check the document [Prerequisite](./docs/Prerequisite.md) and make sure you have already installed the software in your system.
If you are running a SPARK Cluster, please make sure all the software are installed in every single node.

### Installation
Please check the document [Installation Guide](./docs/Installation.md)
#### Installation

### Configuration & Testing
Please check the document [Configuration Guide](./docs/Configuration.md)
Please check the document [Installation Guide](./docs/Installation.md)

## Get started

To enable OAP NativeSQL Engine, the previous built jar `spark-columnar-core-<version>-jar-with-dependencies.jar` should be added to Spark configuration. We also recommend to use `spark-arrow-datasource-standard-<version>-jar-with-dependencies.jar`. We will demonstrate an example by using both jar files.
SPARK related options are:

@@ -75,6 +88,8 @@ SPARK related options are:
For Spark Standalone Mode, please set the above value as relative path to the jar file.
For Spark Yarn Cluster Mode, please set the above value as absolute path to the jar file.

More Configuration, please check the document [Configuration Guide](./docs/Configuration.md)

Example to run Spark Shell with ArrowDataSource jar file
```
${SPARK_HOME}/bin/spark-shell \
2 changes: 1 addition & 1 deletion arrow-data-source/.travis.yml
Original file line number Diff line number Diff line change
@@ -26,7 +26,7 @@ jobs:
- cd arrow && git checkout oap-master && cd cpp
- sed -i "s/\${Python3_EXECUTABLE}/\/opt\/pyenv\/shims\/python3/g" CMakeLists.txt
- mkdir build && cd build
- cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON && make
- cmake .. -DARROW_JNI=ON -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_FILESYSTEM=ON -DARROW_WITH_SNAPPY=ON -DARROW_JSON=ON -DARROW_DATASET=ON -DARROW_WITH_LZ4=ON && make
- sudo make install
- cd ../../java
- mvn clean install -q -P arrow-jni -am -Darrow.cpp.build.dir=/tmp/arrow/cpp/build/release/ -DskipTests -Dcheckstyle.skip
2 changes: 1 addition & 1 deletion arrow-data-source/README.md
Original file line number Diff line number Diff line change
@@ -125,7 +125,7 @@ git clone -b <version> https://github.com/Intel-bigdata/arrow.git
cd arrow/cpp
mkdir build
cd build
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
make
// build and install arrow jvm library
2 changes: 1 addition & 1 deletion arrow-data-source/docs/ApacheArrowInstallation.md
Original file line number Diff line number Diff line change
@@ -42,7 +42,7 @@ git clone https://github.com/Intel-bigdata/arrow.git
cd arrow && git checkout branch-0.17.0-oap-1.0
mkdir -p arrow/cpp/release-build
cd arrow/cpp/release-build
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
make -j
make install

68 changes: 59 additions & 9 deletions arrow-data-source/pom.xml
Original file line number Diff line number Diff line change
@@ -3,7 +3,7 @@
<groupId>com.intel.oap</groupId>
<artifactId>native-sql-engine-parent</artifactId>
<version>1.1.0</version>
</parent>
</parent>

<modelVersion>4.0.0</modelVersion>
<groupId>com.intel.oap</groupId>
@@ -18,12 +18,6 @@
<module>parquet</module>
</modules>
<properties>
<scala.version>2.12.10</scala.version>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.0.0</spark.version>
<arrow.version>3.0.0</arrow.version>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
<script.dir>${arrow.script.dir}</script.dir>
<datasource.cpp_tests>${cpp_tests}</datasource.cpp_tests>
<datasource.build_arrow>${build_arrow}</datasource.build_arrow>
@@ -48,6 +42,50 @@
</pluginRepositories>

<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>${hadoop.version}</version>
<exclusions>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-core</artifactId>
</exclusion>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
</exclusion>
<exclusion>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</exclusion>
<exclusion>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-core</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-json</artifactId>
</exclusion>
<exclusion>
<groupId>com.sun.jersey</groupId>
<artifactId>jersey-server</artifactId>
</exclusion>
<exclusion>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpcore</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<version>4.2</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
@@ -61,7 +99,7 @@
<exclusions>
<exclusion>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-format</artifactId>
<artifactId>arrow-vector</artifactId>
</exclusion>
</exclusions>
<scope>provided</scope>
@@ -83,13 +121,25 @@
<groupId>org.apache.spark</groupId>
<artifactId>spark-catalyst_2.12</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-vector</artifactId>
</exclusion>
</exclusions>
<type>test-jar</type>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.arrow</groupId>
<artifactId>arrow-vector</artifactId>
</exclusion>
</exclusions>
<type>test-jar</type>
<scope>test</scope>
</dependency>
@@ -118,7 +168,7 @@
<configuration>
<executable>bash</executable>
<arguments>
<argument>${script.dir}/build_arrow.sh</argument>
<argument>${script.dir}/build_arrow.sh</argument>
<argument>--tests=${datasource.cpp_tests}</argument>
<argument>--build_arrow=${datasource.build_arrow}</argument>
<argument>--static_arrow=${datasource.static_arrow}</argument>
Original file line number Diff line number Diff line change
@@ -156,6 +156,11 @@ object ArrowUtils {

private def rewriteUri(uriStr: String): String = {
val uri = URI.create(uriStr)
if (uri.getScheme == "s3" || uri.getScheme == "s3a") {
val s3Rewritten = new URI("s3", uri.getAuthority,
uri.getPath, uri.getQuery, uri.getFragment).toString
return s3Rewritten
}
val sch = uri.getScheme match {
case "hdfs" => "hdfs"
case "file" => "file"
Original file line number Diff line number Diff line change
@@ -106,10 +106,18 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
verifyParquet(
spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path))
}

test("simple sql query on s3") {
val path = "s3a://mlp-spark-dataset-bucket/test_arrowds_s3_small"
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.arrow(path)
frame.createOrReplaceTempView("stab")
assert(spark.sql("select id from stab").count() === 1000)
}

test("create catalog table") {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile1)
spark.catalog.createTable("ptab", path, "arrow")
@@ -130,7 +138,6 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile1)
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path)
frame.createOrReplaceTempView("ptab")
verifyParquet(spark.sql("select * from ptab"))
@@ -142,7 +149,6 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile3)
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path)
frame.createOrReplaceTempView("ptab")
val sqlFrame = spark.sql("select * from ptab")
@@ -163,7 +169,6 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile1)
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path)
frame.createOrReplaceTempView("ptab")
spark.sql("select col from ptab where col = 1").explain(true)
@@ -178,7 +183,6 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile2)
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path)
frame.createOrReplaceTempView("ptab")
val rows = spark.sql("select * from ptab where col = 'b'").collect()
@@ -215,7 +219,6 @@ class ArrowDataSourceTest extends QueryTest with SharedSparkSession {
val path = ArrowDataSourceTest.locateResourcePath(parquetFile1)
val frame = spark.read
.option(ArrowOptions.KEY_ORIGINAL_FORMAT, "parquet")
.option(ArrowOptions.KEY_FILESYSTEM, "hdfs")
.arrow(path)
frame.createOrReplaceTempView("ptab")

26 changes: 5 additions & 21 deletions docs/ApacheArrowInstallation.md
Original file line number Diff line number Diff line change
@@ -24,25 +24,16 @@ make install
```

# cmake:
Arrow will download package during compiling, in order to support SSL in cmake, build cmake is optional.
``` shell
wget https://github.com/Kitware/CMake/releases/download/v3.15.0-rc4/cmake-3.15.0-rc4.tar.gz
tar xf cmake-3.15.0-rc4.tar.gz
cd cmake-3.15.0-rc4/
./bootstrap --system-curl --parallel=64 #parallel num depends on your server core number
make -j
make install
cmake --version
cmake version 3.15.0-rc4
```
Please make sure your cmake version is qualified based on the prerequisite.


# Apache Arrow
``` shell
git clone https://github.com/Intel-bigdata/arrow.git
cd arrow && git checkout branch-0.17.0-oap-1.0
cd arrow && git checkout <version>
mkdir -p arrow/cpp/release-build
cd arrow/cpp/release-build
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
cmake -DARROW_DEPENDENCY_SOURCE=BUNDLED -DARROW_GANDIVA_JAVA=ON -DARROW_GANDIVA=ON -DARROW_PARQUET=ON -DARROW_CSV=ON -DARROW_HDFS=ON -DARROW_BOOST_USE_SHARED=ON -DARROW_JNI=ON -DARROW_DATASET=ON -DARROW_WITH_PROTOBUF=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_LZ4=ON -DARROW_FILESYSTEM=ON -DARROW_JSON=ON ..
make -j
make install

@@ -60,11 +51,4 @@ mvn test -pl adapter/parquet -P arrow-jni
mvn test -pl gandiva -P arrow-jni
```

# Copy binary files to oap-native-sql resources directory
Because oap-native-sql plugin will build a stand-alone jar file with arrow dependency, if you choose to build Arrow by yourself, you have to copy below files as a replacement from the original one.
You can find those files in Apache Arrow installation directory or release directory. Below example assume Apache Arrow has been installed on /usr/local/lib64
``` shell
cp /usr/local/lib64/libarrow.so.17 $native-sql-engine-dir/cpp/src/resources
cp /usr/local/lib64/libgandiva.so.17 $native-sql-engine-dir/cpp/src/resources
cp /usr/local/lib64/libparquet.so.17 $native-sql-engine-dir/cpp/src/resources
```
After arrow installed in the specific directory, please make sure to set up -Dbuild_arrow=OFF -Darrow_root=/path/to/arrow when building Native SQL Engine.
Loading

0 comments on commit 230df38

Please sign in to comment.