diff --git a/README.md b/README.md index f0786099d..f48dfd932 100644 --- a/README.md +++ b/README.md @@ -58,3 +58,49 @@ Linux, Apple OSX (Intel and M1) - Apache Spark 3.2, 3.3, or 3.4 - JDK 8 and up - GLIBC 2.17 (Centos 7) and up + +## Getting started + +Make sure the requirements above are met and software installed on your machine + +### Clone repo +```commandline +git clone https://github.com/apache/arrow-datafusion-comet.git +``` + +### Specify the Spark version and build the Comet +Spark 3.4 used for the example. +``` +cd arrow-datafusion-comet +make release PROFILES="-Pspark-3.4" +``` + +### Run Spark with Comet enabled +Make sure `SPARK_HOME` points to the same Spark version as Comet has built for. + +``` +$SPARK_HOME/bin/spark-shell --jars spark/target/comet-spark-spark3.4_2.12-0.1.0-SNAPSHOT.jar \ +--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \ +--conf spark.comet.enabled=true \ +--conf spark.comet.exec.enabled=true \ +--conf spark.comet.exec.all.enabled=true +``` + +### Verify Comet enabled for Spark SQL query + +Create a test Parquet source +```scala +scala> (0 until 10).toDF("a").write.mode("overwrite").parquet("/tmp/test") +``` + +Query the data from the test source and check: +- INFO message shows the native Comet library has been initialized. +- The query plan reflects Comet operators being used for this query instead of Spark ones +```scala +scala> spark.read.parquet("/tmp/test").createOrReplaceTempView("t1"); spark.sql("select * from t1 where a > 5").explain +INFO src/lib.rs: Comet native library initialized +== Physical Plan == + *(1) ColumnarToRow + +- CometFilter [a#14], (isnotnull(a#14) AND (a#14 > 5)) ++- CometScan parquet [a#14] Batched: true, DataFilters: [isnotnull(a#14), (a#14 > 5)], Format: CometParquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/test], PartitionFilters: [], PushedFilters: [IsNotNull(a), GreaterThan(a,5)], ReadSchema: struct +``` \ No newline at end of file