(Abbreviated as NesQuEIT, pronounced [neskwit]
)
This project is meant to test various query engines (Spark, Flink, Presto, etc) against Nessie and Apache Iceberg.
Test runs are driven by either configuring the released versions of Nessie and Iceberg, or by using Git clones of the respective heads of Nessie's main branch and/or Iceberg's main branch.
Goals:
- Be and stay open source!
- Run the query engine integration tests "from your laptop" w/o having to provision external machinery.
- Ability to run tests "from your IDE" (in IntelliJ), although it might require tweaking this and/or this file to specify the required version(s).
- Run the integration tests rather quickly even on tiny-ish instances, like GitHub's default runners.
- Optionally run tests against external machines, like already configured Spark/Flink/Presto clusters or Nessie instances.
Possible tests:
- Test Spark against Iceberg + Nessie. All Spark + Scala versions that both Iceberg + Nessie support can be used.
- Test Flink against Iceberg + Nessie. All Flink versions That Iceberg supports can be used.
- Test Presto against Iceberg + Nessie. All Presto versions that support the currently used Iceberg version can be used.
- Cross-engine tests using Spark + Flink simultaneously against Iceberg + Nessie. All Spark + Scala + Flink versions that both Iceberg + Nessie support can be used.
Possible tests with a local Nessie Git clone:
- Run tests in Nessie that test against Apache Iceberg
Possible tests with a local Iceberg Git clone:
- Run tests in Apache Iceberg that test against Nessie
The NesQuEIT for Nessie leverages Gradle included builds for referenced projects, where possible. This means, that IDEs (at least IntelliJ) allows you to work on all three code bases (integrations tests, Nessie and Iceberg).
A major advantage of this project is that tests can be debugged in an IDE like IntelliJ.
IMPORTANT Setup the required included builds as described below before you open this project in your IDE!!
Tests may run against an ephemeral Nessie instance per (Gradle) test task, but tests may also run concurrently against a shared and external Nessie instance. Writing to the default branch will likely result in intermittent or reproducible test failures.
The Gradle build automatically generates one Gradle project for each configured configuration of
query engine versions in combination with Iceberg + Nessie. For example: For Nessie 0.40 and
Iceberg 0.14, 5 Gradle projects with Spark 3.1, 3.2, 3.3, and Scala 2.12 + 2.13 are created
(Spark 3.1 is only supported with Scala 2.12). Similar for Flink, 2 Gradle projects with Flink 1.14
and 1.15 are created.
Use ./gradlew projects -Dnessie.versionNessie=0.50.0 -Dnessie.versionIceberg=1.1.0
to see the
list of projects available for those released versions.
Tests for each query engine (Spark, Flink, Presto) use the same test code, so a lot of code duplication is prevented.
Tests also leverage JUnit 5 extensions to provide e.g. a Spark session or a JDBC connection for Presto.
The project can be imported into IntelliJ (see below).
Unlike Spark + Flink, which are integrated "from Iceberg", Presto integrates Iceberg (the other way around). While NesQuEIT can already use local Git clones of Nessie + Iceberg, it would be good to also include a local Git clone of Presto. However, Presto does not use Gradle but Maven. A Gradle plugin to include Maven builds in Gradle builds is in early development, but is meant to support using Presto's main branch directly in this project.
Ideally the cross-engine test should also include Presto, but dependency issues against Parquet prevent this.
By default, NesQuEIT includes source builds of Nessie Iceberg. If you want to use released versions instead of source builds, you can do so by using the following system properties.
System property | Meaning and default |
---|---|
nessie.versionNessie |
The version of Nessie to use. Defaults to the included build in included-builds/nessie . |
nessie.versionNessieServer |
The version of Nessie to use for the Nessie server being launched. Defaults to nessie.versionNessie . |
nessie.versionIceberg |
The version of Nessie to use. Defaults to the included build in included-builds/iceberg/ . |
nessie.externalNessieUrl |
Nessie REST API endpoint to use from tests instead of a Nessie server launched for each test. DO NOT USE THE nessie.client.uri PROPERTY! |
nessie.test.keepReferences |
Nessie related tests delete all references created by them. If it is helpful to keep the references, set this property to true . |
nessie.client.* |
All system properties starting with nessie.client. are passed as configuration options to the Iceberg NessieCatalog , and further down to the Nessie client. (The prefix nessie.client. is removed.) |
spark.master.url |
Spark URL to use from tests, defaults to local[2] . Example: see below. |
spark.log.level |
Spark log level. Defaults to WARN . |
flink.remote.host |
Hostname of the externally provided Flink cluster. Tests will use a JVM-local Flink, if this property is not set. |
flink.remote.port |
Port number of the externally provided Flink cluster. Only used, when flink.remote.host is specified. Defaults to 8081 . |
flink.config.* |
Additional Flink configuration settings added to a Flink Configuration object, with the flink.config. prefix removed. Only used, when flink.remote.host is specified. |
presto.session.* |
Additional Presto session properties, added with the presto.session. prefix removed. |
presto.jdbc.host-port |
The optional JDBC host and port to connect to an already provisioned Presto endpoint. |
nessie.inttest.location.* |
Supports specifying URIs for these types (replace the * ): iceberg.warehouse , hive.metastore.catalog , presto.local.path . |
withMavenLocal |
When set to true , the local maven repository will be added to the queried repositories. |
Gradle assumes that tests do not need to be run, if the inputs (aka source files, configuration including system properties, etc.) do not change. For integration tests however, it is often necessary to rerun tests. There are two options:
- Unconditionally run the tests. This can be achieved by setting the Gradle project property
testRerun
. Example:./gradlew :nqeit-iceberg-flink-1.15:intTest -PtestRerun
- Run tests multiple times. This unconditionally runs and repeats all matching tests. Specify
the number of repetitions using the Gradle project property
testRepetitions
. Example:./gradlew :nqeit-iceberg-flink-1.15:intTest -PtestRepetitions=3
./gradlew\
:nqeit-iceberg-spark-3.1:intTest\
:nqeit-iceberg-spark-3.2_2.13:intTest\
-Dnessie.versionNessie=0.50.0\
-Dnessie.versionIceberg=1.1.0
(See below now to set up a Git worktree for Nessie.)
./gradlew\
:nqeit-iceberg-spark-3.1:intTest\
:nqeit-iceberg-spark-3.2_2.13:intTest\
-Dnessie.versionIceberg=1.1.0
Local Spark cluster:
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh
Running the Spark/Iceberg tests:
# 1. Ensure your local Spark master + worker(s) are up and running
# 2. Get Spark master URL
./gradlew\
:nqeit-iceberg-spark-3.1:intTest\
:nqeit-iceberg-spark-3.2:intTest\
-Dspark.master.url=spark://localhost:7077
Local Spark cluster (Spark 3.2 for example):
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh
Running the Spark/Iceberg tests:
# 1. Ensure your local Spark master + worker(s) are up and running
# 2. Get Spark master URL
# 3. Start local Nessie server
# 4. Get Nessie REST API base URI
./gradlew\
:nqeit-iceberg-spark-3.1:intTest\
:nqeit-iceberg-spark-3.2:intTest\
-Dspark.master.url=spark://localhost:7077\
-Dnessie.externalNessieUrl=http://localhost:19120/api/v1
Local Flink cluster (Flink 1.15 for example):
$FLINK_HOME/bin/start-cluster.sh
Running the Flink/Iceberg tests:
# 1. Ensure your local Flink cluster is up and running
# 2. Get Flink host + port values
# 3. Start local Nessie server
# 4. Get Nessie REST API base URI
./gradlew\
:nqeit-iceberg-flink-1.14:intTest\
:nqeit-iceberg-flink-1.15:intTest\
-Dflink.remote.host=127.0.0.1\
-Dflink.remote.port=8081\
-Dnessie.externalNessieUrl=http://localhost:19120/api/v1
The matrix to list all possible project version combinations would be pretty huge. This project aims to reduce the amount of test and build code needed. The same test code is used to run against different versions of the tested query engine. This means that there is only one set of test classes and only one build script for Iceberg+Flink and for Iceberg-Spark. The build scripts take care of generating the "correct" test version matrix.
The integration tests "just use" the local (embedded) or provided (external) Spark cluster, but do not provide any libraries to it. This means, that there's no "spark-submit" or the like. Using or relying on the existence of for example Iceberg or Nessie classes inside the Spark workers will result in failures.
However, using "standard Spark SQL" or "standard Spark Dataset/DataFrames/RDDs" works as expected.
The dremio integration tests requires a Base-URL, PAT(Personal Access Token), Project-id and the Catalog-name. If the following arguments are not provided, the tests will be skipped.
./gradlew :nqeit-iceberg-dremio:intTest -PtestJvmArgs="\
-Ddremio.url=<dremio-url> \
-Ddremio.token=<token> \
-Ddremio.project-id=<project-id> \
-Ddremio.catalog-name=<catalog-name>"
./gradlew :nqeit-iceberg-dremio:intTest \
-PtestJvmArgs="\
-Ddremio.url=<dremio-url> \
-Ddremio.token=<token> \
-Ddremio.project-id=<project-id> \
-Ddremio.catalog-name=<catalog-name> \
-Dnessie.client.uri=<nessie-uri> \
-Dnessie.client.authentication.type=BEARER \
-Dnessie.client.authentication.token=<token>"
# 1. Ensure that the following env variables are set: AWS_SECRET_ACCESS_KEY, AWS_ACCESS_KEY_ID, AWS_REGION
# 2. AWS credentials are set in the Catalog Settings.
# 3. Privileges(CREATE TABLE, DROP, INSERT and SELECT) are provided in the Catalog Settings.
./gradlew :nqeit-cross-engine-3.2-2.12-1.14:intTest --tests "org.projectnessie.integtests.crossengine.ITCrossEngineDremio" -PtestJvmArgs="\
-Dnessie.client.uri=<nessie-uri> \
-Dnessie.client.authentication.type=BEARER \
-Dnessie.client.authentication.token=<token> \
-Dnessie.client.ref=<ref_name> \
-Ddremio.url=<dremio-url> \
-Ddremio.token=<token> \
-Ddremio.project-id=<project-id> \
-Ddremio.catalog-name=<catalog-name> \
-Dnessie.inttest.location.iceberg.warehouse==s3://my-bucket \
-Dnessie.client.io-impl=org.apache.iceberg.aws.s3.S3FileIO"
Note:
dremio-url
anddremio-token
must be set according to the following documentation: dremio-url and dremio-token- The
project-id
can be found on the Sonar Projects Setting page, under General Information named as "Project-ID". - The
nessie-uri
andcatalog-name
can be found on the Catalog Settings page, under General Information named as "Catalog Endpoint" and "Catalog Name".
The recommended way to "link" this project to "latest Nessie" and "latest Iceberg" is to put those
into the included-builds/
directory.
All "linked projects" (Nessie, Iceberg) must include the code changes (patches) that are necessary
to make the code bases work together. Think: Nessie requires code changes on top of the main
branch to let Nessie's Spark extensions work with the latest version of Iceberg. For this reason,
we maintain "integrations branches" with the necessary changes.
Notes:
- including Nessie source builds only works with Nessie built with Gradle.
- including Iceberg source builds with Git worktrees will NOT work anymore for recent versions of Iceberg >= 1.1.0. This is because the Iceberg build now requires a Gradle plugin that is not compatible with worktrees, see this issue for more details.
The easiest way to implement this locally is to use Git worktree.
- Clone this repository and save the path in
NESSIE_INTEGRATION_TESTS
git clone https://github.com/projectnessie/query-engine-integration-tests NESSIE_INTEGRATION_TESTS=$(realpath query-engine-integration-tests)
- Go to your local Nessie clone and create a Git worktree in the
included-builds/
directory.cd PATH_TO_YOUR_LOCAL_NESSIE_CLONE git worktree add --track -b integ-tests-main ${NESSIE_INTEGRATION_TESTS}/included-builds/nessie origin/main
- Go to your local Iceberg clone and create a Git worktree in the
included-builds/
directory.cd PATH_TO_YOUR_LOCAL_ICEBERG_CLONE git worktree add --track -b integ-tests-main ${NESSIE_INTEGRATION_TESTS}/included-builds/iceberg origin/main
As an alternative, you can also create symbolic links called nessie
and iceberg
to your local
clones/worktrees with the "right" code.
Symlinks are less convenient than worktrees, especially if code changes are necessary, but they are currently the only alternative that works with recent versions of Iceberg.
Example:
ln -s INSERT_PATH_TO_YOUR_LOCAL_NESSIE_CLONE included-builds/nessie
ln -s INSERT_PATH_TO_YOUR_LOCAL_ICEBERG_CLONE included-builds/iceberg
Canary build:
./gradlew :nessie:nessie-client:jar :iceberg:iceberg-core:jar :iceberg:iceberg-nessie:jar
Run Iceberg/Nessie tests:
./gradlew :iceberg:iceberg-nessie:test
Run Nessie Spark 3.2 Extensions tests:
./gradlew :nessie:clients:spark-32-extensions:intTest
Run the actual integrations tests:
./gradlew intTest
This project can also use Maven projects, currently Presto, which has support for Nessie.
While Gradle supports included builds and supports substitution of dependencies, Maven builds can only rely on local Maven repositories. This means, that before any Maven based project can be tested, Nessie and Iceberg need to be built and published to the local Maven repository.
This project has, in theory, everything that's needed to publish snapshot artifacts to your local
Maven repo directly by just running ./gradlew publisLocal
. Sadly doesn't work yet.
CI builds are triggered using GitHub actions. See main.yml.
CI will fetch the latest commit from any project's main/master branch and apply the necessary changes by merging the configured "patch branch".