We welcome and encourage contributions of all kinds, such as:
- Tickets with issue reports of feature requests
- Documentation improvements
- Code (PR or PR Review)
In addition to submitting new PRs, we have a healthy tradition of community members helping review each other's PRs. Doing so is a great way to help the community as well as get more familiar with Rust and the relevant codebases.
You can find a curated good-first-issue list to help you get started.
We welcome pull requests (PRs) from anyone from the community.
DataFusion is a very active fast-moving project and we try to review and merge PRs quickly to keep the review backlog down and the pace up. After review and approval, one of the many people with commit access will merge your PR.
Review bandwidth is currently our most limited resource, and we highly encourage reviews by the broader community. If you are waiting for your PR to be reviewed, consider helping review other PRs that are waiting. Such review both helps the reviewer to learn the codebase and become more expert, as well as helps identify issues in the PR (such as lack of test coverage), that can be addressed and make future reviews faster and more efficient.
Since we are a worldwide community, we have contributors in many timezones who review and comment. To ensure anyone who wishes has an opportunity to review a PR, our committers try to ensure that at least 24 hours passes between when a "major" PR is approved and when it is merged.
A "major" PR means there is a substantial change in design or a change in the API. Committers apply their best judgment to determine what constitutes a substantial change. A "minor" PR might be merged without a 24 hour delay, again subject to the judgment of the committer. Examples of potential "minor" PRs are:
- Documentation improvements/additions
- Small bug fixes
- Non-controversial build-related changes (clippy, version upgrades etc.)
- Smaller non-controversial feature additions
This section describes how you can get started at developing DataFusion.
wget https://az792536.vo.msecnd.net/vms/VMBuild_20190311/VirtualBox/MSEdge/MSEdge.Win10.VirtualBox.zip
choco install -y git rustup.install visualcpp-build-tools
git-bash.exe
cargo build
Compiling DataFusion from sources requires an installed version of the protobuf compiler, protoc
.
On most platforms this can be installed from your system's package manager
$ apt install -y protobuf-compiler
$ dnf install -y protobuf-compiler
$ pacman -S protobuf
$ brew install protobuf
You will want to verify the version installed is 3.12
or greater, which introduced support for explicit field presence. Older versions may fail to compile.
$ protoc --version
libprotoc 3.12.4
Alternatively a binary release can be downloaded from the Release Page or built from source.
DataFusion is written in Rust and it uses a standard rust toolkit:
cargo build
cargo fmt
to format the codecargo test
to test- etc.
Testing setup:
rustup update stable
DataFusion uses the latest stable release of rustgit submodule init
git submodule update
Formatting instructions:
or run them all at once:
DataFusion has several levels of tests in its Test Pyramid and tries to follow Testing Organization in the The Book.
This section highlights the most important test modules that exist
Tests for the code in an individual module are defined in the same source file with a test
module, following Rust convention
There are several tests of the public interface of the DataFusion library in the tests directory.
You can run these tests individually using a command such as
cargo test -p datafusion --tests sql_integration
One very important test is the sql_integration test which validates DataFusion's ability to run a large assortment of SQL queries against an assortment of data setups.
The integration-tests directory contains a harness that runs certain queries against both postgres and datafusion and compares results
export POSTGRES_DB=postgres
export POSTGRES_USER=postgres
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
# Install dependencies
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r integration-tests/requirements.txt
# setup environment
POSTGRES_DB=postgres POSTGRES_USER=postgres POSTGRES_HOST=localhost POSTGRES_PORT=5432 python -m pytest -v integration-tests/test_psql_parity.py
# Create
psql -d "$POSTGRES_DB" -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -c 'CREATE TABLE IF NOT EXISTS test (
c1 character varying NOT NULL,
c2 integer NOT NULL,
c3 smallint NOT NULL,
c4 smallint NOT NULL,
c5 integer NOT NULL,
c6 bigint NOT NULL,
c7 smallint NOT NULL,
c8 integer NOT NULL,
c9 bigint NOT NULL,
c10 character varying NOT NULL,
c11 double precision NOT NULL,
c12 double precision NOT NULL,
c13 character varying NOT NULL
);'
psql -d "$POSTGRES_DB" -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -c "\copy test FROM '$(pwd)/testing/data/csv/aggregate_test_100.csv' WITH (FORMAT csv, HEADER true);"
python -m pytest -v integration-tests/test_psql_parity.py
Criterion is a statistics-driven micro-benchmarking framework used by DataFusion for evaluating the performance of specific code-paths. In particular, the criterion benchmarks help to both guide optimisation efforts, and prevent performance regressions within DataFusion.
Criterion integrates with Cargo's built-in benchmark support and a given benchmark can be run with
cargo bench --bench BENCHMARK_NAME
A full list of benchmarks can be found here.
cargo-criterion may also be used for more advanced reporting.
The parquet SQL benchmarks can be run with
cargo bench --bench parquet_query_sql
These randomly generate a parquet file, and then benchmark queries sourced from parquet_query_sql.sql against it. This can therefore be a quick way to add coverage of particular query and/or data paths.
If the environment variable PARQUET_FILE
is set, the benchmark will run queries against this file instead of a randomly generated one. This can be useful for performing multiple runs, potentially with different code, against the same source data, or for testing against a custom dataset.
The benchmark will automatically remove any generated parquet file on exit, however, if interrupted (e.g. by CTRL+C) it will not. This can be useful for analysing the particular file after the fact, or preserving it to use with PARQUET_FILE
in subsequent runs.
Instructions and tooling for running upstream benchmark suites against DataFusion can be found in benchmarks.
These are valuable for comparative evaluation against alternative Arrow implementations and query engines.
Below is a checklist of what you need to do to add a new scalar function to DataFusion:
- Add the actual implementation of the function:
- In physical-expr/src, add:
- a new variant to
BuiltinScalarFunction
- a new entry to
FromStr
with the name of the function as called by SQL - a new line in
return_type
with the expected return type of the function, given an incoming type - a new line in
signature
with the signature of the function (number and types of its arguments) - a new line in
create_physical_expr
/create_physical_fun
mapping the built-in to the implementation - tests to the function.
- a new variant to
- In core/tests/sql, add a new test where the function is called through SQL against well known data and returns the expected result.
- In expr/src/expr_fn.rs, add:
- a new entry of the
unary_scalar_expr!
macro for the new function.
- a new entry of the
Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
- Add the actual implementation of an
Accumulator
andAggregateExpr
: - In datafusion/expr/src, add:
- a new variant to
AggregateFunction
- a new entry to
FromStr
with the name of the function as called by SQL - a new line in
return_type
with the expected return type of the function, given an incoming type - a new line in
signature
with the signature of the function (number and types of its arguments) - a new line in
create_aggregate_expr
mapping the built-in to the implementation - tests to the function.
- a new variant to
- In tests/sql, add a new test where the function is called through SQL against well known data and returns the expected result.
The query plans represented by LogicalPlan
nodes can be graphically
rendered using Graphviz.
To do so, save the output of the display_graphviz
function to a file.:
// Create plan somehow...
let mut output = File::create("/tmp/plan.dot")?;
write!(output, "{}", plan.display_graphviz());
Then, use the dot
command line tool to render it into a file that
can be displayed. For example, the following command creates a
/tmp/plan.pdf
file:
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
We formalize DataFusion semantics and behaviors through specification documents. These specifications are useful to be used as references to help resolve ambiguities during development or code reviews.
You are also welcome to propose changes to existing specifications or create new specifications as you see fit.
Here is the list current active specifications:
All specifications are stored in the docs/source/specification
folder.
We are using prettier
to format .md
files.
You can either use npm i -g prettier
to install it globally or use npx
to run it as a standalone binary. Using npx
required a working node environment. Upgrading to the latest prettier is recommended (by adding --upgrade
to the npm
command).
$ prettier --version
2.3.0
After you've confirmed your prettier version, you can format all the .md
files:
prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md