Skip to content

Commit

Permalink
Update from upstream (#30)
Browse files Browse the repository at this point in the history
* configure_me_codegen retroactively reserved on our `bind_host` parame… (apache#520)

* configure_me_codegen retroactively reserved on our `bind_host` parameter name

* Add label and pray

* Add more labels why not

* Prepare 0.10.0 Release (apache#522)

* bump version

* CHANGELOG

* Ballista gets a docker image!!! (apache#521)

* Ballista gets a docker image!!!

* Enable flight sql

* Allow executing startup script

* Allow executing executables

* Clippy

* Remove capture group (apache#527)

* fix python build in CI (apache#528)

* fix python build in CI

* save progress

* use same min rust version in all crates

* fix

* use image from pyo3

* use newer image from pyo3

* do not require protoc

* wheels now generated

* rat - exclude generated file

* Update docs for simplified instructions (apache#532)

* Update docs for simplified instructions

* Fix whoopsie

* Update docs/source/user-guide/flightsql.md

Co-authored-by: Andy Grove <andygrove73@gmail.com>

Co-authored-by: Andy Grove <andygrove73@gmail.com>

* remove --locked (apache#533)

* Bump actions/labeler from 4.0.2 to 4.1.0 (apache#525)

* Provide a memory StateBackendClient (apache#523)

* Rename StateBackend::Standalone to StateBackend:Sled

* Copy utility files from sled crate since they cannot be used directly

* Provide a memory StateBackendClient

* Fix dashmap deadlock issue

* Fix for the comments

Co-authored-by: yangzhong <yangzhong@ebay.com>

* only build docker images on rc tags (apache#535)

* docs: fix style in the Helm readme (apache#551)

* Fix Helm chart's image format (apache#550)

* Update datafusion requirement from 14.0.0 to 15.0.0 (apache#552)

* Update datafusion requirement from 14.0.0 to 15.0.0

* Fix UT

* Fix python

* Fix python

* Fix Python

Co-authored-by: yangzhong <yangzhong@ebay.com>

* Make it concurrently to launch tasks to executors (apache#557)

* Make it concurrently to launch tasks to executors

* Refine for comments

Co-authored-by: yangzhong <yangzhong@ebay.com>

* fix(ui): fix last seen (apache#562)

* Support Alibaba Cloud OSS with ObjectStore (apache#567)

* Fix cargo clippy (apache#571)

Co-authored-by: yangzhong <yangzhong@ebay.com>

* Super minor spelling error (apache#573)

* Update env_logger requirement from 0.9 to 0.10 (apache#539)

Updates the requirements on [env_logger](https://github.com/rust-cli/env_logger) to permit the latest version.
- [Release notes](https://github.com/rust-cli/env_logger/releases)
- [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md)
- [Commits](rust-cli/env_logger@v0.9.0...v0.10.0)

---
updated-dependencies:
- dependency-name: env_logger
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update graphviz-rust requirement from 0.4.0 to 0.5.0 (apache#574)

Updates the requirements on [graphviz-rust](https://github.com/besok/graphviz-rust) to permit the latest version.
- [Release notes](https://github.com/besok/graphviz-rust/releases)
- [Changelog](https://github.com/besok/graphviz-rust/blob/master/CHANGELOG.md)
- [Commits](https://github.com/besok/graphviz-rust/commits)

---
updated-dependencies:
- dependency-name: graphviz-rust
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* updated readme to contain correct versions of dependencies. (apache#580)

* Fix benchmark image link (apache#596)

* Add support for Azure (apache#599)

* Remove outdated script and use evergreen version of rust (apache#597)

* Remove outdated script and use evergreen version of rust

* Use debian protobuf

* feat: update script such that ballista-cli image is built as well (apache#601)

* Fix Cargo.toml format issue (apache#616)

* Refactor executor main (apache#614)

* Refactor executor main

* copy all configs

* toml fmt

* Refactor scheduler main (apache#615)

* refactor scheduler main

* toml fmt

* Python: add method to get explain output as a string (apache#593)

* Update contributor guide (apache#617)

* Cluster state refactor part 1 (apache#560)

* Customize session builder

* Add setter for executor slots policy

* Construct Executor with functions

* Add queued and completed timestamps to successful job status

* Add public methods to SchedulerServer

* Public method for getting execution graph

* Public method for stage metrics

* Use node-level local limit (#20)

* Use node-level local limit

* serialize limit in shuffle writer

* Revert "Merge pull request #19 from coralogix/sc-5792"

This reverts commit 08140ef, reversing
changes made to a7f1384.

* add log

* make sure we don't forget limit for shuffle writer

* update accum correctly and try to break early

* Check local limit accumulator before polling for more data

* fix build

Co-authored-by: Martins Purins <martins.purins@coralogix.com>

* configure_me_codegen retroactively reserved on our `bind_host` parame… (apache#520)

* configure_me_codegen retroactively reserved on our `bind_host` parameter name

* Add label and pray

* Add more labels why not

* Add ClusterState trait

* Refactor slightly for clarity

* Revert "Use node-level local limit (#20)"

This reverts commit ff96bcd.

* Revert "Public method for stage metrics"

This reverts commit a802315.

* Revert "Public method for getting execution graph"

This reverts commit 490bda5.

* Revert "Add public methods to SchedulerServer"

This reverts commit 5ad27c0.

* Revert "Add queued and completed timestamps to successful job status"

This reverts commit c615fce.

* Revert "Construct Executor with functions"

This reverts commit 24d4830.

* Always forget the apache header

Co-authored-by: Martins Purins <martins.purins@coralogix.com>
Co-authored-by: Brent Gardner <brent.gardner@spaceandtime.io>

* replace master with main (apache#621)

* implement new release process (apache#623)

* add docs on who can release (apache#632)

* Upgrade to DataFusion 16 (again) (apache#636)

* Update datafusion dependency to the latest version (apache#612)

* Update datafusion dependency to the latest version

* Fix python

* Skip ut of test_window_lead due to apache/datafusion-python#135

* Fix clippy

---------

Co-authored-by: yangzhong <yangzhong@ebay.com>

* Upgrade to DataFusion 17 (apache#639)

* Upgrade to DF 17

* Restore original error handling functionality

* Customize session builder

* Construct Executor with functions

* Add queued and completed timestamps to successful job status

* Add public methods to SchedulerServer

* Public method for getting execution graph

* Public method for stage metrics

* Use node-level local limit (#20)

* Use node-level local limit

* serialize limit in shuffle writer

* Revert "Merge pull request #19 from coralogix/sc-5792"

This reverts commit 08140ef, reversing
changes made to a7f1384.

* add log

* make sure we don't forget limit for shuffle writer

* update accum correctly and try to break early

* Check local limit accumulator before polling for more data

* fix build

Co-authored-by: Martins Purins <martins.purins@coralogix.com>

* Add ClusterState trait

* Expose active job count

* Remove println

* Resubmit jobs when no resources available for scheduling

* Make parse_physical_expr public

* Reduce log spam

* Fix job submitted metric by ignoring resubmissions

* Record when job is queued in scheduler metrics (#28)

* Record when job is queueud in scheduler metrics

* add additional buckets for exec times

* Upstream rebase (#29)

* configure_me_codegen retroactively reserved on our `bind_host` parame… (apache#520)

* configure_me_codegen retroactively reserved on our `bind_host` parameter name

* Add label and pray

* Add more labels why not

* Prepare 0.10.0 Release (apache#522)

* bump version

* CHANGELOG

* Ballista gets a docker image!!! (apache#521)

* Ballista gets a docker image!!!

* Enable flight sql

* Allow executing startup script

* Allow executing executables

* Clippy

* Remove capture group (apache#527)

* fix python build in CI (apache#528)

* fix python build in CI

* save progress

* use same min rust version in all crates

* fix

* use image from pyo3

* use newer image from pyo3

* do not require protoc

* wheels now generated

* rat - exclude generated file

* Update docs for simplified instructions (apache#532)

* Update docs for simplified instructions

* Fix whoopsie

* Update docs/source/user-guide/flightsql.md

Co-authored-by: Andy Grove <andygrove73@gmail.com>

Co-authored-by: Andy Grove <andygrove73@gmail.com>

* remove --locked (apache#533)

* Bump actions/labeler from 4.0.2 to 4.1.0 (apache#525)

* Provide a memory StateBackendClient (apache#523)

* Rename StateBackend::Standalone to StateBackend:Sled

* Copy utility files from sled crate since they cannot be used directly

* Provide a memory StateBackendClient

* Fix dashmap deadlock issue

* Fix for the comments

Co-authored-by: yangzhong <yangzhong@ebay.com>

* only build docker images on rc tags (apache#535)

* docs: fix style in the Helm readme (apache#551)

* Fix Helm chart's image format (apache#550)

* Update datafusion requirement from 14.0.0 to 15.0.0 (apache#552)

* Update datafusion requirement from 14.0.0 to 15.0.0

* Fix UT

* Fix python

* Fix python

* Fix Python

Co-authored-by: yangzhong <yangzhong@ebay.com>

* Make it concurrently to launch tasks to executors (apache#557)

* Make it concurrently to launch tasks to executors

* Refine for comments

Co-authored-by: yangzhong <yangzhong@ebay.com>

* fix(ui): fix last seen (apache#562)

* Support Alibaba Cloud OSS with ObjectStore (apache#567)

* Fix cargo clippy (apache#571)

Co-authored-by: yangzhong <yangzhong@ebay.com>

* Super minor spelling error (apache#573)

* Update env_logger requirement from 0.9 to 0.10 (apache#539)

Updates the requirements on [env_logger](https://github.com/rust-cli/env_logger) to permit the latest version.
- [Release notes](https://github.com/rust-cli/env_logger/releases)
- [Changelog](https://github.com/rust-cli/env_logger/blob/main/CHANGELOG.md)
- [Commits](rust-cli/env_logger@v0.9.0...v0.10.0)

---
updated-dependencies:
- dependency-name: env_logger
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Update graphviz-rust requirement from 0.4.0 to 0.5.0 (apache#574)

Updates the requirements on [graphviz-rust](https://github.com/besok/graphviz-rust) to permit the latest version.
- [Release notes](https://github.com/besok/graphviz-rust/releases)
- [Changelog](https://github.com/besok/graphviz-rust/blob/master/CHANGELOG.md)
- [Commits](https://github.com/besok/graphviz-rust/commits)

---
updated-dependencies:
- dependency-name: graphviz-rust
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* updated readme to contain correct versions of dependencies. (apache#580)

* Fix benchmark image link (apache#596)

* Add support for Azure (apache#599)

* Remove outdated script and use evergreen version of rust (apache#597)

* Remove outdated script and use evergreen version of rust

* Use debian protobuf

* Customize session builder

* Add setter for executor slots policy

* Construct Executor with functions

* Add queued and completed timestamps to successful job status

* Add public methods to SchedulerServer

* Public method for getting execution graph

* Public method for stage metrics

* Use node-level local limit (#20)

* Use node-level local limit

* serialize limit in shuffle writer

* Revert "Merge pull request #19 from coralogix/sc-5792"

This reverts commit 08140ef, reversing
changes made to a7f1384.

* add log

* make sure we don't forget limit for shuffle writer

* update accum correctly and try to break early

* Check local limit accumulator before polling for more data

* fix build

Co-authored-by: Martins Purins <martins.purins@coralogix.com>

* Add ClusterState trait

* Expose active job count

* Remove println

* Resubmit jobs when no resources available for scheduling

* Make parse_physical_expr public

* Reduce log spam

* Fix job submitted metric by ignoring resubmissions

* Record when job is queued in scheduler metrics (#28)

* Record when job is queueud in scheduler metrics

* add additional buckets for exec times

* fmt

* clippy

* tomlfmt

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Brent Gardner <brent.gardner@spaceandtime.io>
Co-authored-by: Andy Grove <andygrove73@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: yahoNanJing <90197956+yahoNanJing@users.noreply.github.com>
Co-authored-by: yangzhong <yangzhong@ebay.com>
Co-authored-by: Xin Hao <haoxinst@gmail.com>
Co-authored-by: Duyet Le <5009534+duyet@users.noreply.github.com>
Co-authored-by: r.4ntix <antix.blue@antix.blue>
Co-authored-by: Jeremy Dyer <jdye64@gmail.com>
Co-authored-by: Sai Krishna Reddy Lakkam <86965352+saikrishna1-bidgely@users.noreply.github.com>
Co-authored-by: Aidan Kovacic <95648995+aidankovacic-8451@users.noreply.github.com>
Co-authored-by: Dan Harris <dan@thinkharder.dev>
Co-authored-by: Dan Harris <1327726+thinkharderdev@users.noreply.github.com>
Co-authored-by: Martins Purins <martins.purins@coralogix.com>
Co-authored-by: Dan Harris <dan@coralogix.com>

* Post merge update

* update message formatting

* post merge update

* another post-merge updates

* update github actions

* clippy

* update script

* fmt

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Brent Gardner <brent.gardner@spaceandtime.io>
Co-authored-by: Andy Grove <andygrove73@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: yahoNanJing <90197956+yahoNanJing@users.noreply.github.com>
Co-authored-by: yangzhong <yangzhong@ebay.com>
Co-authored-by: Xin Hao <haoxinst@gmail.com>
Co-authored-by: Duyet Le <5009534+duyet@users.noreply.github.com>
Co-authored-by: r.4ntix <antix.blue@antix.blue>
Co-authored-by: Jeremy Dyer <jdye64@gmail.com>
Co-authored-by: Sai Krishna Reddy Lakkam <86965352+saikrishna1-bidgely@users.noreply.github.com>
Co-authored-by: Aidan Kovacic <95648995+aidankovacic-8451@users.noreply.github.com>
Co-authored-by: Tim Van Wassenhove <tim@timvw.be>
Co-authored-by: Dan Harris <1327726+thinkharderdev@users.noreply.github.com>
Co-authored-by: Martins Purins <martins.purins@coralogix.com>
Co-authored-by: Brent Gardner <bgardner@squarelabs.net>
Co-authored-by: Dan Harris <dan@thinkharder.dev>
Co-authored-by: Dan Harris <dan@coralogix.com>
  • Loading branch information
18 people authored Feb 2, 2023
1 parent 763aa23 commit 8bc5234
Show file tree
Hide file tree
Showing 93 changed files with 3,193 additions and 2,952 deletions.
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,5 @@ target/
**/data
!target/release/ballista-scheduler
!target/release/ballista-executor
!target/release/ballista-cli
!target/release/tpch
212 changes: 35 additions & 177 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,22 +25,28 @@ We welcome and encourage contributions of all kinds, such as:
2. Documentation improvements
3. Code (PR or PR Review)

In addition to submitting new PRs, we have a healthy tradition of community members helping review each other's PRs. Doing so is a great way to help the community as well as get more familiar with Rust and the relevant codebases.
In addition to submitting new PRs, we have a healthy tradition of community members helping review each other's PRs.
Doing so is a great way to help the community as well as get more familiar with Rust and the relevant codebases.

You can find a curated
[good-first-issue](https://github.com/apache/arrow-ballista/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
list to help you get started.

# Developer's guide
# Developer's Guide

This section describes how you can get started at developing DataFusion.
This section describes how you can get started with Ballista development.

For information on developing with Ballista, see the
[Ballista developer documentation](docs/developer/README.md).
## Bootstrap Environment

### Bootstrap environment
Ballista contains components implemented in the following programming languages:

DataFusion is written in Rust and it uses a standard rust toolkit:
- Rust (Scheduler and Executor processes, Client library)
- Python (Python bindings)
- Javascript (Scheduler Web UI)

### Rust Environment

We use the standard Rust development tools.

- `cargo build`
- `cargo fmt` to format the code
Expand All @@ -50,8 +56,6 @@ DataFusion is written in Rust and it uses a standard rust toolkit:
Testing setup:

- `rustup update stable` DataFusion uses the latest stable release of rust
- `git submodule init`
- `git submodule update`

Formatting instructions:

Expand All @@ -63,192 +67,46 @@ or run them all at once:

- [dev/rust_lint.sh](dev/rust_lint.sh)

## Test Organization

DataFusion has several levels of tests in its [Test
Pyramid](https://martinfowler.com/articles/practical-test-pyramid.html)
and tries to follow [Testing Organization](https://doc.rust-lang.org/book/ch11-03-test-organization.html) in the The Book.

This section highlights the most important test modules that exist

### Unit tests

Tests for the code in an individual module are defined in the same source file with a `test` module, following Rust convention

### Rust Integration Tests

There are several tests of the public interface of the DataFusion library in the [tests](https://github.com/apache/arrow-datafusion/blob/master/datafusion/tests) directory.

You can run these tests individually using a command such as

```shell
cargo test -p datafusion --tests sql_integration
```

One very important test is the [sql_integraton](https://github.com/apache/arrow-datafusion/blob/master/datafusion/tests/sql_integration.rs) test which validates DataFusion's ability to run a large assortment of SQL queries against an assortment of data setsups.

### SQL / Postgres Integration Tests

The [integration-tests](https://github.com/apache/arrow-datafusion/blob/master/datafusion/integration-tests] directory contains a harness that runs certain queries against both postgres and datafusion and compares results

#### setup environment

```shell
export POSTGRES_DB=postgres
export POSTGRES_USER=postgres
export POSTGRES_HOST=localhost
export POSTGRES_PORT=5432
```

#### Install dependencies

```shell
# Install dependencies
python -m pip install --upgrade pip setuptools wheel
python -m pip install -r integration-tests/requirements.txt

# setup environment
POSTGRES_DB=postgres POSTGRES_USER=postgres POSTGRES_HOST=localhost POSTGRES_PORT=5432 python -m pytest -v integration-tests/test_psql_parity.py

# Create
psql -d "$POSTGRES_DB" -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -c 'CREATE TABLE IF NOT EXISTS test (
c1 character varying NOT NULL,
c2 integer NOT NULL,
c3 smallint NOT NULL,
c4 smallint NOT NULL,
c5 integer NOT NULL,
c6 bigint NOT NULL,
c7 smallint NOT NULL,
c8 integer NOT NULL,
c9 bigint NOT NULL,
c10 character varying NOT NULL,
c11 double precision NOT NULL,
c12 double precision NOT NULL,
c13 character varying NOT NULL
);'

psql -d "$POSTGRES_DB" -h "$POSTGRES_HOST" -p "$POSTGRES_PORT" -U "$POSTGRES_USER" -c "\copy test FROM '$(pwd)/testing/data/csv/aggregate_test_100.csv' WITH (FORMAT csv, HEADER true);"
```

#### Invoke the test runner

```shell
python -m pytest -v integration-tests/test_psql_parity.py
```

## Benchmarks
### Rust Process Configuration

### Criterion Benchmarks
The scheduler and executor processes can be configured using toml files, environment variables and command-line
arguments. The specification for config options can be found here:

[Criterion](https://docs.rs/criterion/latest/criterion/index.html) is a statistics-driven micro-benchmarking framework used by DataFusion for evaluating the performance of specific code-paths. In particular, the criterion benchmarks help to both guide optimisation efforts, and prevent performance regressions within DataFusion.
- [ballista/scheduler/scheduler_config_spec.toml](ballista/scheduler/scheduler_config_spec.toml)
- [ballista/executor/executor_config_spec.toml](ballista/executor/executor_config_spec.toml)

Criterion integrates with Cargo's built-in [benchmark support](https://doc.rust-lang.org/cargo/commands/cargo-bench.html) and a given benchmark can be run with
Those files fully define Ballista's configuration. If there is a discrepancy between this documentation and the
files, assume those files are correct.

```
cargo bench --bench BENCHMARK_NAME
```

A full list of benchmarks can be found [here](./datafusion/benches).

_[cargo-criterion](https://github.com/bheisler/cargo-criterion) may also be used for more advanced reporting._

#### Parquet SQL Benchmarks

The parquet SQL benchmarks can be run with

```
cargo bench --bench parquet_query_sql
```

These randomly generate a parquet file, and then benchmark queries sourced from [parquet_query_sql.sql](./datafusion/core/benches/parquet_query_sql.sql) against it. This can therefore be a quick way to add coverage of particular query and/or data paths.

If the environment variable `PARQUET_FILE` is set, the benchmark will run queries against this file instead of a randomly generated one. This can be useful for performing multiple runs, potentially with different code, against the same source data, or for testing against a custom dataset.

The benchmark will automatically remove any generated parquet file on exit, however, if interrupted (e.g. by CTRL+C) it will not. This can be useful for analysing the particular file after the fact, or preserving it to use with `PARQUET_FILE` in subsequent runs.
To get a list of command-line arguments, run the binary with `--help`

### Upstream Benchmark Suites
There is an example config file at [ballista/executor/examples/example_executor_config.toml](ballista/executor/examples/example_executor_config.toml)

Instructions and tooling for running upstream benchmark suites against DataFusion and/or Ballista can be found in [benchmarks](./benchmarks).
The order of precedence for arguments is: default config file < environment variables < specified config file < command line arguments.

These are valuable for comparative evaluation against alternative Arrow implementations and query engines.
The executor and scheduler will look for the default config file at `/etc/ballista/[executor|scheduler].toml` To
specify a config file use the `--config-file` argument.

## How to add a new scalar function
Environment variables are prefixed by `BALLISTA_EXECUTOR` or `BALLISTA_SCHEDULER` for the executor and scheduler
respectively. Hyphens in command line arguments become underscores. For example, the `--scheduler-host` argument
for the executor becomes `BALLISTA_EXECUTOR_SCHEDULER_HOST`

Below is a checklist of what you need to do to add a new scalar function to DataFusion:
### Python Environment

- Add the actual implementation of the function:
- [here](datafusion/physical-expr/src/string_expressions.rs) for string functions
- [here](datafusion/physical-expr/src/math_expressions.rs) for math functions
- [here](datafusion/physical-expr/src/datetime_expressions.rs) for datetime functions
- create a new module [here](datafusion/physical-expr/src) for other functions
- In [core/src/physical_plan](datafusion/core/src/physical_plan/functions.rs), add:
- a new variant to `BuiltinScalarFunction`
- a new entry to `FromStr` with the name of the function as called by SQL
- a new line in `return_type` with the expected return type of the function, given an incoming type
- a new line in `signature` with the signature of the function (number and types of its arguments)
- a new line in `create_physical_expr`/`create_physical_fun` mapping the built-in to the implementation
- tests to the function.
- In [core/tests/sql](datafusion/core/tests/sql), add a new test where the function is called through SQL against well known data and returns the expected result.
- In [core/src/logical_plan/expr](datafusion/core/src/logical_plan/expr.rs), add:
- a new entry of the `unary_scalar_expr!` macro for the new function.
- In [core/src/logical_plan/mod](datafusion/core/src/logical_plan/mod.rs), add:
- a new entry in the `pub use expr::{}` set.
Refer to the instructions in the Python Bindings [README](./python/README.md)

## How to add a new aggregate function
### Javascript Environment

Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
Refer to the instructions in the Scheduler Web UI [README](./ballista/scheduler/ui/README.md)

- Add the actual implementation of an `Accumulator` and `AggregateExpr`:
- [here](datafusion/src/physical_plan/string_expressions.rs) for string functions
- [here](datafusion/src/physical_plan/math_expressions.rs) for math functions
- [here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
- create a new module [here](datafusion/src/physical_plan) for other functions
- In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
- a new variant to `BuiltinAggregateFunction`
- a new entry to `FromStr` with the name of the function as called by SQL
- a new line in `return_type` with the expected return type of the function, given an incoming type
- a new line in `signature` with the signature of the function (number and types of its arguments)
- a new line in `create_aggregate_expr` mapping the built-in to the implementation
- tests to the function.
- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
## Integration Tests

## How to display plans graphically

The query plans represented by `LogicalPlan` nodes can be graphically
rendered using [Graphviz](http://www.graphviz.org/).

To do so, save the output of the `display_graphviz` function to a file.:

```rust
// Create plan somehow...
let mut output = File::create("/tmp/plan.dot")?;
write!(output, "{}", plan.display_graphviz());
```

Then, use the `dot` command line tool to render it into a file that
can be displayed. For example, the following command creates a
`/tmp/plan.pdf` file:
The integration tests can be executed by running the following command from the root of the repository.

```bash
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
./dev/integration-tests.sh
```

## Specification

We formalize DataFusion semantics and behaviors through specification
documents. These specifications are useful to be used as references to help
resolve ambiguities during development or code reviews.

You are also welcome to propose changes to existing specifications or create
new specifications as you see fit.

Here is the list current active specifications:

- [Output field name semantic](https://arrow.apache.org/datafusion/specification/output-field-name-semantic.html)
- [Invariants](https://arrow.apache.org/datafusion/specification/invariants.html)

All specifications are stored in the `docs/source/specification` folder.

## How to format `.md` document

We are using `prettier` to format `.md` files.
Expand Down
4 changes: 2 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,13 @@

[workspace]
members = [
"benchmarks",
"ballista-cli",
"ballista/client",
"ballista/core",
"ballista/executor",
"ballista/scheduler",
"benchmarks",
"examples",
"ballista-cli",
]
exclude = ["python"]

Expand Down
4 changes: 2 additions & 2 deletions ballista-cli/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ ballista = { path = "../ballista/client", version = "0.10.0", features = [
"standalone",
] }
clap = { version = "3", features = ["derive", "cargo"] }
datafusion = "15.0.0"
datafusion-cli = "15.0.0"
datafusion = "17.0.0"
datafusion-cli = "17.0.0"
dirs = "4.0.0"
env_logger = "0.10"
mimalloc = { version = "0.1", default-features = false }
Expand Down
6 changes: 3 additions & 3 deletions ballista-cli/src/command.rs
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ impl Command {
.map_err(BallistaError::DataFusionError)
}
Self::DescribeTable(name) => {
let df = ctx.sql(&format!("SHOW COLUMNS FROM {}", name)).await?;
let df = ctx.sql(&format!("SHOW COLUMNS FROM {name}")).await?;
let batches = df.collect().await?;
print_options
.print_batches(&batches, now)
Expand Down Expand Up @@ -97,10 +97,10 @@ impl Command {
Self::SearchFunctions(function) => {
if let Ok(func) = function.parse::<Function>() {
let details = func.function_details()?;
println!("{}", details);
println!("{details}");
Ok(())
} else {
let msg = format!("{} is not a supported function", function);
let msg = format!("{function} is not a supported function");
Err(BallistaError::NotImplemented(msg))
}
}
Expand Down
12 changes: 6 additions & 6 deletions ballista-cli/src/exec.rs
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ pub async fn exec_from_lines(
if line.ends_with(';') {
match exec_and_print(ctx, print_options, query).await {
Ok(_) => {}
Err(err) => println!("{:?}", err),
Err(err) => println!("{err:?}"),
}
query = "".to_owned();
} else {
Expand All @@ -68,7 +68,7 @@ pub async fn exec_from_lines(
if !query.is_empty() {
match exec_and_print(ctx, print_options, query).await {
Ok(_) => {}
Err(err) => println!("{:?}", err),
Err(err) => println!("{err:?}"),
}
}
}
Expand Down Expand Up @@ -110,7 +110,7 @@ pub async fn exec_from_repl(ctx: &BallistaContext, print_options: &mut PrintOpti
if let Err(e) =
command.execute(&mut print_options).await
{
eprintln!("{}", e)
eprintln!("{e}")
}
} else {
eprintln!(
Expand All @@ -124,7 +124,7 @@ pub async fn exec_from_repl(ctx: &BallistaContext, print_options: &mut PrintOpti
}
_ => {
if let Err(e) = cmd.execute(ctx, &mut print_options).await {
eprintln!("{}", e)
eprintln!("{e}")
}
}
}
Expand All @@ -136,7 +136,7 @@ pub async fn exec_from_repl(ctx: &BallistaContext, print_options: &mut PrintOpti
rl.add_history_entry(line.trim_end());
match exec_and_print(ctx, &print_options, line).await {
Ok(_) => {}
Err(err) => eprintln!("{:?}", err),
Err(err) => eprintln!("{err:?}"),
}
}
Err(ReadlineError::Interrupted) => {
Expand All @@ -148,7 +148,7 @@ pub async fn exec_from_repl(ctx: &BallistaContext, print_options: &mut PrintOpti
break;
}
Err(err) => {
eprintln!("Unknown error happened {:?}", err);
eprintln!("Unknown error happened {err:?}");
break;
}
}
Expand Down
Loading

0 comments on commit 8bc5234

Please sign in to comment.