Skip to content

Commit

Permalink
update ballista
Browse files Browse the repository at this point in the history
  • Loading branch information
Jiayu Liu committed May 21, 2021
1 parent 1fecbdf commit 103c87c
Show file tree
Hide file tree
Showing 10 changed files with 44 additions and 34 deletions.
11 changes: 5 additions & 6 deletions ballista/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,14 @@

# Ballista: Distributed Compute with Apache Arrow

Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built
on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as
Ballista is a distributed compute platform primarily implemented in Rust, and powered by Apache Arrow. It is built
on an architecture that allows other programming languages (such as Python, C++, and Java) to be supported as
first-class citizens without paying a penalty for serialization costs.

The foundational technologies in Ballista are:

- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels for efficient processing of data.
- [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient
- [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient
data transfer between processes.
- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) for serializing query plans.
- [Docker](https://www.docker.com/) for packaging up executors along with user-defined code.
Expand Down Expand Up @@ -57,7 +57,6 @@ April 2021 and should be considered experimental.

## Getting Started

The [Ballista Developer Documentation](docs/README.md) and the
[DataFusion User Guide](https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide) are currently the
The [Ballista Developer Documentation](docs/README.md) and the
[DataFusion User Guide](https://github.com/apache/arrow-datafusion/tree/master/docs/user-guide) are currently the
best sources of information for getting started with Ballista.

8 changes: 4 additions & 4 deletions ballista/docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,19 +16,19 @@
specific language governing permissions and limitations
under the License.
-->

# Ballista Developer Documentation

This directory contains documentation for developers that are contributing to Ballista. If you are looking for
end-user documentation for a published release, please start with the
This directory contains documentation for developers that are contributing to Ballista. If you are looking for
end-user documentation for a published release, please start with the
[DataFusion User Guide](../../docs/user-guide) instead.

## Architecture & Design

- Read the [Architecture Overview](architecture.md) to get an understanding of the scheduler and executor
- Read the [Architecture Overview](architecture.md) to get an understanding of the scheduler and executor
processes and how distributed query execution works.

## Build, Test, Release

- Setting up a [development environment](dev-env.md).
- [Integration Testing](integration-testing.md)

30 changes: 15 additions & 15 deletions ballista/docs/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,37 +16,38 @@
specific language governing permissions and limitations
under the License.
-->

# Ballista Architecture

## Overview

Ballista allows queries to be executed in a distributed cluster. A cluster consists of one or
Ballista allows queries to be executed in a distributed cluster. A cluster consists of one or
more scheduler processes and one or more executor processes. See the following sections in this document for more
details about these components.

The scheduler accepts logical query plans and translates them into physical query plans using DataFusion and then
runs a secondary planning/optimization process to translate the physical query plan into a distributed physical
query plan.
The scheduler accepts logical query plans and translates them into physical query plans using DataFusion and then
runs a secondary planning/optimization process to translate the physical query plan into a distributed physical
query plan.

This process breaks a query down into a number of query stages that can be executed independently. There are
dependencies between query stages and these dependencies form a directionally-acyclic graph (DAG) because a query
This process breaks a query down into a number of query stages that can be executed independently. There are
dependencies between query stages and these dependencies form a directionally-acyclic graph (DAG) because a query
stage cannot start until its child query stages have completed.

Each query stage has one or more partitions that can be processed in parallel by the available
Each query stage has one or more partitions that can be processed in parallel by the available
executors in the cluster. This is the basic unit of scalability in Ballista.

The following diagram shows the flow of requests and responses between the client, scheduler, and executor
processes.
The following diagram shows the flow of requests and responses between the client, scheduler, and executor
processes.

![Query Execution Flow](images/query-execution.png)

## Scheduler Process

The scheduler process implements a gRPC interface (defined in
The scheduler process implements a gRPC interface (defined in
[ballista.proto](../rust/ballista/proto/ballista.proto)). The interface provides the following methods:

| Method | Description |
|----------------------|----------------------------------------------------------------------|
| -------------------- | -------------------------------------------------------------------- |
| ExecuteQuery | Submit a logical query plan or SQL query for execution |
| GetExecutorsMetadata | Retrieves a list of executors that have registered with a scheduler |
| GetFileMetadata | Retrieve metadata about files available in the cluster file system |
Expand All @@ -60,7 +61,7 @@ The scheduler can run in standalone mode, or can be run in clustered mode using
The executor process implements the Apache Arrow Flight gRPC interface and is responsible for:

- Executing query stages and persisting the results to disk in Apache Arrow IPC Format
- Making query stage results available as Flights so that they can be retrieved by other executors as well as by
- Making query stage results available as Flights so that they can be retrieved by other executors as well as by
clients

## Rust Client
Expand All @@ -69,7 +70,6 @@ The Rust client provides a DataFrame API that is a thin wrapper around the DataF
the means for a client to build a query plan for execution.

The client executes the query plan by submitting an `ExecuteLogicalPlan` request to the scheduler and then calls
`GetJobStatus` to check for completion. On completion, the client receives a list of locations for the Flights
containing the results for the query and will then connect to the appropriate executor processes to retrieve
`GetJobStatus` to check for completion. On completion, the client receives a list of locations for the Flights
containing the results for the query and will then connect to the appropriate executor processes to retrieve
those results.

5 changes: 3 additions & 2 deletions ballista/docs/dev-env.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,14 @@
specific language governing permissions and limitations
under the License.
-->

# Setting up a Rust development environment

You will need a standard Rust development environment. The easiest way to achieve this is by using rustup: https://rustup.rs/

## Install OpenSSL

Follow instructions for [setting up OpenSSL](https://docs.rs/openssl/0.10.28/openssl/). For Ubuntu users, the following
Follow instructions for [setting up OpenSSL](https://docs.rs/openssl/0.10.28/openssl/). For Ubuntu users, the following
command works.

```bash
Expand All @@ -35,4 +36,4 @@ You'll need cmake in order to compile some of ballista's dependencies. Ubuntu us

```bash
sudo apt-get install cmake
```
```
5 changes: 3 additions & 2 deletions ballista/docs/integration-testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,11 @@
specific language governing permissions and limitations
under the License.
-->

# Integration Testing

We use the [DataFusion Benchmarks](https://github.com/apache/arrow-datafusion/tree/master/benchmarks) for integration
testing.
We use the [DataFusion Benchmarks](https://github.com/apache/arrow-datafusion/tree/master/benchmarks) for integration
testing.

The integration tests can be executed by running the following command from the root of the DataFusion repository.

Expand Down
2 changes: 1 addition & 1 deletion ballista/rust/client/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,5 +18,5 @@
-->

# Ballista - Rust
This crate contains the Ballista client library. For an example usage, please refer [here](../benchmarks/tpch/README.md).

This crate contains the Ballista client library. For an example usage, please refer [here](../benchmarks/tpch/README.md).
1 change: 1 addition & 0 deletions ballista/rust/core/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,4 +18,5 @@
-->

# Ballista - Rust

This crate contains the core Ballista types.
3 changes: 2 additions & 1 deletion ballista/rust/executor/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
-->

# Ballista Executor - Rust

This crate contains the Ballista Executor. It can be used both as a library or as a binary.

## Run
Expand All @@ -28,4 +29,4 @@ RUST_LOG=info cargo run --release
[2021-02-11T05:30:13Z INFO executor] Running with config: ExecutorConfig { host: "localhost", port: 50051, work_dir: "/var/folders/y8/fc61kyjd4n53tn444n72rjrm0000gn/T/.tmpv1LjN0", concurrent_tasks: 4 }
```
By default, the executor will bind to `localhost` and listen on port `50051`.
By default, the executor will bind to `localhost` and listen on port `50051`.
9 changes: 6 additions & 3 deletions ballista/rust/scheduler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
-->

# Ballista Scheduler

This crate contains the Ballista Scheduler. It can be used both as a library or as a binary.

## Run
Expand All @@ -32,8 +33,9 @@ $ RUST_LOG=info cargo run --release
By default, the scheduler will bind to `localhost` and listen on port `50051`.
## Connecting to Scheduler
Scheduler supports REST model also using content negotiation.
For e.x if you want to get list of executors connected to the scheduler,
Scheduler supports REST model also using content negotiation.
For e.x if you want to get list of executors connected to the scheduler,
you can do (assuming you use default config)
```bash
Expand All @@ -43,7 +45,8 @@ curl --request GET \
```
## Scheduler UI
A basic ui for the scheduler is in `ui/scheduler` of the ballista repo.
A basic ui for the scheduler is in `ui/scheduler` of the ballista repo.
It can be started using the following [yarn](https://yarnpkg.com/) command
```bash
Expand Down
4 changes: 4 additions & 0 deletions ballista/ui/scheduler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,9 @@
## Start project from source

### Run scheduler/executor

First, run scheduler from project:

```shell
$ cd rust/scheduler
$ RUST_LOG=info cargo run --release
Expand All @@ -34,6 +36,7 @@ $ RUST_LOG=info cargo run --release
```

and run executor in new terminal:

```shell
$ cd rust/executor
$ RUST_LOG=info cargo run --release
Expand All @@ -44,6 +47,7 @@ $ RUST_LOG=info cargo run --release
```
### Run Client project
```shell
$ cd ui/scheduler
$ yarn
Expand Down

0 comments on commit 103c87c

Please sign in to comment.