Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up README.md in advance of the 5.0 release #536

Merged
merged 7 commits into from
Jul 13, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 103 additions & 42 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,61 +17,122 @@
under the License.
-->

# How to contribute to Apache Arrow
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file was still a copy of the main arrow repo's contributing guidelines. I have replaced the content.

It might be worth spending some time breaking out the contributing guide from the developer guide but perhaps we can do that as a follow on PR. At least after this PR the content is no longer inaccurate / refers to JIRA, etc

## Developer's guide to Arrow Rust

## Did you find a bug?
### How to compile

The Arrow project uses JIRA as a bug tracker. To report a bug, you'll have
to first create an account on the
[Apache Foundation JIRA](https://issues.apache.org/jira/). The JIRA server
hosts bugs and issues for multiple Apache projects. The JIRA project name
for Arrow is "ARROW".
This is a standard cargo project with workspaces. To build it, you need to have `rust` and `cargo`:

To be assigned to an issue, ask an Arrow JIRA admin to go to
[Arrow Roles](https://issues.apache.org/jira/plugins/servlet/project-config/ARROW/roles),
click "Add users to a role," and add you to the "Contributor" role. Most
committers are authorized to do this; if you're a committer and aren't
able to load that project admin page, have someone else add you to the
necessary role.
```bash
cargo build
```

Before you create a new bug entry, we recommend you first
[search](https://issues.apache.org/jira/projects/ARROW/issues/ARROW-5140?filter=allopenissues)
among existing Arrow issues.
You can also use rust's official docker image:

When you create a new JIRA entry, please don't forget to fill the "Component"
field. Arrow has many subcomponents and this helps triaging and filtering
tremendously. Also, we conventionally prefix the issue title with the component
name in brackets, such as "[C++] Crash in Array::Frobnicate()", so as to make
lists more easy to navigate, and we'd be grateful if you did the same.
```bash
docker run --rm -v $(pwd):/arrow-rs -it rust /bin/bash -c "cd /arrow-rs && rustup component add rustfmt && cargo build"
```

## Did you write a patch that fixes a bug or brings an improvement?
The command above assumes that are in the root directory of the project, not in the same
directory as this README.md.

First create a JIRA entry as described above. Then, submit your changes
as a GitHub Pull Request. We'll ask you to prefix the pull request title
with the JIRA issue number and the component name in brackets.
(for example: "ARROW-2345: [C++] Fix crash in Array::Frobnicate()").
Respecting this convention makes it easier for us to process the backlog
of submitted Pull Requests.
You can also compile specific workspaces:

### Minor Fixes
```bash
cd arrow && cargo build
```

Any functionality change should have a JIRA opened. For minor changes that
affect documentation, you do not need to open up a JIRA. Instead you can
prefix the title of your PR with "MINOR: " if meets the following guidelines:
### Git Submodules

- Grammar, usage and spelling fixes that affect no more than 2 files
- Documentation updates affecting no more than 2 files and not more
than 500 words.
Before running tests and examples, it is necessary to set up the local development environment.

## Do you want to propose a significant new feature or an important refactoring?
The tests rely on test data that is contained in git submodules.

We ask that all discussions about major changes in the codebase happen
publicly on the [arrow-dev mailing-list](https://mail-archives.apache.org/mod_mbox/arrow-dev/).
To pull down this data run the following:

## Do you have questions about the source code, the build procedure or the development process?
```bash
git submodule update --init
```

You can also ask on the mailing-list, see above.
This populates data in two git submodules:

## Further information
- `../parquet_testing/data` (sourced from https://github.com/apache/parquet-testing.git)
- `../testing` (sourced from https://github.com/apache/arrow-testing)

Please read our [development documentation](https://arrow.apache.org/docs/developers/contributing.html).
By default, `cargo test` will look for these directories at their
standard location. The following environment variables can be used to override the location:

```bash
# Optionally specify a different location for test data
export PARQUET_TEST_DATA=$(cd ../parquet-testing/data; pwd)
export ARROW_TEST_DATA=$(cd ../testing/data; pwd)
```

From here on, this is a pure Rust project and `cargo` can be used to run tests, benchmarks, docs and examples as usual.

### Running the tests

Run tests using the Rust standard `cargo test` command:

```bash
# run all tests.
cargo test


# run only tests for the arrow crate
cargo test -p arrow
```

## Code Formatting

Our CI uses `rustfmt` to check code formatting. Before submitting a
PR be sure to run the following and check for lint issues:

```bash
cargo +stable fmt --all -- --check
```

## Clippy Lints

We recommend using `clippy` for checking lints during development. While we do not yet enforce `clippy` checks, we recommend not introducing new `clippy` errors or warnings.

Run the following to check for clippy lints.

```bash
cargo clippy
```

If you use Visual Studio Code with the `rust-analyzer` plugin, you can enable `clippy` to run each time you save a file. See https://users.rust-lang.org/t/how-to-use-clippy-in-vs-code-with-rust-analyzer/41881.

One of the concerns with `clippy` is that it often produces a lot of false positives, or that some recommendations may hurt readability. We do not have a policy of which lints are ignored, but if you disagree with a `clippy` lint, you may disable the lint and briefly justify it.

Search for `allow(clippy::` in the codebase to identify lints that are ignored/allowed. We currently prefer ignoring lints on the lowest unit possible.

- If you are introducing a line that returns a lint warning or error, you may disable the lint on that line.
- If you have several lints on a function or module, you may disable the lint on the function or module.
- If a lint is pervasive across multiple modules, you may disable it at the crate level.

## Git Pre-Commit Hook

We can use [git pre-commit hook](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks) to automate various kinds of git pre-commit checking/formatting.

Suppose you are in the root directory of the project.

First check if the file already exists:

```bash
ls -l .git/hooks/pre-commit
```

If the file already exists, to avoid mistakenly **overriding**, you MAY have to check
the link source or file content. Else if not exist, let's safely soft link [pre-commit.sh](pre-commit.sh) as file `.git/hooks/pre-commit`:

```bash
ln -s ../../rust/pre-commit.sh .git/hooks/pre-commit
```

If sometimes you want to commit without checking, just run `git commit` with `--no-verify`:

```bash
git commit --no-verify -m "... commit message ..."
```
167 changes: 21 additions & 146 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,167 +17,42 @@
under the License.
-->

# Native Rust implementation of Apache Arrow
# Native Rust implementation of Apache Arrow and Parquet

[![Coverage Status](https://codecov.io/gh/apache/arrow/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow?branch=master)

Welcome to the implementation of Arrow, the popular in-memory columnar format, in [Rust](https://www.rust-lang.org/).

This part of the Arrow project is divided in 4 main components:
This repo contains the following main components:

| Crate | Description | Documentation |
| ------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------------------------- |
| Arrow | Core functionality (memory layout, arrays, low level computations) | [(README)](arrow/README.md) |
| Parquet | Parquet support | [(README)](parquet/README.md) |
| Arrow-flight | Arrow data between processes | [(README)](arrow-flight/README.md) |
| DataFusion | In-memory query engine with SQL support | [(README)](https://github.com/apache/arrow-datafusion/blob/master/README.md) |
| Ballista | Distributed query execution | [(README)](https://github.com/apache/arrow-datafusion/blob/master/ballista/README.md) |

Independently, they support a vast array of functionality for in-memory computations.
| Crate | Description | Documentation |
| ------------ | ------------------------------------------------------------------ | ---------------------------------- |
| arrow | Core functionality (memory layout, arrays, low level computations) | [(README)](arrow/README.md) |
| parquet | Support for Parquet columnar file format | [(README)](parquet/README.md) |
| arrow-flight | Support for Arrow-Flight IPC protocol | [(README)](arrow-flight/README.md) |

Together, they allow users to write an SQL query or a `DataFrame` (using the `datafusion` crate), run it against a parquet file (using the `parquet` crate), evaluate it in-memory using Arrow's columnar format (using the `arrow` crate), and send to another process (using the `arrow-flight` crate).
There are two related crates in a different repository
| Crate | Description | Documentation |
| ------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------------------------- |
| DataFusion | In-memory query engine with SQL support | [(README)](https://github.com/apache/arrow-datafusion/blob/master/README.md) |
| Ballista | Distributed query execution | [(README)](https://github.com/apache/arrow-datafusion/blob/master/ballista/README.md) |

Generally speaking, the `arrow` crate offers functionality to develop code that uses Arrow arrays, and `datafusion` offers most operations typically found in SQL, including `join`s and window functions.
Collectively, these crates support a vast array of functionality for analytic computations in Rust.

There are too many features to enumerate here, but some notable mentions:
For example, you can write an SQL query or a `DataFrame` (using the `datafusion` crate), run it against a parquet file (using the `parquet` crate), evaluate it in-memory using Arrow's columnar format (using the `arrow` crate), and send to another process (using the `arrow-flight` crate).

- `Arrow` implements all formats in the specification except certain dictionaries
- `Arrow` supports SIMD operations to some of its vertical operations
- `DataFusion` supports `async` execution
- `DataFusion` supports user-defined functions, aggregates, and whole execution nodes
Generally speaking, the `arrow` crate offers functionality for using Arrow arrays, and `datafusion` offers most operations typically found in SQL, including `join`s and window functions.

You can find more details about each crate in their respective READMEs.

## Arrow Rust Community

We use the official [ASF Slack](https://s.apache.org/slack-invite) for informal discussions and coordination. This is
a great place to meet other contributors and get guidance on where to contribute. Join us in the `arrow-rust` channel.

We use [ASF JIRA](https://issues.apache.org/jira/secure/Dashboard.jspa) as the system of record for new features
and bug fixes and this plays a critical role in the release process.

For design discussions we generally collaborate on Google documents and file a JIRA linking to the document.

There is also a bi-weekly Rust-specific sync call for the Arrow Rust community. This is hosted on Google Meet
at https://meet.google.com/ctp-yujs-aee on alternate Wednesday's at 09:00 US/Pacific, 12:00 US/Eastern. During
US daylight savings time this corresponds to 16:00 UTC and at other times this is 17:00 UTC.

## Developer's guide to Arrow Rust

### How to compile

This is a standard cargo project with workspaces. To build it, you need to have `rust` and `cargo`:

```bash
cargo build
```

You can also use rust's official docker image:

```bash
docker run --rm -v $(pwd):/arrow-rs -it rust /bin/bash -c "cd /arrow-rs && rustup component add rustfmt && cargo build"
```

The command above assumes that are in the root directory of the project, not in the same
directory as this README.md.

You can also compile specific workspaces:

```bash
cd arrow && cargo build
```

### Git Submodules

Before running tests and examples, it is necessary to set up the local development environment.

The tests rely on test data that is contained in git submodules.

To pull down this data run the following:

```bash
git submodule update --init
```

This populates data in two git submodules:

- `../parquet_testing/data` (sourced from https://github.com/apache/parquet-testing.git)
- `../testing` (sourced from https://github.com/apache/arrow-testing)

By default, `cargo test` will look for these directories at their
standard location. The following environment variables can be used to override the location:

```bash
# Optionally specify a different location for test data
export PARQUET_TEST_DATA=$(cd ../parquet-testing/data; pwd)
export ARROW_TEST_DATA=$(cd ../testing/data; pwd)
```
The `dev@arrow.apache.org` mailing list serves as the core communication channel for the Arrow community. Instructions for signing up and links to the archives can be found at the [Arrow Community](https://arrow.apache.org/community/) page. All major announcements and communications happen there.

From here on, this is a pure Rust project and `cargo` can be used to run tests, benchmarks, docs and examples as usual.
The Rust Arrow community also uses the official [ASF Slack](https://s.apache.org/slack-invite) for informal discussions and coordination. This is
a great place to meet other contributors and get guidance on where to contribute. Join us in the `#arrow-rust` channel.

### Running the tests

Run tests using the Rust standard `cargo test` command:

```bash
# run all tests.
cargo test


# run only tests for the arrow crate
cargo test -p arrow
```

## Code Formatting

Our CI uses `rustfmt` to check code formatting. Before submitting a
PR be sure to run the following and check for lint issues:

```bash
cargo +stable fmt --all -- --check
```

## Clippy Lints

We recommend using `clippy` for checking lints during development. While we do not yet enforce `clippy` checks, we recommend not introducing new `clippy` errors or warnings.

Run the following to check for clippy lints.

```bash
cargo clippy
```

If you use Visual Studio Code with the `rust-analyzer` plugin, you can enable `clippy` to run each time you save a file. See https://users.rust-lang.org/t/how-to-use-clippy-in-vs-code-with-rust-analyzer/41881.

One of the concerns with `clippy` is that it often produces a lot of false positives, or that some recommendations may hurt readability. We do not have a policy of which lints are ignored, but if you disagree with a `clippy` lint, you may disable the lint and briefly justify it.

Search for `allow(clippy::` in the codebase to identify lints that are ignored/allowed. We currently prefer ignoring lints on the lowest unit possible.

- If you are introducing a line that returns a lint warning or error, you may disable the lint on that line.
- If you have several lints on a function or module, you may disable the lint on the function or module.
- If a lint is pervasive across multiple modules, you may disable it at the crate level.

## Git Pre-Commit Hook

We can use [git pre-commit hook](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks) to automate various kinds of git pre-commit checking/formatting.

Suppose you are in the root directory of the project.

First check if the file already exists:

```bash
ls -l .git/hooks/pre-commit
```

If the file already exists, to avoid mistakenly **overriding**, you MAY have to check
the link source or file content. Else if not exist, let's safely soft link [pre-commit.sh](pre-commit.sh) as file `.git/hooks/pre-commit`:

```bash
ln -s ../../rust/pre-commit.sh .git/hooks/pre-commit
```

If sometimes you want to commit without checking, just run `git commit` with `--no-verify`:
Unlike other parts of the Arrow ecosystem, the Rust implementation uses [github issues](https://github.com/apache/arrow-rs/issues) as the system of record for new features
and bug fixes and this plays a critical role in the release process.

```bash
git commit --no-verify -m "... commit message ..."
```
For design discussions we generally collaborate on Google documents and file a github issue linking to the document.
13 changes: 10 additions & 3 deletions arrow-flight/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,15 @@

[![Crates.io](https://img.shields.io/crates/v/arrow-flight.svg)](https://crates.io/crates/arrow-flight)

Apache Arrow Flight is a gRPC based protocol for exchanging Arrow data between processes. See the blog post [Introducing Apache Arrow Flight: A Framework for Fast Data Transport](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for more information.
## Usage

Add this to your Cargo.toml:

This crate simply provides the Rust implementation of the [Flight.proto](../../format/Flight.proto) gRPC protocol and provides an example that demonstrates how to build a Flight server implemented with Tonic.
```toml
[dependencies]
arrow-flight = "5.0"
```

Apache Arrow Flight is a gRPC based protocol for exchanging Arrow data between processes. See the blog post [Introducing Apache Arrow Flight: A Framework for Fast Data Transport](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for more information.

Note that building a Flight server also requires an implementation of Arrow IPC which is based on the Flatbuffers serialization framework. The Rust implementation of Arrow IPC is not yet complete although the generated Flatbuffers code is available as part of the core Arrow crate.
This crate provides a Rust implementation of the [Flight.proto](../../format/Flight.proto) gRPC protocol and provides an example that demonstrates how to build a Flight server implemented with Tonic.
Loading