From 0bb3cfbb058c4dbd247b58bb6af57ef94c9056b8 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 10 Jul 2021 06:54:01 -0400 Subject: [PATCH 1/7] Clean up parquet readme --- parquet/CONTRIBUTING.md | 42 ++++++++++++++++++++++++++++++ parquet/README.md | 57 ++--------------------------------------- 2 files changed, 44 insertions(+), 55 deletions(-) create mode 100644 parquet/CONTRIBUTING.md diff --git a/parquet/CONTRIBUTING.md b/parquet/CONTRIBUTING.md new file mode 100644 index 000000000000..67cd9288db1c --- /dev/null +++ b/parquet/CONTRIBUTING.md @@ -0,0 +1,42 @@ +## Build + +Run `cargo build` or `cargo build --release` to build in release mode. +Some features take advantage of SSE4.2 instructions, which can be +enabled by adding `RUSTFLAGS="-C target-feature=+sse4.2"` before the +`cargo build` command. + +## Test + +Run `cargo test` for unit tests. To also run tests related to the binaries, use `cargo test --features cli`. + +## Binaries + +The following binaries are provided (use `cargo install --features cli` to install them): + +- **parquet-schema** for printing Parquet file schema and metadata. + `Usage: parquet-schema `, where `file-path` is the path to a Parquet file. Use `-v/--verbose` flag + to print full metadata or schema only (when not specified only schema will be printed). + +- **parquet-read** for reading records from a Parquet file. + `Usage: parquet-read [num-records]`, where `file-path` is the path to a Parquet file, + and `num-records` is the number of records to read from a file (when not specified all records will + be printed). Use `-j/--json` to print records in JSON lines format. + +- **parquet-rowcount** for reporting the number of records in one or more Parquet files. + `Usage: parquet-rowcount ...`, where `...` is a space separated list of one or more + files to read. + +If you see `Library not loaded` error, please make sure `LD_LIBRARY_PATH` is set properly: + +``` +export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(rustc --print sysroot)/lib +``` + +## Benchmarks + +Run `cargo bench` for benchmarks. + +## Docs + +To build documentation, run `cargo doc --no-deps`. +To compile and view in the browser, run `cargo doc --no-deps --open`. diff --git a/parquet/README.md b/parquet/README.md index b48e27e91010..621525e74f64 100644 --- a/parquet/README.md +++ b/parquet/README.md @@ -17,7 +17,7 @@ under the License. --> -# An Apache Parquet implementation in Rust +# Official Apache Parquet implementation in Rust [![Crates.io](https://img.shields.io/crates/v/parquet.svg)](https://crates.io/crates/parquet) @@ -30,11 +30,6 @@ Add this to your Cargo.toml: parquet = "^4" ``` -and this to your crate root: - -```rust -extern crate parquet; -``` Example usage of reading data: @@ -51,7 +46,7 @@ while let Some(record) = iter.next() { } ``` -See [crate documentation](https://docs.rs/crate/parquet/5.0.0-SNAPSHOT) on available API. +See [crate documentation](https://docs.rs/crate/parquet) on available API. ## Upgrading from versions prior to 4.0 @@ -90,54 +85,6 @@ version is available. Then simply update version of `parquet-format` crate in Ca - [ ] Predicate pushdown - [x] Parquet format 2.6.0 support -## Requirements - -Parquet requires LLVM. Our windows CI image includes LLVM but to build the libraries locally windows -users will have to install LLVM. Follow [this](https://github.com/appveyor/ci/issues/2651) link for info. - -## Build - -Run `cargo build` or `cargo build --release` to build in release mode. -Some features take advantage of SSE4.2 instructions, which can be -enabled by adding `RUSTFLAGS="-C target-feature=+sse4.2"` before the -`cargo build` command. - -## Test - -Run `cargo test` for unit tests. To also run tests related to the binaries, use `cargo test --features cli`. - -## Binaries - -The following binaries are provided (use `cargo install --features cli` to install them): - -- **parquet-schema** for printing Parquet file schema and metadata. - `Usage: parquet-schema `, where `file-path` is the path to a Parquet file. Use `-v/--verbose` flag - to print full metadata or schema only (when not specified only schema will be printed). - -- **parquet-read** for reading records from a Parquet file. - `Usage: parquet-read [num-records]`, where `file-path` is the path to a Parquet file, - and `num-records` is the number of records to read from a file (when not specified all records will - be printed). Use `-j/--json` to print records in JSON lines format. - -- **parquet-rowcount** for reporting the number of records in one or more Parquet files. - `Usage: parquet-rowcount ...`, where `...` is a space separated list of one or more - files to read. - -If you see `Library not loaded` error, please make sure `LD_LIBRARY_PATH` is set properly: - -``` -export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(rustc --print sysroot)/lib -``` - -## Benchmarks - -Run `cargo bench` for benchmarks. - -## Docs - -To build documentation, run `cargo doc --no-deps`. -To compile and view in the browser, run `cargo doc --no-deps --open`. - ## License Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0. From 1836ee97cb2baa583556d1d610c46b965d59f527 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 10 Jul 2021 07:07:42 -0400 Subject: [PATCH 2/7] Fixup arrow readme content --- arrow/CONTRIBUTING.md | 144 ++++++++++++++++++++++++++++++++++++ arrow/README.md | 165 +++++------------------------------------- parquet/README.md | 2 +- 3 files changed, 164 insertions(+), 147 deletions(-) create mode 100644 arrow/CONTRIBUTING.md diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md new file mode 100644 index 000000000000..25c3770a9ff3 --- /dev/null +++ b/arrow/CONTRIBUTING.md @@ -0,0 +1,144 @@ +sfeatures +## Developer's guide + +Common information for all Rust libraries in this project, including +testing, code formatting, and lints, can be found in the main Arrow +Rust [README.md](../README.md). + +Please refer to [lib.rs](src/lib.rs) for an introduction to this +specific crate and its current functionality. + +### How to check memory allocations + +This crate heavily uses `unsafe` due to how memory is allocated in cache lines. +We have a small tool to verify that this crate does not leak memory (beyond what the compiler already does) + +Run it with + +```bash +cargo test --features memory-check --lib -- --test-threads 1 +``` + +This runs all unit-tests on a single thread and counts all allocations and de-allocations. + +## Examples + +The examples folder shows how to construct some different types of Arrow +arrays, including dynamic arrays created at runtime. + +Examples can be run using the `cargo run --example` command. For example: + +```bash +cargo run --example builders +cargo run --example dynamic_types +cargo run --example read_csv +``` + +## IPC + +The expected flatc version is 1.12.0+, built from [flatbuffers](https://github.com/google/flatbuffers) +master at fixed commit ID, by regen.sh. + +The IPC flatbuffer code was generated by running this command from the root of the project: + +```bash +./regen.sh +``` + +The above script will run the `flatc` compiler and perform some adjustments to the source code: + +- Replace `type__` with `type_` +- Remove `org::apache::arrow::flatbuffers` namespace +- Add includes to each generated file + + +## Guidelines in usage of `unsafe` + +[`unsafe`](https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html) has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. `valgrind`), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of `unsafe` is the [primary cause of undefined behavior](https://doc.rust-lang.org/reference/behavior-considered-undefined.html) in a program written in Rust. +For two real world examples of where `unsafe` has consumed time in the past in this project see [#8545](https://github.com/apache/arrow/pull/8645) and [8829](https://github.com/apache/arrow/pull/8829) +This crate only accepts the usage of `unsafe` code upon careful consideration, and strives to avoid it to the largest possible extent. + +### When can `unsafe` be used? + +Generally, `unsafe` should only be used when a `safe` counterpart is not available and there is no `safe` way to achieve additional performance in that area. The following is a summary of the current components of the crate that require `unsafe`: + +- alloc, dealloc and realloc of buffers along cache lines +- Interpreting bytes as certain rust types, for access, representation and compute +- Foreign interfaces (C data interface) +- Inter-process communication (IPC) +- SIMD +- Performance (e.g. omit bounds checks, use of pointers to avoid bound checks) + +#### cache-line aligned memory management + +The arrow format recommends storing buffers aligned with cache lines, and this crate adopts this behavior. +However, Rust's global allocator does not allocate memory aligned with cache-lines. As such, many of the low-level operations related to memory management require `unsafe`. + +#### Interpreting bytes + +The arrow format is specified in bytes (`u8`), which can be logically represented as certain types +depending on the `DataType`. +For many operations, such as access, representation, numerical computation and string manipulation, +it is often necessary to interpret bytes as other physical types (e.g. `i32`). + +Usage of `unsafe` for the purpose of interpreting bytes in their corresponding type (according to the arrow specification) is allowed. Specifically, the pointer to the byte slice must be aligned to the type that it intends to represent and the length of the slice is a multiple of the size of the target type of the transmutation. + +#### FFI + +The arrow format declares an ABI for zero-copy from and to libraries that implement the specification +(foreign interfaces). In Rust, receiving and sending pointers via FFI requires usage of `unsafe` due to +the impossibility of the compiler to derive the invariants (such as lifetime, null pointers, and pointer alignment) from the source code alone as they are part of the FFI contract. + +#### IPC + +The arrow format declares a IPC protocol, which this crate supports. IPC is equivalent to a FFI in that the rust compiler can't reason about the contract's invariants. + +#### SIMD + +The API provided by the `packed_simd` library is currently `unsafe`. However, SIMD offers a significant performance improvement over non-SIMD operations. + +#### Performance + +Some operations are significantly faster when `unsafe` is used. + +A common usage of `unsafe` is to offer an API to access the `i`th element of an array (e.g. `UInt32Array`). +This requires accessing the values buffer e.g. `array.buffers()[0]`, picking the slice +`[i * size_of(), (i + 1) * size_of()]`, and then transmuting it to `i32`. In safe Rust, +this operation requires boundary checks that are detrimental to performance. + +Usage of `unsafe` for performance reasons is justified only when all other alternatives have been exhausted and the performance benefits are sufficiently large (e.g. >~10%). + +### Considerations when introducing `unsafe` + +Usage of `unsafe` in this crate _must_: + +- not expose a public API as `safe` when there are necessary invariants for that API to be defined behavior. +- have code documentation for why `safe` is not used / possible +- have code documentation about which invariant the user needs to enforce to ensure [soundness](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#soundness-of-code--of-a-library), or which +- invariant is being preserved. +- if applicable, use `debug_assert`s to relevant invariants (e.g. bound checks) + +Example of code documentation: + +```rust +// JUSTIFICATION +// Benefit +// Describe the benefit of using unsafe. E.g. +// "30% performance degradation if the safe counterpart is used, see bench X." +// Soundness +// Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g. +// "We bounded check these values at initialization and the array is immutable." +let ... = unsafe { ... }; +``` + +When adding this documentation to existing code that is not sound and cannot trivially be fixed, we should file +specific JIRA issues and reference them in these code comments. For example: + +```rust +// Soundness +// This is not sound because .... see https://issues.apache.org/jira/browse/ARROW-nnnnn +``` + +# Releases and publishing to crates.io + +Please see the [release](../dev/release/README.md) for details on how to create arrow releases diff --git a/arrow/README.md b/arrow/README.md index dfd5926281a0..0074d7e92026 100644 --- a/arrow/README.md +++ b/arrow/README.md @@ -17,165 +17,38 @@ under the License. --> -# Native Rust implementation of Apache Arrow +# Apache Arrow Official Native Rust Implementation [![Crates.io](https://img.shields.io/crates/v/arrow.svg)](https://crates.io/crates/arrow) -This crate contains a native Rust implementation of the [Arrow columnar format](https://arrow.apache.org/docs/format/Columnar.html). +This crate contains the official Native Rust implementation of [Apache Arrow](https://arrow.apache.org/) in memory format. Please see the API documents for additional details. -## Developer's guide +## Usage -Common information for all Rust libraries in this project, including -testing, code formatting, and lints, can be found in the main Arrow -Rust [README.md](../README.md). +Add this to your Cargo.toml: -Please refer to [lib.rs](src/lib.rs) for an introduction to this -specific crate and its current functionality. - -### How to check memory allocations - -This crate heavily uses `unsafe` due to how memory is allocated in cache lines. -We have a small tool to verify that this crate does not leak memory (beyond what the compiler already does) - -Run it with - -```bash -cargo test --features memory-check --lib -- --test-threads 1 -``` - -This runs all unit-tests on a single thread and counts all allocations and de-allocations. - -## Examples - -The examples folder shows how to construct some different types of Arrow -arrays, including dynamic arrays created at runtime. - -Examples can be run using the `cargo run --example` command. For example: - -```bash -cargo run --example builders -cargo run --example dynamic_types -cargo run --example read_csv -``` - -## IPC - -The expected flatc version is 1.12.0+, built from [flatbuffers](https://github.com/google/flatbuffers) -master at fixed commit ID, by regen.sh. - -The IPC flatbuffer code was generated by running this command from the root of the project: - -```bash -./regen.sh +```toml +[dependencies] +arrow = "5.0" ``` -The above script will run the `flatc` compiler and perform some adjustments to the source code: - -- Replace `type__` with `type_` -- Remove `org::apache::arrow::flatbuffers` namespace -- Add includes to each generated file - ## Features -Arrow uses the following features: - -- `simd` - Arrow uses the [packed_simd](https://crates.io/crates/packed_simd) crate to optimize many of the - implementations in the [compute](https://github.com/apache/arrow/tree/master/rust/arrow/src/compute) - module using SIMD intrinsics. These optimizations are turned _off_ by default. - If the `simd` feature is enabled, an unstable version of Rust is required (we test with `nightly-2021-03-24`) -- `flight` which contains useful functions to convert between the Flight wire format and Arrow data -- `prettyprint` which is a utility for printing record batches - -Other than `simd` all the other features are enabled by default. Disabling `prettyprint` might be necessary in order to -compile Arrow to the `wasm32-unknown-unknown` WASM target. - -## Guidelines in usage of `unsafe` - -[`unsafe`](https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html) has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. `valgrind`), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of `unsafe` is the [primary cause of undefined behavior](https://doc.rust-lang.org/reference/behavior-considered-undefined.html) in a program written in Rust. -For two real world examples of where `unsafe` has consumed time in the past in this project see [#8545](https://github.com/apache/arrow/pull/8645) and [8829](https://github.com/apache/arrow/pull/8829) -This crate only accepts the usage of `unsafe` code upon careful consideration, and strives to avoid it to the largest possible extent. - -### When can `unsafe` be used? -Generally, `unsafe` should only be used when a `safe` counterpart is not available and there is no `safe` way to achieve additional performance in that area. The following is a summary of the current components of the crate that require `unsafe`: +The arrow crate provides the following optional features: -- alloc, dealloc and realloc of buffers along cache lines -- Interpreting bytes as certain rust types, for access, representation and compute -- Foreign interfaces (C data interface) -- Inter-process communication (IPC) -- SIMD -- Performance (e.g. omit bounds checks, use of pointers to avoid bound checks) +- `csv` (default) - support for reading and writing Arrow arrays to/from csv files +- `ipc` (default) - support for the [arrow-flight]((https://crates.io/crates/arrow-flight) IPC and wire format +- `prettyprint` - support for formatting record batches as textual columns +- `simd` - (*Requires Nightly Rust*) alternate optimized + implementations of some [compute](https://github.com/apache/arrow/tree/master/rust/arrow/src/compute) + kernels using explicit SIMD processor intrinsics. -#### cache-line aligned memory management +## Building for WASM -The arrow format recommends storing buffers aligned with cache lines, and this crate adopts this behavior. -However, Rust's global allocator does not allocate memory aligned with cache-lines. As such, many of the low-level operations related to memory management require `unsafe`. +In order to compile Arrow for Web Assembly (the `wasm32-unknown-unknown` WASM target), you will likely need to turn off this crate's default features. -#### Interpreting bytes - -The arrow format is specified in bytes (`u8`), which can be logically represented as certain types -depending on the `DataType`. -For many operations, such as access, representation, numerical computation and string manipulation, -it is often necessary to interpret bytes as other physical types (e.g. `i32`). - -Usage of `unsafe` for the purpose of interpreting bytes in their corresponding type (according to the arrow specification) is allowed. Specifically, the pointer to the byte slice must be aligned to the type that it intends to represent and the length of the slice is a multiple of the size of the target type of the transmutation. - -#### FFI - -The arrow format declares an ABI for zero-copy from and to libraries that implement the specification -(foreign interfaces). In Rust, receiving and sending pointers via FFI requires usage of `unsafe` due to -the impossibility of the compiler to derive the invariants (such as lifetime, null pointers, and pointer alignment) from the source code alone as they are part of the FFI contract. - -#### IPC - -The arrow format declares a IPC protocol, which this crate supports. IPC is equivalent to a FFI in that the rust compiler can't reason about the contract's invariants. - -#### SIMD - -The API provided by the `packed_simd` library is currently `unsafe`. However, SIMD offers a significant performance improvement over non-SIMD operations. - -#### Performance - -Some operations are significantly faster when `unsafe` is used. - -A common usage of `unsafe` is to offer an API to access the `i`th element of an array (e.g. `UInt32Array`). -This requires accessing the values buffer e.g. `array.buffers()[0]`, picking the slice -`[i * size_of(), (i + 1) * size_of()]`, and then transmuting it to `i32`. In safe Rust, -this operation requires boundary checks that are detrimental to performance. - -Usage of `unsafe` for performance reasons is justified only when all other alternatives have been exhausted and the performance benefits are sufficiently large (e.g. >~10%). - -### Considerations when introducing `unsafe` - -Usage of `unsafe` in this crate _must_: - -- not expose a public API as `safe` when there are necessary invariants for that API to be defined behavior. -- have code documentation for why `safe` is not used / possible -- have code documentation about which invariant the user needs to enforce to ensure [soundness](https://rust-lang.github.io/unsafe-code-guidelines/glossary.html#soundness-of-code--of-a-library), or which -- invariant is being preserved. -- if applicable, use `debug_assert`s to relevant invariants (e.g. bound checks) - -Example of code documentation: - -```rust -// JUSTIFICATION -// Benefit -// Describe the benefit of using unsafe. E.g. -// "30% performance degradation if the safe counterpart is used, see bench X." -// Soundness -// Describe why the code remains sound (according to the definition of rust's unsafe code guidelines). E.g. -// "We bounded check these values at initialization and the array is immutable." -let ... = unsafe { ... }; +```toml +[dependencies] +arrow = {version = "5.0" default-features = false } ``` - -When adding this documentation to existing code that is not sound and cannot trivially be fixed, we should file -specific JIRA issues and reference them in these code comments. For example: - -```rust -// Soundness -// This is not sound because .... see https://issues.apache.org/jira/browse/ARROW-nnnnn -``` - -# Releases and publishing to crates.io - -Please see the [release](../dev/release/README.md) for details on how to create arrow releases diff --git a/parquet/README.md b/parquet/README.md index 621525e74f64..c0efa18430bc 100644 --- a/parquet/README.md +++ b/parquet/README.md @@ -17,7 +17,7 @@ under the License. --> -# Official Apache Parquet implementation in Rust +# Apache Parquet Official Native Rust Implementation [![Crates.io](https://img.shields.io/crates/v/parquet.svg)](https://crates.io/crates/parquet) From b3d39b44f221adf3d82b902587580e2cad068db4 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 10 Jul 2021 07:14:19 -0400 Subject: [PATCH 3/7] cleanups --- arrow-flight/README.md | 15 ++++++++++++--- arrow/CONTRIBUTING.md | 13 ------------- arrow/README.md | 22 +++++++++++++--------- parquet/README.md | 10 ++-------- 4 files changed, 27 insertions(+), 33 deletions(-) diff --git a/arrow-flight/README.md b/arrow-flight/README.md index 4205ebb2e26b..5920a22676b3 100644 --- a/arrow-flight/README.md +++ b/arrow-flight/README.md @@ -21,8 +21,17 @@ [![Crates.io](https://img.shields.io/crates/v/arrow-flight.svg)](https://crates.io/crates/arrow-flight) -Apache Arrow Flight is a gRPC based protocol for exchanging Arrow data between processes. See the blog post [Introducing Apache Arrow Flight: A Framework for Fast Data Transport](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for more information. -This crate simply provides the Rust implementation of the [Flight.proto](../../format/Flight.proto) gRPC protocol and provides an example that demonstrates how to build a Flight server implemented with Tonic. +## Usage + +Add this to your Cargo.toml: + +```toml +[dependencies] +arrow-flight = "5.0" +``` + + +Apache Arrow Flight is a gRPC based protocol for exchanging Arrow data between processes. See the blog post [Introducing Apache Arrow Flight: A Framework for Fast Data Transport](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for more information. -Note that building a Flight server also requires an implementation of Arrow IPC which is based on the Flatbuffers serialization framework. The Rust implementation of Arrow IPC is not yet complete although the generated Flatbuffers code is available as part of the core Arrow crate. +This crate provides a Rust implementation of the [Flight.proto](../../format/Flight.proto) gRPC protocol and provides an example that demonstrates how to build a Flight server implemented with Tonic. diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md index 25c3770a9ff3..ed34c30b6c0a 100644 --- a/arrow/CONTRIBUTING.md +++ b/arrow/CONTRIBUTING.md @@ -21,19 +21,6 @@ cargo test --features memory-check --lib -- --test-threads 1 This runs all unit-tests on a single thread and counts all allocations and de-allocations. -## Examples - -The examples folder shows how to construct some different types of Arrow -arrays, including dynamic arrays created at runtime. - -Examples can be run using the `cargo run --example` command. For example: - -```bash -cargo run --example builders -cargo run --example dynamic_types -cargo run --example read_csv -``` - ## IPC The expected flatc version is 1.12.0+, built from [flatbuffers](https://github.com/google/flatbuffers) diff --git a/arrow/README.md b/arrow/README.md index 0074d7e92026..8d5f944fc358 100644 --- a/arrow/README.md +++ b/arrow/README.md @@ -23,15 +23,6 @@ This crate contains the official Native Rust implementation of [Apache Arrow](https://arrow.apache.org/) in memory format. Please see the API documents for additional details. -## Usage - -Add this to your Cargo.toml: - -```toml -[dependencies] -arrow = "5.0" -``` - ## Features @@ -52,3 +43,16 @@ In order to compile Arrow for Web Assembly (the `wasm32-unknown-unknown` WASM ta [dependencies] arrow = {version = "5.0" default-features = false } ``` + +## Examples + +The examples folder shows how to construct some different types of Arrow +arrays, including dynamic arrays: + +Examples can be run using the `cargo run --example` command. For example: + +```bash +cargo run --example builders +cargo run --example dynamic_types +cargo run --example read_csv +``` diff --git a/parquet/README.md b/parquet/README.md index c0efa18430bc..0600e123be78 100644 --- a/parquet/README.md +++ b/parquet/README.md @@ -21,15 +21,9 @@ [![Crates.io](https://img.shields.io/crates/v/parquet.svg)](https://crates.io/crates/parquet) -## Usage - -Add this to your Cargo.toml: - -```toml -[dependencies] -parquet = "^4" -``` +This crate contains the official Native Rust implementation of [Apache Parquet](https://parquet.apache.org/), which is part of the [Apache Arrow](https://arrow.apache.org/) project. +## Example Example usage of reading data: From 4742b48cb889890caca2715ada1aeb06850306eb Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 10 Jul 2021 07:29:32 -0400 Subject: [PATCH 4/7] update main readmen --- README.md | 160 ++++++------------------------------------------------ 1 file changed, 18 insertions(+), 142 deletions(-) diff --git a/README.md b/README.md index 021442897891..1fc52134366a 100644 --- a/README.md +++ b/README.md @@ -17,167 +17,43 @@ under the License. --> -# Native Rust implementation of Apache Arrow +# Native Rust implementation of Apache Arrow and Parquet [![Coverage Status](https://codecov.io/gh/apache/arrow/rust/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/arrow?branch=master) Welcome to the implementation of Arrow, the popular in-memory columnar format, in [Rust](https://www.rust-lang.org/). -This part of the Arrow project is divided in 4 main components: +This repo contains the following main components: | Crate | Description | Documentation | | ------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------------------------- | -| Arrow | Core functionality (memory layout, arrays, low level computations) | [(README)](arrow/README.md) | -| Parquet | Parquet support | [(README)](parquet/README.md) | -| Arrow-flight | Arrow data between processes | [(README)](arrow-flight/README.md) | +| arrow | Core functionality (memory layout, arrays, low level computations) | [(README)](arrow/README.md) | +| parquet | Support for Parquet columnar file format | [(README)](parquet/README.md) | +| arrow-flight | Support for Arrow-Flight IPC protocol | [(README)](arrow-flight/README.md) | + +There are two related crates in a different repository +| Crate | Description | Documentation | +| ------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------------------------- | | DataFusion | In-memory query engine with SQL support | [(README)](https://github.com/apache/arrow-datafusion/blob/master/README.md) | | Ballista | Distributed query execution | [(README)](https://github.com/apache/arrow-datafusion/blob/master/ballista/README.md) | -Independently, they support a vast array of functionality for in-memory computations. - -Together, they allow users to write an SQL query or a `DataFrame` (using the `datafusion` crate), run it against a parquet file (using the `parquet` crate), evaluate it in-memory using Arrow's columnar format (using the `arrow` crate), and send to another process (using the `arrow-flight` crate). -Generally speaking, the `arrow` crate offers functionality to develop code that uses Arrow arrays, and `datafusion` offers most operations typically found in SQL, including `join`s and window functions. +Collectively, these crates support a vast array of functionality for analytic computations in Rust. -There are too many features to enumerate here, but some notable mentions: +For example, you can write an SQL query or a `DataFrame` (using the `datafusion` crate), run it against a parquet file (using the `parquet` crate), evaluate it in-memory using Arrow's columnar format (using the `arrow` crate), and send to another process (using the `arrow-flight` crate). -- `Arrow` implements all formats in the specification except certain dictionaries -- `Arrow` supports SIMD operations to some of its vertical operations -- `DataFusion` supports `async` execution -- `DataFusion` supports user-defined functions, aggregates, and whole execution nodes +Generally speaking, the `arrow` crate offers functionality for using Arrow arrays, and `datafusion` offers most operations typically found in SQL, including `join`s and window functions. You can find more details about each crate in their respective READMEs. ## Arrow Rust Community -We use the official [ASF Slack](https://s.apache.org/slack-invite) for informal discussions and coordination. This is -a great place to meet other contributors and get guidance on where to contribute. Join us in the `arrow-rust` channel. - -We use [ASF JIRA](https://issues.apache.org/jira/secure/Dashboard.jspa) as the system of record for new features -and bug fixes and this plays a critical role in the release process. - -For design discussions we generally collaborate on Google documents and file a JIRA linking to the document. - -There is also a bi-weekly Rust-specific sync call for the Arrow Rust community. This is hosted on Google Meet -at https://meet.google.com/ctp-yujs-aee on alternate Wednesday's at 09:00 US/Pacific, 12:00 US/Eastern. During -US daylight savings time this corresponds to 16:00 UTC and at other times this is 17:00 UTC. - -## Developer's guide to Arrow Rust - -### How to compile - -This is a standard cargo project with workspaces. To build it, you need to have `rust` and `cargo`: - -```bash -cargo build -``` - -You can also use rust's official docker image: - -```bash -docker run --rm -v $(pwd):/arrow-rs -it rust /bin/bash -c "cd /arrow-rs && rustup component add rustfmt && cargo build" -``` - -The command above assumes that are in the root directory of the project, not in the same -directory as this README.md. - -You can also compile specific workspaces: - -```bash -cd arrow && cargo build -``` - -### Git Submodules - -Before running tests and examples, it is necessary to set up the local development environment. - -The tests rely on test data that is contained in git submodules. - -To pull down this data run the following: - -```bash -git submodule update --init -``` - -This populates data in two git submodules: - -- `../parquet_testing/data` (sourced from https://github.com/apache/parquet-testing.git) -- `../testing` (sourced from https://github.com/apache/arrow-testing) - -By default, `cargo test` will look for these directories at their -standard location. The following environment variables can be used to override the location: +The `dev@arrow.apache.org` mailing list serves as the core communication channel for the Arrow community. Instructions for signing up and links to the archives can be found at the [Arrow Community](https://arrow.apache.org/community/) page. All major announcements and communications happen there. -```bash -# Optionally specify a different location for test data -export PARQUET_TEST_DATA=$(cd ../parquet-testing/data; pwd) -export ARROW_TEST_DATA=$(cd ../testing/data; pwd) -``` +The Rust Arrow community also uses the official [ASF Slack](https://s.apache.org/slack-invite) for informal discussions and coordination. This is +a great place to meet other contributors and get guidance on where to contribute. Join us in the `#arrow-rust` channel. -From here on, this is a pure Rust project and `cargo` can be used to run tests, benchmarks, docs and examples as usual. - -### Running the tests - -Run tests using the Rust standard `cargo test` command: - -```bash -# run all tests. -cargo test - - -# run only tests for the arrow crate -cargo test -p arrow -``` - -## Code Formatting - -Our CI uses `rustfmt` to check code formatting. Before submitting a -PR be sure to run the following and check for lint issues: - -```bash -cargo +stable fmt --all -- --check -``` - -## Clippy Lints - -We recommend using `clippy` for checking lints during development. While we do not yet enforce `clippy` checks, we recommend not introducing new `clippy` errors or warnings. - -Run the following to check for clippy lints. - -```bash -cargo clippy -``` - -If you use Visual Studio Code with the `rust-analyzer` plugin, you can enable `clippy` to run each time you save a file. See https://users.rust-lang.org/t/how-to-use-clippy-in-vs-code-with-rust-analyzer/41881. - -One of the concerns with `clippy` is that it often produces a lot of false positives, or that some recommendations may hurt readability. We do not have a policy of which lints are ignored, but if you disagree with a `clippy` lint, you may disable the lint and briefly justify it. - -Search for `allow(clippy::` in the codebase to identify lints that are ignored/allowed. We currently prefer ignoring lints on the lowest unit possible. - -- If you are introducing a line that returns a lint warning or error, you may disable the lint on that line. -- If you have several lints on a function or module, you may disable the lint on the function or module. -- If a lint is pervasive across multiple modules, you may disable it at the crate level. - -## Git Pre-Commit Hook - -We can use [git pre-commit hook](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks) to automate various kinds of git pre-commit checking/formatting. - -Suppose you are in the root directory of the project. - -First check if the file already exists: - -```bash -ls -l .git/hooks/pre-commit -``` - -If the file already exists, to avoid mistakenly **overriding**, you MAY have to check -the link source or file content. Else if not exist, let's safely soft link [pre-commit.sh](pre-commit.sh) as file `.git/hooks/pre-commit`: - -```bash -ln -s ../../rust/pre-commit.sh .git/hooks/pre-commit -``` - -If sometimes you want to commit without checking, just run `git commit` with `--no-verify`: +Unlike other parts of the Arrow ecosystem, the Rust implementation uses [github issues](https://github.com/apache/arrow-rs/issues) as the system of record for new features +and bug fixes and this plays a critical role in the release process. -```bash -git commit --no-verify -m "... commit message ..." -``` +For design discussions we generally collaborate on Google documents and file a github issue linking to the document. From 91f93c7ad7cedd5ee4e3f2b4cb2a394ce188baad Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 10 Jul 2021 07:31:37 -0400 Subject: [PATCH 5/7] update contributing --- CONTRIBUTING.md | 145 ++++++++++++++++++++++++++++++++++-------------- 1 file changed, 103 insertions(+), 42 deletions(-) diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 18d6a7be5abb..9fe6c48a243d 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -17,61 +17,122 @@ under the License. --> -# How to contribute to Apache Arrow +## Developer's guide to Arrow Rust -## Did you find a bug? +### How to compile -The Arrow project uses JIRA as a bug tracker. To report a bug, you'll have -to first create an account on the -[Apache Foundation JIRA](https://issues.apache.org/jira/). The JIRA server -hosts bugs and issues for multiple Apache projects. The JIRA project name -for Arrow is "ARROW". +This is a standard cargo project with workspaces. To build it, you need to have `rust` and `cargo`: -To be assigned to an issue, ask an Arrow JIRA admin to go to -[Arrow Roles](https://issues.apache.org/jira/plugins/servlet/project-config/ARROW/roles), -click "Add users to a role," and add you to the "Contributor" role. Most -committers are authorized to do this; if you're a committer and aren't -able to load that project admin page, have someone else add you to the -necessary role. +```bash +cargo build +``` -Before you create a new bug entry, we recommend you first -[search](https://issues.apache.org/jira/projects/ARROW/issues/ARROW-5140?filter=allopenissues) -among existing Arrow issues. +You can also use rust's official docker image: -When you create a new JIRA entry, please don't forget to fill the "Component" -field. Arrow has many subcomponents and this helps triaging and filtering -tremendously. Also, we conventionally prefix the issue title with the component -name in brackets, such as "[C++] Crash in Array::Frobnicate()", so as to make -lists more easy to navigate, and we'd be grateful if you did the same. +```bash +docker run --rm -v $(pwd):/arrow-rs -it rust /bin/bash -c "cd /arrow-rs && rustup component add rustfmt && cargo build" +``` -## Did you write a patch that fixes a bug or brings an improvement? +The command above assumes that are in the root directory of the project, not in the same +directory as this README.md. -First create a JIRA entry as described above. Then, submit your changes -as a GitHub Pull Request. We'll ask you to prefix the pull request title -with the JIRA issue number and the component name in brackets. -(for example: "ARROW-2345: [C++] Fix crash in Array::Frobnicate()"). -Respecting this convention makes it easier for us to process the backlog -of submitted Pull Requests. +You can also compile specific workspaces: -### Minor Fixes +```bash +cd arrow && cargo build +``` -Any functionality change should have a JIRA opened. For minor changes that -affect documentation, you do not need to open up a JIRA. Instead you can -prefix the title of your PR with "MINOR: " if meets the following guidelines: +### Git Submodules -- Grammar, usage and spelling fixes that affect no more than 2 files -- Documentation updates affecting no more than 2 files and not more - than 500 words. +Before running tests and examples, it is necessary to set up the local development environment. -## Do you want to propose a significant new feature or an important refactoring? +The tests rely on test data that is contained in git submodules. -We ask that all discussions about major changes in the codebase happen -publicly on the [arrow-dev mailing-list](https://mail-archives.apache.org/mod_mbox/arrow-dev/). +To pull down this data run the following: -## Do you have questions about the source code, the build procedure or the development process? +```bash +git submodule update --init +``` -You can also ask on the mailing-list, see above. +This populates data in two git submodules: -## Further information +- `../parquet_testing/data` (sourced from https://github.com/apache/parquet-testing.git) +- `../testing` (sourced from https://github.com/apache/arrow-testing) -Please read our [development documentation](https://arrow.apache.org/docs/developers/contributing.html). +By default, `cargo test` will look for these directories at their +standard location. The following environment variables can be used to override the location: + +```bash +# Optionally specify a different location for test data +export PARQUET_TEST_DATA=$(cd ../parquet-testing/data; pwd) +export ARROW_TEST_DATA=$(cd ../testing/data; pwd) +``` + +From here on, this is a pure Rust project and `cargo` can be used to run tests, benchmarks, docs and examples as usual. + +### Running the tests + +Run tests using the Rust standard `cargo test` command: + +```bash +# run all tests. +cargo test + + +# run only tests for the arrow crate +cargo test -p arrow +``` + +## Code Formatting + +Our CI uses `rustfmt` to check code formatting. Before submitting a +PR be sure to run the following and check for lint issues: + +```bash +cargo +stable fmt --all -- --check +``` + +## Clippy Lints + +We recommend using `clippy` for checking lints during development. While we do not yet enforce `clippy` checks, we recommend not introducing new `clippy` errors or warnings. + +Run the following to check for clippy lints. + +```bash +cargo clippy +``` + +If you use Visual Studio Code with the `rust-analyzer` plugin, you can enable `clippy` to run each time you save a file. See https://users.rust-lang.org/t/how-to-use-clippy-in-vs-code-with-rust-analyzer/41881. + +One of the concerns with `clippy` is that it often produces a lot of false positives, or that some recommendations may hurt readability. We do not have a policy of which lints are ignored, but if you disagree with a `clippy` lint, you may disable the lint and briefly justify it. + +Search for `allow(clippy::` in the codebase to identify lints that are ignored/allowed. We currently prefer ignoring lints on the lowest unit possible. + +- If you are introducing a line that returns a lint warning or error, you may disable the lint on that line. +- If you have several lints on a function or module, you may disable the lint on the function or module. +- If a lint is pervasive across multiple modules, you may disable it at the crate level. + +## Git Pre-Commit Hook + +We can use [git pre-commit hook](https://git-scm.com/book/en/v2/Customizing-Git-Git-Hooks) to automate various kinds of git pre-commit checking/formatting. + +Suppose you are in the root directory of the project. + +First check if the file already exists: + +```bash +ls -l .git/hooks/pre-commit +``` + +If the file already exists, to avoid mistakenly **overriding**, you MAY have to check +the link source or file content. Else if not exist, let's safely soft link [pre-commit.sh](pre-commit.sh) as file `.git/hooks/pre-commit`: + +```bash +ln -s ../../rust/pre-commit.sh .git/hooks/pre-commit +``` + +If sometimes you want to commit without checking, just run `git commit` with `--no-verify`: + +```bash +git commit --no-verify -m "... commit message ..." +``` From 28ad2da25c8659e5f008ee5389383e316b7a525e Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Sat, 10 Jul 2021 07:32:33 -0400 Subject: [PATCH 6/7] Prettier --- README.md | 17 ++++++++--------- arrow-flight/README.md | 2 -- arrow/CONTRIBUTING.md | 2 +- arrow/README.md | 3 +-- 4 files changed, 10 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 1fc52134366a..b1cac38de446 100644 --- a/README.md +++ b/README.md @@ -25,18 +25,17 @@ Welcome to the implementation of Arrow, the popular in-memory columnar format, i This repo contains the following main components: -| Crate | Description | Documentation | -| ------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------------------------- | -| arrow | Core functionality (memory layout, arrays, low level computations) | [(README)](arrow/README.md) | -| parquet | Support for Parquet columnar file format | [(README)](parquet/README.md) | -| arrow-flight | Support for Arrow-Flight IPC protocol | [(README)](arrow-flight/README.md) | +| Crate | Description | Documentation | +| ------------ | ------------------------------------------------------------------ | ---------------------------------- | +| arrow | Core functionality (memory layout, arrays, low level computations) | [(README)](arrow/README.md) | +| parquet | Support for Parquet columnar file format | [(README)](parquet/README.md) | +| arrow-flight | Support for Arrow-Flight IPC protocol | [(README)](arrow-flight/README.md) | There are two related crates in a different repository -| Crate | Description | Documentation | +| Crate | Description | Documentation | | ------------ | ------------------------------------------------------------------ | ------------------------------------------------------------------------------------- | -| DataFusion | In-memory query engine with SQL support | [(README)](https://github.com/apache/arrow-datafusion/blob/master/README.md) | -| Ballista | Distributed query execution | [(README)](https://github.com/apache/arrow-datafusion/blob/master/ballista/README.md) | - +| DataFusion | In-memory query engine with SQL support | [(README)](https://github.com/apache/arrow-datafusion/blob/master/README.md) | +| Ballista | Distributed query execution | [(README)](https://github.com/apache/arrow-datafusion/blob/master/ballista/README.md) | Collectively, these crates support a vast array of functionality for analytic computations in Rust. diff --git a/arrow-flight/README.md b/arrow-flight/README.md index 5920a22676b3..b9bc466e205e 100644 --- a/arrow-flight/README.md +++ b/arrow-flight/README.md @@ -21,7 +21,6 @@ [![Crates.io](https://img.shields.io/crates/v/arrow-flight.svg)](https://crates.io/crates/arrow-flight) - ## Usage Add this to your Cargo.toml: @@ -31,7 +30,6 @@ Add this to your Cargo.toml: arrow-flight = "5.0" ``` - Apache Arrow Flight is a gRPC based protocol for exchanging Arrow data between processes. See the blog post [Introducing Apache Arrow Flight: A Framework for Fast Data Transport](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for more information. This crate provides a Rust implementation of the [Flight.proto](../../format/Flight.proto) gRPC protocol and provides an example that demonstrates how to build a Flight server implemented with Tonic. diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md index ed34c30b6c0a..ff4389aa93af 100644 --- a/arrow/CONTRIBUTING.md +++ b/arrow/CONTRIBUTING.md @@ -1,4 +1,5 @@ sfeatures + ## Developer's guide Common information for all Rust libraries in this project, including @@ -38,7 +39,6 @@ The above script will run the `flatc` compiler and perform some adjustments to t - Remove `org::apache::arrow::flatbuffers` namespace - Add includes to each generated file - ## Guidelines in usage of `unsafe` [`unsafe`](https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html) has a high maintenance cost because debugging and testing it is difficult, time consuming, often requires external tools (e.g. `valgrind`), and requires a higher-than-usual attention to details. Undefined behavior is particularly difficult to identify and test, and usage of `unsafe` is the [primary cause of undefined behavior](https://doc.rust-lang.org/reference/behavior-considered-undefined.html) in a program written in Rust. diff --git a/arrow/README.md b/arrow/README.md index 8d5f944fc358..f9b73085d686 100644 --- a/arrow/README.md +++ b/arrow/README.md @@ -25,13 +25,12 @@ This crate contains the official Native Rust implementation of [Apache Arrow](ht ## Features - The arrow crate provides the following optional features: - `csv` (default) - support for reading and writing Arrow arrays to/from csv files - `ipc` (default) - support for the [arrow-flight]((https://crates.io/crates/arrow-flight) IPC and wire format - `prettyprint` - support for formatting record batches as textual columns -- `simd` - (*Requires Nightly Rust*) alternate optimized +- `simd` - (_Requires Nightly Rust_) alternate optimized implementations of some [compute](https://github.com/apache/arrow/tree/master/rust/arrow/src/compute) kernels using explicit SIMD processor intrinsics. From f850aca8fa5290ea7d3a93a486dbe1d014e33275 Mon Sep 17 00:00:00 2001 From: Andrew Lamb Date: Mon, 12 Jul 2021 15:33:17 -0400 Subject: [PATCH 7/7] RAT --- arrow/CONTRIBUTING.md | 19 ++++++++++++++++++- parquet/CONTRIBUTING.md | 19 +++++++++++++++++++ 2 files changed, 37 insertions(+), 1 deletion(-) diff --git a/arrow/CONTRIBUTING.md b/arrow/CONTRIBUTING.md index ff4389aa93af..843e1faf05e7 100644 --- a/arrow/CONTRIBUTING.md +++ b/arrow/CONTRIBUTING.md @@ -1,4 +1,21 @@ -sfeatures + ## Developer's guide diff --git a/parquet/CONTRIBUTING.md b/parquet/CONTRIBUTING.md index 67cd9288db1c..834b6af9d4ef 100644 --- a/parquet/CONTRIBUTING.md +++ b/parquet/CONTRIBUTING.md @@ -1,3 +1,22 @@ + + ## Build Run `cargo build` or `cargo build --release` to build in release mode.