diff --git a/arrow/README.md b/arrow/README.md index d26a4f410c23..6ff42110ef3e 100644 --- a/arrow/README.md +++ b/arrow/README.md @@ -22,7 +22,10 @@ [![crates.io](https://img.shields.io/crates/v/arrow.svg)](https://crates.io/crates/arrow) [![docs.rs](https://img.shields.io/docsrs/arrow.svg)](https://docs.rs/arrow/latest/arrow/) -This crate contains the official Native Rust implementation of [Apache Arrow][arrow] in memory format, governed by the Apache Software Foundation. Additional details can be found on [crates.io](https://crates.io/crates/arrow), [docs.rs](https://docs.rs/arrow/latest/arrow/) and [examples](https://github.com/apache/arrow-rs/tree/master/arrow/examples). +This crate contains the official Native Rust implementation of [Apache Arrow][arrow] in memory format, governed by the Apache Software Foundation. + +The [crate documentation](https://docs.rs/arrow/latest/arrow/) contains examples and full API. +There are several [examples](https://github.com/apache/arrow-rs/tree/master/arrow/examples) to start from as well. ## Rust Version Compatibility @@ -34,18 +37,24 @@ The arrow crate follows the [SemVer standard](https://doc.rust-lang.org/cargo/re However, for historical reasons, this crate uses versions with major numbers greater than `0.x` (e.g. `19.0.0`), unlike many other crates in the Rust ecosystem which spend extended time releasing versions `0.x` to signal planned ongoing API changes. Minor arrow releases contain only compatible changes, while major releases may contain breaking API changes. -## Features +## Feature Flags -The arrow crate provides the following features which may be enabled: +The `arrow` crate provides the following features which may be enabled in your `Cargo.toml`: - `csv` (default) - support for reading and writing Arrow arrays to/from csv files - `ipc` (default) - support for the [arrow-flight](https://crates.io/crates/arrow-flight) IPC and wire format - `prettyprint` - support for formatting record batches as textual columns - `js` - support for building arrow for WebAssembly / JavaScript -- `simd` - (_Requires Nightly Rust_) alternate optimized +- `simd` - (_Requires Nightly Rust_) Use alternate hand optimized implementations of some [compute](https://github.com/apache/arrow-rs/tree/master/arrow/src/compute/kernels) - kernels using explicit SIMD instructions available through [packed_simd_2](https://docs.rs/packed_simd_2/latest/packed_simd_2/). + kernels using explicit SIMD instructions via [packed_simd_2](https://docs.rs/packed_simd_2/latest/packed_simd_2/). - `chrono-tz` - support of parsing timezone using [chrono-tz](https://docs.rs/chrono-tz/0.6.0/chrono_tz/) +- `ffi` - bindings for the Arrow C [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html) +- `pyarrow` - bindings for pyo3 to call arrow-rs from python + +## Arrow Feature Status + +The [Apache Arrow Status](https://arrow.apache.org/docs/status.html) page lists which features of Arrow this crate supports. ## Safety @@ -55,25 +64,25 @@ Arrow seeks to uphold the Rust Soundness Pledge as articulated eloquently [here] Where soundness in turn is defined as: -> Code is unable to trigger undefined behaviour using safe APIs +> Code is unable to trigger undefined behavior using safe APIs -One way to ensure this would be to not use `unsafe`, however, as described in the opening chapter of the [Rustonomicon](https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html) this is not a requirement, and flexibility in this regard is actually one of Rust's great strengths. +One way to ensure this would be to not use `unsafe`, however, as described in the opening chapter of the [Rustonomicon](https://doc.rust-lang.org/nomicon/meet-safe-and-unsafe.html) this is not a requirement, and flexibility in this regard is one of Rust's great strengths. In particular there are a number of scenarios where `unsafe` is largely unavoidable: -* Invariants that cannot be statically verified by the compiler and unlock non-trivial performance wins, e.g. values in a StringArray are UTF-8, [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html) iterators, etc... -* FFI -* SIMD +- Invariants that cannot be statically verified by the compiler and unlock non-trivial performance wins, e.g. values in a StringArray are UTF-8, [TrustedLen](https://doc.rust-lang.org/std/iter/trait.TrustedLen.html) iterators, etc... +- FFI +- SIMD -Additionally, this crate exposes a number of `unsafe` APIs, allowing downstream crates to explicitly opt-out of potentially expensive invariant checking where appropriate. +Additionally, this crate exposes a number of `unsafe` APIs, allowing downstream crates to explicitly opt-out of potentially expensive invariant checking where appropriate. We have a number of strategies to help reduce this risk: -* Provide strongly-typed `Array` and `ArrayBuilder` APIs to safely and efficiently interact with arrays -* Extensive validation logic to safely construct `ArrayData` from untrusted sources -* All commits are verified using [MIRI](https://github.com/rust-lang/miri) to detect undefined behaviour -* We provide a `force_validate` feature that enables additional validation checks for use in test/debug builds -* There is ongoing work to reduce and better document the use of unsafe, and we welcome contributions in this space +- Provide strongly-typed `Array` and `ArrayBuilder` APIs to safely and efficiently interact with arrays +- Extensive validation logic to safely construct `ArrayData` from untrusted sources +- All commits are verified using [MIRI](https://github.com/rust-lang/miri) to detect undefined behaviour +- Use a `force_validate` feature that enables additional validation checks for use in test/debug builds +- There is ongoing work to reduce and better document the use of unsafe, and we welcome contributions in this space ## Building for WASM @@ -101,16 +110,38 @@ cargo run --example read_csv [arrow]: https://arrow.apache.org/ +## Performance Tips -## Performance +Arrow aims to be as fast as possible out of the box, whilst not compromising on safety. However, +it relies heavily on LLVM auto-vectorisation to achieve this. Unfortunately the LLVM defaults, +particularly for x86_64, favour portability over performance, and LLVM will consequently avoid +using more recent instructions that would result in errors on older CPUs. -Most of the compute kernels benefit a lot from being optimized for a specific CPU target. -This is especially so on x86-64 since without specifying a target the compiler can only assume support for SSE2 vector instructions. -One of the following values as `-Ctarget-cpu=value` in `RUSTFLAGS` can therefore improve performance significantly: +To address this it is recommended that you specify the override the LLVM defaults either +by setting the `RUSTFLAGS` environment variable, or by setting `rustflags` in your +[Cargo configuration](https://doc.rust-lang.org/cargo/reference/config.html) - - `native`: Target the exact features of the cpu that the build is running on. - This should give the best performance when building and running locally, but should be used carefully for example when building in a CI pipeline or when shipping pre-compiled software. - - `x86-64-v3`: Includes AVX2 support and is close to the intel `haswell` architecture released in 2013 and should be supported by any recent Intel or Amd cpu. - - `x86-64-v4`: Includes AVX512 support available on intel `skylake` server and `icelake`/`tigerlake`/`rocketlake` laptop and desktop processors. +Enable all features supported by the current CPU -These flags should be used in addition to the `simd` feature, since they will also affect the code generated by the simd library. \ No newline at end of file +```ignore +RUSTFLAGS="-C target-cpu=native" +``` + +Enable all features supported by the current CPU, and enable full use of AVX512 + +```ignore +RUSTFLAGS="-C target-cpu=native -C target-feature=-prefer-256-bit" +``` + +Enable all features supported by CPUs more recent than haswell (2013) + +```ignore +RUSTFLAGS="-C target-cpu=haswell" +``` + +For a full list of features and target CPUs use + +```shell +$ rustc --print target-cpus +$ rustc --print target-features +``` diff --git a/arrow/src/compute/README.md b/arrow/src/compute/README.md index 761713a531b4..a5d15a83046f 100644 --- a/arrow/src/compute/README.md +++ b/arrow/src/compute/README.md @@ -33,16 +33,16 @@ We use the term "kernel" to refer to particular general operation that contains Types of functions -* Scalar functions: elementwise functions that perform scalar operations in a +- Scalar functions: elementwise functions that perform scalar operations in a vectorized manner. These functions are generally valid for SQL-like context. These are called "scalar" in that the functions executed consider each value in an array independently, and the output array or arrays have the same length as the input arrays. The result for each array cell is generally independent of its position in the array. -* Vector functions, which produce a result whose output is generally dependent +- Vector functions, which produce a result whose output is generally dependent on the entire contents of the input arrays. These functions **are generally not valid** for SQL-like processing because the output size may be different than the input size, and the result may change based on the order of the values in the array. This includes things like array subselection, sorting, hashing, and more. -* Scalar aggregate functions of which can be used in a SQL-like context \ No newline at end of file +- Scalar aggregate functions of which can be used in a SQL-like context diff --git a/arrow/src/lib.rs b/arrow/src/lib.rs index e64a5361176d..04f495dc0819 100644 --- a/arrow/src/lib.rs +++ b/arrow/src/lib.rs @@ -18,41 +18,8 @@ //! A complete, safe, native Rust implementation of [Apache Arrow](https://arrow.apache.org), a cross-language //! development platform for in-memory data. //! -//! # Performance Tips -//! -//! Arrow aims to be as fast as possible out of the box, whilst not compromising on safety. However, -//! it relies heavily on LLVM auto-vectorisation to achieve this. Unfortunately the LLVM defaults, -//! particularly for x86_64, favour portability over performance, and LLVM will consequently avoid -//! using more recent instructions that would result in errors on older CPUs. -//! -//! To address this it is recommended that you specify the override the LLVM defaults either -//! by setting the `RUSTFLAGS` environment variable, or by setting `rustflags` in your -//! [Cargo configuration](https://doc.rust-lang.org/cargo/reference/config.html) -//! -//! Enable all features supported by the current CPU -//! -//! ```ignore -//! RUSTFLAGS="-C target-cpu=native" -//! ``` -//! -//! Enable all features supported by the current CPU, and enable full use of AVX512 -//! -//! ```ignore -//! RUSTFLAGS="-C target-cpu=native -C target-feature=-prefer-256-bit" -//! ``` -//! -//! Enable all features supported by CPUs more recent than haswell (2013) -//! -//! ```ignore -//! RUSTFLAGS="-C target-cpu=haswell" -//! ``` -//! -//! For a full list of features and target CPUs use -//! -//! ```ignore -//! $ rustc --print target-cpus -//! $ rustc --print target-features -//! ``` +//! Please see the [arrow crates.io](https://crates.io/crates/arrow) +//! page for feature flags and tips to improve performance. //! //! # Columnar Format //! diff --git a/parquet/README.md b/parquet/README.md index fbb6e3e1b5d5..689a664b6326 100644 --- a/parquet/README.md +++ b/parquet/README.md @@ -19,17 +19,38 @@ # Apache Parquet Official Native Rust Implementation -[![Crates.io](https://img.shields.io/crates/v/parquet.svg)](https://crates.io/crates/parquet) +[![crates.io](https://img.shields.io/crates/v/parquet.svg)](https://crates.io/crates/parquet) +[![docs.rs](https://img.shields.io/docsrs/parquet.svg)](https://docs.rs/parquet/latest/parquet/) This crate contains the official Native Rust implementation of [Apache Parquet](https://parquet.apache.org/), which is part of the [Apache Arrow](https://arrow.apache.org/) project. See [crate documentation](https://docs.rs/parquet/latest/parquet/) for examples and the full API. -## Rust Version Compatbility +## Rust Version Compatibility This crate is tested with the latest stable version of Rust. We do not currently test against other, older versions of the Rust compiler. -## Features +## Versioning / Releases + +The arrow crate follows the [SemVer standard](https://doc.rust-lang.org/cargo/reference/semver.html) defined by Cargo and works well within the Rust crate ecosystem. + +However, for historical reasons, this crate uses versions with major numbers greater than `0.x` (e.g. `19.0.0`), unlike many other crates in the Rust ecosystem which spend extended time releasing versions `0.x` to signal planned ongoing API changes. Minor arrow releases contain only compatible changes, while major releases may contain breaking API changes. + +## Feature Flags + +The `parquet` crate provides the following features which may be enabled in your `Cargo.toml`: + +- `arrow` (default) - support for reading / writing [`arrow`](https://crates.io/crates/arrow) arrays to / from parquet +- `async` - support `async` APIs for reading parquet +- `json` - support for reading / writing `json` data to / from parquet +- `brotli` (default) - support for parquet using `brotli` compression +- `flate2` (default) - support for parquet using `gzip` compression +- `lz4` (default) - support for parquet using `lz4` compression +- `zstd` (default) - support for parquet using `zstd` compression +- `cli` - parquet [CLI tools](https://github.com/apache/arrow-rs/tree/master/parquet/src/bin) +- `experimental` - Experimental APIs which may change, even between minor releases + +## Parquet Feature Status - [x] All encodings supported - [x] All compression codecs supported diff --git a/parquet/src/lib.rs b/parquet/src/lib.rs index d4eaaf41686a..5ee43f8ad6fb 100644 --- a/parquet/src/lib.rs +++ b/parquet/src/lib.rs @@ -19,6 +19,9 @@ //! [Apache Parquet](https://parquet.apache.org/), part of //! the [Apache Arrow](https://arrow.apache.org/) project. //! +//! Please see the [parquet crates.io](https://crates.io/crates/parquet) +//! page for feature flags and tips to improve performance. +//! //! # Getting Started //! Start with some examples: //!