Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update arrow module docs #1840

Merged
merged 1 commit into from
Jun 10, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 88 additions & 25 deletions arrow/src/array/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -15,40 +15,66 @@
// specific language governing permissions and limitations
// under the License.

//! The central type in Apache Arrow are arrays, represented
//! by the [`Array` trait](crate::array::Array).
//! An array represents a known-length sequence of values all
//! having the same type.
//! The central type in Apache Arrow are arrays, which are a known-length sequence of values
//! all having the same type. This module provides concrete implementations of each type, as
//! well as an [`Array`] trait that can be used for type-erasure.
//!
//! Internally, those values are represented by one or several
//! [buffers](crate::buffer::Buffer), the number and meaning
//! of which depend on the array’s data type, as documented in
//! [the Arrow data layout specification](https://arrow.apache.org/docs/format/Columnar.html).
//! For example, the type `Int16Array` represents an Apache
//! Arrow array of 16-bit integers.
//! # Downcasting an Array
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it is important to highlight this from the outset, as it can be unclear for a new user given an Array how to actually do something with it 😆

//!
//! Those buffers consist of the value data itself and an
//! optional [bitmap buffer](crate::bitmap::Bitmap) that
//! indicates which array entries are null values.
//! The bitmap buffer can be entirely omitted if the array is
//! known to have zero null values.
//! Arrays are often passed around as a dynamically typed [`&dyn Array`] or [`ArrayRef`].
//! For example, [`RecordBatch`](`crate::record_batch::RecordBatch`) stores columns as [`ArrayRef`].
//!
//! There are concrete implementations of this trait for each
//! data type, that help you access individual values of the
//! array.
//! Whilst these arrays can be passed directly to the [`compute`](crate::compute),
//! [`csv`](crate::csv), [`json`](crate::json), etc... APIs, it is often the case that you wish
//! to interact with the data directly. This requires downcasting to the concrete type of the array:
//!
//! ```
//! # use arrow::array::{Array, Float32Array, Int32Array};
//! #
//! fn sum_int32(array: &dyn Array) -> i32 {
//! let integers: &Int32Array = array.as_any().downcast_ref().unwrap();
//! integers.iter().map(|val| val.unwrap_or_default()).sum()
//! }
//!
//! // Note: the values for positions corresponding to nulls will be arbitrary
//! fn as_f32_slice(array: &dyn Array) -> &[f32] {
//! array.as_any().downcast_ref::<Float32Array>().unwrap().values()
//! }
//! ```
//!
//! # Building an Array
//!
//! Arrow's `Arrays` are immutable, but there is the trait
//! [`ArrayBuilder`](crate::array::ArrayBuilder)
//! that helps you with constructing new `Arrays`. As with the
//! `Array` trait, there are builder implementations for all
//! concrete array types.
//! Most [`Array`] implementations can be constructed directly from iterators or [`Vec`]
//!
//! # Example
//! ```
//! use arrow::array::Int16Array;
//! # use arrow::array::Int32Array;
//! # use arrow::array::StringArray;
//! # use arrow::array::ListArray;
//! # use arrow::datatypes::Int32Type;
//! #
//! Int32Array::from(vec![1, 2]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new APIs are 👌

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree -- this is very nice

//! Int32Array::from(vec![Some(1), None]);
//! Int32Array::from_iter([1, 2, 3, 4]);
//! Int32Array::from_iter([Some(1), Some(2), None, Some(4)]);
//!
//! StringArray::from(vec!["foo", "bar"]);
//! StringArray::from(vec![Some("foo"), None]);
//! StringArray::from_iter([Some("foo"), None]);
//! StringArray::from_iter_values(["foo", "bar"]);
//!
//! ListArray::from_iter_primitive::<Int32Type, _, _>([
//! Some(vec![Some(1), None, Some(3)]),
//! None,
//! Some(vec![])
//! ]);
//! ```
//!
//! Additionally [`ArrayBuilder`](crate::array::ArrayBuilder) implementations can be
//! used to construct arrays with a push-based interface
//!
//! ```
//! # use arrow::array::Int16Array;
//! #
//! // Create a new builder with a capacity of 100
//! let mut builder = Int16Array::builder(100);
//!
Expand Down Expand Up @@ -78,6 +104,43 @@
//! "Get slice of len 2 starting at idx 3"
//! )
//! ```
//!
//! # Zero-Copy Slicing
//!
//! Given an [`Array`] of arbitrary length, it is possible to create an owned slice of this
//! data. Internally this just increments some ref-counts, and so is incredibly cheap
//!
//! ```rust
//! # use std::sync::Arc;
//! # use arrow::array::{Array, Int32Array, ArrayRef};
//! let array = Arc::new(Int32Array::from_iter([1, 2, 3])) as ArrayRef;
//!
//! // Slice with offset 1 and length 2
//! let sliced = array.slice(1, 2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is perhaps a little bit unfortunate that this returns ArrayRef even when called on a concrete type, but then again I'm not sure of many use-cases for slicing concretely type arrays 😆

//! let ints = sliced.as_any().downcast_ref::<Int32Array>().unwrap();
//! assert_eq!(ints.values(), &[2, 3]);
//! ```
//!
//! # Internal Representation
//!
//! Internally, arrays are represented by one or several [`Buffer`], the number and meaning of
//! which depend on the array’s data type, as documented in the [Arrow specification].
//!
//! For example, the type `Int16Array` represents an array of 16-bit integers and consists of:
//!
//! * An optional [`Bitmap`] identifying any null values
//! * A contiguous [`Buffer`] of 16-bit integers
//!
//! Similarly, the type `StringArray` represents an array of UTF-8 strings and consists of:
//!
//! * An optional [`Bitmap`] identifying any null values
//! * An offsets [`Buffer`] of 32-bit integers identifying valid UTF-8 sequences within the values buffer
//! * A values [`Buffer`] of UTF-8 encoded string data
//!
//! [Arrow specification]: https://arrow.apache.org/docs/format/Columnar.html
//! [`&dyn Array`]: Array
//! [`Bitmap`]: crate::bitmap::Bitmap
//! [`Buffer`]: crate::buffer::Buffer

#[allow(clippy::module_inception)]
mod array;
Expand Down
Loading