Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Commit

Permalink
Improved API consistency and docs (#833)
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgecarleitao authored Feb 14, 2022
1 parent 9e6924b commit 9e3c3d0
Show file tree
Hide file tree
Showing 6 changed files with 38 additions and 22 deletions.
18 changes: 9 additions & 9 deletions guide/src/high_level.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Contrarily to `Arc<Vec<Option<T>>`, arrays in this crate are represented in such
that they can be zero-copied to any other Arrow implementation via foreign interfaces (FFI).

Probably the simplest `Array` in this crate is the `PrimitiveArray<T>`. It can be
constructed as from a slice of option values,
constructed from a slice of option values,

```rust
# use arrow2::array::{Array, PrimitiveArray};
Expand Down Expand Up @@ -36,13 +36,13 @@ assert_eq!(array.len(), 3)
# }
```

A `PrimitiveArray` has 3 components:
A `PrimitiveArray` (and every `Array` implemented in this crate) has 3 components:

1. A physical type (e.g. `i32`)
2. A logical type (e.g. `DataType::Int32`)
3. Data

The main differences from a `Vec<Option<T>>` are:
The main differences from a `Arc<Vec<Option<T>>>` are:

* Its data is laid out in memory as a `Buffer<T>` and an `Option<Bitmap>` (see [../low_level.md])
* It has an associated logical type (`DataType`).
Expand Down Expand Up @@ -84,16 +84,16 @@ The following arrays are supported:
* `Utf8Array<i32>` and `Utf8Array<i64>` (for strings)
* `BinaryArray<i32>` and `BinaryArray<i64>` (for opaque binaries)
* `FixedSizeBinaryArray` (like `BinaryArray`, but fixed size)
* `ListArray<i32>` and `ListArray<i64>` (nested arrays)
* `FixedSizeListArray` (nested arrays of fixed size)
* `StructArray` (every row has multiple logical types)
* `ListArray<i32>` and `ListArray<i64>` (array of arrays)
* `FixedSizeListArray` (array of arrays of a fixed size)
* `StructArray` (multiple named arrays where each row has one element from each array)
* `UnionArray` (every row has a different logical type)
* `DictionaryArray<K>` (nested array with encoded values)

## Array as a trait object

`Array` is object safe, and all implementations of `Array` and can be casted
to `&dyn Array`, which enables run-time nesting.
to `&dyn Array`, which enables dynamic casting and run-time nesting.

```rust
# use arrow2::array::{Array, PrimitiveArray};
Expand Down Expand Up @@ -177,8 +177,8 @@ This crate's APIs are generally split into two patterns: whether an operation le
contiguous memory regions or whether it does not.

What this means is that certain operations can be performed irrespectively of whether a value
is "null" or not (e.g. `PrimitiveArray<i32> + i32` can be applied to _all_ values via SIMD and
only copy the validity bitmap independently).
is "null" or not (e.g. `PrimitiveArray<i32> + i32` can be applied to _all_ values
via SIMD and only copy the validity bitmap independently).

When an operation benefits from such arrangement, it is advantageous to use

Expand Down
12 changes: 10 additions & 2 deletions guide/src/low_level.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The starting point of this crate is the idea that data is stored in memory in a
The most important design aspect of this crate is that contiguous regions are shared via an
`Arc`. In this context, the operation of slicing a memory region is `O(1)` because it
corresponds to changing an offset and length. The tradeoff is that once under
an `Arc`, memory regions are immutable.
an `Arc`, memory regions are immutable. See note below on how to overcome this.

The second most important aspect is that Arrow has two main types of data buffers: bitmaps,
whose offsets are measured in bits, and byte types (such as `i32`), whose offsets are
Expand Down Expand Up @@ -55,7 +55,8 @@ interoperable in-memory format.
## Bitmaps

Arrow's in-memory arrangement of boolean values is different from `Vec<bool>`. Specifically,
arrow uses individual bits to represent a boolean, as opposed to the usual byte that `bool` holds.
arrow uses individual bits to represent a boolean, as opposed to the usual byte
that `bool` holds.
Besides the 8x compression, this makes the validity particularly useful for
[AVX512](https://en.wikipedia.org/wiki/AVX-512) masks.
One tradeoff is that an arrows' bitmap is not represented as a Rust slice, as Rust slices use
Expand Down Expand Up @@ -86,3 +87,10 @@ x.set(1, true);
assert_eq!(x.get(1), true);
# }
```

## Copy on write (COW) semantics

Both `Buffer` and `Bitmap` support copy on write semantics via `into_mut`, that may convert
them to a `Vec` or `MutableBitmap` respectively.

This allows re-using them to e.g. perform multiple operations without allocations.
4 changes: 2 additions & 2 deletions src/array/primitive/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -196,7 +196,7 @@ impl<T: NativeType> PrimitiveArray<T> {
self.values,
Some(bitmap),
)),
Right(mutable_bitmap) => match self.values.get_vec() {
Right(mutable_bitmap) => match self.values.into_mut() {
Left(buffer) => Left(PrimitiveArray::from_data(
self.data_type,
buffer,
Expand All @@ -210,7 +210,7 @@ impl<T: NativeType> PrimitiveArray<T> {
},
}
} else {
match self.values.get_vec() {
match self.values.into_mut() {
Left(buffer) => Left(PrimitiveArray::from_data(self.data_type, buffer, None)),
Right(values) => Right(MutablePrimitiveArray::from_data(
self.data_type,
Expand Down
4 changes: 2 additions & 2 deletions src/array/utf8/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ impl<O: Offset> Utf8Array<O> {
self.values,
Some(bitmap),
)),
Right(mutable_bitmap) => match (self.values.get_vec(), self.offsets.get_vec()) {
Right(mutable_bitmap) => match (self.values.into_mut(), self.offsets.into_mut()) {
(Left(immutable_values), Left(immutable_offsets)) => {
Left(Utf8Array::from_data(
self.data_type,
Expand Down Expand Up @@ -250,7 +250,7 @@ impl<O: Offset> Utf8Array<O> {
},
}
} else {
match (self.values.get_vec(), self.offsets.get_vec()) {
match (self.values.into_mut(), self.offsets.into_mut()) {
(Left(immutable_values), Left(immutable_offsets)) => Left(Utf8Array::from_data(
self.data_type,
immutable_offsets,
Expand Down
8 changes: 7 additions & 1 deletion src/bitmap/immutable.rs
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,13 @@ impl Bitmap {
self.offset
}

/// Try to convert this `Bitmap` to a `MutableBitmap`
/// Converts this [`Bitmap`] to [`MutableBitmap`], returning itself if the conversion
/// is not possible
///
/// This operation returns a [`MutableBitmap`] iff:
/// * this [`Bitmap`] is not an offsetted slice of another [`Bitmap`]
/// * this [`Bitmap`] has not been cloned (i.e. [`Arc`]`::get_mut` yields [`Some`])
/// * this [`Bitmap`] was not imported from the c data interface (FFI)
pub fn into_mut(mut self) -> Either<Self, MutableBitmap> {
match (
self.offset,
Expand Down
14 changes: 8 additions & 6 deletions src/buffer/immutable.rs
Original file line number Diff line number Diff line change
Expand Up @@ -131,12 +131,14 @@ impl<T: NativeType> Buffer<T> {
self.offset
}

/// Try to get the inner data as a mutable [`Vec<T>`].
/// This succeeds iff:
/// * This data was allocated by Rust (i.e. it does not come from the C data interface)
/// * This region is not being shared any other struct.
/// * This buffer has no offset
pub fn get_vec(mut self) -> Either<Self, Vec<T>> {
/// Converts this [`Buffer`] to [`Vec`], returning itself if the conversion
/// is not possible
///
/// This operation returns a [`Vec`] iff this [`Buffer`]:
/// * is not an offsetted slice of another [`Buffer`]
/// * has not been cloned (i.e. [`Arc`]`::get_mut` yields [`Some`])
/// * has not been imported from the c data interface (FFI)
pub fn into_mut(mut self) -> Either<Self, Vec<T>> {
if self.offset != 0 {
Either::Left(self)
} else {
Expand Down

0 comments on commit 9e3c3d0

Please sign in to comment.