diff --git a/guide/src/high_level.md b/guide/src/high_level.md index c5887e7c02d..80652038b0f 100644 --- a/guide/src/high_level.md +++ b/guide/src/high_level.md @@ -6,7 +6,7 @@ Contrarily to `Arc>`, arrays in this crate are represented in such that they can be zero-copied to any other Arrow implementation via foreign interfaces (FFI). Probably the simplest `Array` in this crate is the `PrimitiveArray`. It can be -constructed as from a slice of option values, +constructed from a slice of option values, ```rust # use arrow2::array::{Array, PrimitiveArray}; @@ -36,13 +36,13 @@ assert_eq!(array.len(), 3) # } ``` -A `PrimitiveArray` has 3 components: +A `PrimitiveArray` (and every `Array` implemented in this crate) has 3 components: 1. A physical type (e.g. `i32`) 2. A logical type (e.g. `DataType::Int32`) 3. Data -The main differences from a `Vec>` are: +The main differences from a `Arc>>` are: * Its data is laid out in memory as a `Buffer` and an `Option` (see [../low_level.md]) * It has an associated logical type (`DataType`). @@ -84,16 +84,16 @@ The following arrays are supported: * `Utf8Array` and `Utf8Array` (for strings) * `BinaryArray` and `BinaryArray` (for opaque binaries) * `FixedSizeBinaryArray` (like `BinaryArray`, but fixed size) -* `ListArray` and `ListArray` (nested arrays) -* `FixedSizeListArray` (nested arrays of fixed size) -* `StructArray` (every row has multiple logical types) +* `ListArray` and `ListArray` (array of arrays) +* `FixedSizeListArray` (array of arrays of a fixed size) +* `StructArray` (multiple named arrays where each row has one element from each array) * `UnionArray` (every row has a different logical type) * `DictionaryArray` (nested array with encoded values) ## Array as a trait object `Array` is object safe, and all implementations of `Array` and can be casted -to `&dyn Array`, which enables run-time nesting. +to `&dyn Array`, which enables dynamic casting and run-time nesting. ```rust # use arrow2::array::{Array, PrimitiveArray}; @@ -177,8 +177,8 @@ This crate's APIs are generally split into two patterns: whether an operation le contiguous memory regions or whether it does not. What this means is that certain operations can be performed irrespectively of whether a value -is "null" or not (e.g. `PrimitiveArray + i32` can be applied to _all_ values via SIMD and -only copy the validity bitmap independently). +is "null" or not (e.g. `PrimitiveArray + i32` can be applied to _all_ values +via SIMD and only copy the validity bitmap independently). When an operation benefits from such arrangement, it is advantageous to use diff --git a/guide/src/low_level.md b/guide/src/low_level.md index aa638a704b6..db29f077d71 100644 --- a/guide/src/low_level.md +++ b/guide/src/low_level.md @@ -5,7 +5,7 @@ The starting point of this crate is the idea that data is stored in memory in a The most important design aspect of this crate is that contiguous regions are shared via an `Arc`. In this context, the operation of slicing a memory region is `O(1)` because it corresponds to changing an offset and length. The tradeoff is that once under -an `Arc`, memory regions are immutable. +an `Arc`, memory regions are immutable. See note below on how to overcome this. The second most important aspect is that Arrow has two main types of data buffers: bitmaps, whose offsets are measured in bits, and byte types (such as `i32`), whose offsets are @@ -55,7 +55,8 @@ interoperable in-memory format. ## Bitmaps Arrow's in-memory arrangement of boolean values is different from `Vec`. Specifically, -arrow uses individual bits to represent a boolean, as opposed to the usual byte that `bool` holds. +arrow uses individual bits to represent a boolean, as opposed to the usual byte +that `bool` holds. Besides the 8x compression, this makes the validity particularly useful for [AVX512](https://en.wikipedia.org/wiki/AVX-512) masks. One tradeoff is that an arrows' bitmap is not represented as a Rust slice, as Rust slices use @@ -86,3 +87,10 @@ x.set(1, true); assert_eq!(x.get(1), true); # } ``` + +## Copy on write (COW) semantics + +Both `Buffer` and `Bitmap` support copy on write semantics via `into_mut`, that may convert +them to a `Vec` or `MutableBitmap` respectively. + +This allows re-using them to e.g. perform multiple operations without allocations. diff --git a/src/array/primitive/mod.rs b/src/array/primitive/mod.rs index 1c2563e3e02..6496caf4131 100644 --- a/src/array/primitive/mod.rs +++ b/src/array/primitive/mod.rs @@ -196,7 +196,7 @@ impl PrimitiveArray { self.values, Some(bitmap), )), - Right(mutable_bitmap) => match self.values.get_vec() { + Right(mutable_bitmap) => match self.values.into_mut() { Left(buffer) => Left(PrimitiveArray::from_data( self.data_type, buffer, @@ -210,7 +210,7 @@ impl PrimitiveArray { }, } } else { - match self.values.get_vec() { + match self.values.into_mut() { Left(buffer) => Left(PrimitiveArray::from_data(self.data_type, buffer, None)), Right(values) => Right(MutablePrimitiveArray::from_data( self.data_type, diff --git a/src/array/utf8/mod.rs b/src/array/utf8/mod.rs index 015941325ef..d8e58170241 100644 --- a/src/array/utf8/mod.rs +++ b/src/array/utf8/mod.rs @@ -218,7 +218,7 @@ impl Utf8Array { self.values, Some(bitmap), )), - Right(mutable_bitmap) => match (self.values.get_vec(), self.offsets.get_vec()) { + Right(mutable_bitmap) => match (self.values.into_mut(), self.offsets.into_mut()) { (Left(immutable_values), Left(immutable_offsets)) => { Left(Utf8Array::from_data( self.data_type, @@ -250,7 +250,7 @@ impl Utf8Array { }, } } else { - match (self.values.get_vec(), self.offsets.get_vec()) { + match (self.values.into_mut(), self.offsets.into_mut()) { (Left(immutable_values), Left(immutable_offsets)) => Left(Utf8Array::from_data( self.data_type, immutable_offsets, diff --git a/src/bitmap/immutable.rs b/src/bitmap/immutable.rs index 25d7f244968..4037b9f8336 100644 --- a/src/bitmap/immutable.rs +++ b/src/bitmap/immutable.rs @@ -177,7 +177,13 @@ impl Bitmap { self.offset } - /// Try to convert this `Bitmap` to a `MutableBitmap` + /// Converts this [`Bitmap`] to [`MutableBitmap`], returning itself if the conversion + /// is not possible + /// + /// This operation returns a [`MutableBitmap`] iff: + /// * this [`Bitmap`] is not an offsetted slice of another [`Bitmap`] + /// * this [`Bitmap`] has not been cloned (i.e. [`Arc`]`::get_mut` yields [`Some`]) + /// * this [`Bitmap`] was not imported from the c data interface (FFI) pub fn into_mut(mut self) -> Either { match ( self.offset, diff --git a/src/buffer/immutable.rs b/src/buffer/immutable.rs index ec67febbe9f..fd98d562f8c 100644 --- a/src/buffer/immutable.rs +++ b/src/buffer/immutable.rs @@ -131,12 +131,14 @@ impl Buffer { self.offset } - /// Try to get the inner data as a mutable [`Vec`]. - /// This succeeds iff: - /// * This data was allocated by Rust (i.e. it does not come from the C data interface) - /// * This region is not being shared any other struct. - /// * This buffer has no offset - pub fn get_vec(mut self) -> Either> { + /// Converts this [`Buffer`] to [`Vec`], returning itself if the conversion + /// is not possible + /// + /// This operation returns a [`Vec`] iff this [`Buffer`]: + /// * is not an offsetted slice of another [`Buffer`] + /// * has not been cloned (i.e. [`Arc`]`::get_mut` yields [`Some`]) + /// * has not been imported from the c data interface (FFI) + pub fn into_mut(mut self) -> Either> { if self.offset != 0 { Either::Left(self) } else {