Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Commit

Permalink
Improved docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgecarleitao committed Sep 29, 2021
1 parent 3c6c075 commit 5a15d04
Show file tree
Hide file tree
Showing 16 changed files with 134 additions and 91 deletions.
2 changes: 1 addition & 1 deletion src/array/growable/list.rs
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ pub struct GrowableList<'a, O: Offset> {
}

impl<'a, O: Offset> GrowableList<'a, O> {
/// Creates a new [`GrowableFixedSizeBinary`] bound to `arrays` with a pre-allocated `capacity`.
/// Creates a new [`GrowableList`] bound to `arrays` with a pre-allocated `capacity`.
/// # Panics
/// If `arrays` is empty.
pub fn new(arrays: Vec<&'a ListArray<O>>, mut use_validity: bool, capacity: usize) -> Self {
Expand Down
25 changes: 14 additions & 11 deletions src/array/mod.rs
Original file line number Diff line number Diff line change
@@ -1,18 +1,21 @@
//! fixed-length and immutable containers with optional values
//! Contains the [`Array`] and [`MutableArray`] trait objects declaring arrays,
//! as well as concrete arrays (such as [`Utf8Array`] and [`MutableUtf8Array`]).
//!
//! Fixed-length containers with optional values
//! that are layed in memory according to the Arrow specification.
//! Each array type has its own `struct`. The following are the main array types:
//! * [`PrimitiveArray`], an array of values with a fixed length such as integers, floats, etc.
//! * [`BooleanArray`], an array of boolean values (stored as a bitmap)
//! * [`Utf8Array`], an array of utf8 values
//! * [`BinaryArray`], an array of binary values
//! * [`ListArray`], an array of arrays (e.g. `[[1, 2], None, [], [None]]`)
//! * [`PrimitiveArray`] and [`MutablePrimitiveArray`], an array of values with a fixed length such as integers, floats, etc.
//! * [`BooleanArray`] and [`MutableBooleanArray`], an array of boolean values (stored as a bitmap)
//! * [`Utf8Array`] and [`MutableUtf8Array`], an array of variable length utf8 values
//! * [`BinaryArray`] and [`MutableBinaryArray`], an array of opaque variable length values
//! * [`ListArray`] and [`MutableListArray`], an array of arrays (e.g. `[[1, 2], None, [], [None]]`)
//! * [`StructArray`], an array of arrays identified by a string (e.g. `{"a": [1, 2], "b": [true, false]}`)
//! All arrays implement the trait [`Array`] and are often trait objects that can be downcasted
//! to a concrete struct based on [`DataType`] available from [`Array::data_type`].
//! Arrays share memory via [`crate::buffer::Buffer`] and thus cloning and slicing them `O(1)`.
//! All immutable arrays implement the trait object [`Array`] and that can be downcasted
//! to a concrete struct based on [`PhysicalType`](crate::datatypes::PhysicalType) available from [`Array::data_type`].
//! All immutable arrays are backed by [`Buffer`](crate::buffer::Buffer) and thus cloning and slicing them is `O(1)`.
//!
//! This module also contains the mutable counterparts of arrays, that are neither clonable nor slicable, but that
//! can be operated in-place, such as [`MutablePrimitiveArray`] and [`MutableUtf8Array`].
//! Most arrays contain a [`MutableArray`] counterpart that is neither clonable nor slicable, but
//! can be operated in-place.
use std::any::Any;
use std::fmt::Display;

Expand Down
4 changes: 1 addition & 3 deletions src/bitmap/mod.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
#![deny(missing_docs)]
//! Contains efficient containers of booleans: [`Bitmap`] and [`MutableBitmap`].
//! The memory backing these containers is cache-aligned and optimized for both vertical
//! and horizontal operations over booleans.
//! contains [`Bitmap`] and [`MutableBitmap`], containers of `bool`.
mod immutable;
pub use immutable::*;

Expand Down
4 changes: 2 additions & 2 deletions src/buffer/mod.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#![deny(missing_docs)]
//! Contains containers for all Arrow sized types (e.g. `i32`),
//! [`Buffer`] and [`MutableBuffer`].
//! Contains [`Buffer`] and [`MutableBuffer`], containers for all Arrow
//! physical types (e.g. i32, f64).
mod immutable;
mod mutable;
Expand Down
2 changes: 1 addition & 1 deletion src/compute/comparison/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
//! inputs the two items for comparison and an [`Operator`] which specifies the
//! type of comparison that will be conducted, such as `<=` ([`Operator::LtEq`]).
//!
//! Much like the parent module [`crate::compute`](compute), the comparison functions
//! Much like the parent module [`compute`](crate::compute), the comparison functions
//! have two variants - a statically typed one ([`primitive_compare`])
//! which expects concrete types such as [`Int8Array`] and a dynamically typed
//! variant ([`compare`]) that compares values of type `&dyn Array` and errors
Expand Down
8 changes: 6 additions & 2 deletions src/compute/mod.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
//! Contains operators over arrays. This module's general design is
//! contains a wide range of compute operations (e.g.
//! [`arithmetics`], [`aggregate`],
//! [`filter`], [`comparison`], and [`sort`])
//!
//! This module's general design is
//! that each operator has two interfaces, a statically-typed version and a dynamically-typed
//! version.
//! The statically-typed version expects concrete arrays (like `PrimitiveArray`);
//! The statically-typed version expects concrete arrays (such as [`PrimitiveArray`](crate::array::PrimitiveArray));
//! the dynamically-typed version expects `&dyn Array` and errors if the the type is not
//! supported.
//! Some dynamically-typed operators have an auxiliary function, `can_*`, that returns
Expand Down
2 changes: 1 addition & 1 deletion src/datatypes/mod.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
//! Metadata declarations such as [`DataType`], [`Field`] and [`Schema`].
//! Contains all metadata, such as [`PhysicalType`], [`DataType`], [`Field`] and [`Schema`].
mod field;
mod physical_type;
mod schema;
Expand Down
68 changes: 68 additions & 0 deletions src/doc/lib.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
Welcome to arrow2's documentation. Thanks for checking it out!

This is a library for efficient in-memory data operations using
[Arrow in-memory format](https://arrow.apache.org/docs/format/Columnar.html).
It is a re-write from the bottom up of the official `arrow` crate with soundness
and type safety in mind.

Check out [the guide](https://jorgecarleitao.github.io/arrow2/) for an introduction.
Below is an example of some of the things you can do with it:

```rust
use std::sync::Arc;

use arrow2::array::*;
use arrow2::compute::arithmetics;
use arrow2::error::Result;
use arrow2::io::parquet::write::*;
use arrow2::record_batch::RecordBatch;

fn main() -> Result<()> {
// declare arrays
let a = Int32Array::from(&[Some(1), None, Some(3)]);
let b = Int32Array::from(&[Some(2), None, Some(6)]);

// compute (probably the fastest implementation of a nullable op you can find out there)
let c = arithmetics::basic::mul_scalar(&a, &2);
assert_eq!(c, b);

// declare records
let batch = RecordBatch::try_from_iter([
("c1", Arc::new(a) as Arc<dyn Array>),
("c2", Arc::new(b) as Arc<dyn Array>),
])?;
// with metadata
println!("{:?}", batch.schema());

// write to parquet (probably the fastest implementation of writing to parquet out there)
let schema = batch.schema().clone();

let options = WriteOptions {
write_statistics: true,
compression: Compression::Snappy,
version: Version::V1,
};

let row_groups = RowGroupIterator::try_new(
vec![Ok(batch)].into_iter(),
&schema,
options,
vec![Encoding::Plain, Encoding::Plain],
)?;

// anything implementing `std::io::Write` works
let mut file = vec![];

let parquet_schema = row_groups.parquet_schema().clone();
let _ = write_file(
&mut file,
row_groups,
&schema,
parquet_schema,
options,
None,
)?;

Ok(())
}
```
10 changes: 5 additions & 5 deletions src/ffi/mod.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//! Contains interfaces to use the
//! [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html).
//! contains FFI bindings to import and export [`Array`](crate::array::Array) via
//! Arrow's [C Data Interface](https://arrow.apache.org/docs/format/CDataInterface.html)
mod array;
#[allow(clippy::module_inception)]
mod ffi;
Expand All @@ -19,16 +19,16 @@ pub use schema::Ffi_ArrowSchema;

use self::schema::to_field;

/// Exports an `Array` to the C data interface.
/// Exports an [`Arc<dyn Array>`] to the C data interface.
/// # Safety
/// The pointer must be allocated and valid
/// The pointer `ptr` must be allocated and valid
pub unsafe fn export_array_to_c(array: Arc<dyn Array>, ptr: *mut Ffi_ArrowArray) {
*ptr = Ffi_ArrowArray::new(array);
}

/// Exports a [`Field`] to the C data interface.
/// # Safety
/// The pointer must be allocated and valid
/// The pointer `ptr` must be allocated and valid
pub unsafe fn export_field_to_c(field: &Field, ptr: *mut Ffi_ArrowSchema) {
*ptr = Ffi_ArrowSchema::new(field)
}
Expand Down
3 changes: 2 additions & 1 deletion src/io/mod.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
//! Interact with different formats such as Arrow, CSV, parquet, etc.
//! Contains modules to interface with other formats such as [`csv`],
//! [`parquet`], [`json`], [`ipc`], [`mod@print`] and [`avro`].
#[cfg(any(feature = "io_csv_read", feature = "io_csv_write"))]
#[cfg_attr(docsrs, doc(cfg(feature = "io_csv")))]
pub mod csv;
Expand Down
2 changes: 1 addition & 1 deletion src/io/parquet/write/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ pub fn write_file<'a, W, I>(
key_value_metadata: Option<Vec<KeyValue>>,
) -> Result<u64>
where
W: std::io::Write + std::io::Seek,
W: std::io::Write,
I: Iterator<Item = Result<RowGroupIter<'a, ArrowError>>>,
{
let key_value_metadata = key_value_metadata
Expand Down
2 changes: 1 addition & 1 deletion src/io/parquet/write/stream.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ pub async fn write_stream<'a, W, I>(
key_value_metadata: Option<Vec<KeyValue>>,
) -> Result<u64>
where
W: std::io::Write + std::io::Seek,
W: std::io::Write,
I: Stream<Item = Result<RowGroupIter<'static, ArrowError>>>,
{
let key_value_metadata = key_value_metadata
Expand Down
3 changes: 1 addition & 2 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
//! Doc provided by README
#![doc = include_str!("doc/lib.md")]
// So that we have more control over what is `unsafe` inside an `unsafe` block
#![allow(unused_unsafe)]
#![cfg_attr(docsrs, feature(doc_cfg))]
Expand Down
83 changes: 26 additions & 57 deletions src/record_batch.rs
Original file line number Diff line number Diff line change
@@ -1,58 +1,27 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

//! A two-dimensional batch of column-oriented data with a defined
//! [schema](crate::datatypes::Schema).
//! Contains [`RecordBatch`].
use std::sync::Arc;

use crate::array::*;
use crate::datatypes::*;
use crate::error::{ArrowError, Result};

type ArrayRef = Arc<dyn Array>;

/// A two-dimensional batch of column-oriented data with a defined
/// [schema](crate::datatypes::Schema).
///
/// A `RecordBatch` is a two-dimensional dataset of a number of
/// contiguous arrays, each the same length.
/// A record batch has a schema which must match its arrays'
/// datatypes.
///
/// Record batches are a convenient unit of work for various
/// serialization and computation functions, possibly incremental.
/// A two-dimensional dataset with a number of
/// columns ([`Array`]) and rows and defined [`Schema`](crate::datatypes::Schema).
/// # Implementation
/// Cloning is `O(C)` where `C` is the number of columns.
#[derive(Clone, Debug, PartialEq)]
pub struct RecordBatch {
schema: Arc<Schema>,
columns: Vec<ArrayRef>,
columns: Vec<Arc<dyn Array>>,
}

impl RecordBatch {
/// Creates a `RecordBatch` from a schema and columns.
///
/// Expects the following:
/// * the vec of columns to not be empty
/// * the schema and column data types to have equal lengths
/// and match
/// * each array in columns to have the same length
///
/// If the conditions are not met, an error is returned.
///
/// Creates a [`RecordBatch`] from a schema and columns.
/// # Errors
/// This function errors iff
/// * `columns` is empty
/// * the schema and column data types do not match
/// * `columns` have a different length
/// # Example
///
/// ```
Expand All @@ -73,22 +42,22 @@ impl RecordBatch {
/// # Ok(())
/// # }
/// ```
pub fn try_new(schema: Arc<Schema>, columns: Vec<ArrayRef>) -> Result<Self> {
pub fn try_new(schema: Arc<Schema>, columns: Vec<Arc<dyn Array>>) -> Result<Self> {
let options = RecordBatchOptions::default();
Self::validate_new_batch(&schema, columns.as_slice(), &options)?;
Ok(RecordBatch { schema, columns })
}

/// Creates a `RecordBatch` from a schema and columns, with additional options,
/// Creates a [`RecordBatch`] from a schema and columns, with additional options,
/// such as whether to strictly validate field names.
///
/// See [`RecordBatch::try_new`] for the expected conditions.
/// See [`fn@try_new`] for the expected conditions.
pub fn try_new_with_options(
schema: Arc<Schema>,
columns: Vec<ArrayRef>,
columns: Vec<Arc<dyn Array>>,
options: &RecordBatchOptions,
) -> Result<Self> {
Self::validate_new_batch(&schema, columns.as_slice(), options)?;
Self::validate_new_batch(&schema, &columns, options)?;
Ok(RecordBatch { schema, columns })
}

Expand All @@ -106,7 +75,7 @@ impl RecordBatch {
/// if any validation check fails.
fn validate_new_batch(
schema: &Schema,
columns: &[ArrayRef],
columns: &[Arc<dyn Array>],
options: &RecordBatchOptions,
) -> Result<()> {
// check that there are some columns
Expand Down Expand Up @@ -229,12 +198,12 @@ impl RecordBatch {
/// # Panics
///
/// Panics if `index` is outside of `0..num_columns`.
pub fn column(&self, index: usize) -> &ArrayRef {
pub fn column(&self, index: usize) -> &Arc<dyn Array> {
&self.columns[index]
}

/// Get a reference to all columns in the record batch.
pub fn columns(&self) -> &[ArrayRef] {
pub fn columns(&self) -> &[Arc<dyn Array>] {
&self.columns[..]
}

Expand All @@ -255,8 +224,8 @@ impl RecordBatch {
/// use arrow2::datatypes::DataType;
/// use arrow2::record_batch::RecordBatch;
///
/// let a: ArrayRef = Arc::new(Int32Array::from_slice(&[1, 2]));
/// let b: ArrayRef = Arc::new(Utf8Array::<i32>::from_slice(&["a", "b"]));
/// let a: Arc<dyn Array> = Arc::new(Int32Array::from_slice(&[1, 2]));
/// let b: Arc<dyn Array> = Arc::new(Utf8Array::<i32>::from_slice(&["a", "b"]));
///
/// let record_batch = RecordBatch::try_from_iter(vec![
/// ("a", a),
Expand All @@ -265,7 +234,7 @@ impl RecordBatch {
/// ```
pub fn try_from_iter<I, F>(value: I) -> Result<Self>
where
I: IntoIterator<Item = (F, ArrayRef)>,
I: IntoIterator<Item = (F, Arc<dyn Array>)>,
F: AsRef<str>,
{
// TODO: implement `TryFrom` trait, once
Expand All @@ -292,8 +261,8 @@ impl RecordBatch {
/// use arrow2::datatypes::DataType;
/// use arrow2::record_batch::RecordBatch;
///
/// let a: ArrayRef = Arc::new(Int32Array::from_slice(&[1, 2]));
/// let b: ArrayRef = Arc::new(Utf8Array::<i32>::from_slice(&["a", "b"]));
/// let a: Arc<dyn Array> = Arc::new(Int32Array::from_slice(&[1, 2]));
/// let b: Arc<dyn Array> = Arc::new(Utf8Array::<i32>::from_slice(&["a", "b"]));
///
/// // Note neither `a` nor `b` has any actual nulls, but we mark
/// // b an nullable
Expand All @@ -304,7 +273,7 @@ impl RecordBatch {
/// ```
pub fn try_from_iter_with_nullable<I, F>(value: I) -> Result<Self>
where
I: IntoIterator<Item = (F, ArrayRef, bool)>,
I: IntoIterator<Item = (F, Arc<dyn Array>, bool)>,
F: AsRef<str>,
{
// TODO: implement `TryFrom` trait, once
Expand Down
4 changes: 2 additions & 2 deletions src/scalar/mod.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
//! Declares the [`Scalar`] API, an optional, trait object representing
//! the zero-dimension of an [`crate::array::Array`].
//! contains the [`Scalar`] trait object representing individual items of [`Array`](crate::array::Array)s,
//! as well as concrete implementations such as [`BooleanScalar`].
use std::any::Any;

use crate::{array::*, datatypes::*};
Expand Down
3 changes: 2 additions & 1 deletion src/types/mod.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
//! traits to handle _all native types_ used in this crate.
//! Traits and implementations to handle _all types_ used in this crate.
//!
//! Most physical types used in this crate are native Rust types, like `i32`.
//! The most important trait is [`NativeType`], the generic trait of [`crate::array::PrimitiveArray`].
//!
Expand Down

0 comments on commit 5a15d04

Please sign in to comment.