Skip to content

Commit 4f6446f

Browse files
committed
Add downcasting
1 parent d80d48c commit 4f6446f

File tree

1 file changed

+60
-10
lines changed

1 file changed

+60
-10
lines changed

docs/source/user-guide/arrow-introduction.md

Lines changed: 60 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
under the License.
1818
-->
1919

20-
# Introduction to Apache Arrow
20+
# Gentle Arrow Introduction
2121

2222
```{contents}
2323
:local:
@@ -46,7 +46,7 @@ Traditional Row Storage: Arrow Columnar Storage:
4646
(read entire rows) (process entire columns at once)
4747
```
4848

49-
## `RecordBatch`
49+
## `RecordBatch`
5050

5151
Arrow's standard unit for packaging data is the **[`RecordBatch`]**.
5252

@@ -65,10 +65,16 @@ This design allows DataFusion to process streams of row-based chunks while gaini
6565

6666
DataFusion processes queries as pull-based pipelines where operators request batches from their inputs. This streaming approach enables early result production, bounds memory usage (spilling to disk only when necessary), and naturally supports parallel execution across multiple CPU cores.
6767

68+
For example, given the following query:
69+
70+
```sql
71+
SELECT name FROM 'data.parquet' WHERE id > 10
72+
```
73+
74+
The DataFusion Pipeline looks like this:
75+
6876
```text
69-
A user's query: SELECT name FROM 'data.parquet' WHERE id > 10
7077
71-
The DataFusion Pipeline:
7278
┌─────────────┐ ┌──────────────┐ ┌────────────────┐ ┌──────────────────┐ ┌──────────┐
7379
│ Parquet │───▶│ Scan │───▶│ Filter │───▶│ Projection │───▶│ Results │
7480
│ File │ │ Operator │ │ Operator │ │ Operator │ │ │
@@ -81,7 +87,7 @@ In this pipeline, [`RecordBatch`]es are the "packages" of columnar data that flo
8187

8288
## Creating `ArrayRef` and `RecordBatch`es
8389

84-
Sometimes you need to create Arrow data programmatically rather than reading from files.
90+
Sometimes you need to create Arrow data programmatically rather than reading from files.
8591

8692
The first thing needed is creating an Arrow Array, for each column. [arrow-rs] provides array builders and `From` impls to create arrays from Rust vectors.
8793

@@ -126,21 +132,66 @@ use arrow_schema::{DataType, Field, Schema};
126132
// Create the columns as Arrow arrays
127133
let ids = Int32Array::from(vec![1, 2, 3]);
128134
let names = StringArray::from(vec![Some("alice"), None, Some("carol")]);
129-
130-
// Create the schema
135+
// Create the schema
131136
let schema = Arc::new(Schema::new(vec![
132137
Field::new("id", DataType::Int32, false), // false means non-nullable
133138
Field::new("name", DataType::Utf8, true), // true means nullable
134139
]));
135-
136140
// Assemble the columns
137141
let cols: Vec<ArrayRef> = vec![
138-
Arc::new(ids),
142+
Arc::new(ids),
139143
Arc::new(names)
140144
];
145+
// Finally, create the RecordBatch
141146
RecordBatch::try_new(schema, cols).expect("Failed to create RecordBatch");
142147
```
143148

149+
## Working with `ArrayRef` and `RecordBatch`
150+
151+
Most DataFusion APIs are in terms of [`ArrayRef`] and [`RecordBatch`]. To work with the
152+
underlying data, you typically downcast the [`ArrayRef`] to its concrete type
153+
(e.g., [`Int32Array`]).
154+
155+
To do so either use the `as_any().downcast_ref::<T>()` method or the
156+
`as_::<T>()` helper method from the [AsArray] trait.
157+
158+
[asarray]: https://docs.rs/arrow-array/latest/arrow_array/cast/trait.AsArray.html
159+
160+
```rust
161+
# use std::sync::Arc;
162+
# use arrow::datatypes::{DataType, Int32Type};
163+
# use arrow::array::{AsArray, ArrayRef, Int32Array, RecordBatch};
164+
# let arr: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
165+
// First check the data type of the array
166+
match arr.data_type() {
167+
&DataType::Int32 => {
168+
// Downcast to Int32Array
169+
let int_array = arr.as_primitive::<Int32Type>();
170+
// Now you can access Int32Array methods
171+
for i in 0..int_array.len() {
172+
println!("Value at index {}: {}", i, int_array.value(i));
173+
}
174+
}
175+
_ => {
176+
println ! ("Array is not of type Int32");
177+
}
178+
}
179+
```
180+
181+
The following two downcasting methods are equivalent:
182+
183+
```rust
184+
# use std::sync::Arc;
185+
# use arrow::datatypes::{DataType, Int32Type};
186+
# use arrow::array::{AsArray, ArrayRef, Int32Array, RecordBatch};
187+
# let arr: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
188+
// Downcast to Int32Array using as_any
189+
let int_array1 = arr.as_any().downcast_ref::<Int32Array>().unwrap();
190+
// This is the same as using the as_::<T>() helper
191+
let int_array2 = arr.as_primitive::<Int32Type>();
192+
assert_eq!(int_array1, int_array2);
193+
```
194+
144195
## Common Pitfalls
145196

146197
When working with Arrow and RecordBatches, watch out for these common issues:
@@ -168,7 +219,6 @@ When working with Arrow and RecordBatches, watch out for these common issues:
168219
- [DataType](https://docs.rs/arrow-schema/latest/arrow_schema/enum.DataType.html) - Enum of all supported Arrow data types (e.g., Int32, Utf8)
169220
- [Schema](https://docs.rs/arrow-schema/latest/arrow_schema/struct.Schema.html) - Describes the structure of a RecordBatch (column names and types)
170221

171-
172222
[apache arrow]: https://arrow.apache.org/docs/index.html
173223
[`arc`]: https://doc.rust-lang.org/std/sync/struct.Arc.html
174224
[`arrayref`]: https://docs.rs/arrow-array/latest/arrow_array/array/type.ArrayRef.html

0 commit comments

Comments
 (0)