Add downcasting

alamb · alamb · commit 4f6446f14f7e · 2025-10-24T10:14:31.000-04:00
diff --git a/docs/source/user-guide/arrow-introduction.md b/docs/source/user-guide/arrow-introduction.md
@@ -17,7 +17,7 @@
   under the License.
 -->
 
-# Introduction to Apache Arrow
+# Gentle Arrow Introduction
 
 ```{contents}
 :local:
@@ -46,7 +46,7 @@ Traditional Row Storage:          Arrow Columnar Storage:
 (read entire rows)                (process entire columns at once)
 ```
 
-## `RecordBatch` 
+## `RecordBatch`
 
 Arrow's standard unit for packaging data is the **[`RecordBatch`]**.
 
@@ -65,10 +65,16 @@ This design allows DataFusion to process streams of row-based chunks while gaini
 
 DataFusion processes queries as pull-based pipelines where operators request batches from their inputs. This streaming approach enables early result production, bounds memory usage (spilling to disk only when necessary), and naturally supports parallel execution across multiple CPU cores.
 
+For example, given the following query:
+
+```sql
+SELECT name FROM 'data.parquet' WHERE id > 10
+```
+
+The DataFusion Pipeline looks like this:
+
 ```text
-A user's query: SELECT name FROM 'data.parquet' WHERE id > 10
 
-The DataFusion Pipeline:
 ┌─────────────┐    ┌──────────────┐    ┌────────────────┐    ┌──────────────────┐    ┌──────────┐
 │ Parquet     │───▶│ Scan         │───▶│ Filter         │───▶│ Projection       │───▶│ Results  │
 │ File        │    │ Operator     │    │ Operator       │    │ Operator         │    │          │
@@ -81,7 +87,7 @@ In this pipeline, [`RecordBatch`]es are the "packages" of columnar data that flo
 
 ## Creating `ArrayRef` and `RecordBatch`es
 
-Sometimes you need to create Arrow data programmatically rather than reading from files. 
+Sometimes you need to create Arrow data programmatically rather than reading from files.
 
 The first thing needed is creating an Arrow Array, for each column. [arrow-rs] provides array builders and `From` impls to create arrays from Rust vectors.
 
@@ -126,21 +132,66 @@ use arrow_schema::{DataType, Field, Schema};
 // Create the columns as Arrow arrays
 let ids = Int32Array::from(vec![1, 2, 3]);
 let names = StringArray::from(vec![Some("alice"), None, Some("carol")]);
-
-// Create the schema 
+// Create the schema
 let schema = Arc::new(Schema::new(vec![
     Field::new("id", DataType::Int32, false), // false means non-nullable
     Field::new("name", DataType::Utf8, true), // true means nullable
 ]));
-
 // Assemble the columns
 let cols: Vec<ArrayRef> = vec![
-      Arc::new(ids), 
+      Arc::new(ids),
       Arc::new(names)
 ];
+// Finally, create the RecordBatch
 RecordBatch::try_new(schema, cols).expect("Failed to create RecordBatch");
 ```
 
+## Working with `ArrayRef` and `RecordBatch`
+
+Most DataFusion APIs are in terms of [`ArrayRef`] and [`RecordBatch`]. To work with the
+underlying data, you typically downcast the [`ArrayRef`] to its concrete type
+(e.g., [`Int32Array`]).
+
+To do so either use the `as_any().downcast_ref::<T>()` method or the
+`as_::<T>()` helper method from the [AsArray] trait.
+
+[asarray]: https://docs.rs/arrow-array/latest/arrow_array/cast/trait.AsArray.html
+
+```rust
+# use std::sync::Arc;
+# use arrow::datatypes::{DataType, Int32Type};
+# use arrow::array::{AsArray, ArrayRef, Int32Array, RecordBatch};
+# let arr: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
+// First check the data type of the array
+match arr.data_type() {
+   &DataType::Int32 => {
+         // Downcast to Int32Array
+         let int_array = arr.as_primitive::<Int32Type>();
+         // Now you can access Int32Array methods
+         for i in 0..int_array.len() {
+              println!("Value at index {}: {}", i, int_array.value(i));
+         }
+   }
+    _ => {
+        println ! ("Array is not of type Int32");
+    }
+}
+```
+
+The following two downcasting methods are equivalent:
+
+```rust
+# use std::sync::Arc;
+# use arrow::datatypes::{DataType, Int32Type};
+# use arrow::array::{AsArray, ArrayRef, Int32Array, RecordBatch};
+# let arr: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
+// Downcast to Int32Array using as_any
+let int_array1 = arr.as_any().downcast_ref::<Int32Array>().unwrap();
+// This is the same as using the as_::<T>() helper
+let int_array2 = arr.as_primitive::<Int32Type>();
+assert_eq!(int_array1, int_array2);
+```
+
 ## Common Pitfalls
 
 When working with Arrow and RecordBatches, watch out for these common issues:
@@ -168,7 +219,6 @@ When working with Arrow and RecordBatches, watch out for these common issues:
 - [DataType](https://docs.rs/arrow-schema/latest/arrow_schema/enum.DataType.html) - Enum of all supported Arrow data types (e.g., Int32, Utf8)
 - [Schema](https://docs.rs/arrow-schema/latest/arrow_schema/struct.Schema.html) - Describes the structure of a RecordBatch (column names and types)
 
-
 [apache arrow]: https://arrow.apache.org/docs/index.html
 [`arc`]: https://doc.rust-lang.org/std/sync/struct.Arc.html
 [`arrayref`]: https://docs.rs/arrow-array/latest/arrow_array/array/type.ArrayRef.html