Update documentation for creating User Defined Aggregates (AggregateUDF) #6729

alamb · 2023-06-20T11:42:52Z

Which issue does this PR close?

related to #6611

Rationale for this change

@stuartcarnie had some questions about how this API should work, and so I wanted to encode the answers into documentation for others as well

What changes are included in this PR?

Update docs for AggregateUDF
Update docs for Accumulator
Rename AccumulatorFunctionImplementation to AccumulatorFactoryFunction to better describe what it does

Are these changes tested?

Yes (existing tests + doc tests)

Are there any user-facing changes?

Better docs
Different type alias name (AccumulatorFunctionImplementation to AccumulatorFactoryFunction)

alamb · 2023-06-20T12:05:43Z

datafusion/core/src/lib.rs

-//! uses [Apache Arrow] as its in-memory format. DataFusion's [use
-//! cases] include building very fast database and analytic systems,
-//! customized to particular workloads.
+//! uses [Apache Arrow] as its in-memory format. DataFusion's many [use


this was a drive by cleanup as I pretended to be a new user navigating to the AggregateUDF page

alamb · 2023-06-20T12:06:24Z

datafusion/expr/src/accumulator.rs

-    /// other partial states from different instances of this
-    /// accumulator (that ran on different partitions, for
-    /// example).
+    /// Updates the accumulator's state from its input.


The trait is the same -- I did reorderd the methods to be better grouped together by use, but the actual methods are the same

alamb · 2023-06-20T12:06:51Z

datafusion/expr/src/function.rs

@@ -42,7 +42,7 @@ pub type ReturnTypeFunction =

 /// Factory that returns an accumulator for the given aggregate, given
 /// its return datatype.
-pub type AccumulatorFunctionImplementation =
+pub type AccumulatorFactoryFunction =


This was just misleading, so I changed the name

Totally agree – this name is much clearer

stuartcarnie

These are great improvements, thank you!

stuartcarnie · 2023-06-20T22:55:01Z

datafusion/expr/src/accumulator.rs

+    ///
+    /// # Example
+    ///
+    /// For example, given the following input partition
+    ///
+    /// ```text
+    ///                     │      current      │
+    ///                            window
+    ///                     │                   │
+    ///                ┌────┬────┬────┬────┬────┬────┬────┬────┬────┐
+    ///     Input      │ A  │ B  │ C  │ D  │ E  │ F  │ G  │ H  │ I  │
+    ///   partition    └────┴────┴────┴────┼────┴────┴────┴────┼────┘
+    ///
+    ///                                    │         next      │
+    ///                                             window
+    /// ```
+    ///
+    /// First, [`Self::evaluate`] will be called to produce the output
+    /// for the current window.
+    ///
+    /// Then, to advance to the next window:
+    ///
+    /// First, [`Self::retract_batch`] will be called with the values
+    /// that are leaving the window, `[B, C, D]` and then
+    /// [`Self::update_batch`] will be called with the values that are
+    /// entering the window, `[F, G, H]`.


Very clear explanation 💯

stuartcarnie · 2023-06-20T22:56:11Z

datafusion/expr/src/accumulator.rs

-    /// time (e.g. median)
+    /// Note that [`ScalarValue::List`] can be used to pass multiple
+    /// values if the number of intermediate values is not known at
+    /// planning time (e.g. for `MEDIAN`)
    fn state(&self) -> Result<Vec<ScalarValue>>;


This updated documentation is fantastic, thank you @alamb

stuartcarnie · 2023-06-20T22:57:13Z

datafusion/expr/src/function.rs

@@ -42,7 +42,7 @@ pub type ReturnTypeFunction =

 /// Factory that returns an accumulator for the given aggregate, given
 /// its return datatype.
-pub type AccumulatorFunctionImplementation =
+pub type AccumulatorFactoryFunction =


Totally agree – this name is much clearer

Update documentation for creating User Defined Aggregates (AggregateUDF)

ce142db

alamb added the api change Changes the API exposed to users of the crate label Jun 20, 2023

github-actions bot added core Core DataFusion crate logical-expr Logical plan and expressions labels Jun 20, 2023

Fix other references

69c2ea6

github-actions bot added the optimizer Optimizer rules label Jun 20, 2023

alamb commented Jun 20, 2023

View reviewed changes

alamb self-assigned this Jun 20, 2023

stuartcarnie approved these changes Jun 20, 2023

View reviewed changes

alamb merged commit eb290a0 into apache:main Jun 22, 2023

alamb deleted the alamb/accumulator_docs branch June 22, 2023 14:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update documentation for creating User Defined Aggregates (AggregateUDF) #6729

Update documentation for creating User Defined Aggregates (AggregateUDF) #6729

alamb commented Jun 20, 2023

alamb Jun 20, 2023

alamb Jun 20, 2023

alamb Jun 20, 2023

stuartcarnie Jun 20, 2023

stuartcarnie left a comment

stuartcarnie Jun 20, 2023

stuartcarnie Jun 20, 2023

stuartcarnie Jun 20, 2023

Update documentation for creating User Defined Aggregates (AggregateUDF) #6729

Update documentation for creating User Defined Aggregates (AggregateUDF) #6729

Conversation

alamb commented Jun 20, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb Jun 20, 2023

Choose a reason for hiding this comment

alamb Jun 20, 2023

Choose a reason for hiding this comment

alamb Jun 20, 2023

Choose a reason for hiding this comment

stuartcarnie Jun 20, 2023

Choose a reason for hiding this comment

stuartcarnie left a comment

Choose a reason for hiding this comment

stuartcarnie Jun 20, 2023

Choose a reason for hiding this comment

stuartcarnie Jun 20, 2023

Choose a reason for hiding this comment

stuartcarnie Jun 20, 2023

Choose a reason for hiding this comment