Skip to content

Commit 82a7788

Browse files
Jefffreycomphead
authored andcommitted
docs: Update HOWTOs for adding new functions (apache#18089)
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes apache#123` indicates that this PR will close issue apache#123. --> - Closes apache#12220 ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> Updating documentation on adding new functions; aggregate instructions were old, and adding in other types too (window, table) ## What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Updated instructions for adding new functions to DataFusion. Also did some other touchups on the docs. ## Are these changes tested? Doc changes only. <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? Doc changes only. <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> --------- Co-authored-by: Oleks V <comphead@users.noreply.github.com>
1 parent 72ce59e commit 82a7788

File tree

2 files changed

+101
-77
lines changed

2 files changed

+101
-77
lines changed

docs/source/contributor-guide/howtos.md

Lines changed: 97 additions & 73 deletions
Original file line numberDiff line numberDiff line change
@@ -21,60 +21,86 @@
2121

2222
## How to update the version of Rust used in CI tests
2323

24-
- Make a PR to update the [rust-toolchain] file in the root of the repository:
24+
Make a PR to update the [rust-toolchain] file in the root of the repository.
2525

2626
[rust-toolchain]: https://github.com/apache/datafusion/blob/main/rust-toolchain.toml
2727

28-
## How to add a new scalar function
29-
30-
Below is a checklist of what you need to do to add a new scalar function to DataFusion:
31-
32-
- Add the actual implementation of the function to a new module file within:
33-
- [here](https://github.com/apache/datafusion/tree/main/datafusion/functions-nested) for arrays, maps and structs functions
34-
- [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/crypto) for crypto functions
35-
- [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/datetime) for datetime functions
36-
- [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/encoding) for encoding functions
37-
- [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/math) for math functions
38-
- [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/regex) for regex functions
39-
- [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/string) for string functions
40-
- [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/unicode) for unicode functions
41-
- create a new module [here](https://github.com/apache/datafusion/tree/main/datafusion/functions/src/) for other functions.
42-
- New function modules - for example a `vector` module, should use a [rust feature](https://doc.rust-lang.org/cargo/reference/features.html) (for example `vector_expressions`) to allow DataFusion
43-
users to enable or disable the new module as desired.
44-
- The implementation of the function is done via implementing `ScalarUDFImpl` trait for the function struct.
45-
- See the [advanced_udf.rs] example for an example implementation
46-
- Add tests for the new function
47-
- To connect the implementation of the function add to the mod.rs file:
48-
- a `mod xyz;` where xyz is the new module file
49-
- a call to `make_udf_function!(..);`
50-
- an item in `export_functions!(..);`
51-
- In [sqllogictest/test_files], add new `sqllogictest` integration tests where the function is called through SQL against well known data and returns the expected result.
52-
- Documentation for `sqllogictest` [here](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md)
53-
- Add SQL reference documentation [here](https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md)
54-
- An example of this being done can be seen [here](https://github.com/apache/datafusion/pull/12775)
55-
- Run `./dev/update_function_docs.sh` to update docs
56-
57-
[advanced_udf.rs]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udaf.rs
58-
[datafusion/expr/src]: https://github.com/apache/datafusion/tree/main/datafusion/expr/src
59-
[sqllogictest/test_files]: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files
60-
61-
## How to add a new aggregate function
62-
63-
Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
64-
65-
- Add the actual implementation of an `Accumulator` and `AggregateExpr`:
66-
- In [datafusion/expr/src], add:
67-
- a new variant to `AggregateFunction`
68-
- a new entry to `FromStr` with the name of the function as called by SQL
69-
- a new line in `return_type` with the expected return type of the function, given an incoming type
70-
- a new line in `signature` with the signature of the function (number and types of its arguments)
71-
- a new line in `create_aggregate_expr` mapping the built-in to the implementation
72-
- tests to the function.
73-
- In [sqllogictest/test_files], add new `sqllogictest` integration tests where the function is called through SQL against well known data and returns the expected result.
74-
- Documentation for `sqllogictest` [here](https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md)
75-
- Add SQL reference documentation [here](https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/aggregate_functions.md)
76-
- An example of this being done can be seen [here](https://github.com/apache/datafusion/pull/12775)
77-
- Run `./dev/update_function_docs.sh` to update docs
28+
## Adding new functions
29+
30+
**Implementation**
31+
32+
| Function type | Location to implement | Trait to implement | Macros to use | Example |
33+
| ------------- | ------------------------- | ---------------------------------------------- | ------------------------------------------------ | -------------------- |
34+
| Scalar | [functions][df-functions] | [`ScalarUDFImpl`] | `make_udf_function!()` and `export_functions!()` | [`advanced_udf.rs`] |
35+
| Nested | [functions-nested] | [`ScalarUDFImpl`] | `make_udf_expr_and_func!()` | |
36+
| Aggregate | [functions-aggregate] | [`AggregateUDFImpl`] and an [`Accumulator`] | `make_udaf_expr_and_func!()` | [`advanced_udaf.rs`] |
37+
| Window | [functions-window] | [`WindowUDFImpl`] and a [`PartitionEvaluator`] | `define_udwf_and_expr!()` | [`advanced_udwf.rs`] |
38+
| Table | [functions-table] | [`TableFunctionImpl`] and a [`TableProvider`] | `create_udtf_function!()` | [`simple_udtf.rs`] |
39+
40+
- The macros are to simplify some boilerplate such as ensuring a DataFrame API compatible function is also created
41+
- Ensure new functions are properly exported through the subproject
42+
`mod.rs` or `lib.rs`.
43+
- Functions should preferably provide documentation via the `#[user_doc(...)]` attribute so their documentation
44+
can be included in the SQL reference documentation (see below section)
45+
- Scalar functions are further grouped into modules for families of functions (e.g. string, math, datetime).
46+
Functions should be added to the relevant module; if a new module needs to be created then a new [Rust feature]
47+
should also be added to allow DataFusion users to conditionally compile the modules as needed
48+
- Aggregate functions can optionally implement a [`GroupsAccumulator`] for better performance
49+
50+
Spark compatible functions are [located in separate crate][df-spark] but otherwise follow the same steps, though all
51+
function types (e.g. scalar, nested, aggregate) are grouped together in the single location.
52+
53+
[df-functions]: https://github.com/apache/datafusion/tree/main/datafusion/functions
54+
[functions-nested]: https://github.com/apache/datafusion/tree/main/datafusion/functions-nested
55+
[functions-aggregate]: https://github.com/apache/datafusion/tree/main/datafusion/functions-aggregate
56+
[functions-window]: https://github.com/apache/datafusion/tree/main/datafusion/functions-window
57+
[functions-table]: https://github.com/apache/datafusion/tree/main/datafusion/functions-table
58+
[df-spark]: https://github.com/apache/datafusion/tree/main/datafusion/spark
59+
[`scalarudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html
60+
[`aggregateudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.AggregateUDFImpl.html
61+
[`accumulator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html
62+
[`groupsaccumulator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.GroupsAccumulator.html
63+
[`windowudfimpl`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.WindowUDFImpl.html
64+
[`partitionevaluator`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.PartitionEvaluator.html
65+
[`tablefunctionimpl`]: https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableFunctionImpl.html
66+
[`tableprovider`]: https://docs.rs/datafusion/latest/datafusion/catalog/trait.TableProvider.html
67+
[`advanced_udf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udf.rs
68+
[`advanced_udaf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udaf.rs
69+
[`advanced_udwf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udwf.rs
70+
[`simple_udtf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/simple_udtf.rs
71+
[rust feature]: https://doc.rust-lang.org/cargo/reference/features.html
72+
73+
**Testing**
74+
75+
Prefer adding `sqllogictest` integration tests where the function is called via SQL against
76+
well known data and returns an expected result. See the existing [test files][slt-test-files] if
77+
there is an appropriate file to add test cases to, otherwise create a new file. See the
78+
[`sqllogictest` documentation][slt-readme] for details on how to construct these tests.
79+
Ensure edge case, `null` input cases are considered in these tests.
80+
81+
If a behaviour cannot be tested via `sqllogictest` (e.g. testing `simplify()`, needs to be
82+
tested in isolation from the optimizer, difficult to construct exact input via `sqllogictest`)
83+
then tests can be added as Rust unit tests in the implementation module, though these should be
84+
kept minimal where possible
85+
86+
[slt-test-files]: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files
87+
[slt-readme]: https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/README.md
88+
89+
**Documentation**
90+
91+
Run documentation update script `./dev/update_function_docs.sh` which will update the relevant
92+
markdown document [here][fn-doc-home] (see the documents for [scalar][fn-doc-scalar],
93+
[aggregate][fn-doc-aggregate] and [window][fn-doc-window] functions)
94+
95+
- You _should not_ manually update the markdown document after running the script as those manual
96+
changes would be overwritten on next execution
97+
- Reference [GitHub issue] which introduced this behaviour
98+
99+
[fn-doc-home]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql
100+
[fn-doc-scalar]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/scalar_functions.md
101+
[fn-doc-aggregate]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/aggregate_functions.md
102+
[fn-doc-window]: https://github.com/apache/datafusion/blob/main/docs/source/user-guide/sql/window_functions.md
103+
[github issue]: https://github.com/apache/datafusion/issues/12740
78104

79105
## How to display plans graphically
80106

@@ -97,11 +123,13 @@ can be displayed. For example, the following command creates a
97123
dot -Tpdf < /tmp/plan.dot > /tmp/plan.pdf
98124
```
99125

100-
## How to format `.md` document
126+
## How to format `.md` documents
101127

102-
We are using `prettier` to format `.md` files.
128+
We use [`prettier`] to format `.md` files.
103129

104-
You can either use `npm i -g prettier` to install it globally or use `npx` to run it as a standalone binary. Using `npx` required a working node environment. Upgrading to the latest prettier is recommended (by adding `--upgrade` to the `npm` command).
130+
You can either use `npm i -g prettier` to install it globally or use `npx` to run it as a standalone binary.
131+
Using `npx` requires a working node environment. Upgrading to the latest prettier is recommended (by adding
132+
`--upgrade` to the `npm` command).
105133

106134
```bash
107135
$ prettier --version
@@ -114,19 +142,19 @@ After you've confirmed your prettier version, you can format all the `.md` files
114142
prettier -w {datafusion,datafusion-cli,datafusion-examples,dev,docs}/**/*.md
115143
```
116144

145+
[`prettier`]: https://prettier.io/
146+
117147
## How to format `.toml` files
118148

119-
We use `taplo` to format `.toml` files.
149+
We use [`taplo`] to format `.toml` files.
120150

121-
For Rust developers, you can install it via:
151+
To install via cargo:
122152

123153
```sh
124154
cargo install taplo-cli --locked
125155
```
126156

127-
> Refer to the [Installation section][doc] on other ways to install it.
128-
>
129-
> [doc]: https://taplo.tamasfe.dev/cli/installation/binary.html
157+
> Refer to the [taplo installation documentation][taplo-install] for other ways to install it.
130158
131159
```bash
132160
$ taplo --version
@@ -139,28 +167,24 @@ After you've confirmed your `taplo` version, you can format all the `.toml` file
139167
taplo fmt
140168
```
141169

170+
[`taplo`]: https://taplo.tamasfe.dev/
171+
[taplo-install]: https://taplo.tamasfe.dev/cli/installation/binary.html
172+
142173
## How to update protobuf/gen dependencies
143174

144-
The prost/tonic code can be generated by running `./regen.sh`, which in turn invokes the Rust binary located in `./gen`
175+
For the `proto` and `proto-common` crates, the prost/tonic code is generated by running their respective `./regen.sh` scripts,
176+
which in turn invokes the Rust binary located in `./gen`.
145177

146178
This is necessary after modifying the protobuf definitions or altering the dependencies of `./gen`, and requires a
147179
valid installation of [protoc] (see [installation instructions] for details).
148180

149181
```bash
150-
./regen.sh
182+
# From repository root
183+
# proto-common
184+
./datafusion/proto-common/regen.sh
185+
# proto
186+
./datafusion/proto/regen.sh
151187
```
152188

153189
[protoc]: https://github.com/protocolbuffers/protobuf#protocol-compiler-installation
154190
[installation instructions]: https://datafusion.apache.org/contributor-guide/getting_started.html#protoc-installation
155-
156-
## How to add/edit documentation for UDFs
157-
158-
Documentations for the UDF documentations are generated from code (related [github issue]). To generate markdown run `./update_function_docs.sh`.
159-
160-
This is necessary after adding new UDF implementation or modifying existing implementation which requires to update documentation.
161-
162-
```bash
163-
./dev/update_function_docs.sh
164-
```
165-
166-
[github issue]: https://github.com/apache/datafusion/issues/12740

docs/source/library-user-guide/functions/adding-udfs.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -354,7 +354,7 @@ async fn main() {
354354
}
355355
```
356356

357-
## Adding a Async Scalar UDF
357+
## Adding an Async Scalar UDF
358358

359359
An Async Scalar UDF allows you to implement user-defined functions that support
360360
asynchronous execution, such as performing network or I/O operations within the
@@ -1257,7 +1257,7 @@ async fn main() -> Result<()> {
12571257
[`create_udaf`]: https://docs.rs/datafusion/latest/datafusion/logical_expr/fn.create_udaf.html
12581258
[`advanced_udaf.rs`]: https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_udaf.rs
12591259

1260-
## Adding a User-Defined Table Function
1260+
## Adding a Table UDF
12611261

12621262
A User-Defined Table Function (UDTF) is a function that takes parameters and returns a `TableProvider`.
12631263

@@ -1266,8 +1266,8 @@ This is a simple struct that holds a set of RecordBatches in memory and treats t
12661266
be replaced with your own struct that implements `TableProvider`.
12671267

12681268
While this is a simple example for illustrative purposes, UDTFs have a lot of potential use cases. And can be
1269-
particularly useful for reading data from external sources and interactive analysis. For example, see the [example][4]
1270-
for a working example that reads from a CSV file. As another example, you could use the built-in UDTF `parquet_metadata`
1269+
particularly useful for reading data from external sources and interactive analysis. See the [working example][simple_udtf.rs]
1270+
which reads from a CSV file. As another example, you could use the built-in UDTF `parquet_metadata`
12711271
in the CLI to read the metadata from a Parquet file.
12721272

12731273
```console

0 commit comments

Comments
 (0)