Skip to content

feat: add FFI support for user defined functions #1145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jul 2, 2025

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Jun 9, 2025

Which issue does this PR close?

Closes #1017

Rationale for this change

Now that we have user defined scalar, aggregate, and window functions in the upstream datafusion 48.0.0, we can add support in datafusion-python. This allows for greater code reuse and extends the available options for Rust implementation of functions to be exposed to python.

What changes are included in this PR?

For all three types of functions, add pycapsule support in the datafusion-python libraries and add the supporting methods to initialize these as UDFs to register with the session context.

Integration tests are included.

Are there any user-facing changes?

This is only addition. All previous code is not impacted.

This was referenced Jun 13, 2025
@timsaucer timsaucer self-assigned this Jun 17, 2025
@timsaucer timsaucer marked this pull request as ready for review June 17, 2025 10:36
@timsaucer timsaucer changed the title Draft: feat: Add FFI support for user defined functions feat: add FFI support for user defined functions Jun 17, 2025
Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I checked the PyO3 docs to convince myself the unsafe blocks were good and everything checks out.

result = [r.column(0) for r in result]
expected = [
pa.array([3], type=pa.int64()),
pa.array([3], type=pa.int64()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely unrelated to this PR, but why isn't this null instead of 3?

> select 1 + null;
+-----------------+
| Int64(1) + NULL |
+-----------------+
|                 |
+-----------------+
1 row(s) fetched.
Elapsed 0.004 seconds.

I'm more than open to the possibility that the group by has a subtly different behavior than normal addition, but it just caught me by surprise because I expected to see null there. Poking a bit, none of the other aggregates I thought of off the top of my head return null so I guess its at least consistent.

Anywho, just a curiosity while I was reviewing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just reusing the Sum aggregate function, not writing my own. And by default Sum will skip null values in an aggregation. Sum and addition are actually different in this way!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, absolutely! I was just wearing my review hat while reading the test so was predicting that it’d be null and then got surprised so I noted it.

And, “aha!” on sum skipping null values, so thanks for that! I just have the “math on nulls returns nulls” in my brain so knowing that behavior isn’t consistent is a TIL.

Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wat? That was supposed to be an approval.

+1 for real this time.

@davisp
Copy link
Member

davisp commented Jun 19, 2025

Ok, that's twice now so I assume that's a config thing on this repo.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @timsaucer and @davisp for the review

FYI @kosiew in case you have any time to review this code

Copy link
Contributor

@kosiew kosiew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the lock file pinned dependency, this looks good to me.

source = "registry+https://github.com/rust-lang/crates.io-index"
checksum = "ffe060b978f74ab446be722adb8a274e052e005bf6dfd171caadc3abaad10080"
version = "48.0.0"
source = "git+https://github.com/apache/datafusion?rev=33a32d4382bee7e3c705d0f55d05c24a115a2f98#33a32d4382bee7e3c705d0f55d05c24a115a2f98"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why this lock file entry contains a revision commit point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe regenerate the lock file to avoid curious questions like mine 😄 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call! Thank you!

@timsaucer timsaucer merged commit 98f4773 into apache:main Jul 2, 2025
25 of 26 checks passed
@timsaucer timsaucer deleted the feat/ffi-udfs branch July 2, 2025 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make it easier to use rust DataFusion UDFs in datafusion-python
4 participants