-
Notifications
You must be signed in to change notification settings - Fork 13
Add MoveArrowToDataChunk to allow support using arrow in table UDF #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
872b075 to
6b01067
Compare
wmTJc9IK0Q
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests TODO, but opening for feedback in conjunction with the linked duckdb-go PR: duckdb/duckdb-go#75
darwin-amd64/bindings_arrow.go
Outdated
| // The returned DataChunk must be destroyed with DestroyDataChunk. | ||
| // The returned ErrorData must be checked for errors and destroyed with DestroyErrorData. | ||
| func DataChunkFromArrow(conn Connection, rec arrow.RecordBatch, schema ArrowConvertedSchema) (DataChunk, ErrorData) { | ||
| func NewDataChunkFromArrow(conn Connection, rec arrow.RecordBatch) (DataChunk, ErrorData) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was renamed for clarity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the new name is more clear but unfortunately we need to stay consistent with the function names of the C API header we're wrapping (in case we ever want to generate this package). Also, the duckdb-go-bindings should be as-thin-as-possible of a wrapper of the C API functions - we're now changing the function signature by no longer passing the schema. 🤔
Maybe we can leave the current functions as they are + add one helper functions (MoveArrowToDataChunk)? Maybe two, if we really want to omit the schema, too.
And do we need the helpers on this package's level? I.e., they can't happen higher up, say in duckdb-go?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@taniabogatsch Ok, that design goal makes sense. I think in that case then the right path forward would be to just change the existing DataChunkFromArrow to not create a new data chunk and instead take one to be passed in as an arg (and add back the schema arg). I will move forward with this change.
I think we still want this in this package to avoid all of the C.* and alloc calls in the duckdb-go package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sounds good. 👍
darwin-amd64/bindings_arrow.go
Outdated
| // MoveArrowToDataChunk moves an Arrow RecordBatch into the provided DuckDB DataChunk using the provided Connection and ArrowConvertedSchema. | ||
| // The returned DataChunk must be destroyed with DestroyDataChunk. | ||
| // The returned ErrorData must be checked for errors and destroyed with DestroyErrorData. | ||
| func MoveArrowToDataChunk(conn Connection, rec arrow.RecordBatch, chunk DataChunk) ErrorData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why doesn't this also implicitly set the chunk size? Maybe this should happen on the duckdb side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure - it could be an oversight when exposing the new functionality? If so, then it should be patched on the DuckDB side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the Arrow-to-DataChunk conversion API to support table UDFs that need to populate existing DataChunks with Arrow data. It introduces MoveArrowToDataChunk as a new function and renames the existing DataChunkFromArrow to NewDataChunkFromArrow.
Key Changes
- Introduced
MoveArrowToDataChunkfunction that moves Arrow RecordBatch data into an existing DataChunk - Renamed
DataChunkFromArrowtoNewDataChunkFromArrowto better reflect its purpose - Refactored the implementation to separate chunk creation from data population
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
| bindings_arrow.go | Adds MoveArrowToDataChunk function and refactors NewDataChunkFromArrow to use it |
| bindings_arrow_test.go | Updates test to call renamed function NewDataChunkFromArrow |
| darwin-amd64/bindings_arrow.go | Platform-specific variant with identical changes |
| darwin-amd64/bindings_arrow_test.go | Platform-specific test update |
| darwin-arm64/bindings_arrow.go | Platform-specific variant with identical changes |
| darwin-arm64/bindings_arrow_test.go | Platform-specific test update |
| linux-amd64/bindings_arrow.go | Platform-specific variant with identical changes |
| linux-amd64/bindings_arrow_test.go | Platform-specific test update |
| linux-arm64/bindings_arrow.go | Platform-specific variant with identical changes |
| linux-arm64/bindings_arrow_test.go | Platform-specific test update |
| windows-amd64/bindings_arrow.go | Platform-specific variant with identical changes |
| windows-amd64/bindings_arrow_test.go | Platform-specific test update |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Copilot lost it's mind and posted the same 2 comments 12 times. |
|
Thanks! Good point! |
62c2043 to
f36d684
Compare
|
For me it would be better to have here the two DuckDB functions: (New)DataChunkFromArrow (with chunk creation) and FillDataChunkFromArrow. @taniabogatsch In any case it should be your decision. |
|
@VGSML I think we can have these both in the duckdb-go layer and only have one here. |
|
This isn't working right yet with the duckdb-go PR |
taniabogatsch
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small nit and then this is good to go from my side. 👍
|
Looks good from my side - could you just check the failing tests? :) |
This would allow a table UDF to use an arrow to directly fill a data chunk in the FillChunk callback.