Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support FFI query result streams that do not pre-collect #1011

Open
matko opened this issue Feb 3, 2025 · 1 comment
Open

support FFI query result streams that do not pre-collect #1011

matko opened this issue Feb 3, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@matko
Copy link

matko commented Feb 3, 2025

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
I'm trying to pass the result of a query into my rust code. Some of the queries I'm doing produce a lot of data, and I would like to process this in a streaming way, without first loading the entire query result into memory (where it might not even fit).

Dataframe has a function __arrow_c_stream__(), which can be used to cross the FFI boundary and get dataframe results into a native component. Unfortunately, this calls .collect() internally. This means I can't actually stream over the results while keeping the memory footprint low. I need to be able to load my entire dataset in memory, and the rest of my processing logic has to wait for this to complete before it can start.

Describe the solution you'd like
I would like __arrow_c_stream__() or a similar function to produce a RecordBatchReader or even a RecordBatchStream (which also appears to be FFI-wrapped), which streams the query result without first collecting into memory.

Describe alternatives you've considered
The alternative is accepting that using results from python in rust will always require a collect on the python side first. Given that the infrastructure seems to be in place to pass around readers and streams, this seems silly.

@matko matko added the enhancement New feature or request label Feb 3, 2025
@matko matko changed the title Make __arrow_c_stream__() not collect internally support query result streams that do not pre-collect Feb 3, 2025
@matko matko changed the title support query result streams that do not pre-collect support FFI query result streams that do not pre-collect Feb 3, 2025
@kylebarron
Copy link
Contributor

kylebarron commented Feb 5, 2025

The Arrow C Data Interface and the PyCapsule Interface (of which __arrow_c_stream__ is a part), are synchronous-only. You could in theory create a RecordBatchIterator in Rust that under the hood blocks using tokio in each next() call, but that will of course be blocking.

Alternatively, you can use RecordBatchStream. That holds a DataFusion RecordBatchStream, and so it shouldn't pre-collect. However then it's impossible to transfer via FFI to another library because Arrow doesn't define an async FFI protocol. I added async iteration support to RecordBatchStream, which may help you.

All this means is that you have to do the iteration via Python. So you can do sync or async iteration over the RecordBatchStream, and then for each produced RecordBatch you can use FFI to convert that to whatever library you're using.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants