Skip to content
This repository has been archived by the owner on Jul 25, 2022. It is now read-only.

Draft PyArrow Dataset reader impl #21

Closed

Conversation

wjones127
Copy link

@wjones127 wjones127 commented Jan 30, 2022

Work in progress. Working toward being able to stream record batches from a PyArrow dataset.

Fixes #10.

@wjones127
Copy link
Author

FYI I am going to set this aside for now, since I think this really needs the Arrow C Stream Interface to be reliable. Right now it just wraps a Python iterator and holds onto the GIL while waiting for the PyArrow scanner to generate each batch. I'm running into some GIL deadlocks, so it would be nice to eliminate the GIL stuff from record batch streaming.

@kylebrooks-8451
Copy link

I believe I have a working solution for this that I developed for the company I work for. I will get a PR out there soon. Is there still a need for this?

@wjones127
Copy link
Author

I believe I have a working solution for this that I developed for the company I work for. I will get a PR out there soon. Is there still a need for this?

This was mostly an experimental curiosity, but a PR would be cool if you are willing :)

I probably won't get around to finishing this for a while.

@kylebrooks-8451
Copy link

@wjones127 I've created a PR, #59

I couldn't add you as a reviewer after I made the PR but I'd love to have your feedback on it.

@wjones127 wjones127 closed this Jul 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support reading from PyArrow datasets
2 participants