-
Notifications
You must be signed in to change notification settings - Fork 12
Implement PyArrow Dataset TableProvider #59
base: main
Are you sure you want to change the base?
Implement PyArrow Dataset TableProvider #59
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great. You write really clean PyO3 code and I learned a lot going through this PR :)
I'm requesting changes, but the only thing I think you need to do is add the test cases I suggested. Everything else is a suggestion or FYI.
I'm not a contributor on this repo so can't approve the CI or merge, but happy to review again.
type Item = ArrowResult<RecordBatch>; | ||
|
||
fn next(&mut self) -> Option<Self::Item> { | ||
Python::with_gil(|py| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires an upgrade in the version of datafusion, but FYI we have a zero-copy conversion from RecordBatchReaders
into arrow-rs ArrowArrayStreamReader
. Requires arrow-rs 17.0 and higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! I will plan on using that once Datafusion is upgraded in this project. At the time I authored this code for our internal project this did not exist.
fragments: fragments.into(), | ||
columns, | ||
filter_expr, | ||
projected_statistics: Default::default(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I'm not sure of is whether we should be doing anything to fill in these statistics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure on this either. I think supporting this is optional and we can add statistics collection on as a future improvement.
// Note that pyarrow.compute.{field,scalar} are put into Python globals() when evaluated | ||
// isin, is_null, and is_valid (~is_null) are methods of pyarrow.dataset.Expression | ||
// https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html#pyarrow-dataset-expression | ||
fn try_from(expr: &Expr) -> Result<Self, Self::Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this will be limited to the max recursion depth of Rust. You might want to instead implement an ExpressionVisitor
. https://docs.rs/datafusion/10.0.0/datafusion/logical_plan/trait.ExpressionVisitor.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that ExpressionVisitor would be a better approach due to no recursion. I noticed that your PR does this. I will give implementing it a shot for this PR.
} | ||
} | ||
|
||
impl TryFrom<&Expr> for PyArrowFilterExpression { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I think the long-term solution for this will be going through Substrait https://issues.apache.org/jira/browse/ARROW-16844
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to replace this code with Substait once it's ready.
Also FYI we are moving this repo, so they might ask to re-open this PR at the new location soon. apache/datafusion-python#5 |
cc @andygrove |
Thanks @kylebrooks-8451 this is looking very cool. As @wjones127 mentioned we are just in the process of moving development to https://github.com/apache/arrow-datafusion-python so would you mind opening the PR there? |
Thanks @andygrove! This PR has been moved to apache/datafusion-python#9. We can close this PR if you wish. I'm unable to make sure this works with DataFusion 10.0.0 because of this error:
|
This implements a PyArrow Dataset TableProvider that allows for using Datasets as tables in Datafusion.
Fixes #10