-
Notifications
You must be signed in to change notification settings - Fork 810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split out arrow-schema
(#2594)
#2711
Conversation
arrow-schema/Cargo.toml
Outdated
bench = false | ||
|
||
[dependencies] | ||
serde = { version = "1.0", default-features = false, features = ["derive"], optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is somewhat annoying, but the orphan rule means serde needs to tag along for the ride
@@ -179,8 +186,7 @@ impl Schema { | |||
|
|||
/// Returns a vector with references to all fields (including nested fields) | |||
#[inline] | |||
#[cfg(feature = "ipc")] | |||
pub(crate) fn all_fields(&self) -> Vec<&Field> { | |||
pub fn all_fields(&self) -> Vec<&Field> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be made public, so that the IPC reader can use it. I don't think this is a big problem
arrow/src/util/decimal.rs
Outdated
@@ -296,6 +293,791 @@ pub(crate) fn singed_cmp_le_bytes(left: &[u8], right: &[u8]) -> Ordering { | |||
Ordering::Equal | |||
} | |||
|
|||
// MAX decimal256 value of little-endian format for each precision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These were moved out of datatypes as they didn't really belong there, as they aren't to do with the schema. I'm not entirely sure they belong here, but they are pub(crate) so we can always move them later
Working on trying to disentangle pyarrow 😭 |
Self::from_pyarrow(value) | ||
} | ||
} | ||
/// A newtype wrapper around a `T: PyArrowConvert` that implements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change, but is necessary to get around the orphan rule
I think this is ready to go now, I've also created #2723 which would theoretically allow dropping the json flag from arrow-schema |
Marking as draft pending #2724 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like everything about this PR except for the change in Error types
Thanks for starting thsi conversation @tustvold |
b903441
to
4e1896c
Compare
@@ -28,9 +28,13 @@ use arrow::compute::kernels; | |||
use arrow::datatypes::{DataType, Field, Schema}; | |||
use arrow::error::ArrowError; | |||
use arrow::ffi_stream::ArrowArrayStreamReader; | |||
use arrow::pyarrow::PyArrowConvert; | |||
use arrow::pyarrow::{PyArrowConvert, PyArrowException, PyArrowType}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pyarrow bindings take a bit of a hit from this split, but I don't really see an obvious way around this, unless we push pyo3 into arrow-schema also. Thoughts?
Edit: This doesn't actually work, because the conversions for Schema require the FFI bindings, so I don't think there is a way around this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kszucs / @andygrove
I don't use the python bindings so I don't understand the implications of this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into this. These are important in DataFusion/Ballista for executing Python UDFs. I will have time to review tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ran out of time today - the Ballista release work took longer than hoped. I will try and look at this over the weekend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to wait until apache/datafusion#3483 is passing, and then we can create a PR in https://github.com/apache/arrow-datafusion-python/ to use that version and make sure the tests pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apache/datafusion#3483 is now ready for review / merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates in apache/datafusion-python#54
@@ -75,7 +75,7 @@ default = ["csv", "ipc", "json"] | |||
ipc_compression = ["ipc", "zstd", "lz4"] | |||
csv = ["csv_crate"] | |||
ipc = ["flatbuffers"] | |||
json = ["serde", "serde_json"] | |||
json = ["serde_json"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a subtle breaking change here, previously json
would enabled serde derives on schema, etc... despite this not actually being part of the JSON API. This is now an explicit feature flag on arrow-schema
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me -- thank you @tustvold
It would be nice if someone familiar with pyarrow
/ pyo3
could weigh in before we merged this, as I don't understand the implications of the changes to that interface
@@ -63,6 +63,8 @@ jobs: | |||
cargo run --example read_csv_infer_schema | |||
- name: Run non-archery based integration-tests | |||
run: cargo test -p arrow-integration-testing | |||
- name: Test arrow-schema with all features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we do the same for arrow-buffer
as added in #2693?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do, but it doesn't have any feature flags that explicitly need testing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could add a comment for future readers like my self
@@ -28,9 +28,13 @@ use arrow::compute::kernels; | |||
use arrow::datatypes::{DataType, Field, Schema}; | |||
use arrow::error::ArrowError; | |||
use arrow::ffi_stream::ArrowArrayStreamReader; | |||
use arrow::pyarrow::PyArrowConvert; | |||
use arrow::pyarrow::{PyArrowConvert, PyArrowException, PyArrowType}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kszucs / @andygrove
I don't use the python bindings so I don't understand the implications of this change
bench = false | ||
|
||
[dependencies] | ||
serde = { version = "1.0", default-features = false, features = ["derive", "std"], optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is certainly a nice (very small) list of dependencies!
bfba28e
to
6f62bb6
Compare
I'm going to get this in, apache/datafusion-python#54 shows that the pyarrow changes don't necessitate major downstream changes. If there is further feedback I will be more than happy to address it in a follow-up PR, prior to the next arrow release. |
Benchmark runs are scheduled for baseline = 74f639c and contender = 48cc8be. 48cc8be is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
* Update for changes in apache/arrow-rs#2711 * Clippy * Use Vec conversion * Update to DF 13 (#59) * [DataFrame] - Add write_csv/write_parquet/write_json to DataFrame (#58) * [SessionContext] - Add read_csv/read_parquet/read_avro functions to SessionContext (#57) Co-authored-by: Francis Du <me@francis.run> * remove patch from cargo toml * add notes on git submodule for test data Co-authored-by: Raphael Taylor-Davies <r.taylordavies@googlemail.com> Co-authored-by: Francis Du <me@francis.run>
Which issue does this PR close?
Part of #2594
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?