-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: update datafusion to 42.0 and arrow to 53.2 #3176
Conversation
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
rust/lance-datafusion/Cargo.toml
Outdated
@@ -32,7 +32,8 @@ snafu.workspace = true | |||
tokio.workspace = true | |||
|
|||
[dev-dependencies] | |||
substrait-expr = { version = "0.2.1" } | |||
# TODO: This is too old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe somebody needs to also update substrait-expr to use prost 0.13 to match the version used by datafusion-substrait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess that's on me. How do you feel about bumping all the way to datafusion 43? That release should include a change I made which will let us drop substrait-expr entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Version 0.2.2 of substrait-expr is now available and will include prost 0.13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about bumping all the way to datafusion 43
There were a couple more API surfaces that needed modifying so I was focusing on min-change possible, but I don't have a philosophiscal objection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Let's keep it minimal for now. I'll need to do some more work to get rid of substrait-expr anyways.
rust/lance-datafusion/Cargo.toml
Outdated
@@ -32,7 +32,8 @@ snafu.workspace = true | |||
tokio.workspace = true | |||
|
|||
[dev-dependencies] | |||
substrait-expr = { version = "0.2.1" } | |||
# TODO: This is too old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess that's on me. How do you feel about bumping all the way to datafusion 43? That release should include a change I made which will let us drop substrait-expr entirely.
@@ -224,7 +224,7 @@ impl<'a, T: ByteArrayType> BinaryDecoder<'a, T> { | |||
.null_bit_buffer(null_buf); | |||
} | |||
|
|||
let buf = bytes.into(); | |||
let buf = Buffer::from_vec(bytes.to_vec()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This triggers a data copy (and could be the source of your alignment issues). What changed in arrow-rs to necessitate this change? It should still be possible to go from Bytes
to Buffer
with zero-copy somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this was the change:
Will try that approach instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the comment above I believe:
Buffer::from_vec(bytes.into())
is the intended pathway, though I also tried bytes.to_byte_slice().into()
, but both still result in the same error:
---- index::vector::ivf::tests::test_create_ivf_hnsw_with_empty_partition stdout ----
thread 'index::vector::ivf::tests::test_create_ivf_hnsw_with_empty_partition' panicked at /home/jleibs/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-buffer-53.3.0/src/buffer/scalar.rs:133:42:
Memory pointer is not aligned with the specified scalar type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's still going to trigger a copy it seems. I'll take a look real quick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's still going to trigger a copy it seems.
Is it?
My reads of both:
impl From<Bytes> for Vec<u8>
Buffer::from_vec()
Look like they should both be zero-copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I saw bytes.to_byte_slice().into()
which I think is not zero-copy. You are right that Buffer::from_vec(bytes.into())
should be zero-copy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no Buffer::from_vec(bytes.into())
will not be zero copy because bytes.into()
is not zero-copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nevermind, it looks like bytes
does have a zero-copy conversion to Vec<u8>
. My bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing your fix was probably ok here. The root cause was probably just the deep_copy_buffer
change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha. Those new routines are much more clear. Thanks.
e128098
to
08aa39b
Compare
With the changes to Buffer it looks like more tests are passing but now we get a failure in substrait:
|
63fec42
to
8eb87ea
Compare
It looks like the substrait error can be fixed by:
but I'm not sure if this hints at a bug or just a behavior change |
And the chain just keeps on going... looks like |
@jleibs this will work around the tfrecord while we wait for them to upgrade. It's a bit ugly but 🤷 |
…into jleibs/update_arrow_datafusion
@westonpace thanks! That's a helpful trick. Looks like the wheel is building now. There are some new deprecation warnings to be cleaned up related to bumping pyo3, but those should be straightforward. Any thoughts on the "x" vs "c0" business in substrait? |
If the tests in |
(we only use substrait for pushdown from duckdb / polars and that's what test_integration.py is checking) |
No luck. Integration tests show the same problem:
|
Ok. I'll take a look tomorrow AM |
@westonpace we can probably remove tfrecord dependency from rust |
Sorry, lost track of this over Thanksgiving. I'll get it wrapped up today. |
rerun-io#2 (let me know if you want me to just open a new PR from my branch so I can skip the intermediate PRs) |
Closing as I've taken this over in #3201 |
@westonpace sorry I was out for an extended thanksgiving. Thanks for taking over. |
Changes:
impl<T: AsRef<[u8]>> From<T> for Buffer
(see: explanation)BufferExt
to centralize these conversions and make behavior more explicit.