Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: update datafusion to 42.0 and arrow to 53.2 #3176

Closed
wants to merge 17 commits into from

Conversation

jleibs
Copy link
Contributor

@jleibs jleibs commented Nov 25, 2024

Changes:

  • The way in which zero-copy Bytes -> Arrow::Buffer conversion are done was necessitated by arrow-53 dropping support for impl<T: AsRef<[u8]>> From<T> for Buffer (see: explanation)
  • Introduces new BufferExt to centralize these conversions and make behavior more explicit.

Copy link

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@jleibs jleibs changed the title Attempt to update arrow/datafusion chore: update datafusion to 42.0 and arrow to 53.2 Nov 26, 2024
@github-actions github-actions bot added the chore label Nov 26, 2024
@jleibs jleibs marked this pull request as draft November 26, 2024 00:04
@@ -32,7 +32,8 @@ snafu.workspace = true
tokio.workspace = true

[dev-dependencies]
substrait-expr = { version = "0.2.1" }
# TODO: This is too old
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe somebody needs to also update substrait-expr to use prost 0.13 to match the version used by datafusion-substrait

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess that's on me. How do you feel about bumping all the way to datafusion 43? That release should include a change I made which will let us drop substrait-expr entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Version 0.2.2 of substrait-expr is now available and will include prost 0.13

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about bumping all the way to datafusion 43

There were a couple more API surfaces that needed modifying so I was focusing on min-change possible, but I don't have a philosophiscal objection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Let's keep it minimal for now. I'll need to do some more work to get rid of substrait-expr anyways.

@@ -32,7 +32,8 @@ snafu.workspace = true
tokio.workspace = true

[dev-dependencies]
substrait-expr = { version = "0.2.1" }
# TODO: This is too old
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess that's on me. How do you feel about bumping all the way to datafusion 43? That release should include a change I made which will let us drop substrait-expr entirely.

@@ -224,7 +224,7 @@ impl<'a, T: ByteArrayType> BinaryDecoder<'a, T> {
.null_bit_buffer(null_buf);
}

let buf = bytes.into();
let buf = Buffer::from_vec(bytes.to_vec());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This triggers a data copy (and could be the source of your alignment issues). What changed in arrow-rs to necessitate this change? It should still be possible to go from Bytes to Buffer with zero-copy somehow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@jleibs jleibs Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comment above I believe:
Buffer::from_vec(bytes.into()) is the intended pathway, though I also tried bytes.to_byte_slice().into(), but both still result in the same error:

---- index::vector::ivf::tests::test_create_ivf_hnsw_with_empty_partition stdout ----
thread 'index::vector::ivf::tests::test_create_ivf_hnsw_with_empty_partition' panicked at /home/jleibs/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-buffer-53.3.0/src/buffer/scalar.rs:133:42:
Memory pointer is not aligned with the specified scalar type

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's still going to trigger a copy it seems. I'll take a look real quick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's still going to trigger a copy it seems.

Is it?

My reads of both:

  • impl From<Bytes> for Vec<u8>
  • Buffer::from_vec()

Look like they should both be zero-copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I saw bytes.to_byte_slice().into() which I think is not zero-copy. You are right that Buffer::from_vec(bytes.into()) should be zero-copy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no Buffer::from_vec(bytes.into()) will not be zero copy because bytes.into() is not zero-copy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nevermind, it looks like bytes does have a zero-copy conversion to Vec<u8>. My bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing your fix was probably ok here. The root cause was probably just the deep_copy_buffer change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha. Those new routines are much more clear. Thanks.

rust/lance/src/dataset/scanner.rs Show resolved Hide resolved
@jleibs jleibs force-pushed the jleibs/update_arrow_datafusion branch from e128098 to 08aa39b Compare November 26, 2024 17:18
@jleibs
Copy link
Contributor Author

jleibs commented Nov 26, 2024

With the changes to Buffer it looks like more tests are passing but now we get a failure in substrait:

thread 'substrait::tests::test_substrait_conversion' panicked at rust/lance-datafusion/src/substrait.rs:430:14:
called `Result::unwrap()` on an `Err` value: Schema { message: "Schema error: No field named c0. Valid fields are x.", location: Location { file: "rust/lance-datafusion/src/substrait.rs", line: 346, column: 9 } }

@jleibs jleibs force-pushed the jleibs/update_arrow_datafusion branch from 63fec42 to 8eb87ea Compare November 26, 2024 18:56
@jleibs
Copy link
Contributor Author

jleibs commented Nov 26, 2024

It looks like the substrait error can be fixed by:

diff --git a/rust/lance-datafusion/src/substrait.rs b/rust/lance-datafusion/src/substrait.rs
index 57cffb12..c7b05e23 100644
--- a/rust/lance-datafusion/src/substrait.rs
+++ b/rust/lance-datafusion/src/substrait.rs
@@ -423,7 +423,7 @@ mod tests {
         let expr = expr_builder.build();
         let expr_bytes = expr.encode_to_vec();
 
-        let schema = Arc::new(Schema::new(vec![Field::new("x", DataType::Int32, true)]));
+        let schema = Arc::new(Schema::new(vec![Field::new("c0", DataType::Int32, true)]));
 
         let df_expr = parse_substrait(expr_bytes.as_slice(), schema)
             .await
@@ -432,7 +432,7 @@ mod tests {
         let expected = Expr::BinaryExpr(BinaryExpr {
             left: Box::new(Expr::Column(Column {
                 relation: None,
-                name: "x".to_string(),
+                name: "c0".to_string(),
             })),
             op: Operator::Lt,
             right: Box::new(Expr::Literal(ScalarValue::Int32(Some(0)))),

but I'm not sure if this hints at a bug or just a behavior change

@jleibs
Copy link
Contributor Author

jleibs commented Nov 26, 2024

And the chain just keeps on going... looks like tfrecord depends on prost v12 which makes it incompatible with prost v13 as necessitated by datafusion-substrait...

jerry73204/rust-tfrecord#8

@westonpace
Copy link
Contributor

@jleibs this will work around the tfrecord while we wait for them to upgrade. It's a bit ugly but 🤷

westonpace@ddcf53c

@jleibs
Copy link
Contributor Author

jleibs commented Nov 26, 2024

@westonpace thanks! That's a helpful trick.

Looks like the wheel is building now. There are some new deprecation warnings to be cleaned up related to bumping pyo3, but those should be straightforward.

Any thoughts on the "x" vs "c0" business in substrait?

@westonpace
Copy link
Contributor

Any thoughts on the "x" vs "c0" business in substrait?

If the tests in test_integration.py are passing then we are good. If not, then I'll take a closer look tomorrow morning.

@westonpace
Copy link
Contributor

(we only use substrait for pushdown from duckdb / polars and that's what test_integration.py is checking)

@jleibs
Copy link
Contributor Author

jleibs commented Nov 26, 2024

Any thoughts on the "x" vs "c0" business in substrait?

If the tests in test_integration.py are passing then we are good. If not, then I'll take a closer look tomorrow morning.

No luck. Integration tests show the same problem:

       actual = duckdb.query("SELECT * FROM ds WHERE filterme = 2").fetch_arrow_table()
E       duckdb.duckdb.Error: ValueError: LanceError(Schema): Schema error: No field named c0. Valid fields are filterme, othercol., /home/jleibs/lance/rust/lance-datafusion/src/substrait.rs:346:9

@westonpace
Copy link
Contributor

Ok. I'll take a look tomorrow AM

@eddyxu
Copy link
Contributor

eddyxu commented Nov 28, 2024

@westonpace we can probably remove tfrecord dependency from rust

@westonpace
Copy link
Contributor

Sorry, lost track of this over Thanksgiving. I'll get it wrapped up today.

@westonpace
Copy link
Contributor

rerun-io#2 (let me know if you want me to just open a new PR from my branch so I can skip the intermediate PRs)

@westonpace
Copy link
Contributor

Closing as I've taken this over in #3201

@westonpace westonpace closed this Dec 4, 2024
@jleibs
Copy link
Contributor Author

jleibs commented Dec 5, 2024

@westonpace sorry I was out for an extended thanksgiving. Thanks for taking over.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants