Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Substrait serializer clippy error: not calling truncate #14723

Merged
merged 9 commits into from
Feb 20, 2025

Conversation

niebayes
Copy link
Contributor

@niebayes niebayes commented Feb 17, 2025

Which issue does this PR close?

Rationale for this change

It's reasonable to always truncate an existing file before serializing a plan.
On the other hand, the following deserialize method always read the file to end which does not consider the case that the file is not truncated before serialization.

What changes are included in this PR?

Applied a clippy suggestion by adding a truncate(true).

Are these changes tested?

Tested by clippy

Are there any user-facing changes?

No

@github-actions github-actions bot added the substrait Changes to the substrait crate label Feb 17, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @niebayes -- I think this is an improvment, but it would be better if:

  1. Return an error of the file already exists
  2. Added doccomments comments to serialize

match std::fs::metadata(path.as_ref()) {
Ok(meta) if meta.len() > 0 => {
return Err(DataFusionError::Substrait(format!(
"Failed to encode substrait plan: the file {} already exists and is not empty",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could remove substrait here because it's already part of the error prefix

DataFusionError::Substrait(_) => "Substrait error: ",

Suggested change
"Failed to encode substrait plan: the file {} already exists and is not empty",
"Failed to encode plan: the file {} already exists and is not empty",

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a substrait is because there already had an error message "Failed to encode substrait plan" in the fuction serialize_bytes. I was following that error message to make error messages consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the word "substrait" from all error messages.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @niebayes and @mbrobbel -- looks great to me

I think @mbrobbel 's recent comments and suggestions are worth considering too as they would improve the code substantially, but we can also do it as a follow on PR too

niebayes and others added 2 commits February 19, 2025 00:05
Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
@niebayes
Copy link
Contributor Author

@alamb Applied suggestions from @mbrobbel. Please check it.

niebayes and others added 2 commits February 19, 2025 00:26
Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
Copy link
Contributor

@mbrobbel mbrobbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I added some follow-up suggestions.

@@ -56,8 +67,9 @@ pub async fn deserialize(path: &str) -> Result<Box<Plan>> {
deserialize_bytes(protobuf_in).await
}

/// Deserializes a plan from the bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up suggestion (breaking change): this function is marked async but it doesn't have to be.

Comment on lines +60 to +61
/// Reads the file at `path` and deserializes a plan from the bytes.
pub async fn deserialize(path: impl AsRef<Path>) -> Result<Box<Plan>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up suggestion (breaking change): this function is marked async but it doesn't have to be (see comment on deserialize_bytes).

Suggested change
/// Reads the file at `path` and deserializes a plan from the bytes.
pub async fn deserialize(path: impl AsRef<Path>) -> Result<Box<Plan>> {
/// Reads the file at `path` and deserializes a plan from the bytes.
pub fn deserialize(path: impl AsRef<Path>) -> Result<Box<Plan>> {

Copy link
Contributor Author

@niebayes niebayes Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to contribute more to the datafusion-substrait crate. Personally, I would like to reserve the breaking change in one of the following PR.

What do you think? @alamb

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree -- let's discuss API changes in follow on PRs / issues

Comment on lines 42 to 43
let mut file = OpenOptions::new().write(true).create_new(true).open(path)?;
file.write_all(&protobuf_out?)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up suggestion: use tokio::io::AsyncWriteExt::write_all instead of the blocking std::io::Write::write_all in this async function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replaced the sync OpenOptions with the async counterpart backed by tokio.

Comment on lines +47 to 48
/// Plans a sql and serializes the generated logical plan to bytes.
pub async fn serialize_bytes(sql: &str, ctx: &SessionContext) -> Result<Vec<u8>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we replace the sql arg with a DataFrame arg, this function (and serialize) can be non-async.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure which one is preferable for users.
@alamb cc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should leave this one as a str in this PR and we can potentially add a DataFrame specific one as a follow on

However, given there is already an API to serialize LogicalPlans directly I am not sure how much more value a DataFrame one would add

https://docs.rs/datafusion-substrait/latest/datafusion_substrait/#example-serializing-logicalplans

niebayes and others added 2 commits February 19, 2025 17:14
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @niebayes and @mbrobbel

@alamb alamb merged commit 2b39b84 into apache:main Feb 20, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Substrait serializer clippy error: not calling truncate
3 participants