-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Substrait serializer clippy error: not calling truncate #14723
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @niebayes -- I think this is an improvment, but it would be better if:
- Return an error of the file already exists
- Added doccomments comments to
serialize
match std::fs::metadata(path.as_ref()) { | ||
Ok(meta) if meta.len() > 0 => { | ||
return Err(DataFusionError::Substrait(format!( | ||
"Failed to encode substrait plan: the file {} already exists and is not empty", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could remove substrait
here because it's already part of the error prefix
datafusion/datafusion/common/src/error.rs
Line 517 in 19fe44c
DataFusionError::Substrait(_) => "Substrait error: ", |
"Failed to encode substrait plan: the file {} already exists and is not empty", | |
"Failed to encode plan: the file {} already exists and is not empty", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a substrait
is because there already had an error message "Failed to encode substrait plan" in the fuction serialize_bytes
. I was following that error message to make error messages consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the word "substrait" from all error messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I added some follow-up suggestions.
@@ -56,8 +67,9 @@ pub async fn deserialize(path: &str) -> Result<Box<Plan>> { | |||
deserialize_bytes(protobuf_in).await | |||
} | |||
|
|||
/// Deserializes a plan from the bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up suggestion (breaking change): this function is marked async
but it doesn't have to be.
/// Reads the file at `path` and deserializes a plan from the bytes. | ||
pub async fn deserialize(path: impl AsRef<Path>) -> Result<Box<Plan>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up suggestion (breaking change): this function is marked async
but it doesn't have to be (see comment on deserialize_bytes
).
/// Reads the file at `path` and deserializes a plan from the bytes. | |
pub async fn deserialize(path: impl AsRef<Path>) -> Result<Box<Plan>> { | |
/// Reads the file at `path` and deserializes a plan from the bytes. | |
pub fn deserialize(path: impl AsRef<Path>) -> Result<Box<Plan>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plan to contribute more to the datafusion-substrait crate. Personally, I would like to reserve the breaking change in one of the following PR.
What do you think? @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree -- let's discuss API changes in follow on PRs / issues
let mut file = OpenOptions::new().write(true).create_new(true).open(path)?; | ||
file.write_all(&protobuf_out?)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up suggestion: use tokio::io::AsyncWriteExt::write_all
instead of the blocking std::io::Write::write_all
in this async function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have replaced the sync OpenOptions
with the async counterpart backed by tokio.
/// Plans a sql and serializes the generated logical plan to bytes. | ||
pub async fn serialize_bytes(sql: &str, ctx: &SessionContext) -> Result<Vec<u8>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we replace the sql
arg with a DataFrame
arg, this function (and serialize
) can be non-async.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure which one is preferable for users.
@alamb cc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should leave this one as a str
in this PR and we can potentially add a DataFrame
specific one as a follow on
However, given there is already an API to serialize LogicalPlans
directly I am not sure how much more value a DataFrame one would add
https://docs.rs/datafusion-substrait/latest/datafusion_substrait/#example-serializing-logicalplans
Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which issue does this PR close?
truncate
#9727Rationale for this change
It's reasonable to always truncate an existing file before serializing a plan.
On the other hand, the following
deserialize
method always read the file to end which does not consider the case that the file is not truncated before serialization.What changes are included in this PR?
Applied a clippy suggestion by adding a
truncate(true)
.Are these changes tested?
Tested by clippy
Are there any user-facing changes?
No