-
Notifications
You must be signed in to change notification settings - Fork 739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stop panic in MetadataLoader
on invalid data
#6367
Conversation
@@ -71,7 +71,10 @@ impl<F: MetadataFetch> MetadataLoader<F> { | |||
let suffix_len = suffix.len(); | |||
|
|||
let mut footer = [0; 8]; | |||
footer.copy_from_slice(&suffix[suffix_len - 8..suffix_len]); | |||
let Some(footer_start) = suffix_len.checked_sub(8) else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks good to me initially, but I'm wondering about the logic behind it.
First of all, it should fail at file_size < 8
, right? We need to clarify this in our API documentation regarding the validation of file_size
.
Additionally, it's possible that users provide an incorrect file size. The result of fetch.fetch(footer_start..file_size).await?;
seems incorrect. We should specify that fetch
must handle this correctly, returning an error if the range of footer_start..file_size
fails to read. We can add a check here: if suffix_len != file_size - footer_start
, we can raise an error like EOF
.
What do you think? My motivation is that we should return the error at the correct place instead of here, where it is just a side effect.
I believe this error could be reproduced by an invalid file size or a malfunctioning fetch
that returns a smaller size than expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is also possible to hit this error if size_hint
is smaller than 8? A minimum value for size_hint
could make sense, or returning an error if it is too small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ye I think @jhorstmann was right, the error almost certainly happened because the prefetch
value was invalid, not because fetch
returned the wrong size of slice.
I've updated to check prefetch
size and added a test.
5879611
to
0d82132
Compare
0d82132
to
d041b13
Compare
} else if let Some(size_hint) = prefetch { | ||
if size_hint < 8 { | ||
return Err(ParquetError::EOF(format!( | ||
"prefetch size of {size_hint} is less than footer size" | ||
))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that it's a "hint", I think it's ok to ignore an invalid size and instead set size_hint
to 8 if necessary (although I suppose the docs for fetch_parquet_metadata
would need to reflect this). I'd make a similar argument for a prefetch
value larger than file_size
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can just set it to None
in both cases if that's agreed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to not penalize a user for over specifying the prefetch. I think of it as a budget... you can use up to 2MB of prefetch. I can see angry AWS customers complaining that a metadata fetch that should have been a single GET is now 3 😅.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'm lost on what you would suggest. Feel free to take this, or let me know concretely what you'd suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking along the lines of
// If a size hint is provided, read more than the minimum size
// to try and avoid a second fetch.
let footer_start = if let Some(size_hint) = prefetch {
// check for hint smaller than footer
let size_hint = std::cmp::max(size_hint, FOOTER_SIZE);
// check for hint larger than the file
let size_hint = std::cmp::min(size_hint, file_size);
file_size.saturating_sub(size_hint)
} else {
file_size - FOOTER_SIZE
};
This guards against the error you're seeing, but allows providing a hint larger than the file without triggering extra I/O (as would happen if the hint were simply set to None
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, except we don't need the let size_hint = std::cmp::min(size_hint, file_size);
since we're using file_size.saturating_sub(size_hint)
directly afterwards.
Changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last nit. Love the reduction in magic numbers! Thanks @samuelcolvin.
@@ -48,12 +48,14 @@ pub struct MetadataLoader<F> { | |||
remainder: Option<(usize, Bytes)>, | |||
} | |||
|
|||
const FOOTER_SIZE: usize = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is defined elsewhere in the parquet crate.
use crate::file::FOOTER_SIZE;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Which issue does this PR close?
None
Rationale for this change
While messing with our
object_store
implementation I sent invalid data toMetadataEncoder
and got a panic:Looking at the code I'm pretty sure it's because
suffix_len
is <8 causing the subtract to fail.What changes are included in this PR?
return an error instead of panicing. I don't know how to reproduce this in a test, hence just the change. LMK if this requires a test?
Are there any user-facing changes?
no.