-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Default parquet reader to reading 64K footer #4459
Comments
Any thoughts @tustvold or @Ted-Jiang ? |
Make sense to me, I think we have to notice the best practice to user keeping footer size less than 64k. |
#!/bin/bash
le=`xxd -p -s -8 -l 4 $1`;
be=${le:6:2}${le:4:2}${le:2:2}${le:0:2};
printf "Footer has size $le=%d\n" $((16#$be));
😄 |
While the bash script is quite compelling from a dependencies point of view, I have been dreaming (though haven't found time) to contribute to @manojkarthick 's https://github.com/manojkarthick/pqrs -- I think with some more contributions that tool could be come "the parquet-tools I actually want to use" |
As of #4427 it is easier to see that the DataFusion parquet reader still defaults to reading the last 4 bytes of a parquet file (which contains the metadata length) and then does a second read to read the footer.
Doing two IO operations is likely non ideal, especially for object storage where the cost of an additional read is very expensive relative to reading a bit more data in the first read.
The suggestion is to default reading the last 64k of a parquet file to try and capture the entire footer in a single read
Originally posted by @thinkharderdev in #3885 (comment)
The text was updated successfully, but these errors were encountered: