Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default parquet reader to reading 64K footer #4459

Open
alamb opened this issue Dec 1, 2022 · 4 comments
Open

Default parquet reader to reading 64K footer #4459

alamb opened this issue Dec 1, 2022 · 4 comments

Comments

@alamb
Copy link
Contributor

alamb commented Dec 1, 2022

As of #4427 it is easier to see that the DataFusion parquet reader still defaults to reading the last 4 bytes of a parquet file (which contains the metadata length) and then does a second read to read the footer.

Doing two IO operations is likely non ideal, especially for object storage where the cost of an additional read is very expensive relative to reading a bit more data in the first read.

The suggestion is to default reading the last 64k of a parquet file to try and capture the entire footer in a single read

Originally posted by @thinkharderdev in #3885 (comment)

@alamb
Copy link
Contributor Author

alamb commented Dec 1, 2022

Any thoughts @tustvold or @Ted-Jiang ?

@Ted-Jiang
Copy link
Member

Ted-Jiang commented Dec 2, 2022

Make sense to me, I think we have to notice the best practice to user keeping footer size less than 64k.
And i can not find a tool to read parquet footer size 😂

@thinkharderdev
Copy link
Contributor

thinkharderdev commented Dec 2, 2022

#!/bin/bash

le=`xxd -p -s -8 -l 4 $1`;
be=${le:6:2}${le:4:2}${le:2:2}${le:0:2};
printf "Footer has size $le=%d\n" $((16#$be));

./paruqet_size.sh file.parquet

😄

@alamb
Copy link
Contributor Author

alamb commented Dec 2, 2022

And i can not find a tool to read parquet footer size 😂

While the bash script is quite compelling from a dependencies point of view, I have been dreaming (though haven't found time) to contribute to @manojkarthick 's https://github.com/manojkarthick/pqrs -- I think with some more contributions that tool could be come "the parquet-tools I actually want to use"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants