Default parquet reader to reading 64K footer #4459

alamb · 2022-12-01T15:55:03Z

As of #4427 it is easier to see that the DataFusion parquet reader still defaults to reading the last 4 bytes of a parquet file (which contains the metadata length) and then does a second read to read the footer.

Doing two IO operations is likely non ideal, especially for object storage where the cost of an additional read is very expensive relative to reading a bit more data in the first read.

The suggestion is to default reading the last 64k of a parquet file to try and capture the entire footer in a single read

Originally posted by @thinkharderdev in #3885 (comment)

alamb · 2022-12-01T15:57:30Z

Any thoughts @tustvold or @Ted-Jiang ?

Ted-Jiang · 2022-12-02T02:58:50Z

Make sense to me, I think we have to notice the best practice to user keeping footer size less than 64k.
And i can not find a tool to read parquet footer size 😂

thinkharderdev · 2022-12-02T13:13:11Z

#!/bin/bash

le=`xxd -p -s -8 -l 4 $1`;
be=${le:6:2}${le:4:2}${le:2:2}${le:0:2};
printf "Footer has size $le=%d\n" $((16#$be));

./paruqet_size.sh file.parquet

😄

alamb · 2022-12-02T13:41:51Z

And i can not find a tool to read parquet footer size 😂

While the bash script is quite compelling from a dependencies point of view, I have been dreaming (though haven't found time) to contribute to @manojkarthick 's https://github.com/manojkarthick/pqrs -- I think with some more contributions that tool could be come "the parquet-tools I actually want to use"

alamb mentioned this issue Dec 1, 2022

Expose remaining parquet config options into ConfigOptions (try 2) #4427

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default parquet reader to reading 64K footer #4459

Default parquet reader to reading 64K footer #4459

alamb commented Dec 1, 2022 •

edited

Loading

alamb commented Dec 1, 2022

Ted-Jiang commented Dec 2, 2022 •

edited

Loading

thinkharderdev commented Dec 2, 2022 •

edited

Loading

alamb commented Dec 2, 2022

Default parquet reader to reading 64K footer #4459

Default parquet reader to reading 64K footer #4459

Comments

alamb commented Dec 1, 2022 • edited Loading

alamb commented Dec 1, 2022

Ted-Jiang commented Dec 2, 2022 • edited Loading

thinkharderdev commented Dec 2, 2022 • edited Loading

alamb commented Dec 2, 2022

alamb commented Dec 1, 2022 •

edited

Loading

Ted-Jiang commented Dec 2, 2022 •

edited

Loading

thinkharderdev commented Dec 2, 2022 •

edited

Loading