Add a benchmark for large file scans versus smaller files #41

jonkeane · 2021-09-29T18:57:33Z

When scanning and doing operations on files, it would be nice to know if it's more efficient to have one large parquet (for example) file per partition, or to have more smaller files.

lidavidm · 2021-09-29T19:00:09Z

This would be good to have. For Parquet specifically, looking at row group sizes may also be interesting - we can potentially get more parallelism with smaller row groups, but if you're reading only a few sparse columns of many, and you're on something like S3, small row groups also mean you have to make lots of small reads which is not an ideal I/O pattern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a benchmark for large file scans versus smaller files #41

Add a benchmark for large file scans versus smaller files #41

jonkeane commented Sep 29, 2021

lidavidm commented Sep 29, 2021

Add a benchmark for large file scans versus smaller files #41

Add a benchmark for large file scans versus smaller files #41

Comments

jonkeane commented Sep 29, 2021

lidavidm commented Sep 29, 2021