Faster parquet streaming + filters with predicate pushdown #7309

lhoestq · 2024-12-06T18:01:54Z

ParquetFragment.to_batches uses a buffered stream to read parquet data, which makes streaming faster (x2 on my laptop).

I also added the filters config parameter to support filtering with predicate pushdown, e.g.

from datasets import load_dataset

filters = [('problem_source', '==', 'math')]
ds = load_dataset("nvidia/OpenMathInstruct-2", streaming=True, filters=filters)
first_example = next(iter(ds["train"]))
print(first_example["problem_source"])
# 'math'

cc @allisonwang-db this is a nice plus for usage in spark

HuggingFaceDocBuilderDev · 2024-12-06T18:04:23Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq added 2 commits December 6, 2024 18:46

faster parquet streaming + add filters config param

013ee45

add test

98266a9

lhoestq merged commit 661d7ba into main Dec 7, 2024
15 checks passed

lhoestq deleted the faster-parquet-streaming-and-filters branch December 7, 2024 23:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster parquet streaming + filters with predicate pushdown #7309

Faster parquet streaming + filters with predicate pushdown #7309

lhoestq commented Dec 6, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 6, 2024

Faster parquet streaming + filters with predicate pushdown #7309

Faster parquet streaming + filters with predicate pushdown #7309

Conversation

lhoestq commented Dec 6, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Dec 6, 2024

lhoestq commented Dec 6, 2024 •

edited

Loading