feat: improve read performance by 7x with prebuffer #1709

ion-elgreco · 2023-10-08T21:26:06Z

Description

Enable prebuffer in the pyarrow.dataset.ParquetFragmentScanOptions. Relevant PR in Arrow repo, where they changed it to be default behavior. However, this won't be the case for older versions for PyArrow, so we need to set it to True.:

It improves read speed by 6-7x on Azure in one dataset that I have.

Before:
1min 4s ± 3.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:
8.99 s ± 786 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related Issue(s)

Closes ##1569

rtyler

There is a memory tradeoff, but I think upstream defaulting indicates it's well worth the tradeoff for a default behavior.

Going to approve, thanks for another solid improvement @ion-elgreco

Enable prebuffer

94b41b7

ion-elgreco requested review from wjones127, fvaleye and roeap as code owners October 8, 2023 21:26

github-actions bot added the binding/python Issues for the Python package label Oct 8, 2023

rtyler enabled auto-merge October 9, 2023 15:16

rtyler approved these changes Oct 9, 2023

View reviewed changes

rtyler merged commit ab6b0cf into delta-io:main Oct 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve read performance by 7x with prebuffer #1709

feat: improve read performance by 7x with prebuffer #1709

ion-elgreco commented Oct 8, 2023 •

edited

Loading

rtyler left a comment

feat: improve read performance by 7x with prebuffer #1709

feat: improve read performance by 7x with prebuffer #1709

Conversation

ion-elgreco commented Oct 8, 2023 • edited Loading

Description

Related Issue(s)

rtyler left a comment

Choose a reason for hiding this comment

ion-elgreco commented Oct 8, 2023 •

edited

Loading