Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve read performance by 7x with prebuffer #1709

Merged
merged 1 commit into from
Oct 9, 2023

Conversation

ion-elgreco
Copy link
Collaborator

@ion-elgreco ion-elgreco commented Oct 8, 2023

Description

Enable prebuffer in the pyarrow.dataset.ParquetFragmentScanOptions. Relevant PR in Arrow repo, where they changed it to be default behavior. However, this won't be the case for older versions for PyArrow, so we need to set it to True.:

It improves read speed by 6-7x on Azure in one dataset that I have.

Before:
1min 4s ± 3.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

After:
8.99 s ± 786 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related Issue(s)

Closes ##1569

@github-actions github-actions bot added the binding/python Issues for the Python package label Oct 8, 2023
@rtyler rtyler enabled auto-merge October 9, 2023 15:16
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a memory tradeoff, but I think upstream defaulting indicates it's well worth the tradeoff for a default behavior.

Going to approve, thanks for another solid improvement @ion-elgreco

@rtyler rtyler merged commit ab6b0cf into delta-io:main Oct 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants