Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read the footers in parallel when reading multiple Parquet files #17957

Merged
merged 6 commits into from
Feb 24, 2025

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Feb 7, 2025

Description

Depends on #18018

When reading multiple files, all data(i.e. pages) IO is performed in the same "batch", allowing parallel IO operations (provided by kvikIO). However, footers are read serially, leading to poor performance when reading many files. This is especially pronounced for IO that benefits from high level of parallelism.

This PR performs footer reading/parsing asynchronously using an internal thread pool. The pool size can be controlled with an environment variable LIBCUDF_NUM_HOST_WORKERS.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Feb 7, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Feb 7, 2025
@vuule vuule added Performance Performance related issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 8, 2025
@mhaseeb123 mhaseeb123 self-requested a review February 8, 2025 01:01
@vuule vuule marked this pull request as ready for review February 10, 2025 21:04
@vuule vuule requested review from a team as code owners February 10, 2025 21:04
@vuule vuule added the DO NOT MERGE Hold off on merging; see PR for details label Feb 11, 2025
@github-actions github-actions bot removed the CMake CMake build issue label Feb 18, 2025
@vuule vuule force-pushed the opt-parallel-metadata-ctors branch from 058482b to e321f66 Compare February 24, 2025 20:46
@github-actions github-actions bot added the CMake CMake build issue label Feb 24, 2025
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving CMake.

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Feb 24, 2025
@vuule
Copy link
Contributor Author

vuule commented Feb 24, 2025

/merge

@rapids-bot rapids-bot bot merged commit 8c7eecf into rapidsai:branch-25.04 Feb 24, 2025
118 checks passed
@vuule vuule deleted the opt-parallel-metadata-ctors branch February 24, 2025 23:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants