Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] Improve RPagePool #16859

Merged
merged 13 commits into from
Nov 15, 2024
Merged

Conversation

jblomer
Copy link
Contributor

@jblomer jblomer commented Nov 7, 2024

Improve the lookup complexity for pages in the page pool from linear to constant in well-behaved cases, i.e. if there is a small number of pages per column and cluster. Some smaller cleanups around the RPage/RPagePool logic.

Copy link

github-actions bot commented Nov 7, 2024

Test Results

    18 files      18 suites   4d 4h 10m 19s ⏱️
 2 678 tests  2 678 ✅ 0 💤 0 ❌
46 342 runs  46 342 ✅ 0 💤 0 ❌

Results for commit 15b33c0.

♻️ This comment has been updated with latest results.

Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Some stylistic comments inline.

It would be interesting to run the limits test, especially Limits_ManyFields, Limits_ManyPages, and Limits_ManyPagesOneEntry. It's possible that this PR addresses the quadratic complexity seen in there.

tree/ntuple/v7/src/RPagePool.cxx Outdated Show resolved Hide resolved
tree/ntuple/v7/src/RPagePool.cxx Outdated Show resolved Hide resolved
tree/ntuple/v7/src/RPagePool.cxx Outdated Show resolved Hide resolved
jblomer and others added 13 commits November 14, 2024 22:55
Instead of mapping all synthezised zero pages to the same memory buffer,
use real allocated and zeroed out pages. That makes sure no special
logic is required when adding and removing pages to and from the page
pool.
Allows for O(1) page lookup when a page is returned to the page pool,
instead of the O(n) linear search.
Use a hash map to filter the pages in the page pool by column ID and
on-disk type on access.
Co-authored-by: Jonas Hahnfeld <hahnjo@hahnjo.de>
@jblomer
Copy link
Contributor Author

jblomer commented Nov 14, 2024

No changes to the "many pages" test. The "many fields" unit test got significantly faster. The overall complexity is still super-linear but much more benign.

@hahnjo
Copy link
Member

hahnjo commented Nov 15, 2024

No changes to the "many pages" test. The "many fields" unit test got significantly faster. The overall complexity is still super-linear but much more benign.

Ah right, now I remember that I had already profiled this before: The "many pages" tests are actually bound by RPageRange::Find, which has a TODO to use binary search. The case we are speeding up here is many pages distributed over many fields, in which case performance was bound by the page pool.

Edit: Hm, the "many pages" tests will also hit the linear loop over the page set in RPagePool::GetPage... For a future PR though.

Copy link
Member

@hahnjo hahnjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

FileRaii fileGuard("test_ntuple_limits_manyFields.root");

static constexpr int NumFields = 40'000;
static constexpr int NumFields = 100'000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic that we can now process models with 100k fields in reasonable time!

@jblomer jblomer merged commit ca0d725 into root-project:master Nov 15, 2024
21 checks passed
@jblomer jblomer deleted the ntuple-fix-page-pool branch November 15, 2024 10:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants