Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ntuple] Merger: fix handling of compression-related options #16949

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

silverweed
Copy link
Contributor

@silverweed silverweed commented Nov 14, 2024

Currently the RNTupleMerger has a pretty confusing and unintuitive handling of hadd's compression options.
To simplify the situation, this PR implements these simpler rules:

  1. the output RNTuple and its container file always have the same compression
  2. if the user explicitly passes a compression settings (-f[0-9]), use that compression
  3. if the user doesn't pass any compression flag, use 505 as the output compression
  4. if the user passes -ff or -fk, open the first source file, grab the RNTuple inside it, peek the first column range we can find and use its compression settings as the output compression. This is different from the current behavior of using the same compression as the first source file, as that behavior is probably not what the user wants.

This requires passing more information from hadd to the RNTupleMerger. I added a couple of TString merge options to do so, which won't impact the existing merging code as they only get interpreted by the RNTupleMerger.

Checklist:

  • tested changes locally
  • updated the docs (if necessary)

Copy link
Contributor

@jblomer jblomer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle looks good but we should add tests.

mergeInfo->fOptions.Contains("fast") ? kUnknownCompressionSettings : outFile->GetCompressionSettings();

RNTupleWriteOptions writeOpts;
writeOpts.SetUseBufferedWrite(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This option got lost, didn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a victim of the rebase 😅

Copy link

github-actions bot commented Nov 14, 2024

Test Results

    18 files      18 suites   4d 9h 48m 48s ⏱️
 2 680 tests  2 677 ✅ 0 💤  3 ❌
46 378 runs  46 358 ✅ 0 💤 20 ❌

For more details on these failures, see this check.

Results for commit 3aaf05a.

♻️ This comment has been updated with latest results.

@silverweed
Copy link
Contributor Author

@jblomer added a couple of tests for the new rules: root-project/roottest#1220

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The commit message seems to have two title lines, is this intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a squash of two commits; I could remove one of the lines as it's not really that relevant

Comment on lines +97 to +99
// user passed no compression-related options: use default
compression = RCompressionSetting::EDefaults::kUseGeneralPurpose;
Info("RNTuple::Merge", "Using the default compression: %d", compression);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(cross-posting from #16944 (comment))
Can you remind me what is the default if I just do hadd out.root in1.root in2.root? From a users perspective, I would not expect this to change the compression / recompress, but the code seems to suggest that I have to pass -ff or -fk to get "fast" merging?

Comment on lines 147 to 148
// Always write the RNTuple and the file with the same compression.
outFile->SetCompressionSettings(compression);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we "always" do this, or only when there is exactly one RNTuple? What about a file that has one RNTuple (505) and one histogram (101)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I agree. At this point, we simply diverged and the default TFile compression is different from the default RNTuple compression. We can discuss if/how we want to address it but I don't think we need extra code in the merger. We have the same situation (different compression algorithms) if you write a new RNTuple.

RNTupleWriteOptions writeOpts;
assert(compression != kUnknownCompressionSettings);
writeOpts.SetUseBufferedWrite(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go to the previous commit... On the other hand it probably doesn't do anything since we are not using RPageSinkBuf in the first place, but construct a RPageSinkFile manually, so might as well just drop it (in a separate commit)

This requires passing more information from hadd to the RNTupleMerger.
I added a couple of TString merge options to do so, which won't impact
the existing merging code as they only get interpreted by the
RNTupleMerger.
@silverweed silverweed force-pushed the ntuple_merge_compression_fix branch 2 times, most recently from c46dd28 to 2bddca6 Compare November 15, 2024 08:59
the option got lost in a rebase
If a compression settings different from the one used by the sink is
given to RNTupleMerger::Merge, the resulting RNTuple would currently
be wrong as the merger cannot handle this situation correctly right now.
Therefore for now we refuse to do the merging if the compression passed
via the merge opts differs from the one used by the sink.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants