Skip to content

[WIP] Speedup retrieval of branch names #18358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

vepadulano
Copy link
Member

@vepadulano vepadulano commented Apr 10, 2025

When the input TTree has a large dataset schema O(10k) and when the branch types are non trivial (i.e. many of the branches store some kind of nested data structure with possibly multiple levels of sub-branches representing the data members), retrieving the full list of available branches in the tree may become a costly operation. Anecdotally, a TTree used by an ATLAS analysis with >30k total leaves results in a retrieval time of around 2 minutes on my machine. With the changes in this PR, this time goes down to around 20 seconds. Main two contributors:

  • a new transient map data member in TTree. When the TTree is being read, it already needs to traverse all its branch hierarchy. Take this opportunity to fill a map that connects the string with the full name of the leaf to the TBranch pointer. Then use this map when retrieving the branch.
  • A fix in the logic in RDF that retrieves the full list of branches, such that in cases where the parent branch name is overlapping with the name of one of its sub branches, the concatenated string is first deduplicated before being passed to TTree::GetBranch. This ensures that a proper match in the aforementioned map can be found.

WIP for now, local testing indicates there is a speedup but it needs to be checked with the CI.

Copy link

github-actions bot commented Apr 11, 2025

Test Results

    19 files      19 suites   4d 19h 55m 19s ⏱️
 2 722 tests  2 719 ✅ 0 💤  3 ❌
49 977 runs  49 924 ✅ 0 💤 53 ❌

For more details on these failures, see this check.

Results for commit 07a421d.

♻️ This comment has been updated with latest results.

@vepadulano vepadulano force-pushed the rdf-speedup-retrieval-branch-names branch from 5e990a6 to 07a421d Compare April 11, 2025 07:42
@@ -139,7 +141,16 @@ void ExploreBranch(TTree &t, std::set<std::string> &bNamesReg, ColumnNames_t &bN
TBranch *subBranch = static_cast<TBranch *>(sb);
auto subBranchName = std::string(subBranch->GetName());
auto fullName = prefix + subBranchName;

auto prefixTokens = ROOT::Split(prefix, ".", true /*skipEmpty*/);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be done once outside of the for

fullName = std::string{};
for (const auto &s : prefixTokens)
fullName += s + ".";
for (decltype(subBranchNameTokens.size()) i = 1; i < subBranchNameTokens.size(); i++)
Copy link
Contributor

@silverweed silverweed Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this code correctly you don't really need to Split the strings; can't you instead find the last occurrence of . in prefix, the first occurrence of . in subBranchName and then, after matching the substrings, do

fullName = prefix + subBranchName.substr(firstDotIdx);

?

(sorry, I know this is a draft and maybe you are already thinking of changing this code, but just in case..)

@TomasDado
Copy link
Contributor

Just to add that with these changes I see about a factor of 5 improvement in the runnig speed for the initialisation part of an RDF setup with a file with 30k branches, thanks!

Comment on lines +5284 to +5286
const std::string prefix{b->GetName()};
if (prefix.compare(0, prefix.size(), name, prefix.size()) != 0)
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is seem odd. Which cases does it support? Does it work correctly if the top level branch is elided from the name search or the branch name (i.e. 'no trailing dot' case)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a smaller optimization I was playing with before the unordered_map. It probably won't be necessary so I think I will remove it.

@@ -226,6 +229,8 @@ class TTree : public TNamed, public TAttLine, public TAttFill, public TAttMarker
};

public:
void InsertNameAndBranch(std::pair<std::string, TBranch *> &&kv) { fNamesToBranches.insert(kv); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably should not be public. We may need to have TBranch__SetTree become a friend of TTree.

Suggested change
void InsertNameAndBranch(std::pair<std::string, TBranch *> &&kv) { fNamesToBranches.insert(kv); }
void RegisterBranchFullName(std::pair<std::string, TBranch *> &&kv) { fNamesToBranches.insert(kv); }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants