[WIP] Speedup retrieval of branch names #18358

vepadulano · 2025-04-10T23:55:51Z

When the input TTree has a large dataset schema O(10k) and when the branch types are non trivial (i.e. many of the branches store some kind of nested data structure with possibly multiple levels of sub-branches representing the data members), retrieving the full list of available branches in the tree may become a costly operation. Anecdotally, a TTree used by an ATLAS analysis with >30k total leaves results in a retrieval time of around 2 minutes on my machine. With the changes in this PR, this time goes down to around 20 seconds. Main two contributors:

a new transient map data member in TTree. When the TTree is being read, it already needs to traverse all its branch hierarchy. Take this opportunity to fill a map that connects the string with the full name of the leaf to the TBranch pointer. Then use this map when retrieving the branch.
A fix in the logic in RDF that retrieves the full list of branches, such that in cases where the parent branch name is overlapping with the name of one of its sub branches, the concatenated string is first deduplicated before being passed to TTree::GetBranch. This ensures that a proper match in the aforementioned map can be found.

WIP for now, local testing indicates there is a speedup but it needs to be checked with the CI.

github-actions · 2025-04-11T04:30:32Z

Test Results

19 files 19 suites 4d 19h 55m 19s ⏱️
2 722 tests 2 719 ✅ 0 💤 3 ❌
49 977 runs 49 924 ✅ 0 💤 53 ❌

For more details on these failures, see this check.

Results for commit 07a421d.

♻️ This comment has been updated with latest results.

silverweed · 2025-04-11T08:36:20Z

tree/dataframe/src/RLoopManager.cxx

@@ -139,7 +141,16 @@ void ExploreBranch(TTree &t, std::set<std::string> &bNamesReg, ColumnNames_t &bN
      TBranch *subBranch = static_cast<TBranch *>(sb);
      auto subBranchName = std::string(subBranch->GetName());
      auto fullName = prefix + subBranchName;
-
+      auto prefixTokens = ROOT::Split(prefix, ".", true /*skipEmpty*/);


This may be done once outside of the for

silverweed · 2025-04-11T08:41:22Z

tree/dataframe/src/RLoopManager.cxx

+         fullName = std::string{};
+         for (const auto &s : prefixTokens)
+            fullName += s + ".";
+         for (decltype(subBranchNameTokens.size()) i = 1; i < subBranchNameTokens.size(); i++)


If I understand this code correctly you don't really need to Split the strings; can't you instead find the last occurrence of . in prefix, the first occurrence of . in subBranchName and then, after matching the substrings, do

fullName = prefix + subBranchName.substr(firstDotIdx);

?

(sorry, I know this is a draft and maybe you are already thinking of changing this code, but just in case..)

TomasDado · 2025-04-11T14:18:38Z

Just to add that with these changes I see about a factor of 5 improvement in the runnig speed for the initialisation part of an RDF setup with a file with 30k branches, thanks!

pcanal · 2025-04-16T18:11:48Z

tree/tree/src/TTree.cxx

+      const std::string prefix{b->GetName()};
+      if (prefix.compare(0, prefix.size(), name, prefix.size()) != 0)
+         continue;


This is seem odd. Which cases does it support? Does it work correctly if the top level branch is elided from the name search or the branch name (i.e. 'no trailing dot' case)?

It was a smaller optimization I was playing with before the unordered_map. It probably won't be necessary so I think I will remove it.

pcanal · 2025-04-16T18:16:29Z

tree/tree/inc/TTree.h

@@ -226,6 +229,8 @@ class TTree : public TNamed, public TAttLine, public TAttFill, public TAttMarker
   };

 public:
+   void InsertNameAndBranch(std::pair<std::string, TBranch *> &&kv) { fNamesToBranches.insert(kv); }


This probably should not be public. We may need to have TBranch__SetTree become a friend of TTree.

Suggested change

void InsertNameAndBranch(std::pair<std::string, TBranch *> &&kv) { fNamesToBranches.insert(kv); }

void RegisterBranchFullName(std::pair<std::string, TBranch *> &&kv) { fNamesToBranches.insert(kv); }

vepadulano self-assigned this Apr 10, 2025

vepadulano added in:RDataFrame in:TTree labels Apr 10, 2025

[WIP] Speedup retrieval of branch names

07a421d

vepadulano force-pushed the rdf-speedup-retrieval-branch-names branch from 5e990a6 to 07a421d Compare April 11, 2025 07:42

silverweed reviewed Apr 11, 2025

View reviewed changes

pcanal reviewed Apr 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Speedup retrieval of branch names #18358

[WIP] Speedup retrieval of branch names #18358

vepadulano commented Apr 10, 2025 •

edited

Loading

github-actions bot commented Apr 11, 2025 •

edited

Loading

silverweed Apr 11, 2025

silverweed Apr 11, 2025 •

edited

Loading

TomasDado commented Apr 11, 2025

pcanal Apr 16, 2025

vepadulano Apr 17, 2025

pcanal Apr 16, 2025

	void InsertNameAndBranch(std::pair<std::string, TBranch *> &&kv) { fNamesToBranches.insert(kv); }
	void RegisterBranchFullName(std::pair<std::string, TBranch *> &&kv) { fNamesToBranches.insert(kv); }

[WIP] Speedup retrieval of branch names #18358

Are you sure you want to change the base?

[WIP] Speedup retrieval of branch names #18358

Conversation

vepadulano commented Apr 10, 2025 • edited Loading

github-actions bot commented Apr 11, 2025 • edited Loading

Test Results

silverweed Apr 11, 2025

Choose a reason for hiding this comment

silverweed Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

TomasDado commented Apr 11, 2025

pcanal Apr 16, 2025

Choose a reason for hiding this comment

vepadulano Apr 17, 2025

Choose a reason for hiding this comment

pcanal Apr 16, 2025

Choose a reason for hiding this comment

vepadulano commented Apr 10, 2025 •

edited

Loading

github-actions bot commented Apr 11, 2025 •

edited

Loading

silverweed Apr 11, 2025 •

edited

Loading