ROOT DIFF

Currently in development. Plans:

Compare with decompression and without (would hopefully introduce speed improvement if compression level/algo has not changed)
Load all baskets at once and compare (would hopefully resolve limitation of changing basketsize)
Looking to generalize from TTrees to all objects between two ROOT files
Implement some testing using ctest package
Maybe auto-test using GitHub actions?
Option to install along-side ROOT used to build it

Quick Start

Need ROOT (version 6.16+) installed on initialized on your system.
Need cmake (version 3.12+) installed on your system
Configure build: mkdir build; cd build; cmake -DCMAKE_INSTALL_PREFIX=../install ..
Build: make
Install: make install

Why?

I want to be able to check that my developments only change things I want them to change. One of the most direct ways to do this is to compare the output files to make sure that the final output has not been modified after some code changes. This method allows me to do this comparison while leaving the contents of the different branches of our trees in their serialized form. This "lower-level" comparison has the following benefits.

Speed - Only looking at decompressed buffers instead of having to create all of the hierarchical objects really speeds up the comparison.
Robust - Comparing the decompressed buffers directly is less prone to bugs compared to having to define a comparison method for each of our event object types.
Flexibility - We won't have to change the comparison code when adding/moving/changing event objects. The comparison code will notice that objects have changed, but it won't break.

Explanation of Method

(a.k.a How?) Buckle up. This is quite the ride.

ROOT Serialization Primer

In order to understand what I'm doing here; first, you need to understand how ROOT serializes a TTree.

Splitting

Each TTree has TBranches created through the TTree::Branch method. If allowed using a non-zero "split level", ROOT will "split" TBranches of complicated objects into several parallel TBranches of less complicated objects. For example, a TBranch of a struct MyObj { int my_int_; float my_float_; }; would be split into two sub-branches: one for my_int_ and one for my_float_. The splitting process is recursive. If a TBranch has a sub-branch that is a complicated object itself, the sub-branch can also split into less complicated sub-branches.

Only the lowest-level branches (branches with no sub-branches) follow data and serialize it into the output file. The higher level branches (branches with sub-branches) are only useful for interfacing between our complicated, hierarchical C++ objects and the simple, serialized ones and zeros in the file. This is crucial. For our purposes here, we don't care about the higher-level branches because we only want to look at the simple, serialized data that is easy to compare. Since the splitlevel changes what the lowest-level branches are, we will need to assume that the splitlevel input is the same for branches of the same name.

From now on, when I say "branch", assume I'm talking about only these lowest-level branches.

Baskets

Branches whose data is actually being serialized into/outof the file often contain large amounts of data that cannot be loaded into memory all at once. In order to get around this difficulty, ROOT "chunks" branches into baskets (TBaskets) that are the objects serialized into the file. The size of these baskets is configurable and is called buffsize at the TTree::Branch level. Since the number of the baskets and which data is in which basket changes depending on the size of these baskets, we will need to assume that the buffsize input is the same for branches of the same name.

The TBasket is where the data from its corresponding TBranch is compressed (or decompressed), so getting down the the TBasket level is where we want to be. Note: The TBranch serializes the object before giving the data to the TBasket, so the TBasket doesn't need to know the type of object that TBranch is following.

Summary

In summary, each TTree has several TBranches. Each TBranch may be split into several child TBranches (recursively) depending on the splitlevel input. The bottom TBranches have several TBaskets. Each TBasket has one or more entries in the corresponding TBranch depending on the memory size of the TBranch entries. How the entires in the TBranch are partitioned into TBaskets is controlled by the buffsize input.

Objects in General

Finally, I need to make a comment about how ROOT writes objects to files. This applies to any object that ROOT writes and TBaskets are a special case. ROOT writes objects in two stages. First, ROOT writes a "header" which contains object details such as the name of the object, its class, the size of the object, it's location in the file, and other information we won't use. This "header" is also called a "key" in ROOT terminology; hence, why you see TKeys floating around. The second stage, immediately after this header, is the serialized (usually also compressed) data. At the end of the day, once we have this "key", we can access the serialized data off the file directly. TBasket is actually a specialization of TKey for interfacing with TBranches, so you won't see TKey in the code above; however, you will see me calling TKey methods from the derived class TBasket.

The Method

Now we can actually talk about how this method of finding the difference between two TTrees works. I've called the TTree wrapper and TBranch wrapper "Bare" to reflect the fact that they are only accessing the already-serialized data and have no knowledge of the types of objects it is comparing.

bare::Tree

This tree wrapper makes sure we can access the requested tree in the open file and then generates the "flat" list of branches by recursively calling GetListOfBranches() until we obtain a list of all the lowest-level branches a part of this tree. This is where we deal with the splitting. This implies that two branches with the same data must have the same splitlevel to pass the comparison.

During the comparison, we store branch names that only exist on our tree, branch names that only exist on the other tree, and branch names that exist on both trees but have different content. These three lists of branch names are then used to determine if the three trees match (all three lists are empty if the trees match) and (outside this class) if there isn't a match the lists are printed.

bare::Branch

This branch wrapper is where the main comparison process takes place. There are two levels of comparison.

First, we check if two branches have the same name. We use the full name because (as mention above) a lot of these branches are the "lowest-level" branch and so we need the name all the way to the root. Two branches only have the same name if their full names are exactly equal. This requires that the event files must have the same pass name to pass the test. We could look into allowing for different pass names, but I like requiring the same pass name. It emphasizes that this comparison tool isn't meant for two different configs.

Second, we check that the contents of the two branches are the same. Again, since this tool is focused on comparing files expected to be mostly similar, we don't do fancy things like trying to figure out what the differing content is. If there is one part of the two branches that is different, the whole branch fails the comparison. We do this comparison by loading the baskets of both branches into memory (this reads in the "header" of all the TBaskets associated with both TBranches). Then we can loop through these baskets and compare their content one-by-one. Reading the content of the baskets into memory from the file is where we handle decompression of the data; however, since the number of baskets and what data is in which basket depends on the the buffer size of the branches, this implies that two branches with the same data must have the same buffsize to pass the comparison.

Known Limitations

As highlighted in my comments, this method has the following known limitations.

Two event trees need to have the same pass name to be compared effectively. This limitation is actually a feature because it emphasizes that this tool is for comparing files generated by the same configuration script.
Two branches need to have the same splitlevel and buffsize to be compared effectively. This limitation means that this tool cannot be used to validate changes to the code where the splitlevel or buffsize changes. Luckily, the splitlevel and buffsize are currently hardcoded in our code (or left as the default values in the case of the NtupleManager), so hopefully this limitation won't come up frequently enough to matter.

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
app		app
src/Bare		src/Bare
test		test
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ROOT DIFF

Quick Start

Why?

Explanation of Method

ROOT Serialization Primer

Splitting

Baskets

Summary

Objects in General

The Method

bare::Tree

bare::Branch

Known Limitations

About

Uh oh!

Releases 1

Packages

Languages

License

tomeichlersmith/root-diff

Folders and files

Latest commit

History

Repository files navigation

ROOT DIFF

Quick Start

Why?

Explanation of Method

ROOT Serialization Primer

Splitting

Baskets

Summary

Objects in General

The Method

bare::Tree

bare::Branch

Known Limitations

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages