Efficiently writing many histograms to a root file #1242

Teddy-Curtis · 2024-07-03T10:05:50Z

Teddy-Curtis
Jul 3, 2024

Hey Experts!

So I have a quick question about how best to write many histogram (boost-histogram objects) to a root file using uproot v5.3.7.
Looking at the documentation 1, it says that one way to write a boost-histogram to a root file is simply by doing e.g.:

import boost_histogram as bh
import uproot
import numpy as np

# get random data
s = np.random.normal(50, 20, 100000)

# Make the histogram
bins = np.linspace(0, 100, 100)
hist, _ = np.histogram(s, bins=bins)
err = np.sqrt(hist)

# put into boost hist
boost_hist = bh.Histogram(bh.axis.Variable(bins), storage=bh.storage.Weight())
boost_hist[...] = np.stack([hist, err], axis=-1)

# Now save to root file 
with uproot.recreate("test.root") as f:
    f["hist"] = boost_hist

and this seems to work completely fine.
However, in my case I need to save loads of histograms to the root file (tens of thousands + of them!), and this method seems to slow down considerably as histograms are being saved to the file.
For example:

import boost_histogram as bh
import uproot
import numpy as np
from tqdm import tqdm

# List of histograms that I want to save
data_list = []

for _ in range(10000):
    # get random data
    s = np.random.normal(50, 20, 1000)

    # Make the histogram
    bins = np.linspace(0, 100, 100)
    hist, _ = np.histogram(s, bins=bins)
    err = np.sqrt(hist)

    # put into boost hist
    boost_hist = bh.Histogram(bh.axis.Variable(bins), storage=bh.storage.Weight())
    boost_hist[...] = np.stack([hist, err], axis=-1)

    data_list.append(boost_hist)


# Now save
# Now save to root file 
file = uproot.recreate("test.root")

for i, h in tqdm(enumerate(data_list)):
    file[f"hist{i}"] = h

tqdm shows that the saving loop at the end slows down considerably even for this small example, going from roughly 300 iterations a second down to roughly 90 by the end. For my actual case this is even worse because I have even more histograms to save, and this almost completely grinds to a halt half way through!

Alternatively I thought it might be possible to save all of these histograms in one go by containing them in a dictionary and saving them as a tree following the example in 1. But I believe this is only possible if the data is numpy/awkward arrays or pandas dataframes. So I end up getting the error:

import boost_histogram as bh
import uproot
import numpy as np
from tqdm import tqdm

# dict of histograms that I want to save
hist_dict = {}

for i in range(10000):
    # get random data
    s = np.random.normal(50, 20, 1000)

    # Make the histogram
    bins = np.linspace(0, 100, 100)
    hist, _ = np.histogram(s, bins=bins)
    err = np.sqrt(hist)

    # put into boost hist
    boost_hist = bh.Histogram(bh.axis.Variable(bins), storage=bh.storage.Weight())
    boost_hist[...] = np.stack([hist, err], axis=-1)

    hist_dict[f'hist{i}'] = boost_hist


# Now save to root file 
with uproot.recreate("test.root") as f:
    f["hists"] = hist_dict


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[79], line 27
     25 # Now save to root file 
     26 with uproot.recreate("test.root") as f:
---> 27     f["hists"] = hist_dict

File ~/micromamba/envs/pnn-study/lib/python3.9/site-packages/uproot/writing/writable.py:984, in WritableDirectory.__setitem__(self, where, what)
    982 if self._file.sink.closed:
    983     raise ValueError("cannot write data to a closed file")
--> 984 self.update({where: what})

File ~/micromamba/envs/pnn-study/lib/python3.9/site-packages/uproot/writing/writable.py:1553, in WritableDirectory.update(self, pairs, **more_pairs)
   1550     for item in path:
   1551         directory = directory[item]
-> 1553     uproot.writing.identify.add_to_directory(v, name, directory, streamers)
   1555 self._file._cascading.streamers.update_streamers(self._file.sink, streamers)

File ~/micromamba/envs/pnn-study/lib/python3.9/site-packages/uproot/writing/identify.py:132, in add_to_directory(obj, name, directory, streamers)
    128     branch_array = awkward.from_iter(  # noqa: PLW2901 (overwriting branch_array)
    129         branch_array
    130     )
    131 except Exception:
--> 132     raise TypeError(
    133         f"unrecognizable array type {type(branch_array)} associated with {branch_name!r}"
    134     ) from None
    135 else:
    136     data[branch_name] = branch_array

TypeError: unrecognizable array type <class 'boost_histogram.Histogram'> associated with 'hist0'

If anyone could help me fix this issue it would be greatly appreciated!
I should add that I am using boost histograms because that way I could include the associated bin errors as well, which I wasn't sure how to do by saving them as numpy histograms. If there is a way to do that then I guess I could use the final example shown above where I save it as a dictionary I believe?

Thanks!

jpivarski · 2024-07-03T12:37:26Z

jpivarski
Jul 3, 2024
Maintainer

What would be ideal would be to have a serialization that writes the shared metadata (axis ranges, numbers of bins, labels, etc.) only once and the bin contents of all the histograms as a single, large multidimensional array. I've been trying to get developers interested in this idea for a while (most recently here).

What Uproot does is it takes the boost-histogram, converts it into a ROOT histogram class (TH1, TH2, or TH3), which has a lot more metadata than boost-histogram itself, and saves that. We've had some discussions of native boost-histogram serialization; there's a draft schema and (I think) an implementation for HDF5, but (as far as I know) no serialization for ROOT yet. Boost-histograms can also be written with boost-serialization.

Uproot's algorithm for determining what ROOT type to write on assignment considers any dict or DataFrame to be a TTree. It's technically possible for a ROOT TTree to contain histograms, instead of event data, but that's an odd (and not efficient) thing to do, and Uproot doesn't do it.

There isn't a faster way to write large numbers of histograms to a ROOT file with Uproot, so you have a few options:

use PyROOT, which would involve manual conversion from boost-histogram to ROOT histogram types, and might be a little faster because all of the metadata handling is in C++ code rather than Python, but the end result is the same (no sharing of metadata)
stack histograms with same axes into as many as 3 dimension so that a large set of histograms can be written as a single TH3
write the histograms to HDF5 or boost-serialization instead of ROOT
advocate for or develop improvements in boost-histogram serialization. This is an area that's lacking in the Scikit-HEP ecosystem and is low-hanging fruit for improvement, if you're up for it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficiently writing many histograms to a root file #1242

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Efficiently writing many histograms to a root file #1242

Teddy-Curtis Jul 3, 2024

Replies: 1 comment

jpivarski Jul 3, 2024 Maintainer

Teddy-Curtis
Jul 3, 2024

jpivarski
Jul 3, 2024
Maintainer