Extremely large memory consumption when reading from an arbitrary location of a root file #319

kandrosov · 2019-08-21T15:48:13Z

Hello,
when trying to read a chunk of data from an arbitrary location in a large root file, uproot takes all RAM memory (> 64GB) and eventually crashes. If the same file is converted to HDF5 (using root_pandas.read_root and pandas.DataFrame.to_hdf) and then the same chunk of data is read with pandas.read_hdf - it works fast and consumes less than 1 GB of RAM.
Is this related to some intrinsic limitation of the root file format, or there is a way to overcome this problem?
Here is a minimal code to reproduce the issue:

import uproot
import pandas
f = uproot.open("data.root")
tree = f['outer_cells']
df = tree.arrays('*', outputtype=pandas.DataFrame, namedecode='utf-8',
                 entrystart=3422904, entrystop=3429655)
# the program crashes before arriving to this point

System information: uproot version 3.9.0, python version 3.6.7, OS CERN CentOS 7.
The data file can be found here: https://cernbox.cern.ch/index.php/s/QOUBLxRUXpek7tz (or /eos/home-k/kandroso/share-tmp/data.root requires lxplus access)

The text was updated successfully, but these errors were encountered:

… reading any.

jpivarski · 2019-08-21T18:17:20Z

I reproduced your bug and found a cure, though this definitely goes under the "One Weird Trick" category. Do this:

tree._recover()

before attempting tree.arrays.

The long explanation: when you write a ROOT file and close it without calling TTree::Write, TFile::Close first, the resulting file is in a different state than it would be if you did "properly" close it. This different state is declared to be valid and readers must read it. The unflushed data (usually the last basket) is embedded in the TTree object, rather than being written in individually readable objects throughout the file. This is the case for your file, and many others, since ROOT doesn't tell you that it's in this different state.

At first, uproot didn't support this, but I've since added code to read embedded baskets the first time you try to read anything from a branch. You also have many branches, and each one was trying to recover its branches in between reading the data you were actually interested in. It's not an excessive amount of data, but the order was bad—it couldn't let go of previously read branches while each branch went and recovered its baskets—and the garbage collector couldn't do its job.

By recovering all branches up-front (which doesn't take very long or very much memory), it doesn't have this problem when it goes and tries to read the data you're interested in (which also doesn't take very long or very much memory). That's why this feels like "One Weird Trick," you get the same amount of work done, but doing it in a different order makes the difference between a few MBs in under a second and crashing your computer with 64 GB.

I'm also putting in a fix for all methods that read multiple branches, such as tree.arrays. From now on, if uproot knows it will be reading multiple branches (eventually), it will recover them all up-front. When that fix is in, calling tree._recover() will be unnecessary, but harmless.

jpivarski · 2019-08-21T18:18:32Z

See PR #320.

kandrosov · 2019-08-21T18:47:24Z

Thank you for the fast fix and the detailed explanation! I confirm that after calling tree.recover() the readout is very fast and takes only a few MB of memory.
FYI. During the creation of this file, directory->WriteTObject(tree, tree->GetName(), "Overwrite"); was called at the end, but file->Close(); was not (because I thought that it would be done by the destructor when calling delete file;. I'll try to do both in the next production round.

Fix memory issue (#319) by recovering all interesting branches before reading any.

jpivarski added a commit that referenced this issue Aug 21, 2019

Fix memory issue (#319) by recovering all interesting branches before…

b03edae

… reading any.

jpivarski closed this as completed Aug 21, 2019

jpivarski added a commit that referenced this issue Aug 21, 2019

Merge pull request #320 from scikit-hep/issue-319

31fa124

Fix memory issue (#319) by recovering all interesting branches before reading any.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely large memory consumption when reading from an arbitrary location of a root file #319

Extremely large memory consumption when reading from an arbitrary location of a root file #319

kandrosov commented Aug 21, 2019 •

edited

Loading

jpivarski commented Aug 21, 2019

jpivarski commented Aug 21, 2019

kandrosov commented Aug 21, 2019

Extremely large memory consumption when reading from an arbitrary location of a root file #319

Extremely large memory consumption when reading from an arbitrary location of a root file #319

Comments

kandrosov commented Aug 21, 2019 • edited Loading

jpivarski commented Aug 21, 2019

jpivarski commented Aug 21, 2019

kandrosov commented Aug 21, 2019

kandrosov commented Aug 21, 2019 •

edited

Loading