Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Extremely large memory consumption when reading from an arbitrary location of a root file #319

Closed
kandrosov opened this issue Aug 21, 2019 · 3 comments

Comments

@kandrosov
Copy link

kandrosov commented Aug 21, 2019

Hello,
when trying to read a chunk of data from an arbitrary location in a large root file, uproot takes all RAM memory (> 64GB) and eventually crashes. If the same file is converted to HDF5 (using root_pandas.read_root and pandas.DataFrame.to_hdf) and then the same chunk of data is read with pandas.read_hdf - it works fast and consumes less than 1 GB of RAM.
Is this related to some intrinsic limitation of the root file format, or there is a way to overcome this problem?
Here is a minimal code to reproduce the issue:

import uproot
import pandas
f = uproot.open("data.root")
tree = f['outer_cells']
df = tree.arrays('*', outputtype=pandas.DataFrame, namedecode='utf-8',
                 entrystart=3422904, entrystop=3429655)
# the program crashes before arriving to this point

System information: uproot version 3.9.0, python version 3.6.7, OS CERN CentOS 7.
The data file can be found here: https://cernbox.cern.ch/index.php/s/QOUBLxRUXpek7tz (or /eos/home-k/kandroso/share-tmp/data.root requires lxplus access)

@jpivarski
Copy link
Member

I reproduced your bug and found a cure, though this definitely goes under the "One Weird Trick" category. Do this:

tree._recover()

before attempting tree.arrays.

The long explanation: when you write a ROOT file and close it without calling TTree::Write, TFile::Close first, the resulting file is in a different state than it would be if you did "properly" close it. This different state is declared to be valid and readers must read it. The unflushed data (usually the last basket) is embedded in the TTree object, rather than being written in individually readable objects throughout the file. This is the case for your file, and many others, since ROOT doesn't tell you that it's in this different state.

At first, uproot didn't support this, but I've since added code to read embedded baskets the first time you try to read anything from a branch. You also have many branches, and each one was trying to recover its branches in between reading the data you were actually interested in. It's not an excessive amount of data, but the order was bad—it couldn't let go of previously read branches while each branch went and recovered its baskets—and the garbage collector couldn't do its job.

By recovering all branches up-front (which doesn't take very long or very much memory), it doesn't have this problem when it goes and tries to read the data you're interested in (which also doesn't take very long or very much memory). That's why this feels like "One Weird Trick," you get the same amount of work done, but doing it in a different order makes the difference between a few MBs in under a second and crashing your computer with 64 GB.

I'm also putting in a fix for all methods that read multiple branches, such as tree.arrays. From now on, if uproot knows it will be reading multiple branches (eventually), it will recover them all up-front. When that fix is in, calling tree._recover() will be unnecessary, but harmless.

@jpivarski
Copy link
Member

See PR #320.

@kandrosov
Copy link
Author

Thank you for the fast fix and the detailed explanation! I confirm that after calling tree.recover() the readout is very fast and takes only a few MB of memory.
FYI. During the creation of this file, directory->WriteTObject(tree, tree->GetName(), "Overwrite"); was called at the end, but file->Close(); was not (because I thought that it would be done by the destructor when calling delete file;. I'll try to do both in the next production round.

jpivarski added a commit that referenced this issue Aug 21, 2019
Fix memory issue (#319) by recovering all interesting branches before reading any.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants