MultiIndex pandas dataframe from uproot.iterate #263

afrankenthal · 2019-03-27T07:56:57Z

Hi! First of all thank you very much for this awesome package!

I have a question regarding MultiIndex pandas dataframes and uproot.iterate. When I open a ROOT file via uproot.open, I am able to select branches which contain JaggedArrays with the same dimensionality, and make them into a pandas dataframe with MultiIndex. For example:

mytree = uproot.open(myfile)["mytree"]
mytree.pandas.df(['muonPt', 'muonEta', 'muonPhi'])

Depending on the event, I can have (say) 0, 1, or 2 muons, so I can have accordingly 0, 1, or 2 subentries, and the resulting pandas dataframe reflects that.

But I would like to process several files with the same structure using uproot.iterate. I haven't found a way to make the pandas dataframe with MultiIndex by selecting the right branches from the iterate, e.g.:

for arrays in uproot.iterate(listoffiles, "mytree", ['muonPt', 'muonEta', 'muonPhi'], outputtype=pd.DataFrame, executor=executor):
         listofdataframes.append(arrays)
pd.concat(listofdataframes)

Without "flatten=True" in the iterate command above, the dataframes come out containing JaggedArrays, and I'm not sure how to turn those into a MultiIndex structure. If I do include "flatten=True", however, I get an error about incompatible dimensionalities:

ValueError: Shape of passed values is (1, 1382), indices imply (1, 1466)

(I think this is because of the variable number of muons per entry). Is there a way to get the same behavior from uproot.iterate on many files, as I would from tree.pandas.df() on a single file?

Thank you!
Andre

The text was updated successfully, but these errors were encountered:

jpivarski · 2019-03-27T10:21:29Z

This is a bug—you're using iterate the way it's supposed to be used. In fact, tree.pandas.df is just an alias to tree.arrays with some different options, and tree.iterate shares code paths with tree.arrays. Something minor must be mismatched—I'll look into it.

jpivarski · 2019-03-27T14:21:59Z

I fixed this in PR #264, where I found and fixed more issues that the one you found.

I was wrong when I thought it might be a minor mismatch: so many (good) updates have gone into DataFrame handling, tested in tree.arrays, that the DataFrame handling in tree.iterate was out of date. Then uproot.iterate (which works by simply calling tree.iterate on each tree) also had some out of date assumptions.

See that PR for updates. This will be a new version of uproot when it's done.

jpivarski · 2019-03-27T17:04:18Z

The fix is in master, but Travis is having issues and it won't get pushed to PyPI until that gets resolved. If you need this fix, git clone it or use pip's install-from-git feature.

afrankenthal · 2019-03-27T19:45:24Z

Hello, thank you for the incredibly speedy response! I will try to set up a new uproot install using pip's git install feature (I'm currently using conda install which only pulls from binaries, I believe). Or else I'll just wait for the Travis issues to go away.

jpivarski · 2019-03-27T19:54:33Z

Sure. :) Based on a Google talk, I'm trying to encourage a "live at head" lifestyle, but that only works if head consists of small changes (and therefore frequent, small changes).

I just checked into Travis again, and they're apparently having serious issues. Only a few jobs have started and those that need to install dependencies from conda time-out at 10 minutes. I guess it won't happen today.

The normal order is that Travis does the continuous integration, and if that's successful, I tag a release, Travis runs again but this time deploys to PyPI at the end of its test. The new version in PyPI notifies the conda package maintainer and he presses the button to deploy to conda. We're stuck at step one.

afrankenthal · 2019-03-27T21:20:50Z

That makes a lot of sense! Actually, if this Google talk is available publicly, it would be awesome to watch it, if you can share the link here! :)

jpivarski · 2019-03-27T21:35:01Z

I thought it was at the last ROOT Workshop, but I can't find anything that looks like it. Even if I did manage to find slides, it actually wasn't what the speaker intended to talk about: he thought he was referencing a discipline we were familiar with, but it ended up being the most interesting thing in his talk. Apparently that phrase, "live at head," is the common way of describing it.

afrankenthal · 2019-03-28T05:59:07Z

Very interesting! Someone needs to make this phrase into a t-shirt...

jpivarski mentioned this issue Mar 27, 2019

Fix iteration over DataFrames and provide more interfaces #264

Merged

jpivarski closed this as completed Mar 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultiIndex pandas dataframe from uproot.iterate #263

MultiIndex pandas dataframe from uproot.iterate #263

afrankenthal commented Mar 27, 2019

jpivarski commented Mar 27, 2019

jpivarski commented Mar 27, 2019

jpivarski commented Mar 27, 2019

afrankenthal commented Mar 27, 2019

jpivarski commented Mar 27, 2019

afrankenthal commented Mar 27, 2019

jpivarski commented Mar 27, 2019

afrankenthal commented Mar 28, 2019

MultiIndex pandas dataframe from uproot.iterate #263

MultiIndex pandas dataframe from uproot.iterate #263

Comments

afrankenthal commented Mar 27, 2019

jpivarski commented Mar 27, 2019

jpivarski commented Mar 27, 2019

jpivarski commented Mar 27, 2019

afrankenthal commented Mar 27, 2019

jpivarski commented Mar 27, 2019

afrankenthal commented Mar 27, 2019

jpivarski commented Mar 27, 2019

afrankenthal commented Mar 28, 2019