Skip to content
This repository has been archived by the owner on Jun 21, 2022. It is now read-only.

Duplicate scalar columns (or custom index) in Pandas DF with flatten=True #179

Closed
beojan opened this issue Oct 31, 2018 · 8 comments
Closed

Comments

@beojan
Copy link

beojan commented Oct 31, 2018

In my case, my tree contains runNumber and eventNumber columns that I would like to use as an index, but these columns are NaN for subentry != 0.

@jpivarski
Copy link
Member

My thinking on the was that anyone could use Pandas's fillna in the forward direction. Alternatively, I could call that function just before returning the DataFrame, but this provides more information to the user.

@beojan
Copy link
Author

beojan commented Oct 31, 2018

That would cause issues if you have multiple jagged-array columns with different lengths. I was suggesting duplicating only the scalar columns.

@jpivarski
Copy link
Member

That is doable. I'll use fillna per column because in the arrays function, I know which columns are scalar. It does lose information, but that information is available in the original TTree object as the branch.interpretation (asdtype vs asjagged).

@jpivarski
Copy link
Member

In uproot 3.2.9, scalar columns get duplicated down, but jagged columns of different lengths do not.

@beojan
Copy link
Author

beojan commented Nov 2, 2018

Turns out there's a problem. The integer columns have turned into floats.

@jpivarski
Copy link
Member

That's something that Pandas does when it consolidates Numpy arrays internally. I don't know how to control it— I add columns to the DataFrame and it sometimes converts them. Do you know the mechanism behind that? It seems like something they really to be hidden/transparent.

@beojan
Copy link
Author

beojan commented Nov 2, 2018

It's probably because you used NaN which is only available with floats.

@jpivarski
Copy link
Member

That makes sense. However, I didn't put NaN in myself: that's what Pandas does when you merge a dataset into one with a larger index— namely the one with nonzero subentries. That's intrinsic to the process. I suppose I could afterward determine if any fillna'ed scalar columns used to be integers and change them back...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants