Replies: 5 comments
-
This is the conceptual shift that Awkward Array represents: you don't iterate over events and you don't accumulate results like appending to a loop. The reason is that these would be inefficient in Python, and NumPy-like idioms are generally more concise, too. For instance, if you have some data (that came out of Uproot 4, for instance) like this: >>> import awkward1 as ak
>>> events = ak.Array([{"x": 1.1, "energy": 10},
... {"x": 2.2, "energy": 15},
... {"x": 3.3, "energy": 20},
... {"x": 4.4, "energy": 12},
... {"x": 5.5, "energy": 17}])
...
>>> events
<Array [{x: 1.1, energy: 10, ... energy: 17}] type='5 * {"x": float64, "energy":...'> extracting just the >>> events.x
<Array [1.1, 2.2, 3.3, 4.4, 5.5] type='5 * float64'>
>>> events.energy
<Array [10, 15, 20, 12, 17] type='5 * int64'>
>>> events["energy"] # if the name of the field is not a legal identifier
<Array [10, 15, 20, 12, 17] type='5 * int64'> The performance considerations are all different in this framework: you should consider projections like this to be essentially "free" because it's just rearranging metadata. For instance, if a projection takes ~30 μs on an array with 5 elements, it will take ~30 μs on an array with 5 billion elements, because it's not actually doing anything to the underlying data buffers. (The 30 μs is parsing the string This is also true if the objects are structured, like nested lists. >>> events = ak.Array([[{"x": 1.1, "energy": 10}, {"x": 2.2, "energy": 15}, {"x": 3.3, "energy": 20}],
... [],
... [{"x": 4.4, "energy": 12}, {"x": 5.5, "energy": 17}]])
...
>>> events.x
<Array [[1.1, 2.2, 3.3], [], [4.4, 5.5]] type='3 * var * float64'>
>>> events.energy
<Array [[10, 15, 20], [], [12, 17]] type='3 * var * int64'> That said, it is possible to do this using for loops and it can be efficient if it is compiled by Numba. >>> import numba as nb
>>> @nb.jit
... def this_will_be_compiled(builder, events):
... for event in events:
... builder.begin_list()
... for particle in event:
... builder.append(particle.energy)
... builder.end_list()
... return builder
...
>>> this_will_be_compiled(ak.ArrayBuilder(), events).snapshot()
<Array [[10, 15, 20], [], [12, 17]] type='3 * var * int64'> The documentation for ArrayBuilder is here; even though it is compiled, it can be a performance bottleneck because it has to discover the data type (e.g. how many levels of nested lists do you have?), but it is the most general way to make an Awkward Array from a for loop. Preallocating a NumPy array and filling it, if you can do that, is faster. But I think you'll agree that if your task is as simple as projecting out |
Beta Was this translation helpful? Give feedback.
-
Hi Jim, thank you very much for your detailed answer. It was helpful. The reason I need a loop is that we have a big maybe Cheers, |
Beta Was this translation helpful? Give feedback.
-
If you have to convert a rowwise source (all the data for one event is stored together, data for the next event is elsewhere, like JSON) into columnar (all data for a field, across all events, is stored together—ROOT TTree, Awkward Array, Apache Arrow, Parquet are all examples), then you have to loop. This is what So if your data source is like that, you'll have to use a for loop and either fill a preallocated NumPy array if your data are flat or regular, or use Outside of Numba (because of limitations in Numba), with builder.list():
builder.append(x)
builder.append(y)
builder.append(z) instead of builder.begin_list()
builder.append(x)
builder.append(y)
builder.append(z)
builder.end_list() and that's nice. (Imbalanced begin/end can be disastrous—you'll likely run out of memory while building the wrong result). |
Beta Was this translation helpful? Give feedback.
-
I think this is done. If not, let me know! |
Beta Was this translation helpful? Give feedback.
-
Dear Jim, Sorry for the late response. I was trying to digest what you suggested. I think I've managed to do it with thanks for the help, |
Beta Was this translation helpful? Give feedback.
-
Dear colleagues,
I'm sorry beforehand if this question has been asked already.. Or the answer is hidden in the tutorials. I have tried my best to find an answer but could not find it.
Just imagine you have an event loop and getting information one by one
Instead of appending the information to python lists, I would like to append them into Jagged / numpy arrays inside the loop.
Cheers,
Engin
Beta Was this translation helpful? Give feedback.
All reactions