-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
awkward to rdataframe #1374
awkward to rdataframe #1374
Conversation
Codecov Report
|
@jpivarski - I think, this is pretty much done. I'm still not quite sure how to select the range for the different length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'm trying this out (using the .py files in the ROOT RDF tutorials as a guide), and it hasn't been designed in such a way that RDataFrame users can use it as they would any other RDataFrame.
For context, this is basic usage:
>>> import ROOT
>>> ROOT.RDF.MakeTrivialDataFrame(50).AsNumpy(["col0"])
{'col0': ndarray([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49], dtype=uint64)}
where MakeTrivialDataFrame
is essentially np.arange
(with dtype=np.uint64
) and AsNumpy
lets you select a subset of columns and convert them into NumPy arrays. If the columns have primitive types, these are regular NumPy arrays, otherwise they'll be dtype=object
arrays with PyROOT objects.
Here's another one that takes a NumPy array as input:
>>> import numpy as np
>>> rdf = ROOT.RDF.MakeNumpyDataFrame({"x": np.arange(10) * 1.1})
>>> rdf.AsNumpy(["x"])
{'x': ndarray([0. , 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])}
Here's an example of defining a new column from an old column:
>>> rdf.Define("y", "10 * x").AsNumpy(["y"])
{'y': ndarray([ 0., 11., 22., 33., 44., 55., 66., 77., 88., 99.])}
At any point, we can look at the head of the RDataFrame, for all defined columns. That's a good diagnostic.
>>> rdf.Display().Print()
x |
0.0000000 |
1.1000000 |
2.2000000 |
3.3000000 |
4.4000000 |
>>> rdf.Define("y", "10 * x").Display().Print()
y | x |
0.0000000 | 0.0000000 |
11.000000 | 1.1000000 |
22.000000 | 2.2000000 |
33.000000 | 3.3000000 |
44.000000 | 4.4000000 |
The first Awkward Array that I tried had primitive type, and it didn't work. I know, this only becomes useful when the types are non-primitive, but I'm starting with the basics.
>>> import awkward as ak
>>> array = ak._v2.Array([1.1, 2.2, 3.3, 4.4, 5.5])
>>> rdf = ak._v2.to_rdataframe({"some_array": array})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/operations/convert/ak_to_rdataframe.py", line 24, in to_rdataframe
return _impl(
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/operations/convert/ak_to_rdataframe.py", line 36, in _impl
rdf = ak._v2._connect.rdataframe.to_rdataframe.to_rdataframe(
File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_connect/rdataframe/to_rdataframe.py", line 86, in to_rdataframe
type(rdf_array_view_entries[key]).__cpp_name__,
AttributeError: type object 'float' has no attribute '__cpp_name__'
Okay, it's easy to see how that slipped by if all of the arrays used in tests happen to be record arrays. Line 86, type(rdf_array_view_entries[key]).__cpp_name__
presumes that rdf_array_view_entries[key]
is a PyROOT object, but PyROOT only gives you one of its objects (lines 80‒82) if it can't be resolved to a Python builtin. Actually, relying on that is dangerous because many C++ types convert to a single Python type (float
and double
going to a Python float or all the int*_t
going to a Python int). Fortunately, you don't have to create an entry (which might not work in a dataset with zero entries, anyway) to find out what its type is going to be; you can use the generator's entry_type
method:
While we're at it, you also don't need to create a get_entry_{generated_type}_{key}_{flatlist_as_rvec}
function just to use it in the AwkwardArrayColumnReader_{generated_type}_{key}_{flatlist_as_rvec}
class. You can create the entry directly in GetColumnReadersImpl
.
(I also don't understand the need for either of the two levels of wrapping: ArrayWrapper
and AwkwardArrayColumnReader_{generated_type}_{key}_{flatlist_as_rvec}
. All you really need is the one {rdf_array_data_source_class_name}
class, which supplies a batch of entries from GetColumnReadersImpl
as a vector of void*
pointers that the RDataSource superclass will cast based on the return value of GetTypeName
.)
So I made another simple example that has record structure so that it will pass the __cpp_name__
check.
>>> array = ak._v2.Array([{"x": 1.1}, {"x": 2.2}, {"x": 3.3}, {"x": 4.4}, {"x": 5.5}])
>>> rdf = ak._v2.to_rdataframe({"some_array": array})
So far, so good.
>>> list(rdf.GetColumnNames())
['awkward:some_array']
Okay—if the column name has a colon in it, we can't use it as an argument of any functions, for instance
>>> rdf.Define("y", "some_array.x()")
input_line_122:2:28: error: use of undeclared identifier 'some_array'
auto lambda1 = [](){return some_array.x()
doesn't work because some_array
is not a column name, and
>>> rdf.Define("y", "awkward:some_array.x()")
input_line_130:2:35: error: unexpected ':' in nested name specifier; did you mean '::'?
auto lambda1 = [](){return awkward:some_array.x()
^
::
input_line_130:2:36: error: no member named 'some_array' in namespace 'awkward'
RDataFrame generates code from the string passed to its functors (Define
, Filter
, etc.) that uses identifiers to look up columns and know what to put in the arguments of the functions.
The current tests build functions manually, and presumably they work because arguments are passed in the right order, but that's not how users would use RDataFrame—they'd generate code from strings like the examples in the tutorials directory.
I'm going to see if I can write up a minimally working example.
I managed to get a fully working example by modifying the "WonkyDS" from #588 (comment). It's emulating a two-column Awkward RDataSource, in which one column contains a primitive type and the other contains records. import json
import ROOT
import awkward as ak
import awkward._v2._connect.cling
example1 = ak._v2.Array([1.1, 2.2, 3.3, 4.4, 5.5])
example2 = ak._v2.Array([{"x": 1.1}, {"x": 2.2}, {"x": 3.3}, {"x": 4.4}, {"x": 5.5}])
assert len(example1) == len(example2) It's essential that they have the same lengths, something that First, I'll generate the ArrayViews and RecordViews and get the data types for them as strings. RDataFrame is not involved yet. generator1 = ak._v2._connect.cling.togenerator(example1.layout.form)
lookup1 = ak._v2._lookup.Lookup(example1.layout)
generator1.generate(ROOT.gInterpreter.Declare, flatlist_as_rvec=True)
generator2 = ak._v2._connect.cling.togenerator(example2.layout.form)
lookup2 = ak._v2._lookup.Lookup(example2.layout)
generator2.generate(ROOT.gInterpreter.Declare, flatlist_as_rvec=True)
dataset_type_one = generator1.class_type((True,))
entry_type_one = generator1.entry_type(flatlist_as_rvec=True)
dataset_type_two = generator2.class_type((True,))
entry_type_two = generator2.entry_type(flatlist_as_rvec=True) The following code is going to be generated in the >>> dataset_type_one, dataset_type_two
('NumpyArray_float64_S3W7Jysru0Y', 'RecordArray_TdrfhV0YO8') and here's what the entry type names look like: >>> entry_type_one, entry_type_two
('double', 'Record_SlxdvzA8mjY') When providing RDataFrame with these type names (i.e. the string that (Also, note that the hashes aren't the same in different Python processes.) The following is almost entirely hard-coded, but it should be clear how to generate the parts that depend on "one" and "two". ROOT.gInterpreter.Declare(f"""
namespace awkward {{
class RWonkyDS final : public ROOT::RDF::RDataSource {{
private:
unsigned int fNSlots = 0U;
std::vector<std::pair<ULong64_t, ULong64_t>> fEntryRanges;
ULong64_t fSize = 0ULL;
ULong64_t fPtrs_one = 0;
ULong64_t fPtrs_two = 0;
std::vector<std::string> fColNames{{"one", "two"}};
std::vector<{entry_type_one}> slots_one;
std::vector<{entry_type_one}*> addrs_one;
std::vector<{entry_type_two}> slots_two;
std::vector<{entry_type_two}*> addrs_two;
std::vector<void *> GetColumnReadersImpl(std::string_view name, const std::type_info &);
protected:
std::string AsString() {{ return "trivial data source"; }};
public:
RWonkyDS(ULong64_t size, ULong64_t ptrs_one, ULong64_t ptrs_two);
RWonkyDS();
~RWonkyDS();
const std::vector<std::string> &GetColumnNames() const;
bool HasColumn(std::string_view colName) const;
std::string GetTypeName(std::string_view) const;
std::vector<std::pair<ULong64_t, ULong64_t>> GetEntryRanges();
bool SetEntry(unsigned int slot, ULong64_t entry);
void SetNSlots(unsigned int nSlots);
void Initialise();
std::string GetLabel();
}};
// Make a RDF wrapping a RWonkyDS with the specified amount of entries
ROOT::RDF::RInterface<ROOT::RDF::RDFDetail::RLoopManager, RWonkyDS> MakeWonkyDataFrame(ULong64_t size, ULong64_t ptrs_one, ULong64_t ptrs_two);
// Make a RDF wrapping a broken RWonkyDS... because we need a zero-argument constructor?
ROOT::RDF::RInterface<ROOT::RDF::RDFDetail::RLoopManager, RWonkyDS> MakeWonkyDataFrame();
std::vector<void *> RWonkyDS::GetColumnReadersImpl(std::string_view colName, const std::type_info &ti) {{
std::vector<void *> ret;
if (colName == "one") {{
for (auto i : ROOT::TSeqU(fNSlots)) {{
addrs_one[i] = &slots_one[i];
ret.emplace_back((void *)(&addrs_one[i]));
}}
}}
else if (colName == "two") {{
for (auto i : ROOT::TSeqU(fNSlots)) {{
addrs_two[i] = &slots_two[i];
ret.emplace_back((void *)(&addrs_two[i]));
}}
}}
else {{
for (auto i : ROOT::TSeqU(fNSlots)) {{
ret.emplace_back(nullptr);
}}
}}
return ret;
}}
RWonkyDS::RWonkyDS(ULong64_t size, ULong64_t ptrs_one, ULong64_t ptrs_two) : fSize(size), fPtrs_one(ptrs_one), fPtrs_two(ptrs_two) {{
}}
RWonkyDS::RWonkyDS() : fSize(0), fPtrs_one(0), fPtrs_two(0) {{
}}
RWonkyDS::~RWonkyDS() {{
}}
const std::vector<std::string> &RWonkyDS::GetColumnNames() const {{
return fColNames;
}}
bool RWonkyDS::HasColumn(std::string_view colName) const {{
for (auto name : fColNames) {{
if (colName == name) {{
return true;
}}
}}
return false;
}}
std::string RWonkyDS::GetTypeName(std::string_view colName) const {{
if (colName == "one") {{
return {json.dumps(entry_type_one)};
}}
else if (colName == "two") {{
return {json.dumps("awkward::" + entry_type_two)};
}}
else {{
return "no such column"; // should break whatever tries to use it as a type
}}
}}
std::vector<std::pair<ULong64_t, ULong64_t>> RWonkyDS::GetEntryRanges() {{
// empty fEntryRanges so we'll return an empty vector on subsequent calls
auto ranges = std::move(fEntryRanges);
return ranges;
}}
bool RWonkyDS::SetEntry(unsigned int slot, ULong64_t entry) {{
slots_one[slot] = {dataset_type_one}(0, fSize, 0, reinterpret_cast<ssize_t*>(fPtrs_one))[entry];
slots_two[slot] = {dataset_type_two}(0, fSize, 0, reinterpret_cast<ssize_t*>(fPtrs_two))[entry];
return true;
}}
void RWonkyDS::SetNSlots(unsigned int nSlots) {{
R__ASSERT(0U == fNSlots && "Setting the number of slots even if the number of slots is different from zero.");
fNSlots = nSlots;
slots_one.resize(fNSlots);
addrs_one.resize(fNSlots);
slots_two.resize(fNSlots);
addrs_two.resize(fNSlots);
}}
void RWonkyDS::Initialise() {{
// initialize fEntryRanges
const auto chunkSize = fSize / fNSlots;
auto start = 0UL;
auto end = 0UL;
for (auto i : ROOT::TSeqUL(fNSlots)) {{
start = end;
end += chunkSize;
fEntryRanges.emplace_back(start, end);
(void)i;
}}
// TODO: redistribute reminder to all slots
fEntryRanges.back().second += fSize % fNSlots;
}}
std::string RWonkyDS::GetLabel() {{
return "WonkyDS";
}}
ROOT::RDF::RInterface<ROOT::RDF::RDFDetail::RLoopManager, RWonkyDS> MakeWonkyDataFrame(ULong64_t size, ULong64_t ptrs_one, ULong64_t ptrs_two) {{
auto lm = std::make_unique<ROOT::RDF::RDFDetail::RLoopManager>(std::make_unique<RWonkyDS>(size, ptrs_one, ptrs_two), ROOT::RDF::RDFInternal::ColumnNames_t{{}});
return ROOT::RDF::RInterface<ROOT::RDF::RDFDetail::RLoopManager, RWonkyDS>(std::move(lm));
}}
}} // namespace awkward
""")
rdf = ROOT.awkward.MakeWonkyDataFrame(
len(example1), lookup1.arrayptrs.ctypes.data, lookup2.arrayptrs.ctypes.data
) Now we can use this >>> rdf.Display().Print()
one | two |
1.1000000 | @0x564bafb68100 |
2.2000000 | @0x564bafb68100 |
3.3000000 | @0x564bafb68100 |
4.4000000 | @0x564bafb68100 |
5.5000000 | @0x564bafb68100 | The "one" column has readable numbers. It seems that RDataFrame always prints objects as pointers. It's the same pointer in each entry because RDataFrame gets an entry by overwriting memory. The above implementation should work with multiple slots (parallel processing), but I haven't tested it. With two ROOT threads, the entries should get divided among the threads with one pointer for half the entries and another pointer for the other half of the entries, and the two threads only modify their own data in place. Now let's try calculating some new variables on these columns. Column "one" really ought to work because we can see that the numbers are properly represented in the display above, but let's do it anyway: >>> rdf.Define("newvar1", "one * 10").AsNumpy(["newvar1"])
{'newvar1': ndarray([11., 22., 33., 44., 55.])} Okay, good. Now for column "two", we can't pass the record-valued data into NumPy (unless it makes PyROOT objects? I'll try that next), but we can get numbers from it through its >>> rdf.Define("newvar2", "two.x()").AsNumpy(["newvar2"])
{'newvar2': ndarray([1.1, 2.2, 3.3, 4.4, 5.5])} Let's be adventurous and try actually sending the record objects to NumPy. In principle, it should know how to make PyROOT objects for our ArrayViews and RecordViews because they were compiled by Cling. >>> will_it_crash = rdf.AsNumpy(["two"])
>>> will_it_crash
{'two': ndarray([<cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac690>,
<cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac6a8>,
<cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac6c0>,
<cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac6d8>,
<cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac6f0>],
dtype=object)}
>>> will_it_crash["two"][2]
<cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac6c0>
>>> will_it_crash["two"][2].x()
3.3 It works! However, we'll probably need some significant disclaimers on these objects because they're based on borrowed references. If the lookups ever go out of scope, In fact, that reminds me, the outgoing RDataFrame that rdf._lookups = (lookup1, lookup2) As long as this Anyway, I think the above example should help. As you can see, there was no need to define any wrappers, only the single RWonkyDS class and its "Make" function. There are no templates anywhere. The class and function names need to be generated with hashes so that a different set of column names and types don't try to reuse this one. The hash should be based on a tuple of name, generator pairs: >>> hash((("one", generator1), ("two", generator2)))
6386967581103326234 which is unchanged when we recreate them with the same Forms (but not different Forms, and not in a different Python process): >>> generator1 = ak._v2._connect.cling.togenerator(example1.layout.form)
>>> generator2 = ak._v2._connect.cling.togenerator(example2.layout.form)
>>> hash((("one", generator1), ("two", generator2)))
6386967581103326234 Be sure to use the helper functions to generate the code, so that the code-generator remains readable: >>> generator1.entry(length="fSize", ptrs="fPtrs_one", entry="entry", flatlist_as_rvec=True)
'awkward::NumpyArray_float64_uXYDMncQqaw(0, fSize, 0, reinterpret_cast<ssize_t*>(fPtrs_one))[entry]'
>>> generator2.entry(length="fSize", ptrs="fPtrs_two", entry="entry", flatlist_as_rvec=True)
'awkward::RecordArray_nd3xGJcxmAg(0, fSize, 0, reinterpret_cast<ssize_t*>(fPtrs_two))[entry]' Preparing generated code parts before the big f-string and inserting them with single-curly brackets will help to keep the code-generator readable. Inline expressions when they're simple enough is better than following names. A good process would be to reproduce the example described in this comment, with all its hard-codedness, and slowly replace the hard-coded parts with auto-generated parts, testing between every invocation. Segfaults can usually be debugged by inserting Let me know if there's anything else I can help with. |
b6ce7c8
to
1f7fb24
Compare
a289334
to
3e1467d
Compare
@jpivarski - please, have a look when you have time. The |
…od argument (to ensure consistency).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is much better!
There are still some rough edges, so this will need more testing. I saw some cases in which nonsense values came out of an integer field and repeated calculations had history dependence, but couldn't reproduce them. Many of the tests here are print-outs, rather than automated checks to ensure that the output is right. We should at least test the full suite of structures, the way that the test for #1300 does, using RDataFrame's AsNumpy
to get data back out.
I completed the flatlist_as_rvec
refactoring (which someday should become "options", maybe a namedtuple to keep it hashable). I also made the inputs a bit more robust (use ak._v2.to_layout
rather than just assuming ak._v2.Arrays
). The default (for ROOT) should be flatlist_as_rvec=True
.
The testing/hardening can be in a new PR, in which case, you can merge this PR now.
@jpivarski - thanks! I have automated the test checks and will add the changes to this PR. |
The user side looks as follows:
The array passed as an "x" column is:
An
AwkwardArrayColumnReader
iterates over 3 ranges to applyForeach
functor: