Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

awkward to rdataframe #1374

Merged
merged 62 commits into from
Apr 27, 2022
Merged

awkward to rdataframe #1374

merged 62 commits into from
Apr 27, 2022

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Mar 17, 2022

The user side looks as follows:

    array = ak._v2.Array(
        [
            [{"x": 1, "y": [1.1]}, {"x": 2, "y": [2.0, 0.2]}],
            [],
            [{"x": 3, "y": [3.0, 0.3, 3.3]}],
        ]
    )
    ak_array_1 = array["x"]
    ak_array_2 = array["y"]

    # An 'awkward' namespace will be added to the column name
    data_frame = ak._v2.operations.convert.to_rdataframe(
        {"x": ak_array_1, "y": ak_array_2}
    )

    done = compiler(
        """
        template <typename Array>
        struct MyFunctor {
            void operator()(const Array& a) {
                cout << "#6. Call user function: " << a << endl;
            }
        };
        """
    )
    assert done is True

    f = ROOT.MyFunctor[data_frame.GetColumnType("awkward:x")]()
    data_frame.Foreach(f, ["awkward:x"])

The array passed as an "x" column is:

ak_array_1 = <Array [[1, 2], [], [3]] type='3 * var * int64'>

An AwkwardArrayColumnReader iterates over 3 ranges to apply Foreach functor:

CONSTRUCTED AwkwardArrayColumnReader_NumpyArray_int64_Nt5aOZLMgX4_x of a ROOT::VecOps::RVec<Long64_t> at 0x7fefaec0ef48
#5. SetEntry
#6. Call user function: { 1, 2 }
#5. SetEntry
#6. Call user function: {  }
#5. SetEntry
#6. Call user function: { 3 }
#4. GetEntryRanges
DESTRUCTED AwkwardArrayColumnReader_NumpyArray_int64_Nt5aOZLMgX4_x of a ROOT::VecOps::RVec<Long64_t> at 0x7fefaec0ef48

@ianna ianna marked this pull request as draft March 17, 2022 17:16
@ianna ianna changed the title awkward to rdaframe awkward to rdataframe Mar 17, 2022
@codecov
Copy link

codecov bot commented Mar 18, 2022

Codecov Report

Merging #1374 (57ede36) into main (edfce38) will increase coverage by 0.65%.
The diff coverage is 52.66%.

Impacted Files Coverage Δ
src/awkward/_v2/_connect/cuda/__init__.py 0.00% <0.00%> (ø)
src/awkward/_v2/_connect/jax/__init__.py 0.00% <0.00%> (ø)
...c/awkward/_v2/_connect/rdataframe/to_rdataframe.py 0.00% <0.00%> (ø)
src/awkward/_v2/contents/recordarray.py 82.26% <ø> (+0.76%) ⬆️
src/awkward/_v2/contents/unionarray.py 86.50% <ø> (ø)
src/awkward/_v2/operations/convert/ak_from_cupy.py 50.00% <0.00%> (+23.33%) ⬆️
src/awkward/_v2/operations/convert/ak_from_iter.py 93.75% <ø> (ø)
src/awkward/_v2/operations/convert/ak_from_jax.py 50.00% <0.00%> (-25.00%) ⬇️
...wkward/_v2/operations/convert/ak_to_arrow_table.py 100.00% <ø> (ø)
src/awkward/_v2/operations/convert/ak_to_cupy.py 33.33% <0.00%> (+23.95%) ⬆️
... and 77 more

@ianna
Copy link
Collaborator Author

ianna commented Apr 20, 2022

@jpivarski - I think, this is pretty much done. I'm still not quite sure how to select the range for the different length ak.Arrays. So far it's defined by the length of the first column's ak.Array:
https://github.com/scikit-hep/awkward-1.0/pull/1374/files#diff-2e47d1ea25ddbe4f77fe23a2e0cc5897d74b5389c659e527d7c6b07a8166588bR254-R256

@ianna ianna marked this pull request as ready for review April 20, 2022 15:30
@ianna ianna requested a review from jpivarski April 20, 2022 15:30
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm trying this out (using the .py files in the ROOT RDF tutorials as a guide), and it hasn't been designed in such a way that RDataFrame users can use it as they would any other RDataFrame.

For context, this is basic usage:

>>> import ROOT
>>> ROOT.RDF.MakeTrivialDataFrame(50).AsNumpy(["col0"])
{'col0': ndarray([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
         16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
         32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
         48, 49], dtype=uint64)}

where MakeTrivialDataFrame is essentially np.arange (with dtype=np.uint64) and AsNumpy lets you select a subset of columns and convert them into NumPy arrays. If the columns have primitive types, these are regular NumPy arrays, otherwise they'll be dtype=object arrays with PyROOT objects.

Here's another one that takes a NumPy array as input:

>>> import numpy as np
>>> rdf = ROOT.RDF.MakeNumpyDataFrame({"x": np.arange(10) * 1.1})
>>> rdf.AsNumpy(["x"])
{'x': ndarray([0. , 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8, 9.9])}

Here's an example of defining a new column from an old column:

>>> rdf.Define("y", "10 * x").AsNumpy(["y"])
{'y': ndarray([ 0., 11., 22., 33., 44., 55., 66., 77., 88., 99.])}

At any point, we can look at the head of the RDataFrame, for all defined columns. That's a good diagnostic.

>>> rdf.Display().Print()
x         | 
0.0000000 | 
1.1000000 | 
2.2000000 | 
3.3000000 | 
4.4000000 | 
>>> rdf.Define("y", "10 * x").Display().Print()
y         | x         | 
0.0000000 | 0.0000000 | 
11.000000 | 1.1000000 | 
22.000000 | 2.2000000 | 
33.000000 | 3.3000000 | 
44.000000 | 4.4000000 | 

The first Awkward Array that I tried had primitive type, and it didn't work. I know, this only becomes useful when the types are non-primitive, but I'm starting with the basics.

>>> import awkward as ak
>>> array = ak._v2.Array([1.1, 2.2, 3.3, 4.4, 5.5])
>>> rdf = ak._v2.to_rdataframe({"some_array": array})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/operations/convert/ak_to_rdataframe.py", line 24, in to_rdataframe
    return _impl(
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/operations/convert/ak_to_rdataframe.py", line 36, in _impl
    rdf = ak._v2._connect.rdataframe.to_rdataframe.to_rdataframe(
  File "/home/jpivarski/irishep/awkward-1.0/awkward/_v2/_connect/rdataframe/to_rdataframe.py", line 86, in to_rdataframe
    type(rdf_array_view_entries[key]).__cpp_name__,
AttributeError: type object 'float' has no attribute '__cpp_name__'

Okay, it's easy to see how that slipped by if all of the arrays used in tests happen to be record arrays. Line 86, type(rdf_array_view_entries[key]).__cpp_name__ presumes that rdf_array_view_entries[key] is a PyROOT object, but PyROOT only gives you one of its objects (lines 80‒82) if it can't be resolved to a Python builtin. Actually, relying on that is dangerous because many C++ types convert to a single Python type (float and double going to a Python float or all the int*_t going to a Python int). Fortunately, you don't have to create an entry (which might not work in a dataset with zero entries, anyway) to find out what its type is going to be; you can use the generator's entry_type method:

https://github.com/scikit-hep/awkward-1.0/blob/ec9eec9aa5fae80dac44c300c6a48e3766afd1cc/src/awkward/_v2/_connect/cling.py#L524-L526

While we're at it, you also don't need to create a get_entry_{generated_type}_{key}_{flatlist_as_rvec} function just to use it in the AwkwardArrayColumnReader_{generated_type}_{key}_{flatlist_as_rvec} class. You can create the entry directly in GetColumnReadersImpl.

(I also don't understand the need for either of the two levels of wrapping: ArrayWrapper and AwkwardArrayColumnReader_{generated_type}_{key}_{flatlist_as_rvec}. All you really need is the one {rdf_array_data_source_class_name} class, which supplies a batch of entries from GetColumnReadersImpl as a vector of void* pointers that the RDataSource superclass will cast based on the return value of GetTypeName.)

So I made another simple example that has record structure so that it will pass the __cpp_name__ check.

>>> array = ak._v2.Array([{"x": 1.1}, {"x": 2.2}, {"x": 3.3}, {"x": 4.4}, {"x": 5.5}])
>>> rdf = ak._v2.to_rdataframe({"some_array": array})

So far, so good.

>>> list(rdf.GetColumnNames())
['awkward:some_array']

Okay—if the column name has a colon in it, we can't use it as an argument of any functions, for instance

>>> rdf.Define("y", "some_array.x()")
input_line_122:2:28: error: use of undeclared identifier 'some_array'
auto lambda1 = [](){return some_array.x()

doesn't work because some_array is not a column name, and

>>> rdf.Define("y", "awkward:some_array.x()")
input_line_130:2:35: error: unexpected ':' in nested name specifier; did you mean '::'?
auto lambda1 = [](){return awkward:some_array.x()
                                  ^
                                  ::
input_line_130:2:36: error: no member named 'some_array' in namespace 'awkward'

RDataFrame generates code from the string passed to its functors (Define, Filter, etc.) that uses identifiers to look up columns and know what to put in the arguments of the functions.

The current tests build functions manually, and presumably they work because arguments are passed in the right order, but that's not how users would use RDataFrame—they'd generate code from strings like the examples in the tutorials directory.

I'm going to see if I can write up a minimally working example.

@jpivarski
Copy link
Member

I managed to get a fully working example by modifying the "WonkyDS" from #588 (comment). It's emulating a two-column Awkward RDataSource, in which one column contains a primitive type and the other contains records.

import json

import ROOT
import awkward as ak
import awkward._v2._connect.cling

example1 = ak._v2.Array([1.1, 2.2, 3.3, 4.4, 5.5])
example2 = ak._v2.Array([{"x": 1.1}, {"x": 2.2}, {"x": 3.3}, {"x": 4.4}, {"x": 5.5}])

assert len(example1) == len(example2)

It's essential that they have the same lengths, something that ak.to_rdataframe should check.

First, I'll generate the ArrayViews and RecordViews and get the data types for them as strings. RDataFrame is not involved yet.

generator1 = ak._v2._connect.cling.togenerator(example1.layout.form)
lookup1 = ak._v2._lookup.Lookup(example1.layout)
generator1.generate(ROOT.gInterpreter.Declare, flatlist_as_rvec=True)

generator2 = ak._v2._connect.cling.togenerator(example2.layout.form)
lookup2 = ak._v2._lookup.Lookup(example2.layout)
generator2.generate(ROOT.gInterpreter.Declare, flatlist_as_rvec=True)

dataset_type_one = generator1.class_type((True,))
entry_type_one = generator1.entry_type(flatlist_as_rvec=True)
dataset_type_two = generator2.class_type((True,))
entry_type_two = generator2.entry_type(flatlist_as_rvec=True)

The following code is going to be generated in the awkward namespace to ensure that it doesn't collide with anything else. The type strings above are unqualified, so they'll work in the namespace, but they won't work in the RDataFrame code, when it tries to cast the column values. To be concrete, here's what the dataset type names look like:

>>> dataset_type_one, dataset_type_two
('NumpyArray_float64_S3W7Jysru0Y', 'RecordArray_TdrfhV0YO8')

and here's what the entry type names look like:

>>> entry_type_one, entry_type_two
('double', 'Record_SlxdvzA8mjY')

When providing RDataFrame with these type names (i.e. the string that GetTypeName returns), the Record will need to be qualified with "awkward::" but the double must not. You can tell which to qualify by checking the Python type of the generators: generator1 is a NumpyArrayGenerator, so its entry type will always be a primitive (must not qualify) and generator2 is a RecordArrayGenerator, so its entry type will be something in the awkward namespace (must qualify). Since we're using flatlist_as_rvec=True everywhere (and it's important to do that consistently), a ListArrayGenerator whose content is a NumpyArrayGenerator also must not be qualified because it will be an RVec, rather than an ArrayView. There should be tests in which the top-level type is a list of primitive and in which the top-level type is a list of records.

(Also, note that the hashes aren't the same in different Python processes.)

The following is almost entirely hard-coded, but it should be clear how to generate the parts that depend on "one" and "two".

ROOT.gInterpreter.Declare(f"""
namespace awkward {{

class RWonkyDS final : public ROOT::RDF::RDataSource {{
private:
   unsigned int fNSlots = 0U;
   std::vector<std::pair<ULong64_t, ULong64_t>> fEntryRanges;

   ULong64_t fSize = 0ULL;
   ULong64_t fPtrs_one = 0;
   ULong64_t fPtrs_two = 0;

   std::vector<std::string> fColNames{{"one", "two"}};
   std::vector<{entry_type_one}>  slots_one;
   std::vector<{entry_type_one}*> addrs_one;
   std::vector<{entry_type_two}>  slots_two;
   std::vector<{entry_type_two}*> addrs_two;

   std::vector<void *> GetColumnReadersImpl(std::string_view name, const std::type_info &);

protected:
   std::string AsString() {{ return "trivial data source"; }};

public:
   RWonkyDS(ULong64_t size, ULong64_t ptrs_one, ULong64_t ptrs_two);
   RWonkyDS();
   ~RWonkyDS();
   const std::vector<std::string> &GetColumnNames() const;
   bool HasColumn(std::string_view colName) const;
   std::string GetTypeName(std::string_view) const;
   std::vector<std::pair<ULong64_t, ULong64_t>> GetEntryRanges();
   bool SetEntry(unsigned int slot, ULong64_t entry);
   void SetNSlots(unsigned int nSlots);
   void Initialise();
   std::string GetLabel();
}};

// Make a RDF wrapping a RWonkyDS with the specified amount of entries
ROOT::RDF::RInterface<ROOT::RDF::RDFDetail::RLoopManager, RWonkyDS> MakeWonkyDataFrame(ULong64_t size, ULong64_t ptrs_one, ULong64_t ptrs_two);
// Make a RDF wrapping a broken RWonkyDS... because we need a zero-argument constructor?
ROOT::RDF::RInterface<ROOT::RDF::RDFDetail::RLoopManager, RWonkyDS> MakeWonkyDataFrame();

std::vector<void *> RWonkyDS::GetColumnReadersImpl(std::string_view colName, const std::type_info &ti) {{
   std::vector<void *> ret;

   if (colName == "one") {{
      for (auto i : ROOT::TSeqU(fNSlots)) {{
         addrs_one[i] = &slots_one[i];
         ret.emplace_back((void *)(&addrs_one[i]));
      }}
   }}
   else if (colName == "two") {{
      for (auto i : ROOT::TSeqU(fNSlots)) {{
         addrs_two[i] = &slots_two[i];
         ret.emplace_back((void *)(&addrs_two[i]));
      }}
   }}
   else {{
      for (auto i : ROOT::TSeqU(fNSlots)) {{
         ret.emplace_back(nullptr);
      }}
   }}

   return ret;
}}

RWonkyDS::RWonkyDS(ULong64_t size, ULong64_t ptrs_one, ULong64_t ptrs_two) : fSize(size), fPtrs_one(ptrs_one), fPtrs_two(ptrs_two) {{
}}

RWonkyDS::RWonkyDS() : fSize(0), fPtrs_one(0), fPtrs_two(0) {{
}}

RWonkyDS::~RWonkyDS() {{
}}

const std::vector<std::string> &RWonkyDS::GetColumnNames() const {{
   return fColNames;
}}

bool RWonkyDS::HasColumn(std::string_view colName) const {{
   for (auto name : fColNames) {{
      if (colName == name) {{
         return true;
      }}
   }}
   return false;
}}

std::string RWonkyDS::GetTypeName(std::string_view colName) const {{
   if (colName == "one") {{
      return {json.dumps(entry_type_one)};
   }}
   else if (colName == "two") {{
      return {json.dumps("awkward::" + entry_type_two)};
   }}
   else {{
      return "no such column";   // should break whatever tries to use it as a type
   }}
}}

std::vector<std::pair<ULong64_t, ULong64_t>> RWonkyDS::GetEntryRanges() {{
   // empty fEntryRanges so we'll return an empty vector on subsequent calls
   auto ranges = std::move(fEntryRanges);
   return ranges;
}}

bool RWonkyDS::SetEntry(unsigned int slot, ULong64_t entry) {{

   slots_one[slot] = {dataset_type_one}(0, fSize, 0, reinterpret_cast<ssize_t*>(fPtrs_one))[entry];
   slots_two[slot] = {dataset_type_two}(0, fSize, 0, reinterpret_cast<ssize_t*>(fPtrs_two))[entry];

   return true;
}}

void RWonkyDS::SetNSlots(unsigned int nSlots) {{
   R__ASSERT(0U == fNSlots && "Setting the number of slots even if the number of slots is different from zero.");
   fNSlots = nSlots;

   slots_one.resize(fNSlots);
   addrs_one.resize(fNSlots);
   slots_two.resize(fNSlots);
   addrs_two.resize(fNSlots);

}}

void RWonkyDS::Initialise() {{
   // initialize fEntryRanges
   const auto chunkSize = fSize / fNSlots;
   auto start = 0UL;
   auto end = 0UL;
   for (auto i : ROOT::TSeqUL(fNSlots)) {{
      start = end;
      end += chunkSize;
      fEntryRanges.emplace_back(start, end);
      (void)i;
   }}
   // TODO: redistribute reminder to all slots
   fEntryRanges.back().second += fSize % fNSlots;
}}

std::string RWonkyDS::GetLabel() {{
   return "WonkyDS";
}}

ROOT::RDF::RInterface<ROOT::RDF::RDFDetail::RLoopManager, RWonkyDS> MakeWonkyDataFrame(ULong64_t size, ULong64_t ptrs_one, ULong64_t ptrs_two) {{
   auto lm = std::make_unique<ROOT::RDF::RDFDetail::RLoopManager>(std::make_unique<RWonkyDS>(size, ptrs_one, ptrs_two), ROOT::RDF::RDFInternal::ColumnNames_t{{}});
   return ROOT::RDF::RInterface<ROOT::RDF::RDFDetail::RLoopManager, RWonkyDS>(std::move(lm));
}}

}} // namespace awkward
""")

rdf = ROOT.awkward.MakeWonkyDataFrame(
    len(example1), lookup1.arrayptrs.ctypes.data, lookup2.arrayptrs.ctypes.data
)

Now we can use this rdf as an RDataFrame.

>>> rdf.Display().Print()
one       | two             | 
1.1000000 | @0x564bafb68100 | 
2.2000000 | @0x564bafb68100 | 
3.3000000 | @0x564bafb68100 | 
4.4000000 | @0x564bafb68100 | 
5.5000000 | @0x564bafb68100 | 

The "one" column has readable numbers. It seems that RDataFrame always prints objects as pointers. It's the same pointer in each entry because RDataFrame gets an entry by overwriting memory. The above implementation should work with multiple slots (parallel processing), but I haven't tested it. With two ROOT threads, the entries should get divided among the threads with one pointer for half the entries and another pointer for the other half of the entries, and the two threads only modify their own data in place.

Now let's try calculating some new variables on these columns. Column "one" really ought to work because we can see that the numbers are properly represented in the display above, but let's do it anyway:

>>> rdf.Define("newvar1", "one * 10").AsNumpy(["newvar1"])
{'newvar1': ndarray([11., 22., 33., 44., 55.])}

Okay, good. Now for column "two", we can't pass the record-valued data into NumPy (unless it makes PyROOT objects? I'll try that next), but we can get numbers from it through its x method:

>>> rdf.Define("newvar2", "two.x()").AsNumpy(["newvar2"])
{'newvar2': ndarray([1.1, 2.2, 3.3, 4.4, 5.5])}

Let's be adventurous and try actually sending the record objects to NumPy. In principle, it should know how to make PyROOT objects for our ArrayViews and RecordViews because they were compiled by Cling.

>>> will_it_crash = rdf.AsNumpy(["two"])
>>> will_it_crash
{'two': ndarray([<cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac690>,
         <cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac6a8>,
         <cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac6c0>,
         <cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac6d8>,
         <cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac6f0>],
        dtype=object)}
>>> will_it_crash["two"][2]
<cppyy.gbl.awkward.Record_dlJfZCDGSwY object at 0x564bb4fac6c0>
>>> will_it_crash["two"][2].x()
3.3

It works! However, we'll probably need some significant disclaimers on these objects because they're based on borrowed references. If the lookups ever go out of scope, will_it_crash["two"][2].x() will definitely crash.

In fact, that reminds me, the outgoing RDataFrame that ak.to_rdataframe returns will have to maintain a reference to the lookups. Fortunately, this is Python and we can just stash them in the object:

rdf._lookups = (lookup1, lookup2)

As long as this rdf object (which ak.to_rdataframe returns) is alive, the ArrayViews and RecordViews it makes in internal calculations and PyROOT objects in dtype=object NumPy arrays will be valid. As soon as the rdf goes out of scope, the lookups will be deleted and there are no memory leaks. That covers all of the cases in which a user interacts with it through the rdf methods (which is the proper way, and the only way that runs at compiled speed), but it doesn't cover cases in which the user extracted a dtype=object NumPy array, then deleted the rdf, then interacts with the data in the NumPy array. If that ever comes up, we'll tell them not to do that.

Anyway, I think the above example should help. As you can see, there was no need to define any wrappers, only the single RWonkyDS class and its "Make" function. There are no templates anywhere. The class and function names need to be generated with hashes so that a different set of column names and types don't try to reuse this one. The hash should be based on a tuple of name, generator pairs:

>>> hash((("one", generator1), ("two", generator2)))
6386967581103326234

which is unchanged when we recreate them with the same Forms (but not different Forms, and not in a different Python process):

>>> generator1 = ak._v2._connect.cling.togenerator(example1.layout.form)
>>> generator2 = ak._v2._connect.cling.togenerator(example2.layout.form)
>>> hash((("one", generator1), ("two", generator2)))
6386967581103326234

Be sure to use the helper functions to generate the code, so that the code-generator remains readable:

>>> generator1.entry(length="fSize", ptrs="fPtrs_one", entry="entry", flatlist_as_rvec=True)
'awkward::NumpyArray_float64_uXYDMncQqaw(0, fSize, 0, reinterpret_cast<ssize_t*>(fPtrs_one))[entry]'
>>> generator2.entry(length="fSize", ptrs="fPtrs_two", entry="entry", flatlist_as_rvec=True)
'awkward::RecordArray_nd3xGJcxmAg(0, fSize, 0, reinterpret_cast<ssize_t*>(fPtrs_two))[entry]'

Preparing generated code parts before the big f-string and inserting them with single-curly brackets will help to keep the code-generator readable. Inline expressions when they're simple enough is better than following names.

A good process would be to reproduce the example described in this comment, with all its hard-codedness, and slowly replace the hard-coded parts with auto-generated parts, testing between every invocation. Segfaults can usually be debugged by inserting std::cout in the right places, including the code generators in cling.py, since the code generated by cling.py is used in the RDataFrame loop, and you can get closer to the actual error.

Let me know if there's anything else I can help with.

@ianna ianna force-pushed the ianna/awkward_to_rdataframe branch from b6ce7c8 to 1f7fb24 Compare April 22, 2022 09:26
@ianna ianna force-pushed the ianna/awkward_to_rdataframe branch from a289334 to 3e1467d Compare April 22, 2022 12:30
@ianna
Copy link
Collaborator Author

ianna commented Apr 25, 2022

@jpivarski - please, have a look when you have time. The flatlist_as_rvec is a property of the generators now, so there should not be possible to mix RVecs and vectors. I did not test this change with numba, but I did not see explicit use of the flag there. The default value of it is still the same.
I've removed the ArrayWrapper and ArrayReader classes. The column dependent code is generated in a Python loop. These extra fragments can be implemented as functions, though, I hope, it is more or less readable.
I've finished for today, please, feel free to modify the PR. Thanks.

@ianna ianna requested a review from jpivarski April 25, 2022 16:19
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better!

There are still some rough edges, so this will need more testing. I saw some cases in which nonsense values came out of an integer field and repeated calculations had history dependence, but couldn't reproduce them. Many of the tests here are print-outs, rather than automated checks to ensure that the output is right. We should at least test the full suite of structures, the way that the test for #1300 does, using RDataFrame's AsNumpy to get data back out.

I completed the flatlist_as_rvec refactoring (which someday should become "options", maybe a namedtuple to keep it hashable). I also made the inputs a bit more robust (use ak._v2.to_layout rather than just assuming ak._v2.Arrays). The default (for ROOT) should be flatlist_as_rvec=True.

The testing/hardening can be in a new PR, in which case, you can merge this PR now.

@ianna
Copy link
Collaborator Author

ianna commented Apr 26, 2022

@jpivarski - thanks! I have automated the test checks and will add the changes to this PR.

@ianna ianna merged commit 73c24fa into main Apr 27, 2022
@ianna ianna deleted the ianna/awkward_to_rdataframe branch April 27, 2022 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants