Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Awkward to RDataFrame: to start a discussion #1295

Closed
wants to merge 31 commits into from
Closed

Conversation

ianna
Copy link
Collaborator

@ianna ianna commented Feb 17, 2022

@jpivarski and @miranov25 - it would be nice to move the discussion here. This PR is empty for the moment - I've just started on it, but it will be addressing the issue #588.

Any thoughts, ideas, and early requests are welcome!

@ianna ianna marked this pull request as draft February 17, 2022 17:21
@miranov25
Copy link

@NJManganelli
Copy link

I can do some alpha testing on this as well. Regarding whether it should be presented as std::vector, RVec, or a subclass of one of these, what are the current pros and cons of each?

@jpivarski
Copy link
Member

On the user side, the pros and cons are determined by what you expect. If all list-like data in ROOT (dynamic arrays like Muon_pt[nMuon], as well as std::vector, TClonesArray, etc.) are presented in RDataFrame as rvec, then that's a strong argument for presenting Awkward lists (ListArray, ListOffsetArray, RegularArray) as rvec, too. If the type in RDataFrame depends on the type in ROOT I/O, then we're more free to say what our lists are, because this is a different source.

On the technical side, we want the list and record types to be views/proxies as much as possible. In the Numba implementation, we were able to take this all the way: the runtime objects representing lists and records in Numba JIT-compiled code are all nothing more than pointers into the original array data: the same kind of 48-byte ArrayView (40-bytes in version 2) object can be allocated on the stack, regardless of how many elements the list has or how many fields the record has. It works because we can inject any code in the equivalent of __getitem__ or __getattr__.

For C++, this means that we'd have to be able to override operator[] and have a lot of fieldname() const methods for records to yield the data upon request, rather than upon construction. I know that we'll be able to make record proxies in C++: we can define a new class for each RecordArray with a new set of (field name, field type) pairs and that class can be full of fieldname() const methods that generate the appropriate proxy for each field type. For the lists, we could have this much freedom if we created our own collection type (with const_iterator and all that) because then we could define operator[] however we like. If it has to be std::vector or rvec, or even a subclass of one of these, that might not be possible.

The reason that views/proxies are ideal is because user code might, for example, extract a list (by calling fieldname() on the record that contains it) for the sole purpose of getting its length, or for getting only the first item, or maybe just iterating through it until finding something, then breaking out of the loop. We don't want list construction to also have to construct all the elements that don't get looked at. We especially don't want the runtime list objects to live on the heap, to call malloc and free every time one gets created or destroyed. If the std::vector or rvec types require the data they contain to exist as allocated C++ objects when the std::vector or rvec is constructed with full size, then we get all of these undesirable performance constraints that we won't be able to relax later.

So from a technical point of view, a ranking from best to worst is:

  1. C++ runtime objects for Awkward lists and records are proxies that generate sub-proxies or numerical data when operator[] or fieldname() is called. All of these proxies would have the same size, a handful of bytes, that can be stack-allocated.
  2. C++ runtime objects for Awkward lists are concrete list objects that need to contain N real objects in memory to have size N. This comes with a runtime cost of constructing N objects when the C++ view of the Awkward list is created with size N, though those objects might themselves be small proxies. Only the current level in an Awkward node hierarchy has to be instantiated—not all the way down.
  3. No proxies at all; the whole structure has to be constructed before it appears as an entity in user code. This is tantamount to a full columnar → rowwise conversion before any computation potentially happens.

For anyone who is totally confused at this point, I probably should have mentioned that data in an Awkward Array is organized in a completely different way from std::vector<struct> in C++. Data with type like

image

but, say, length 1 billion, is represented by a single 5-node tree with 4 attached, 1-dimensional arrays whose lengths are O(1 billion). See https://arxiv.org/abs/2001.06307

image

The worst case, number 3 above, would be to turn an entire entry into C++ std::vector<struct {int x; std::vector<double> y;}> before any calculations are done.

The best case, number 1 above, would be to turn it into some ListProxy, and then when operator[] is called, produce the RecordProxy, and then when x() is called, fetch an int from the original Awkward Array. Or if y() is called, produce another ListProxy that has its own operator[] overload. If it's all fixed-size proxies, there's no reason to ever (implicitly) call malloc.


But if the above is incompatible with user expectations because creating our own C++ collection type goes against the grain of expectation, then we'll have to think about priorities.

(If you're thinking that thinking about this is premature optimization, Awkward Arrays in Numba started out with a straightforward, non-proxy implementation, and the result was so bad that it had to be laboriously rewritten, PR #118 (87 commits, 32 kLoC). This needs to be an up-front design decision.)

@ianna
Copy link
Collaborator Author

ianna commented Feb 18, 2022

Here is a Numpy RDataFrame example that uses this RDataSource implementation
It takes a collection of RVecs adopting the Numpy array data:

import ROOT
import numpy as np
x = np.array([1, 2, 3], dtype=np.int32)
y = np.array([4, 5, 6], dtype=np.float64)
# Read the data with RDataFrame
# The column names in the RDataFrame are defined by the keys of the dictionary.
# Please note that only fundamental types (int, float, ...) are supported and
# the arrays must have the same length.
df = ROOT.RDF.MakeNumpyDataFrame({'x': x, 'y': y})
df.Display().Print()
x | y         | 
1 | 4.0000000 | 
2 | 5.0000000 | 
3 | 6.0000000 | 
>>> x
array([1, 2, 3], dtype=int32)
>>> df.__data__['x'][0]
1
>>> df.__data__['x'][2]
3
>>> df.__data__['x'][5]
32650
>>> df.__data__['x']
<cppyy.gbl.ROOT.VecOps.RVec<int> object at 0x7f8a7b498420>
>>> x[0]
1
>>> x[0]=222
>>> df.__data__['x'][0]
222

@eguiraud
Copy link

eguiraud commented Feb 18, 2022

Hi, thank you for working on this! Just a runaway comment that:

  • RVec is indeed the catch-all type with which all collections are represented in RDF (C-style arrays, std::vectors etc. are all represented as RVecs in RDF)
  • RVec can act as a view on an existing contiguous memory buffer, it's sufficient to construct the object as RVec<T>(pointer, size) and you get a view instead of a copy. The data will be copied if the RVec ever reallocates, e.g. because of a push_back, and performing selections of elements creates a copy of the original RVec with the elements selected. However a small vector optimization guarantees that small-enough RVecs use fast stack memory rather than slow heap allocations, so copying small-enough RVecs is cheap the same way that copying small-enough std::strings is cheap

I hope this helps you in making the decision.
Cheers,
Enrico

@codecov
Copy link

codecov bot commented Feb 18, 2022

Codecov Report

Merging #1295 (2d5cff8) into main (b2fd2be) will decrease coverage by 1.36%.
The diff coverage is 50.70%.

Impacted Files Coverage Δ
src/awkward/_v2/_connect/cling.py 0.00% <0.00%> (ø)
...c/awkward/_v2/_connect/rdataframe/to_rdataframe.py 0.00% <0.00%> (ø)
src/awkward/_v2/_lookup.py 97.50% <0.00%> (ø)
src/awkward/_v2/_prettyprint.py 66.09% <0.00%> (+2.29%) ⬆️
src/awkward/_v2/_typetracer.py 69.14% <0.00%> (ø)
src/awkward/_v2/forms/form.py 81.87% <0.00%> (-8.20%) ⬇️
src/awkward/_v2/identifier.py 55.69% <0.00%> (ø)
src/awkward/_v2/index.py 81.95% <0.00%> (-1.64%) ⬇️
src/awkward/_v2/operations/convert/ak_from_jax.py 75.00% <0.00%> (ø)
...kward/_v2/operations/convert/ak_from_rdataframe.py 0.00% <0.00%> (ø)
... and 169 more

@jpivarski
Copy link
Member

  • RVec is indeed the catch-all type with which all collections are represented in RDF (C-style arrays, std::vectors etc. are all represented as RVecs in RDF)

That's good to know! I just asked the same question in a ROOT I/O meeting (because I'm out of order and didn't see your message until now). So users will expect RVec from any source.

  • RVec can act as a view on an existing contiguous memory buffer, it's sufficient to construct the object as RVec<T>(pointer, size) and you get a view instead of a copy.

The trouble is that we don't have preexisting data in a contiguous memory buffer unless it happens to be a list of a primitive type, such as numbers, booleans, or dates. If it's a list of lists or a list of records—i.e. at least two levels deep—then the proxies representing the nested lists or nested records are things that have to be created. If we have a data type like

list<list<list<float>>>

what we want to be able to do is leave our three offsets buffers where they are, in the original Awkward Array, and make some C++ instance with a type like ListProxy<ListProxy<ListProxy<float>>> (better name TBD). The ListProxy object is a fixed-size struct, probably 32 bytes (8-byte start, stop, which_array, array_pointers). There is no preexisting buffer of ListProxy<ListProxy<float>> instances anywhere. Then the overloaded operator[] (and const_iterator, etc.) make each ListProxy<ListProxy<float>> on demand. Each one of these is just another 32 bytes; everything can easily be stack-allocated, no malloc. (Everything here applies equally to RecordProxy, OptionProxy, etc., but these will have named methods, like field names, and that's not as problematic as overloading operator[] and I guess also size.)

By the time we've navigated down to the ListProxy<float>, that one can be a wrapped buffer because we do have a contiguous buffer of numeric contents that we can just point to. Maybe only that one should be an RVec, because most of those VecOps operations assume that you have a collection of numbers or booleans.

On the one hand, we want our ListProxy to be an RVec so that it's not surprising to users. Regarding your first point, users are expecting to get RVec as a container for any kind of list. Getting a different thing when the source is an Awkward Array would be a pain point. On the other hand, if making it an RVec means that we need to point to a contiguous buffer of its contents, then

  1. constructing the ListProxy<ListProxy<ListProxy<float>>> means that we need to allocate a variable-sized memory buffer (with malloc) to fill with all of the <ListProxy<ListProxy<float>> instances,
  2. constructing each <ListProxy<ListProxy<float>> means allocating a variable-sized memory buffer to fill with all the <ListProxy<float> instances,
  3. but each <ListProxy<float> instance can just be a pointer to the numeric data in the Awkward content.

That's a lot of instantiation at the beginning of an entry, it requires heap memory management, and it really slowed down the first implementation of Awkward-in-Numba. (The stack-only, late instantiation approach is why some early performance plots showed iteration over Awkward Arrays outperforming std::vector-based code, because constructing and populating the std::vectors were counted as part of the cost.) In this approach, there is an opportunity to consolidate all of the mallocs into one big malloc at the beginning of an entry, since that is a special point in the workflow that can have specialized code. But then it would involve running over all of the nested lists to find out how many of them there will be, to put them all in the one big malloc, and they do have different types, so reinterpret_casting would be necessary.

It just occurred to me that RNTuple-to-RDataFrame would have the same problem. Its memory representation is a bunch of offsets and content buffers, just like Awkward Array. RNTuple doesn't have the historical constraint of going through TTreeReader, so how is RNTuple-to-RDataFrame implemented? Does it create in-memory entities for all lists in a list<list<list<float>>> at the beginning of each entry? If so, it could benefit from this approach, too.

Maybe the thing we could do, from Awkward, is to create ListProxy<ListProxy<ROOT::RVec<float>>> when we have list<list<list<float>>> data? The ROOT::VecOps would only be available for the lists of numeric and boolean types, but those are the only ones that make sense, right? This solution would involve zero mallocs, all stack-based, and also provide users with familiar types in the very common list<float> case. The downside is that it's introducing a distinction between two different kinds of lists.

Or/also we could provide a ListProxy::to_RVec method. That makes the performance cost opt-in.

@jpivarski
Copy link
Member

This problem comes apart into four pieces. These are the responsibilities @ianna and I agreed on earlier today.

from Awkward to Awkward
pure data translation @jpivarski @ianna
fitting into RDataFrame @ianna @ianna

For my piece, I've opened PR #1300. It would be a blocker for @ianna's work on the from Awkward part, but not the to Awkward part, so she's not currently blocked. The to Awkward part would involve the LayoutBuilder because the type is known before iteration starts.

If it's only ever used in a context that has a JIT compiler available (like this one), then perhaps LayoutBuilder could itself be JIT'ed instead of going through AwkwardForth. But even if that work is done someday, it would be with the same API that exists now, so we can ratchet up to that in multiple steps. For the entirety of this PR, LayoutBuilder would be used as-is.

@ianna ianna force-pushed the ianna/awkward-to-rdf branch 2 times, most recently from f2dc051 to fbe0e49 Compare February 23, 2022 09:01
@jpivarski
Copy link
Member

The pure data translation from Awkward to C++ is done. It doesn't have a great interface, but that's something that can be fixed. See tests/v2/test_1300-awkward-to-cpp-converter-with-cling.py for usage. At the moment, awkward._v2._connect.cling doesn't actually import ROOT, but that will probably change.

The current, clunky interface takes a compiler function as an argument. Normally, this would be ROOT.gInterpreter.Declare, but in the first examples below, it will be print. I did that for debugging, and to isolate the external dependency so that we know exactly when it's needed.

Here's an example array from https://arxiv.org/abs/2001.06307:

>>> import awkward as ak
>>> array = ak._v2.Array(
...     [[{"x": 1, "y": [1.1]}, {"x": 2, "y": [2.0, 0.2]}], [], [{"x": 3, "y": [3.0, 0.3, 3.3]}]]
... )
>>> array.show()
[{x: 1, y: [1.1]}, {x: 2, y: [2, 0.2]}],
[],
[{x: 3, y: [3, 0.3, 3.3]}]]

I picked that one so that I could show a figure for its structure:

image

To turn this into a C++ iterable, we need to make a Generator and a Layout. (The fact that this is multiple steps is the "clunkiness." That would be easy to wrap up in a single function call, but all of this is probably going to be hidden in @ianna's code anyway, so there's not much need to make it user-friendly.)

>>> import awkward._v2._connect.cling
>>> generator = ak._v2._connect.cling.togenerator(array.layout.form)
>>> lookup = ak._v2._lookup.Lookup(array.layout)

Here's what its code looks like. There's a C++ class for each node in the figure.

>>> generator.generate(print)
namespace awkward {
  class ArrayView {
  public:
    ArrayView(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : start_(start), stop_(stop), which_(which), ptrs_(ptrs) { }

    size_t size() const noexcept {{
      return stop_ - start_;
    }}

    bool empty() const noexcept {{
      return start_ == stop_;
    }}

  protected:
    ssize_t start_;
    ssize_t stop_;
    ssize_t which_;
    ssize_t* ptrs_;
  };
}
namespace awkward {
  class RecordView {
  public:
    RecordView(ssize_t at, ssize_t which, ssize_t* ptrs)
      : at_(at), which_(which), ptrs_(ptrs) { }

  protected:
    ssize_t at_;
    ssize_t which_;
    ssize_t* ptrs_;
  };
}
namespace awkward {
  class NumpyArray_int64_9vlCxRnT3oc: public ArrayView {
  public:
    NumpyArray_int64_9vlCxRnT3oc(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : ArrayView(start, stop, which, ptrs) { }

    typedef int64_t value_type;

    const std::string parameter(const std::string& parameter) const noexcept {
      return "null";
    }

    value_type at(size_t at) const {
      if (at >= stop_ - start_) {
        throw std::out_of_range(std::to_string(at) + " is out of range");
      }
      else {
        return (*this)[at];
      }
    }

    value_type operator[](size_t at) const noexcept {
      return reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
    }
  };
}
namespace awkward {
  class NumpyArray_float64_O1I50DFDJTY: public ArrayView {
  public:
    NumpyArray_float64_O1I50DFDJTY(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : ArrayView(start, stop, which, ptrs) { }

    typedef double value_type;

    const std::string parameter(const std::string& parameter) const noexcept {
      return "null";
    }

    value_type at(size_t at) const {
      if (at >= stop_ - start_) {
        throw std::out_of_range(std::to_string(at) + " is out of range");
      }
      else {
        return (*this)[at];
      }
    }

    value_type operator[](size_t at) const noexcept {
      return reinterpret_cast<double*>(ptrs_[which_ + 1])[start_ + at];
    }
  };
}
namespace awkward {
  class ListArray_BgI9cDJVCAw: public ArrayView {
  public:
    ListArray_BgI9cDJVCAw(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : ArrayView(start, stop, which, ptrs) { }

    typedef NumpyArray_float64_O1I50DFDJTY value_type;

    const std::string parameter(const std::string& parameter) const noexcept {
      return "null";
    }

    value_type at(size_t at) const {
      if (at >= stop_ - start_) {
        throw std::out_of_range(std::to_string(at) + " is out of range");
      }
      else {
        return (*this)[at];
      }
    }

    value_type operator[](size_t at) const noexcept {
      ssize_t start = reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
      ssize_t stop = reinterpret_cast<int64_t*>(ptrs_[which_ + 2])[start_ + at];
      return value_type(start, stop, ptrs_[which_ + 3], ptrs_);
    }
  };
}
namespace awkward {
  class Record_gGZVr7BbK4: public RecordView {
  public:
    Record_gGZVr7BbK4(ssize_t at, ssize_t which, ssize_t* ptrs)
      : RecordView(at, which, ptrs) { }

    const std::string parameter(const std::string& parameter) const noexcept {
      return "null";
    }

    int64_t x() const noexcept {
      return NumpyArray_int64_9vlCxRnT3oc(at_, at_ + 1, ptrs_[which_ + 2], ptrs_)[0];
    }
    NumpyArray_float64_O1I50DFDJTY y() const noexcept {
      return ListArray_BgI9cDJVCAw(at_, at_ + 1, ptrs_[which_ + 3], ptrs_)[0];
    }
  };
}
namespace awkward {
  class RecordArray_AUBUNrlJjX8: public ArrayView {
  public:
    RecordArray_AUBUNrlJjX8(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : ArrayView(start, stop, which, ptrs) { }

    typedef Record_gGZVr7BbK4 value_type;

    const std::string parameter(const std::string& parameter) const noexcept {
      return "null";
    }

    value_type at(size_t at) const {
      if (at >= stop_ - start_) {
        throw std::out_of_range(std::to_string(at) + " is out of range");
      }
      else {
        return (*this)[at];
      }
    }

    value_type operator[](size_t at) const noexcept {
      return value_type(start_ + at, which_, ptrs_);
    }
  };
}
namespace awkward {
  class ListArray_HTIOlVcPIAU: public ArrayView {
  public:
    ListArray_HTIOlVcPIAU(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : ArrayView(start, stop, which, ptrs) { }

    typedef RecordArray_AUBUNrlJjX8 value_type;

    const std::string parameter(const std::string& parameter) const noexcept {
      return "null";
    }

    value_type at(size_t at) const {
      if (at >= stop_ - start_) {
        throw std::out_of_range(std::to_string(at) + " is out of range");
      }
      else {
        return (*this)[at];
      }
    }

    value_type operator[](size_t at) const noexcept {
      ssize_t start = reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
      ssize_t stop = reinterpret_cast<int64_t*>(ptrs_[which_ + 2])[start_ + at];
      return value_type(start, stop, ptrs_[which_ + 3], ptrs_);
    }
  };
}

The only classes with data members are ArrayView and RecordView. They are superclasses for all the specialized classes for each node, and only those specialized classes should ever get instantiated. Methods like get are copy-pasted into each class (by _generate_common() in cling.py), which could have been avoided by templatizing ArrayView on value_type, but it hardly matters: either C++'s templating language copy-pastes those methods or the Python that generates these classes does. That would only improve the readability of the above strings, which are something someone would rarely look at.

The class names all include a base64 hash to distinguish, say, a ListArray of NumpyArray from a ListArray of RecordArray. This hash depends on the deep contents of the Awkward node and any parameters they have, as well as any options used during the generation, such as flatlist_as_rvec, which changes the above to:

>>> generator.generate(print, flatlist_as_rvec=True)
namespace awkward {
  class ArrayView {
  public:
    ArrayView(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : start_(start), stop_(stop), which_(which), ptrs_(ptrs) { }

    size_t size() const noexcept {{
      return stop_ - start_;
    }}

    bool empty() const noexcept {{
      return start_ == stop_;
    }}

  protected:
    ssize_t start_;
    ssize_t stop_;
    ssize_t which_;
    ssize_t* ptrs_;
  };
}
namespace awkward {
  class RecordView {
  public:
    RecordView(ssize_t at, ssize_t which, ssize_t* ptrs)
      : at_(at), which_(which), ptrs_(ptrs) { }

  protected:
    ssize_t at_;
    ssize_t which_;
    ssize_t* ptrs_;
  };
}
namespace awkward {
  class NumpyArray_int64_cRhHHLKAiXY: public ArrayView {
  public:
    NumpyArray_int64_cRhHHLKAiXY(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : ArrayView(start, stop, which, ptrs) { }

    typedef int64_t value_type;

    const std::string parameter(const std::string& parameter) const noexcept {
      return "null";
    }

    value_type at(size_t at) const {
      if (at >= stop_ - start_) {
        throw std::out_of_range(std::to_string(at) + " is out of range");
      }
      else {
        return (*this)[at];
      }
    }

    value_type operator[](size_t at) const noexcept {
      return reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
    }
  };
}
namespace awkward {
  class ListArray_EhxjPFyWKf8: public ArrayView {
  public:
    ListArray_EhxjPFyWKf8(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : ArrayView(start, stop, which, ptrs) { }

    typedef ROOT::RVec<double> value_type;

    const std::string parameter(const std::string& parameter) const noexcept {
      return "null";
    }

    value_type at(size_t at) const {
      if (at >= stop_ - start_) {
        throw std::out_of_range(std::to_string(at) + " is out of range");
      }
      else {
        return (*this)[at];
      }
    }

    value_type operator[](size_t at) const noexcept {
      ssize_t start = reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
      ssize_t stop = reinterpret_cast<int64_t*>(ptrs_[which_ + 2])[start_ + at];
      ssize_t which = ptrs_[which_ + 3];
      double* content = reinterpret_cast<double*>(ptrs_[which + 1]) + start;
      return value_type(content, stop - start);
    }
  };
}
namespace awkward {
  class Record_gtaz2QTTPs: public RecordView {
  public:
    Record_gtaz2QTTPs(ssize_t at, ssize_t which, ssize_t* ptrs)
      : RecordView(at, which, ptrs) { }

    const std::string parameter(const std::string& parameter) const noexcept {
      return "null";
    }

    int64_t x() const noexcept {
      return NumpyArray_int64_cRhHHLKAiXY(at_, at_ + 1, ptrs_[which_ + 2], ptrs_)[0];
    }
    NumpyArray_float64_Jw2edUDvrA y() const noexcept {
      return ListArray_EhxjPFyWKf8(at_, at_ + 1, ptrs_[which_ + 3], ptrs_)[0];
    }
  };
}
namespace awkward {
  class RecordArray_39Ik7hb1TXs: public ArrayView {
  public:
    RecordArray_39Ik7hb1TXs(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : ArrayView(start, stop, which, ptrs) { }

    typedef Record_gtaz2QTTPs value_type;

    const std::string parameter(const std::string& parameter) const noexcept {
      return "null";
    }

    value_type at(size_t at) const {
      if (at >= stop_ - start_) {
        throw std::out_of_range(std::to_string(at) + " is out of range");
      }
      else {
        return (*this)[at];
      }
    }

    value_type operator[](size_t at) const noexcept {
      return value_type(start_ + at, which_, ptrs_);
    }
  };
}
namespace awkward {
  class ListArray_zdlkE7xbFoY: public ArrayView {
  public:
    ListArray_zdlkE7xbFoY(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : ArrayView(start, stop, which, ptrs) { }

    typedef RecordArray_39Ik7hb1TXs value_type;

    const std::string parameter(const std::string& parameter) const noexcept {
      return "null";
    }

    value_type at(size_t at) const {
      if (at >= stop_ - start_) {
        throw std::out_of_range(std::to_string(at) + " is out of range");
      }
      else {
        return (*this)[at];
      }
    }

    value_type operator[](size_t at) const noexcept {
      ssize_t start = reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
      ssize_t stop = reinterpret_cast<int64_t*>(ptrs_[which_ + 2])[start_ + at];
      return value_type(start, stop, ptrs_[which_ + 3], ptrs_);
    }
  };
}

Now the flat lists have types like ROOT::RVec<double> instead of NumpyArray_float64_O1I50DFDJTY.

Those hashes are not good for naming types, because although they're stable for a given set of nested types and options, a little change in the Awkward code could produce radically different hashes. Users should (as in "ought to") rely on auto.

To turn an Awkward Array into one of these types, use the C++ generated by

>>> generator.entry(flatlist_as_rvec=True)
'awkward::ListArray_zdlkE7xbFoY(0, length, 0, ptrs)'

It is necessary to pass the same options to generator.entry as generator.generate. Somehow, the length of the original array and ptrs, an array that serves as navigation in C++, must be passed in. It's generally pretty easy to pass a ssize_t and a ssize_t* from Python into C++.

>>> length = len(array)
>>> length
3

>>> ptrs = lookup.arrayptrs
>>> ptrs
array([            -1, 94864433929152, 94864433929160,              4,
                   -1,              3,              8,             10,
                   -1, 94864441259584,             -1, 94864440548752,
       94864440548760,             14,             -1, 94864440292080])
>>> ptrs.ctypes.data
94864440446608

In this, C++ is always seeing borrowed references, so the lookup object must be kept in scope while the C++ is running. The lookup holds Python references to everything else, keeping it all alive.

The real magic of what's going on happens in the operator[] methods, which as const noexcept. They either fetch single values from the original array or create more ArrayView/RecordView instances, which are 32 and 24 bytes, respectively. All of these can live on the stack—no malloc anywhere. It might even be a little better if the return values were value_type&, rather than value_type (saving a 32-byte copy), but I'm not very familiar/comfortable with these aspects of C++ and I'll let others make those performance adjustments.

Users will probably want all of the const_iterator stuff, which should be easy to add to _generate_common() or (if ArrayType gets templated with value_type) directly in ArrayType. These can be defined in terms of the operator[], which should be where the optimization effort goes.

That's all I can think of to say. Comments? Questions?

@jpivarski
Copy link
Member

Oh, I forgot to mention that Awkward's option-type goes to C++ std::optional, and union-type goes to std::variant, so in general, this requires C++17. The required headers are in a list in cling.py. I didn't do any tests of complex numbers and dates would probably have to be explicitly handled—they probably don't work yet.

Every Awkward type has a C++ iterable. (Implementing this even revealed a bug in the Numba iterators, which are very similar, apart from generating LLVM IR instead of C++.)

@ianna
Copy link
Collaborator Author

ianna commented Feb 23, 2022

@jpivarski - I run into a RuntimeError: could not load cppyy_backend library with the following command :

python localbuild.py --pytest tests/v2/test_1300-awkward-to-cpp-converter-with-cling.py

while this one works just fine:

python -m pytest tests/v2/test_1300-awkward-to-cpp-converter-with-cling.py

I think, it's related to @rpath:

% otool -L /Users/yana/Projects/ROOT/ROOT-master/23.02.2022/root_install/lib/libcppyy_backend3_10.so 
/Users/yana/Projects/ROOT/ROOT-master/23.02.2022/root_install/lib/libcppyy_backend3_10.so:
	@rpath/libcppyy_backend3_10.so (compatibility version 0.0.0, current version 0.0.0)
	@rpath/libCore.so (compatibility version 0.0.0, current version 0.0.0)
	/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 1200.3.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.0.0)

@miranov25
Copy link

@jpivarski That's all I can think of to say. Comments? Questions?

Reading text above you are implementing point 1 from the original comment:
#1295 (comment)

When do you expect the first prototype (of Awkward to C++) to be ready for testing? Based on what was written above, I can not estimate that.

@jpivarski
Copy link
Member

@ianna, do you know if localbuild.py vs pytest is using a different compilation of Awkward? All that the test does is call ROOT.gInterpreter.Declare and then call the function that ROOT has created. If you're having problems with rpath and error messages like "could not load cppyy_backend library", I would suspect your ROOT installation.

While working on this, I discovered that ROOT installed from source does not have C++17 enabled unless you explicitly pass -DCMAKE_CXX_STANDARD=17. If you get ROOT from conda-forge, you should be okay for C++17.

Although if the problem was that ROOT is compiled without C++17, then the error messages would be different: it would be saying that optional is not in namespace std...

@jpivarski
Copy link
Member

@miranov25, it depends on what you want to test. If you want to do an initial test outside of RDataFrame (even if you plan on using RDataFrame in the future), you could use the examples I gave above right now. They're in Awkward's main branch, and they use v2 arrays (that's what the examples construct). This covers the Awkward → C++ direction, not the other way, and it should permit you to iterate over the data using operator[] syntax.

If you want to test RDataFrame access, that will depend on @ianna's developments, and I would guess that it would take until next week or so to have estimates. @ianna, based on what you've seen of the RDataSource infrastructure so far, what would it take to build something that @miranov25 can at least test?

@miranov25
Copy link

@jpivarski, @ianna , thank you. O(weeks) is enough, so I will not try a temporary solution
For the Root C++, I assume the default CVMS version should be fine:

Singularity> root-config --cflag
-pthread -std=c++17 -m64 -I/cvmfs/alice.cern.ch/el7-x86_64/Packages/ROOT/v6-24-06-62/include

@ianna
Copy link
Collaborator Author

ianna commented Feb 24, 2022

@jpivarski, @ianna , thank you. O(weeks) is enough, so I will not try a temporary solution For the Root C++, I assume the default CVMS version should be fine:

Singularity> root-config --cflag
-pthread -std=c++17 -m64 -I/cvmfs/alice.cern.ch/el7-x86_64/Packages/ROOT/v6-24-06-62/include

Yes, I'm basing the code on ROOT 6.24/06 and C++17. Though, I have a local build following the recipe:

git clone --branch latest-stable https://github.com/root-project/root.git root_src
mkdir root_build root_install && cd root_build
cmake -DCMAKE_INSTALL_PREFIX=../root_install -Dbuiltin_glew=ON -Dclad=OFF -Dtmva-pymva=OFF -DCMAKE_CXX_STANDARD=17 ../root_src
cmake --build . -- install -j4

source ../root_install/bin/thisroot.sh

@ianna
Copy link
Collaborator Author

ianna commented Feb 24, 2022

@ianna, do you know if localbuild.py vs pytest is using a different compilation of Awkward? All that the test does is call ROOT.gInterpreter.Declare and then call the function that ROOT has created. If you're having problems with rpath and error messages like "could not load cppyy_backend library", I would suspect your ROOT installation.

While working on this, I discovered that ROOT installed from source does not have C++17 enabled unless you explicitly pass -DCMAKE_CXX_STANDARD=17. If you get ROOT from conda-forge, you should be okay for C++17.

Although if the problem was that ROOT is compiled without C++17, then the error messages would be different: it would be saying that optional is not in namespace std...

@jpivarski - it looks like localbuild.py sets LD_LIBRARY_PATH to awkward: and does not pick up the existing value... that is defined by the ROOT install script, but it is not in the env virtual environment... I think, need to fix it on my side or use the following patch:

diff --git a/localbuild.py b/localbuild.py
index d3626ae5..615bae7e 100755
--- a/localbuild.py
+++ b/localbuild.py
@@ -163,6 +163,8 @@ if args.buildpython:
     # localbuild must be in the library path for some operations.
     env = dict(os.environ)
     reminder = False
+    if env.get("ROOTSYS") is not None and env.get("SHLIB_PATH") is not None:
+        env["LD_LIBRARY_PATH"] = env.get("SHLIB_PATH", "") + ":" + env.get("LD_LIBRARY_PATH", "")
     if "awkward" not in env.get("LD_LIBRARY_PATH", ""):
         env["LD_LIBRARY_PATH"] = "awkward:" + env.get("LD_LIBRARY_PATH", "")
         reminder = True

@jpivarski
Copy link
Member

@ianna The localbuild.py script is adding the awkward directory to LD_LIBRARY_PATH, but it is neither adding nor removing ROOT from it. I'd rather not add the ROOTSYS check to it because it has nothing to do with installing ROOT.

Anyway, this is probably moot because if @henryiii fixes editable installation (pip install -e .) and compilation times get much shorter when we drop v1, we probably won't need localbuild.py anymore.

@ianna ianna force-pushed the ianna/awkward-to-rdf branch from 1485511 to 1f904f1 Compare March 2, 2022 11:43
@ianna ianna force-pushed the ianna/awkward-to-rdf branch from d739670 to 63500c0 Compare March 14, 2022 11:12
jpivarski added a commit that referenced this pull request Mar 14, 2022
jpivarski added a commit that referenced this pull request Mar 15, 2022
* Starting a pure Cling DEMO (do not merge).

* A working version of the demo, assuming you have InterpreterUtils.cpp.

* Port API changes from ffda5a3 (PR #1295).

* Cleanup.

* Demonstration of ArrayBuilder casting without std::invoke.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fully implemented/tested ArrayBuilder, tests not included.

* Using ArrayBuilder in C++.

* Defined iterators so that for-each loops work.

* And reverse iterators (they're all const).

* Also give ArrayBuilder its 'append' methods.

* Iterators only need 32 bytes and track their value in start_.

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@ianna ianna force-pushed the ianna/awkward-to-rdf branch from 5f37041 to 2dfbb6c Compare March 15, 2022 17:42
@ianna ianna force-pushed the ianna/awkward-to-rdf branch from e9726fb to 9e9a3e0 Compare March 15, 2022 19:59
@ianna ianna force-pushed the ianna/awkward-to-rdf branch from c686f1e to 2d5cff8 Compare March 17, 2022 09:43
@jpivarski
Copy link
Member

The discussion started here informed quite a few PRs: Awkward → C++ generation in #1300, #1359, #1372, #1376, #1383, #1398, and now Awkward → RDataFrame in #1374, which is just about ready to be merged.

There's still the RDataFrame → Awkward direction, but that will be new PRs. I can close this one now.

@jpivarski jpivarski closed this Apr 26, 2022
@ianna
Copy link
Collaborator Author

ianna commented Jun 23, 2022

FYI, both ak._v2.to_rdataframe and ak._v2.from_rdataframe are now in a pre-release. You can pick it up with

pip install --pre awkward

Here is an example how to use it:

def test_data_frame_vec_of_real():
    ak_array_in = ak._v2.Array([[1.1, 2.2], [3.3], [4.4, 5.5]])

    data_frame = ak._v2.to_rdataframe({"x": ak_array_in})

    assert data_frame.GetColumnType("x") == "ROOT::VecOps::RVec<double>"

    ak_array_out = ak._v2.from_rdataframe(
        data_frame,
        column="x",
    )
    assert ak_array_in.to_list() == ak_array_out["x"].to_list()

Please, let me know if there are any issues, or requests. Thanks!

@miranov25
Copy link

miranov25 commented Jun 23, 2022

Dear @ianna and @jpivarski

Thank you. I will try to use it. During the first blind test I got some errors that could be due to my ROOT configuration.
The first part of the test is fine, in the second part of the test it crashes. Is there some test where I can check ROOT/awkward compatibility?

defining test function

In [1]: import awkward as ak                                                                                                                                                                                
In [2]: def test_data_frame_vec_of_real(): 
   ...:     ak_array_in = ak._v2.Array([[1.1, 2.2], [3.3], [4.4, 5.5]]) 
   ...:     data_frame = ak._v2.to_rdataframe({"x": ak_array_in}) 
   ...:  
   ...:     assert data_frame.GetColumnType("x") == "ROOT::VecOps::RVec<double>" 
   ...:  
   ...:     ak_array_out = ak._v2.from_rdataframe( 
   ...:         data_frame, 
   ...:         column="x", 
   ...:     ) 
   ...:     assert ak_array_in.to_list() == ak_array_out["x"].to_list() 

-->

error - stacktrace


In [3]: test_data_frame_vec_of_real()                                                                                                                                                                       
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-3-cb35a82c7e49> in <module>
----> 1 test_data_frame_vec_of_real()

<ipython-input-2-b6ba8aa26f13> in test_data_frame_vec_of_real()
      6     assert data_frame.GetColumnType("x") == "ROOT::VecOps::RVec<double>"
      7 
----> 8     ak_array_out = ak._v2.from_rdataframe(
      9         data_frame,
     10         column="x",

/alicesw/sw/ubuntu2004_x86-64/Python/v3.8.10-local1/lib/python/site-packages/awkward/_v2/operations/ak_from_rdataframe.py in from_rdataframe(data_frame, column)
     22         ),
     23     ):
---> 24         return _impl(
     25             data_frame,
     26             column,

/alicesw/sw/ubuntu2004_x86-64/Python/v3.8.10-local1/lib/python/site-packages/awkward/_v2/operations/ak_from_rdataframe.py in _impl(data_frame, column)
     32     column,
     33 ):
---> 34     import awkward._v2._connect.rdataframe.from_rdataframe  # noqa: F401
     35 
     36     return ak._v2._connect.rdataframe.from_rdataframe.from_rdataframe(

/alicesw/sw/ubuntu2004_x86-64/ROOT/v6-26-04-patches-alice1-local1/lib/ROOT/_facade.py in _importhook(name, *args, **kwds)
    151                 except Exception:
    152                     pass
--> 153             return _orig_ihook(name, *args, **kwds)
    154         __builtin__.__import__ = _importhook
    155 

/alicesw/sw/ubuntu2004_x86-64/Python/v3.8.10-local1/lib/python/site-packages/awkward/_v2/_connect/rdataframe/from_rdataframe.py in <module>
     23 
     24 
---> 25 cppyy.add_include_path(
     26     os.path.abspath(
     27         os.path.join(

/alicesw/sw/ubuntu2004_x86-64/ROOT/v6-26-04-patches-alice1-local1/lib/cppyy/__init__.py in add_include_path(path)
    219     """Add a path to the include paths available to Cling."""
    220     if not os.path.isdir(path):
--> 221         raise OSError("no such directory: %s" % path)
    222     gbl.gInterpreter.AddIncludePath(path)
    223 

OSError: no such directory: /alicesw/sw/ubuntu2004_x86-64/Python/v3.8.10-local1/lib/python3.8/site-packages/awkward/_v2/cpp-headers

directory mentioned in the error is indeed not existing

Singularity> ls -a /alicesw/sw/ubuntu2004_x86-64/Python/v3.8.10-local1/lib/python3.8/site-packages/awkward/_v2/
.   __init__.py  _broadcasting.py  _lookup.py       _reducers.py  _typetracer.py  behaviors  forms         identifier.py  numba.py    record.py           types
..  __pycache__  _connect          _prettyprint.py  _slicing.py   _util.py        contents   highlevel.py  index.py       operations  tmp_for_testing.py

awkward description

Singularity> pip show awkward
Name: awkward
Version: 1.9.0rc6
Summary: Manipulate JSON-like data with NumPy-like idioms.
Home-page: https://github.com/scikit-hep/awkward-1.0
Author: Jim Pivarski
Author-email: pivarski@princeton.edu
License: BSD-3-Clause
Location: /alicesw/sw/ubuntu2004_x86-64/Python/v3.8.10-local1/lib/python3.8/site-packages
Requires: numpy, setuptools
Required-by: 

@agoose77
Copy link
Collaborator

@miranov25 I took a cursory glance at the RC wheel - it looks like we're not packaging the headers for consumption. I'll look into this

@ianna
Copy link
Collaborator Author

ianna commented Jun 23, 2022

@miranov25 I took a cursory glance at the RC wheel - it looks like we're not packaging the headers for consumption. I'll look into this

Thanks! Indeed, the C++ header-only directory is something new and was not needed before.

@ianna
Copy link
Collaborator Author

ianna commented Jun 23, 2022

@miranov25 - thanks for a quick feedback! Indeed, the header file https://github.com/scikit-hep/awkward/blob/main/src/awkward/_v2/cpp-headers/rdataframe_jagged_builders.h is needed. I think, you could try to copy it by hand to the mentioned directory or wait for the next pre-release. Thanks again!

@miranov25
Copy link

@ianna

I tried it before via Github master and it failed (I do not know why). Now I tried it via your link https://github.com/scikit-hep/awkward/blob/main/src/awkward/_v2/cpp-headers/rdataframe_jagged_builders.h and the test ran until the end without problems.

@ianna
Copy link
Collaborator Author

ianna commented Jun 24, 2022

@miranov25 - thanks! A new pre-release is available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants