-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Awkward to RDataFrame: to start a discussion #1295
Conversation
adding link to the gitter and root forum: |
I can do some alpha testing on this as well. Regarding whether it should be presented as std::vector, RVec, or a subclass of one of these, what are the current pros and cons of each? |
On the user side, the pros and cons are determined by what you expect. If all list-like data in ROOT (dynamic arrays like On the technical side, we want the list and record types to be views/proxies as much as possible. In the Numba implementation, we were able to take this all the way: the runtime objects representing lists and records in Numba JIT-compiled code are all nothing more than pointers into the original array data: the same kind of 48-byte ArrayView (40-bytes in version 2) object can be allocated on the stack, regardless of how many elements the list has or how many fields the record has. It works because we can inject any code in the equivalent of For C++, this means that we'd have to be able to override The reason that views/proxies are ideal is because user code might, for example, extract a list (by calling So from a technical point of view, a ranking from best to worst is:
For anyone who is totally confused at this point, I probably should have mentioned that data in an Awkward Array is organized in a completely different way from but, say, length 1 billion, is represented by a single 5-node tree with 4 attached, 1-dimensional arrays whose lengths are O(1 billion). See https://arxiv.org/abs/2001.06307 The worst case, number 3 above, would be to turn an entire entry into C++ The best case, number 1 above, would be to turn it into some But if the above is incompatible with user expectations because creating our own C++ collection type goes against the grain of expectation, then we'll have to think about priorities. (If you're thinking that thinking about this is premature optimization, Awkward Arrays in Numba started out with a straightforward, non-proxy implementation, and the result was so bad that it had to be laboriously rewritten, PR #118 (87 commits, 32 kLoC). This needs to be an up-front design decision.) |
Here is a Numpy RDataFrame example that uses this RDataSource implementation import ROOT
import numpy as np
x = np.array([1, 2, 3], dtype=np.int32)
y = np.array([4, 5, 6], dtype=np.float64)
# Read the data with RDataFrame
# The column names in the RDataFrame are defined by the keys of the dictionary.
# Please note that only fundamental types (int, float, ...) are supported and
# the arrays must have the same length.
df = ROOT.RDF.MakeNumpyDataFrame({'x': x, 'y': y})
df.Display().Print()
>>> x
array([1, 2, 3], dtype=int32)
>>> df.__data__['x'][0]
1
>>> df.__data__['x'][2]
3
>>> df.__data__['x'][5]
32650
>>> df.__data__['x']
<cppyy.gbl.ROOT.VecOps.RVec<int> object at 0x7f8a7b498420>
>>> x[0]
1
>>> x[0]=222
>>> df.__data__['x'][0]
222 |
Hi, thank you for working on this! Just a runaway comment that:
I hope this helps you in making the decision. |
Codecov Report
|
That's good to know! I just asked the same question in a ROOT I/O meeting (because I'm out of order and didn't see your message until now). So users will expect RVec from any source.
The trouble is that we don't have preexisting data in a contiguous memory buffer unless it happens to be a list of a primitive type, such as numbers, booleans, or dates. If it's a list of lists or a list of records—i.e. at least two levels deep—then the proxies representing the nested lists or nested records are things that have to be created. If we have a data type like list<list<list<float>>> what we want to be able to do is leave our three By the time we've navigated down to the On the one hand, we want our
That's a lot of instantiation at the beginning of an entry, it requires heap memory management, and it really slowed down the first implementation of Awkward-in-Numba. (The stack-only, late instantiation approach is why some early performance plots showed iteration over Awkward Arrays outperforming It just occurred to me that RNTuple-to-RDataFrame would have the same problem. Its memory representation is a bunch of Maybe the thing we could do, from Awkward, is to create Or/also we could provide a |
This problem comes apart into four pieces. These are the responsibilities @ianna and I agreed on earlier today.
For my piece, I've opened PR #1300. It would be a blocker for @ianna's work on the from Awkward part, but not the to Awkward part, so she's not currently blocked. The to Awkward part would involve the If it's only ever used in a context that has a JIT compiler available (like this one), then perhaps |
f2dc051
to
fbe0e49
Compare
The pure data translation from Awkward to C++ is done. It doesn't have a great interface, but that's something that can be fixed. See tests/v2/test_1300-awkward-to-cpp-converter-with-cling.py for usage. At the moment, awkward._v2._connect.cling doesn't actually The current, clunky interface takes a Here's an example array from https://arxiv.org/abs/2001.06307: >>> import awkward as ak
>>> array = ak._v2.Array(
... [[{"x": 1, "y": [1.1]}, {"x": 2, "y": [2.0, 0.2]}], [], [{"x": 3, "y": [3.0, 0.3, 3.3]}]]
... )
>>> array.show()
[{x: 1, y: [1.1]}, {x: 2, y: [2, 0.2]}],
[],
[{x: 3, y: [3, 0.3, 3.3]}]] I picked that one so that I could show a figure for its structure: To turn this into a C++ iterable, we need to make a Generator and a Layout. (The fact that this is multiple steps is the "clunkiness." That would be easy to wrap up in a single function call, but all of this is probably going to be hidden in @ianna's code anyway, so there's not much need to make it user-friendly.) >>> import awkward._v2._connect.cling
>>> generator = ak._v2._connect.cling.togenerator(array.layout.form)
>>> lookup = ak._v2._lookup.Lookup(array.layout) Here's what its code looks like. There's a C++ class for each node in the figure. >>> generator.generate(print) namespace awkward {
class ArrayView {
public:
ArrayView(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: start_(start), stop_(stop), which_(which), ptrs_(ptrs) { }
size_t size() const noexcept {{
return stop_ - start_;
}}
bool empty() const noexcept {{
return start_ == stop_;
}}
protected:
ssize_t start_;
ssize_t stop_;
ssize_t which_;
ssize_t* ptrs_;
};
}
namespace awkward {
class RecordView {
public:
RecordView(ssize_t at, ssize_t which, ssize_t* ptrs)
: at_(at), which_(which), ptrs_(ptrs) { }
protected:
ssize_t at_;
ssize_t which_;
ssize_t* ptrs_;
};
}
namespace awkward {
class NumpyArray_int64_9vlCxRnT3oc: public ArrayView {
public:
NumpyArray_int64_9vlCxRnT3oc(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: ArrayView(start, stop, which, ptrs) { }
typedef int64_t value_type;
const std::string parameter(const std::string& parameter) const noexcept {
return "null";
}
value_type at(size_t at) const {
if (at >= stop_ - start_) {
throw std::out_of_range(std::to_string(at) + " is out of range");
}
else {
return (*this)[at];
}
}
value_type operator[](size_t at) const noexcept {
return reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
}
};
}
namespace awkward {
class NumpyArray_float64_O1I50DFDJTY: public ArrayView {
public:
NumpyArray_float64_O1I50DFDJTY(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: ArrayView(start, stop, which, ptrs) { }
typedef double value_type;
const std::string parameter(const std::string& parameter) const noexcept {
return "null";
}
value_type at(size_t at) const {
if (at >= stop_ - start_) {
throw std::out_of_range(std::to_string(at) + " is out of range");
}
else {
return (*this)[at];
}
}
value_type operator[](size_t at) const noexcept {
return reinterpret_cast<double*>(ptrs_[which_ + 1])[start_ + at];
}
};
}
namespace awkward {
class ListArray_BgI9cDJVCAw: public ArrayView {
public:
ListArray_BgI9cDJVCAw(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: ArrayView(start, stop, which, ptrs) { }
typedef NumpyArray_float64_O1I50DFDJTY value_type;
const std::string parameter(const std::string& parameter) const noexcept {
return "null";
}
value_type at(size_t at) const {
if (at >= stop_ - start_) {
throw std::out_of_range(std::to_string(at) + " is out of range");
}
else {
return (*this)[at];
}
}
value_type operator[](size_t at) const noexcept {
ssize_t start = reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
ssize_t stop = reinterpret_cast<int64_t*>(ptrs_[which_ + 2])[start_ + at];
return value_type(start, stop, ptrs_[which_ + 3], ptrs_);
}
};
}
namespace awkward {
class Record_gGZVr7BbK4: public RecordView {
public:
Record_gGZVr7BbK4(ssize_t at, ssize_t which, ssize_t* ptrs)
: RecordView(at, which, ptrs) { }
const std::string parameter(const std::string& parameter) const noexcept {
return "null";
}
int64_t x() const noexcept {
return NumpyArray_int64_9vlCxRnT3oc(at_, at_ + 1, ptrs_[which_ + 2], ptrs_)[0];
}
NumpyArray_float64_O1I50DFDJTY y() const noexcept {
return ListArray_BgI9cDJVCAw(at_, at_ + 1, ptrs_[which_ + 3], ptrs_)[0];
}
};
}
namespace awkward {
class RecordArray_AUBUNrlJjX8: public ArrayView {
public:
RecordArray_AUBUNrlJjX8(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: ArrayView(start, stop, which, ptrs) { }
typedef Record_gGZVr7BbK4 value_type;
const std::string parameter(const std::string& parameter) const noexcept {
return "null";
}
value_type at(size_t at) const {
if (at >= stop_ - start_) {
throw std::out_of_range(std::to_string(at) + " is out of range");
}
else {
return (*this)[at];
}
}
value_type operator[](size_t at) const noexcept {
return value_type(start_ + at, which_, ptrs_);
}
};
}
namespace awkward {
class ListArray_HTIOlVcPIAU: public ArrayView {
public:
ListArray_HTIOlVcPIAU(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: ArrayView(start, stop, which, ptrs) { }
typedef RecordArray_AUBUNrlJjX8 value_type;
const std::string parameter(const std::string& parameter) const noexcept {
return "null";
}
value_type at(size_t at) const {
if (at >= stop_ - start_) {
throw std::out_of_range(std::to_string(at) + " is out of range");
}
else {
return (*this)[at];
}
}
value_type operator[](size_t at) const noexcept {
ssize_t start = reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
ssize_t stop = reinterpret_cast<int64_t*>(ptrs_[which_ + 2])[start_ + at];
return value_type(start, stop, ptrs_[which_ + 3], ptrs_);
}
};
} The only classes with data members are The class names all include a base64 hash to distinguish, say, a ListArray of NumpyArray from a ListArray of RecordArray. This hash depends on the deep contents of the Awkward node and any parameters they have, as well as any options used during the generation, such as >>> generator.generate(print, flatlist_as_rvec=True) namespace awkward {
class ArrayView {
public:
ArrayView(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: start_(start), stop_(stop), which_(which), ptrs_(ptrs) { }
size_t size() const noexcept {{
return stop_ - start_;
}}
bool empty() const noexcept {{
return start_ == stop_;
}}
protected:
ssize_t start_;
ssize_t stop_;
ssize_t which_;
ssize_t* ptrs_;
};
}
namespace awkward {
class RecordView {
public:
RecordView(ssize_t at, ssize_t which, ssize_t* ptrs)
: at_(at), which_(which), ptrs_(ptrs) { }
protected:
ssize_t at_;
ssize_t which_;
ssize_t* ptrs_;
};
}
namespace awkward {
class NumpyArray_int64_cRhHHLKAiXY: public ArrayView {
public:
NumpyArray_int64_cRhHHLKAiXY(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: ArrayView(start, stop, which, ptrs) { }
typedef int64_t value_type;
const std::string parameter(const std::string& parameter) const noexcept {
return "null";
}
value_type at(size_t at) const {
if (at >= stop_ - start_) {
throw std::out_of_range(std::to_string(at) + " is out of range");
}
else {
return (*this)[at];
}
}
value_type operator[](size_t at) const noexcept {
return reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
}
};
}
namespace awkward {
class ListArray_EhxjPFyWKf8: public ArrayView {
public:
ListArray_EhxjPFyWKf8(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: ArrayView(start, stop, which, ptrs) { }
typedef ROOT::RVec<double> value_type;
const std::string parameter(const std::string& parameter) const noexcept {
return "null";
}
value_type at(size_t at) const {
if (at >= stop_ - start_) {
throw std::out_of_range(std::to_string(at) + " is out of range");
}
else {
return (*this)[at];
}
}
value_type operator[](size_t at) const noexcept {
ssize_t start = reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
ssize_t stop = reinterpret_cast<int64_t*>(ptrs_[which_ + 2])[start_ + at];
ssize_t which = ptrs_[which_ + 3];
double* content = reinterpret_cast<double*>(ptrs_[which + 1]) + start;
return value_type(content, stop - start);
}
};
}
namespace awkward {
class Record_gtaz2QTTPs: public RecordView {
public:
Record_gtaz2QTTPs(ssize_t at, ssize_t which, ssize_t* ptrs)
: RecordView(at, which, ptrs) { }
const std::string parameter(const std::string& parameter) const noexcept {
return "null";
}
int64_t x() const noexcept {
return NumpyArray_int64_cRhHHLKAiXY(at_, at_ + 1, ptrs_[which_ + 2], ptrs_)[0];
}
NumpyArray_float64_Jw2edUDvrA y() const noexcept {
return ListArray_EhxjPFyWKf8(at_, at_ + 1, ptrs_[which_ + 3], ptrs_)[0];
}
};
}
namespace awkward {
class RecordArray_39Ik7hb1TXs: public ArrayView {
public:
RecordArray_39Ik7hb1TXs(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: ArrayView(start, stop, which, ptrs) { }
typedef Record_gtaz2QTTPs value_type;
const std::string parameter(const std::string& parameter) const noexcept {
return "null";
}
value_type at(size_t at) const {
if (at >= stop_ - start_) {
throw std::out_of_range(std::to_string(at) + " is out of range");
}
else {
return (*this)[at];
}
}
value_type operator[](size_t at) const noexcept {
return value_type(start_ + at, which_, ptrs_);
}
};
}
namespace awkward {
class ListArray_zdlkE7xbFoY: public ArrayView {
public:
ListArray_zdlkE7xbFoY(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: ArrayView(start, stop, which, ptrs) { }
typedef RecordArray_39Ik7hb1TXs value_type;
const std::string parameter(const std::string& parameter) const noexcept {
return "null";
}
value_type at(size_t at) const {
if (at >= stop_ - start_) {
throw std::out_of_range(std::to_string(at) + " is out of range");
}
else {
return (*this)[at];
}
}
value_type operator[](size_t at) const noexcept {
ssize_t start = reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
ssize_t stop = reinterpret_cast<int64_t*>(ptrs_[which_ + 2])[start_ + at];
return value_type(start, stop, ptrs_[which_ + 3], ptrs_);
}
};
} Now the flat lists have types like Those hashes are not good for naming types, because although they're stable for a given set of nested types and options, a little change in the Awkward code could produce radically different hashes. Users should (as in "ought to") rely on To turn an Awkward Array into one of these types, use the C++ generated by >>> generator.entry(flatlist_as_rvec=True)
'awkward::ListArray_zdlkE7xbFoY(0, length, 0, ptrs)' It is necessary to pass the same options to >>> length = len(array)
>>> length
3
>>> ptrs = lookup.arrayptrs
>>> ptrs
array([ -1, 94864433929152, 94864433929160, 4,
-1, 3, 8, 10,
-1, 94864441259584, -1, 94864440548752,
94864440548760, 14, -1, 94864440292080])
>>> ptrs.ctypes.data
94864440446608 In this, C++ is always seeing borrowed references, so the The real magic of what's going on happens in the Users will probably want all of the That's all I can think of to say. Comments? Questions? |
Oh, I forgot to mention that Awkward's option-type goes to C++ Every Awkward type has a C++ iterable. (Implementing this even revealed a bug in the Numba iterators, which are very similar, apart from generating LLVM IR instead of C++.) |
@jpivarski - I run into a
while this one works just fine:
I think, it's related to
|
Reading text above you are implementing point 1 from the original comment: When do you expect the first prototype (of Awkward to C++) to be ready for testing? Based on what was written above, I can not estimate that. |
@ianna, do you know if While working on this, I discovered that ROOT installed from source does not have C++17 enabled unless you explicitly pass Although if the problem was that ROOT is compiled without C++17, then the error messages would be different: it would be saying that |
@miranov25, it depends on what you want to test. If you want to do an initial test outside of RDataFrame (even if you plan on using RDataFrame in the future), you could use the examples I gave above right now. They're in Awkward's If you want to test RDataFrame access, that will depend on @ianna's developments, and I would guess that it would take until next week or so to have estimates. @ianna, based on what you've seen of the RDataSource infrastructure so far, what would it take to build something that @miranov25 can at least test? |
@jpivarski, @ianna , thank you. O(weeks) is enough, so I will not try a temporary solution
|
Yes, I'm basing the code on ROOT 6.24/06 and C++17. Though, I have a local build following the recipe:
|
@jpivarski - it looks like diff --git a/localbuild.py b/localbuild.py
index d3626ae5..615bae7e 100755
--- a/localbuild.py
+++ b/localbuild.py
@@ -163,6 +163,8 @@ if args.buildpython:
# localbuild must be in the library path for some operations.
env = dict(os.environ)
reminder = False
+ if env.get("ROOTSYS") is not None and env.get("SHLIB_PATH") is not None:
+ env["LD_LIBRARY_PATH"] = env.get("SHLIB_PATH", "") + ":" + env.get("LD_LIBRARY_PATH", "")
if "awkward" not in env.get("LD_LIBRARY_PATH", ""):
env["LD_LIBRARY_PATH"] = "awkward:" + env.get("LD_LIBRARY_PATH", "")
reminder = True |
@ianna The Anyway, this is probably moot because if @henryiii fixes editable installation ( |
1485511
to
1f904f1
Compare
d739670
to
63500c0
Compare
* Starting a pure Cling DEMO (do not merge). * A working version of the demo, assuming you have InterpreterUtils.cpp. * Port API changes from ffda5a3 (PR #1295). * Cleanup. * Demonstration of ArrayBuilder casting without std::invoke. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fully implemented/tested ArrayBuilder, tests not included. * Using ArrayBuilder in C++. * Defined iterators so that for-each loops work. * And reverse iterators (they're all const). * Also give ArrayBuilder its 'append' methods. * Iterators only need 32 bytes and track their value in start_. Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
5f37041
to
2dfbb6c
Compare
e9726fb
to
9e9a3e0
Compare
c686f1e
to
2d5cff8
Compare
The discussion started here informed quite a few PRs: Awkward → C++ generation in #1300, #1359, #1372, #1376, #1383, #1398, and now Awkward → RDataFrame in #1374, which is just about ready to be merged. There's still the RDataFrame → Awkward direction, but that will be new PRs. I can close this one now. |
FYI, both
Here is an example how to use it: def test_data_frame_vec_of_real():
ak_array_in = ak._v2.Array([[1.1, 2.2], [3.3], [4.4, 5.5]])
data_frame = ak._v2.to_rdataframe({"x": ak_array_in})
assert data_frame.GetColumnType("x") == "ROOT::VecOps::RVec<double>"
ak_array_out = ak._v2.from_rdataframe(
data_frame,
column="x",
)
assert ak_array_in.to_list() == ak_array_out["x"].to_list() Please, let me know if there are any issues, or requests. Thanks! |
Dear @ianna and @jpivarski Thank you. I will try to use it. During the first blind test I got some errors that could be due to my ROOT configuration. defining test function
--> error - stacktrace
directory mentioned in the error is indeed not existing
awkward description
|
@miranov25 I took a cursory glance at the RC wheel - it looks like we're not packaging the headers for consumption. I'll look into this |
Thanks! Indeed, the C++ header-only directory is something new and was not needed before. |
@miranov25 - thanks for a quick feedback! Indeed, the header file https://github.com/scikit-hep/awkward/blob/main/src/awkward/_v2/cpp-headers/rdataframe_jagged_builders.h is needed. I think, you could try to copy it by hand to the mentioned directory or wait for the next pre-release. Thanks again! |
I tried it before via Github master and it failed (I do not know why). Now I tried it via your link https://github.com/scikit-hep/awkward/blob/main/src/awkward/_v2/cpp-headers/rdataframe_jagged_builders.h and the test ran until the end without problems. |
@miranov25 - thanks! A new pre-release is available. |
@jpivarski and @miranov25 - it would be nice to move the discussion here. This PR is empty for the moment - I've just started on it, but it will be addressing the issue #588.
Any thoughts, ideas, and early requests are welcome!