-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pure Cling demo and improvements to C++ JIT infrastructure #1359
Pure Cling demo and improvements to C++ JIT infrastructure #1359
Conversation
Codecov Report
|
@lukasheinrich, I should include you here, as this will be relevant for my talk to ATLAS, just as #1295 is. |
@jpivarski, is the C++ code available somewhere? |
The code you gave me is not included in this PR. I haven't figured out how to distribute it, yet, or even how much will be in clangdev itself. The C++ samples I showed you yesterday (in Slack) were generated. I can copy them here. |
Although you can't test it without @vgvassilev's InterpreterUtils.cpp (and maybe I can integrate that into Awkward's compiled code? no, not without depending on Cling...), this version is able to do the following: >>> import awkward as ak
>>> from awkward._v2._connect.cling import CppStatements
>>>
>>> a = ak._v2.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4]])
>>>
>>> f = CppStatements("""
...
... for (int i = 0; i < array.size(); i++) {
... printf("[\\n");
... for (int j = 0; j < array[i].size(); j++) {
... printf(" %g\\n", array[i][j]);
... }
... printf("]\\n");
... }
... """, array=a)
>>>
>>> f(array=a)
[
0
1.1
2.2
]
[
]
[
3.3
4.4
] The Behind the scenes, this is the C++ code that was generated and JIT-compiled: #include<sys/types.h>
extern "C" int printf(const char*, ...);
namespace awkward {
class ArrayView {
public:
ArrayView(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: start_(start), stop_(stop), which_(which), ptrs_(ptrs) { }
size_t size() const noexcept {{
return stop_ - start_;
}}
bool empty() const noexcept {{
return start_ == stop_;
}}
protected:
ssize_t start_;
ssize_t stop_;
ssize_t which_;
ssize_t* ptrs_;
};
}
namespace awkward {
class NumpyArray_float64_bdYFlthWcck: public ArrayView {
public:
NumpyArray_float64_bdYFlthWcck(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: ArrayView(start, stop, which, ptrs) { }
typedef double value_type;
value_type operator[](size_t at) const noexcept {
return reinterpret_cast<double*>(ptrs_[which_ + 1])[start_ + at];
}
};
}
namespace awkward {
class ListArray_ms7qTHCb7Ik: public ArrayView {
public:
ListArray_ms7qTHCb7Ik(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
: ArrayView(start, stop, which, ptrs) { }
typedef NumpyArray_float64_bdYFlthWcck value_type;
value_type operator[](size_t at) const noexcept {
ssize_t start = reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
ssize_t stop = reinterpret_cast<int64_t*>(ptrs_[which_ + 2])[start_ + at];
return value_type(start, stop, ptrs_[which_ + 3], ptrs_);
}
};
}
void awkward_function_0(ssize_t awkward_argument_1_length, ssize_t awkward_argument_1_ptrs) {
auto array = awkward::ListArray_ms7qTHCb7Ik(0, awkward_argument_1_length, 0, reinterpret_cast<ssize_t*>(awkward_argument_1_ptrs));
for (int i = 0; i < array.size(); i++) {
printf("[\n");
for (int j = 0; j < array[i].size(); j++) {
printf(" %g\n", array[i][j]);
}
printf("]\n");
}
} The user code is the last part, in the function with the generated name. Still thinking about interface... Thanks to @vgvassilev for all the help in getting this to work! (It involved quite a lot of compiler details I didn't know about.) |
For future reference (so that the information exists somewhere other than my hard drive), we needed std::vector<const char *> ClangArgv = {"-Xclang", "-emit-llvm-only",
"-fPIC", "-fno-rtti", "-fno-exceptions"}; to get this to work because the clangdev package that ships in conda-forge has exceptions turned off. Also, it seems to me that this is the place to put |
This is the sign of someone who has been burned by this before 😄 |
Maybe we do not need |
for more information, see https://pre-commit.ci
Now we can use ArrayBuilder in C++ (pure Cling; adopted from @ianna's implementation in RDataFrame). For example, >>> import awkward as ak
>>> from awkward._v2._connect.cling import CppStatements
>>>
>>> a = ak._v2.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4]])
>>> b = ak._v2.ArrayBuilder()
>>>
>>> f = CppStatements("""
... for (int i = 0; i < array.size(); i++) {
... builder.begin_list();
... for (int j = 0; j < array[i].size(); j++) {
... builder.real(array[i][j]);
... builder.real(array[i][j]);
... }
... builder.end_list();
... }
... """, builder=b, array=a)
>>>
>>> f(array=a, builder=b)
>>>
>>> b.snapshot().show() # all of the numbers have been inserted twice
[0, 0, 1.1, 1.1, 2.2, 2.2],
[],
[3.3, 3.3, 4.4, 4.4]] |
With iterators and ArrayBuilder >>> import awkward as ak
>>> from awkward._v2._connect.cling import CppStatements
>>>
>>> a = ak._v2.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4]])
>>> b = ak._v2.ArrayBuilder()
>>>
>>> f = CppStatements("""
... for (auto inner : array) {
... builder.begin_list();
... for (auto x : inner) {
... builder.append(x);
... builder.append(x);
... }
... builder.end_list();
... }
... """, builder=b, array=a)
>>>
>>> f(array=a, builder=b)
>>>
>>> b.snapshot().show() # all of the numbers have been inserted twice
[0, 0, 1.1, 1.1, 2.2, 2.2],
[],
[3.3, 3.3, 4.4, 4.4]] @ianna, we should get together to talk about converging our PRs. I've taken your ArrayBuilder and reimplemented it with all the methods and pure C style function invocation (i.e. without bringing in From https://stackoverflow.com/a/36606852/1623645, I saw that the minimum we need for a class to be usable in a for-each loop is These iterators make a stack-bound copy of the ArrayView (32 bytes, better than trying to ensure that an iterator doesn't outlive the array it was made from) and an ... Although I did just think of a way to make it a bit more efficient. One moment... |
This last commit puts the data that used to be in I'm not sure how much of the But more importantly, this last commit establishes an API for the Iterator that we can later optimize. It has exactly the same member content as the ArrayViews themselves (but it has to be separate because ArrayViews are immutable and Iterators walk). |
Actually, we could also drop the I won't look at that, though, until I have a large-scale performance test set up. |
@jpivarski - It looks great! I really like it. If it’s ready to be merged, please, go ahead. I’ll either cherry pick from this PR or rebase mine. |
I wonder, if dropping an ArrayBuilder shim class makes any difference: I haven't dropped this code from the PR yet. There is a test: rdf = ROOT.RDataFrame(10).Define("x", "gRandom->Rndm()")
builder = ak._v2.highlevel.ArrayBuilder()
func = ak._v2._connect.rdataframe.from_rdataframe.connect_ArrayBuilder(
compiler, builder
)
compiler(
f"""
uint8_t
my_x_record(double x) {{
{func["beginrecord"]}();
{func["field_fast"]}("one");
{func["real"]}(x);
return {func["endrecord"]}();
}}
"""
)
rdf.Foreach["std::function<uint8_t(double)>"](ROOT.my_x_record, ["x"])
array = builder.snapshot() |
You mean just go through functions, without having a class instance to organize it? I would have thought that generates the same bytecode. Actually, perhaps I ought to play with godbolt and find out what bytecode the function pointer generates. Meanwhile, I've been thinking about adding an interface in which arrays and |
What if we extend an The following loop: @nb.njit
def q_numba(array, builder):
for inner1 in array:
for inner2 in inner1:
for inner3 in inner2:
for inner4 in inner3:
builder.real(inner4) would become: @nb.njit
def q_numba(array, builder):
for inner1 in array:
for inner2 in inner1:
for inner3 in inner2:
builder.append_range_real(inner3.begin(), inner3.end()): |
That would reduce the number of external function pointer calls, but it's only applicable to data of primitive type (like But rather than just work around the slow external function pointer, I'd like to understand why it's so slow in C++ and not in Numba. Numba provides an existence proof that external functions can be called a lot faster than they are with the FYI: I've also put |
Cc: @sudo-panda, @wlav. I think both of them would be interested to follow this work here. |
Just a guess, but a straight-up function pointer is considerably faster than a relocation through the PLT (I saw I also saw |
The above doesn't use There are two approaches being developed in parallel: @ianna is going through RDataFrame, which JIT-compiles strings through ROOT and Cling, in PR #1295. This PR is about Clang Incremental, using a ctypes-enabled library (a prototype—not for production). The performance plots above are all Clang Incremental vs Numba, and both of them access ArrayBuilder through external function pointers. The fact that the performance is so different is a clue that it's being done in a different way—there's some source of overhead in the Clang Incremental set-up that isn't present in the Numba. |
Hm, I suppose we can dump the llvm IR for both and compare them to see what gets optimized and how? @sudo-panda, do you want to take a look? |
Yeah! |
There must be more to it than just the IR. A single function pointer dispatch (of type
So Numba is using absolute addressing for the call through the function pointer. I don't have "clang incremental" handy to try, but What about the loops themselves? As seen in this disasm, Numba enables |
The loops themselves are (equally) fast: #1359 (comment). |
Not convinced that that is comparable ... there the operation " |
Okay, so here's a loop that fills output (assigning to an output array; pure side-effect) and the iteration through the whole dataset takes 2.04 seconds (standard deviation of 0.01 seconds in 10 trials). Iterating over the input without writing output (just summing the values with To do that, I added another interface to CppStatements, so that it accepts NumPy arrays as well as Awkward Arrays (PR #1372). Quick demo: >>> import numpy as np
>>> import awkward as ak
>>> from awkward._v2._connect.cling import CppStatements
>>>
>>> a = ak._v2.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4]])
>>> b = np.zeros(3, dtype=np.float64)
>>>
>>> f = CppStatements("""
...
... for (int i = 0; i < input.size(); i++) {
... output[i] = 0.0;
... for (auto x : input[i]) {
... output[i] += x;
... }
... }
...
... """, input=a.layout.form, output=b.dtype)
>>>
>>> f(input=a, output=b)
>>>
>>> b
array([3.3, 0. , 7.7]) NumPy arrays can be used as output because they're mutable (unlike Awkward Arrays). Unfortunately, they're also unstructured, but I can use them to make a structured object with the ak.from_buffers function. Using the same input array, array = ak._v2.from_parquet("zlib9-jagged3.parquet") I make all the buffers I'll need to pass to ak.from_buffers by copying the shape and dtype of the original (not the contents). offsets1 = np.zeros_like(array.layout.offsets)
offsets2 = np.zeros_like(array.layout.content.offsets)
offsets3 = np.zeros_like(array.layout.content.content.offsets)
data = np.zeros_like(array.layout.content.content.content.data) Here's a sequence of statements that puts the right values into f = CppStatements("""
offsets1[0] = 0;
offsets2[0] = 0;
offsets3[0] = 0;
int i1 = 0;
int i2 = 0;
int i3 = 0;
int i4 = 0;
for (auto inner1 : array) {
offsets1[i1 + 1] = offsets1[i1] + inner1.size();
i1++;
for (auto inner2 : inner1) {
offsets2[i2 + 1] = offsets2[i2] + inner2.size();
i2++;
for (auto inner3 : inner2) {
offsets3[i3 + 1] = offsets3[i3] + inner3.size();
i3++;
for (auto inner4 : inner3) {
data[i4] = inner4;
i4++;
}
}
}
}
""", array=array, offsets1=offsets1, offsets2=offsets2, offsets3=offsets3, data=data) The following line runs in 2.04 seconds, consistently: f(array=array, offsets1=offsets1, offsets2=offsets2, offsets3=offsets3, data=data) To put the NumPy arrays together into an Awkward Array with the original structure, we need a Form, labeled by output = ak._v2.from_buffers(
ak._v2.forms.from_json(
"""
{
"class": "ListOffsetArray",
"offsets": "i32",
"content": {
"class": "ListOffsetArray",
"offsets": "i32",
"content": {
"class": "ListOffsetArray",
"offsets": "i32",
"content": {
"class": "NumpyArray",
"primitive": "float32",
"form_key": "data"
},
"form_key": "offsets3"
},
"form_key": "offsets2"
},
"form_key": "offsets1"
}
"""
),
len(array),
{
"offsets1": offsets1,
"offsets2": offsets2,
"offsets3": offsets3,
"data": data,
},
buffer_key="{form_key}",
) And now The time for this sequence of C++ statements, 2.04 seconds, is close to a run without output of 1.78 seconds, and it's very different from the 218 seconds of writing to the ArrayBuilder through external pointers. Behind those external pointers, the ArrayBuilder does some dynamically typed stuff, so we could also compare it to the 69 seconds of only filling the inmost And if you're thinking it's the ArrayBuilder itself, not the external pointers in Clang Incremental, filling just the inmost |
Well, color me puzzled. I'll need to get a hand on "clang incremental" then to try it out or someone should post the disasm. Just as some more data points, I compared Fastest by far is passing a function pointer on the stack (i.e. as an argument; this outperforms Numba, too, but obviously no surprise as this removes the load), slowest is when JITted in but through a relocation. But even then, the spread is only ~16%. |
@vgvassilev, I hope it's alright: I've posted the code you sent me on https://gist.github.com/jpivarski/aad015ac893a0d9cca6c6f42a90a9505 with instructions for reproducing the above. I think all the warnings that I put there (and the fact that it's a gist, not a regular repo) should prevent anyone from taking it too seriously as production code. |
I am fine posting that code. |
How to get the parquet file? Closest I seem to get when following the white rabbit is this script, but it still needs inputs that don't seem to be available? |
Ah, found it in the thread above. :) |
I can't get the Numba example to work yet:
might be obvious, but I'm still new to Numba. However, when checking with top (yay for simple tools :) ), I find that both slow Cpp examples are massive memory hogs: the 6.9s one runs up to 8GB, then jumps to ~17GB at the end. The other increases slowly up to ~14GB, then jumps to ~19GB at the end. I still see a rough 3x difference just like you do (more like 2.6x, from 57s and 22s, but I don't think I restricted the data set), and it can't be swapping, so smells to me that there are superfluous copies being made? With the slower one, which uses more memory, making more copies? Just putting that out there in case it rings a bell? Will check in more detail once I get the Numba code to work ... |
Sorry about the issue with the Numba examples. Run this: ak._v2.numba.register() It's because everything in the With that, Numba should recognize these I would not expect any changes in memory when going through the external function pointer, other than the fact that the ArrayBuilder contains GrowableBuffers that fill up like If there's a large memory use when |
Thanks! With that the Numba example works! It doesn't spike in memory like the other ones. Yes, growing with 1.5x will cause lots of memory copying all around and is the most likely culprit. Yes, the number of jumps slows down logarithmically with size, but the amount of copying needed on each jump grows exponentially (okay A q&d profiling with perf checks the whole program, not just the timed portion, but total time scaling is pretty close to the load/store count scaling. A q&d profiling with operf, also whole program, unequivocally puts the blame on Is there any way you can use the semantic equivalent of Aside, mixing Numba and AArrayclangInterpreter in the same file results in a segfault in |
Since the ArrayBuilder is used in both the Numba and C++ demos, they should show the same memory growth and copying, though the C++ demo still takes longer than the Numba one. As for the copying, we're slowly moving to a model that requires less copying: since #1359 (comment) is the semantic equivalent of "reserve": the arrays that get filled in this example are preallocated with a given size. In general use-cases, we won't know the size, so we can't always use this technique. That's the one that's 2.04 seconds instead of 218 seconds (with 1.78 seconds to iterate without producing output). To really zero in on the external function pointer, instead of ArrayBuilder's memory usage, how about import numpy as np
import awkward as ak
from awkward._v2._connect.cling import CppStatements
builder = ak._v2.ArrayBuilder()
f = CppStatements("""
for (int i = 0; i < 1000000000; i++) {
int64_t result;
builder.length(&result);
}
""", builder=builder)
f(builder=builder) versus dummy = np.array([0])
g = CppStatements("""
for (int i = 0; i < 1000000000; i++) {
dummy[0] = dummy[0] * 2 - 1; // complicated enough that the optimizer can't remove it?
}
""", dummy=dummy)
g(dummy=dummy) because The Numba equivalent of import numba as nb
import awkward as ak
ak._v2.numba.register()
@nb.njit
def f(builder):
for i in range(1000000000):
len(builder)
f(builder) takes 3.0 seconds, which is noticeably less than 8.1 seconds. That's the odd thing: that they're both external function calls, but Numba is faster. I would have thought they'd be the same. It's an indication that something is slowing down Clang Incremental's use of external function pointers.
I noticed that, too, and forgot to report it to @vgvassilev! (I didn't narrow it down, though, just noticed that I needed separate conda environments for Numba and clangdev.) Thanks for doing so. |
Not sure what As for the first and last, on my system, they are basically the same speed, both ~4.7s. |
Aside, I just took the examples as-is, but if I add a warmup call, then Clang easily outperfoms by ~20%. |
I forgot to copy in dummy = np.array([0]) so you were right (integers, though). I fixed the comment above. What's the "warmup"? Repeated application? Because with >>> s = time.time(); f(builder=builder); time.time() - s
8.109642505645752
>>> s = time.time(); f(builder=builder); time.time() - s
8.104896306991577
>>> s = time.time(); f(builder=builder); time.time() - s
8.123145818710327
>>> s = time.time(); f(builder=builder); time.time() - s
8.121228694915771
>>> s = time.time(); f(builder=builder); time.time() - s
8.182328701019287 and with >>> s = time.time(); f(builder=builder); time.time() - s
3.039454460144043
>>> s = time.time(); f(builder=builder); time.time() - s
3.0041792392730713
>>> s = time.time(); f(builder=builder); time.time() - s
3.0177931785583496
>>> s = time.time(); f(builder=builder); time.time() - s
3.0152335166931152
>>> s = time.time(); f(builder=builder); time.time() - s
3.0572359561920166 This is hitting the same external pointer one billion times per call but not doing anything beyond that. If you're getting different results on repeated application, then we might have different hardware and that would be a clue to the origin of this effect. On the other hand, if this effect is something that even changes direction with hardware (Clang being eventually faster on your computer by 20%), then it's probably not something to deep-dive into, since we wouldn't be able to control for it in a product. Regardless of Clang or Numba, external pointers should be avoided if possible. |
Yes, is what I meant. It's not directly clear to me what Clang is still doing on lookup, although the code as written doesn't look free, parsing is already done at that point. Numba, however, doesn't JIT until first use (where types become known). As for hardware, I'm using a Linux box with an AMD EPYC 7702P and 1TB of RAM.
Yes, absolutely agree with that generally speaking. For cppyy/numba, that is what I want to achieve, too: place the IR from Cling C++ into the IR from Numba Python, effectively inlining it. Not only gets rid of the indirections, but also adds further optimization opportunities. |
One more quick note: yes, with an Edit: no, I messed up. My example code didn't take
|
Can you set me up with a minimal reproducer? |
@vgvassilev This seems to be the most minimal example (note that there's another bug with the arguments of the C++ function). The basic point is to do "something" with Numba which creates its setup (doesn't need to run the full JIT by having the
|
This is based on the Cling standalone you gave me, @vgvassilev.