Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pure Cling demo and improvements to C++ JIT infrastructure #1359

Merged
merged 12 commits into from
Mar 15, 2022

Conversation

jpivarski
Copy link
Member

This is based on the Cling standalone you gave me, @vgvassilev.

@codecov
Copy link

codecov bot commented Mar 10, 2022

Codecov Report

Merging #1359 (b8fe8e6) into main (b2fd2be) will increase coverage by 0.04%.
The diff coverage is 51.21%.

Impacted Files Coverage Δ
src/awkward/_v2/_connect/cling.py 0.00% <0.00%> (ø)
src/awkward/_v2/_connect/pyarrow.py 85.74% <0.00%> (ø)
src/awkward/_v2/_lookup.py 97.50% <0.00%> (ø)
src/awkward/_v2/_prettyprint.py 66.09% <0.00%> (+2.29%) ⬆️
src/awkward/_v2/_typetracer.py 69.14% <0.00%> (ø)
src/awkward/_v2/forms/form.py 90.06% <0.00%> (ø)
src/awkward/_v2/identifier.py 55.69% <0.00%> (ø)
src/awkward/_v2/index.py 83.59% <0.00%> (ø)
src/awkward/_v2/operations/convert/ak_from_jax.py 75.00% <0.00%> (ø)
src/awkward/_v2/operations/convert/ak_to_jax.py 75.00% <0.00%> (ø)
... and 145 more

@jpivarski
Copy link
Member Author

@lukasheinrich, I should include you here, as this will be relevant for my talk to ATLAS, just as #1295 is.

@vgvassilev
Copy link

@jpivarski, is the C++ code available somewhere?

@jpivarski
Copy link
Member Author

The code you gave me is not included in this PR. I haven't figured out how to distribute it, yet, or even how much will be in clangdev itself.

The C++ samples I showed you yesterday (in Slack) were generated. I can copy them here.

@jpivarski
Copy link
Member Author

Although you can't test it without @vgvassilev's InterpreterUtils.cpp (and maybe I can integrate that into Awkward's compiled code? no, not without depending on Cling...), this version is able to do the following:

>>> import awkward as ak
>>> from awkward._v2._connect.cling import CppStatements
>>> 
>>> a = ak._v2.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4]])
>>> 
>>> f = CppStatements("""
... 
... for (int i = 0; i < array.size(); i++) {
...   printf("[\\n");
...   for (int j = 0; j < array[i].size(); j++) {
...     printf("  %g\\n", array[i][j]);
...   }
...   printf("]\\n");
... }
... """, array=a)
>>> 
>>> f(array=a)
[
  0
  1.1
  2.2
]
[
]
[
  3.3
  4.4
]

The array=a passed to the CppStatements constructor could instead be a Form, like array=a.layout.form (or more likely, something derived from a Parquet schema).

Behind the scenes, this is the C++ code that was generated and JIT-compiled:

#include<sys/types.h>
extern "C" int printf(const char*, ...);

namespace awkward {
  class ArrayView {
  public:
    ArrayView(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : start_(start), stop_(stop), which_(which), ptrs_(ptrs) { }

    size_t size() const noexcept {{
      return stop_ - start_;
    }}

    bool empty() const noexcept {{
      return start_ == stop_;
    }}

  protected:
    ssize_t start_;
    ssize_t stop_;
    ssize_t which_;
    ssize_t* ptrs_;
  };
}
namespace awkward {
  class NumpyArray_float64_bdYFlthWcck: public ArrayView {
  public:
    NumpyArray_float64_bdYFlthWcck(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : ArrayView(start, stop, which, ptrs) { }

    typedef double value_type;

    

    value_type operator[](size_t at) const noexcept {
      return reinterpret_cast<double*>(ptrs_[which_ + 1])[start_ + at];
    }
  };
}
namespace awkward {
  class ListArray_ms7qTHCb7Ik: public ArrayView {
  public:
    ListArray_ms7qTHCb7Ik(ssize_t start, ssize_t stop, ssize_t which, ssize_t* ptrs)
      : ArrayView(start, stop, which, ptrs) { }

    typedef NumpyArray_float64_bdYFlthWcck value_type;

    

    value_type operator[](size_t at) const noexcept {
      ssize_t start = reinterpret_cast<int64_t*>(ptrs_[which_ + 1])[start_ + at];
      ssize_t stop = reinterpret_cast<int64_t*>(ptrs_[which_ + 2])[start_ + at];
      return value_type(start, stop, ptrs_[which_ + 3], ptrs_);
    }
  };
}

void awkward_function_0(ssize_t awkward_argument_1_length, ssize_t awkward_argument_1_ptrs) {
auto array = awkward::ListArray_ms7qTHCb7Ik(0, awkward_argument_1_length, 0, reinterpret_cast<ssize_t*>(awkward_argument_1_ptrs));


for (int i = 0; i < array.size(); i++) {
  printf("[\n");
  for (int j = 0; j < array[i].size(); j++) {
    printf("  %g\n", array[i][j]);
  }
  printf("]\n");
}

}

The user code is the last part, in the function with the generated name. Still thinking about interface...

Thanks to @vgvassilev for all the help in getting this to work! (It involved quite a lot of compiler details I didn't know about.)

@jpivarski
Copy link
Member Author

For future reference (so that the information exists somewhere other than my hard drive), we needed

  std::vector<const char *> ClangArgv = {"-Xclang", "-emit-llvm-only",
                                         "-fPIC", "-fno-rtti", "-fno-exceptions"};

to get this to work because the clangdev package that ships in conda-forge has exceptions turned off.

Also, it seems to me that this is the place to put "-O3", unless it can also be controlled by a #pragma.

@agoose77
Copy link
Collaborator

agoose77 commented Mar 10, 2022

For future reference (so that the information exists somewhere other than my hard drive)

This is the sign of someone who has been burned by this before 😄

@vgvassilev
Copy link

For future reference (so that the information exists somewhere other than my hard drive), we needed

  std::vector<const char *> ClangArgv = {"-Xclang", "-emit-llvm-only",
                                         "-fPIC", "-fno-rtti", "-fno-exceptions"};

to get this to work because the clangdev package that ships in conda-forge has exceptions turned off.

Also, it seems to me that this is the place to put "-O3", unless it can also be controlled by a #pragma.

Maybe we do not need fPIC and sure, we can add -O3, I think -O2 is safer.

@jpivarski
Copy link
Member Author

Now we can use ArrayBuilder in C++ (pure Cling; adopted from @ianna's implementation in RDataFrame).

For example,

>>> import awkward as ak
>>> from awkward._v2._connect.cling import CppStatements
>>> 
>>> a = ak._v2.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4]])
>>> b = ak._v2.ArrayBuilder()
>>> 
>>> f = CppStatements("""
... for (int i = 0; i < array.size(); i++) {
...   builder.begin_list();
...   for (int j = 0; j < array[i].size(); j++) {
...     builder.real(array[i][j]);
...     builder.real(array[i][j]);
...   }
...   builder.end_list();
... }
... """, builder=b, array=a)
>>> 
>>> f(array=a, builder=b)
>>> 
>>> b.snapshot().show()   # all of the numbers have been inserted twice
[0, 0, 1.1, 1.1, 2.2, 2.2],
[],
[3.3, 3.3, 4.4, 4.4]]

@jpivarski
Copy link
Member Author

With iterators and ArrayBuilder append methods, now the following works:

>>> import awkward as ak
>>> from awkward._v2._connect.cling import CppStatements
>>> 
>>> a = ak._v2.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4]])
>>> b = ak._v2.ArrayBuilder()
>>> 
>>> f = CppStatements("""
... for (auto inner : array) {
...   builder.begin_list();
...   for (auto x : inner) {
...     builder.append(x);
...     builder.append(x);
...   }
...   builder.end_list();
... }
... """, builder=b, array=a)
>>> 
>>> f(array=a, builder=b)
>>> 
>>> b.snapshot().show()   # all of the numbers have been inserted twice
[0, 0, 1.1, 1.1, 2.2, 2.2],
[],
[3.3, 3.3, 4.4, 4.4]]

@ianna, we should get together to talk about converging our PRs. I've taken your ArrayBuilder and reimplemented it with all the methods and pure C style function invocation (i.e. without bringing in std::invoke). The iterators I've implemented are rather different from yours—they seem to be a lot more minimal. I don't know if there will be any unexpected effects from calling ROOT's ProcessLine, but if we're going to point at free-standing header files like src/awkward/_v2/_connect/_cling/is_iterable.h, we'll have to use Python resources to find them, or else it likely won't work in the package we distribute (they'll be in all sorts of different locations, and I don't know if ROOT will know what to take as the right local path). I'm not sure why we need an is_iterable compile-time function.

From https://stackoverflow.com/a/36606852/1623645, I saw that the minimum we need for a class to be usable in a for-each loop is begin and end methods that return a type with operator*, operator++, and operator!=, so that's what the Iterator<ARRAY, VALUE> is. (Yeah, I used templates to define it when I could have generated it with Python, but in this case, the template is easier because it's just setting the array type, nothing more. Also, if somebody looks at the generated C++, they won't have to scroll past a bunch of Python-generated iterator classes; the C++ templates are written once, though the effective code-generation is the same.)

These iterators make a stack-bound copy of the ArrayView (32 bytes, better than trying to ensure that an iterator doesn't outlive the array it was made from) and an at integer (8 more bytes, total of 40), which is the only thing that gets incremented in the loop. We can't, in general, represent the iterator with a single pointer and increment it because not all Awkward types have a single buffer of data to point to.

...

Although I did just think of a way to make it a bit more efficient. One moment...

@jpivarski
Copy link
Member Author

This last commit puts the data that used to be in at_ into start_ and constructs ArrayViews on the fly to produce values. So now Iterators are 32 bytes, rather than 40. The operator!= should be faster, too, because most of the comparisons will be in the start_.

I'm not sure how much of the operator[](size_t at) with at=0 the compiler will optimize, but it wouldn't be too ridiculous to add a code-path for operator_at_zero() (no arguments). This is auto-generated code, after all.

But more importantly, this last commit establishes an API for the Iterator that we can later optimize. It has exactly the same member content as the ArrayViews themselves (but it has to be separate because ArrayViews are immutable and Iterators walk).

@jpivarski
Copy link
Member Author

Actually, we could also drop the which from all the ArrayViews, RecordViews, and Iterators by moving the ptrs when we descend into nested structure. ArrayViews and Iterators could be 24 bytes and RecordViews could be 16 bytes.

I won't look at that, though, until I have a large-scale performance test set up.

@jpivarski jpivarski changed the title Starting a pure Cling DEMO (do not merge). Pure Cling demo and improvements to C++ JIT infrastructure Mar 15, 2022
@jpivarski
Copy link
Member Author

Thinking about optimization derailed me. The main thing is, we have to figure out a merge strategy between this and #1295. If this one merges with main first, #1295 will need to adjust before it can merge.

@agoose77 agoose77 marked this pull request as ready for review March 15, 2022 06:53
@agoose77 agoose77 marked this pull request as draft March 15, 2022 06:54
@ianna
Copy link
Collaborator

ianna commented Mar 15, 2022

Thinking about optimization derailed me. The main thing is, we have to figure out a merge strategy between this and #1295. If this one merges with main first, #1295 will need to adjust before it can merge.

@jpivarski - It looks great! I really like it. If it’s ready to be merged, please, go ahead. I’ll either cherry pick from this PR or rebase mine.

@jpivarski jpivarski marked this pull request as ready for review March 15, 2022 13:10
@ianna
Copy link
Collaborator

ianna commented Mar 16, 2022

I wonder, if dropping an ArrayBuilder shim class makes any difference:
https://github.com/scikit-hep/awkward-1.0/pull/1295/files#diff-a48bdafeba779feab0a7581cdabf4d98ad62952bb52eec7ae76870ddda2f6641R131-R246

I haven't dropped this code from the PR yet. There is a test:

    rdf = ROOT.RDataFrame(10).Define("x", "gRandom->Rndm()")

    builder = ak._v2.highlevel.ArrayBuilder()
    func = ak._v2._connect.rdataframe.from_rdataframe.connect_ArrayBuilder(
        compiler, builder
    )
    compiler(
        f"""
    uint8_t
    my_x_record(double x) {{
        {func["beginrecord"]}();
        {func["field_fast"]}("one");
        {func["real"]}(x);
        return {func["endrecord"]}();
    }}
    """
    )

    rdf.Foreach["std::function<uint8_t(double)>"](ROOT.my_x_record, ["x"])

    array = builder.snapshot()

@jpivarski
Copy link
Member Author

You mean just go through functions, without having a class instance to organize it? I would have thought that generates the same bytecode.

Actually, perhaps I ought to play with godbolt and find out what bytecode the function pointer generates.

Meanwhile, I've been thinking about adding an interface in which arrays and std::vector in the user code can be exported as a container for ak._v2.from_buffers. Using that would require more expertise from the user, but it's the fastest way to generate an array. (I need to demonstrate on Friday that it's possible to get data into and out of ATLAS at a reasonable rate, even if only by experts.)

@ianna
Copy link
Collaborator

ianna commented Mar 16, 2022

What if we extend an ArrayBuilder API with an append_range<T>? We'd not need to update its builder, just copy the range of the n values (or append a contiguous buffer) to its GrowableBuffer.

The following loop:

@nb.njit
def q_numba(array, builder):
    for inner1 in array:
        for inner2 in inner1:
            for inner3 in inner2:
                for inner4 in inner3:
                    builder.real(inner4)

would become:

@nb.njit
def q_numba(array, builder):
    for inner1 in array:
        for inner2 in inner1:
            for inner3 in inner2:
                builder.append_range_real(inner3.begin(), inner3.end()):

@jpivarski
Copy link
Member Author

That would reduce the number of external function pointer calls, but it's only applicable to data of primitive type (like append_range_real). Lists in HEP tend to be small, such as leptons with an average number of 1.0 or even 0.1 per event: calling a range-fill for a lot of empty events can actually be worse than not calling it in a loop. But then again, a user could add an if statement to avoid calling it on a loop, so yeah: having that range-fill function and users who know how to use it would be a net positive.

But rather than just work around the slow external function pointer, I'd like to understand why it's so slow in C++ and not in Numba. Numba provides an existence proof that external functions can be called a lot faster than they are with the reinterpret_cast. Something is a bottleneck in the C++. And for all we know at present, it might only be the incremental parser I'm using here, not the one you get in ROOT.

FYI: I've also put zlib9-jagged3.parquet in a publicly accessible place, for repeatability: https://pivarski-princeton.s3.amazonaws.com/chep-2021-jagged-jagged-jagged/zlib9-jagged3.parquet

@vgvassilev
Copy link

Cc: @sudo-panda, @wlav. I think both of them would be interested to follow this work here.

@wlav
Copy link

wlav commented Mar 16, 2022

But rather than just work around the slow external function pointer, I'd like to understand why it's so slow in C++
and not in Numba.

Just a guess, but a straight-up function pointer is considerably faster than a relocation through the PLT (I saw -fPIC way up in this thread) in a tight loop, especially if you have lots of relocatable functions and two or more of them used in the inner loop are some distance in the table: the function pointer allows prediction, the jump through the PLT does not.

I also saw ProcessLine mentioned, which I think enables the null checker pass. I actually don't know whether it applies to function pointers (Vassil?) but it's trouble in tight loops for two reasons: 1) it introduces a branch, and 2) it kills pattern-based optimizations (more of a problem for arrays than single pointers, though).

@jpivarski
Copy link
Member Author

The above doesn't use ProcessLine: this is directly through Clang Incremental, not ROOT.

There are two approaches being developed in parallel: @ianna is going through RDataFrame, which JIT-compiles strings through ROOT and Cling, in PR #1295.

This PR is about Clang Incremental, using a ctypes-enabled library (a prototype—not for production). The performance plots above are all Clang Incremental vs Numba, and both of them access ArrayBuilder through external function pointers. The fact that the performance is so different is a clue that it's being done in a different way—there's some source of overhead in the Clang Incremental set-up that isn't present in the Numba.

@vgvassilev
Copy link

Hm, I suppose we can dump the llvm IR for both and compare them to see what gets optimized and how? @sudo-panda, do you want to take a look?

@sudo-panda
Copy link

Hm, I suppose we can dump the llvm IR for both and compare them to see what gets optimized and how? @sudo-panda, do you want to take a look?

Yeah!

@wlav
Copy link

wlav commented Mar 16, 2022

There must be more to it than just the IR. A single function pointer dispatch (of type double(*)(double) from Numba looks like this after disasm:

    vmovsd   6(%rsp), %xmm0
    movabsq  $_ZN8__main__11go_fast_241B42c8tJTIeFCjyCbUFRqqOAK_2f6h0kCng1maAA_3d_3dEd, %rax
    leaq     8(%rsp), %rdi
    movq     $0, 8(%rsp)
    callq    *%rax
    vmovsd   8(%rsp), %xmm0

So Numba is using absolute addressing for the call through the function pointer. I don't have "clang incremental" handy to try, but clang++ actually generates relative addressing starting from -O1 (i.e. no load and using a single instruction, albeit with %rax pushed on the stack for some reason, which is gone by -O2). Iow., Numba is actually not being optimally efficient here ...

What about the loops themselves? As seen in this disasm, Numba enables -mavx by default. I don't see that argument listed in the set above in this thread for Clang?

@jpivarski
Copy link
Member Author

jpivarski commented Mar 16, 2022

The loops themselves are (equally) fast: #1359 (comment).

@wlav
Copy link

wlav commented Mar 16, 2022

Not convinced that that is comparable ... there the operation "+= on native type double" is fully visible (i.e. the compiler can deduce it has no side-effects). It can thus e.g. re-order the loops if that improves the memory layout (needs -O3, though). With the function pointer, it has to assume a worst case of the call having side-effects. I'd figure that with -mavx in both cases the inner loop could be unrolled and if so, I'd be curious to see whether the compiler reloads the function pointer in between invocations (I presume Numba can declare it const, in which case it would not need to).

@jpivarski
Copy link
Member Author

jpivarski commented Mar 16, 2022

Okay, so here's a loop that fills output (assigning to an output array; pure side-effect) and the iteration through the whole dataset takes 2.04 seconds (standard deviation of 0.01 seconds in 10 trials). Iterating over the input without writing output (just summing the values with +=) took 1.78 seconds. We can see the difference, but it's small.

To do that, I added another interface to CppStatements, so that it accepts NumPy arrays as well as Awkward Arrays (PR #1372). Quick demo:

>>> import numpy as np
>>> import awkward as ak
>>> from awkward._v2._connect.cling import CppStatements
>>> 
>>> a = ak._v2.Array([[0.0, 1.1, 2.2], [], [3.3, 4.4]])
>>> b = np.zeros(3, dtype=np.float64)
>>> 
>>> f = CppStatements("""
... 
... for (int i = 0;  i < input.size();  i++) {
...   output[i] = 0.0;
...   for (auto x : input[i]) {
...     output[i] += x;
...   }
... }
... 
... """, input=a.layout.form, output=b.dtype)
>>> 
>>> f(input=a, output=b)
>>> 
>>> b
array([3.3, 0. , 7.7])

NumPy arrays can be used as output because they're mutable (unlike Awkward Arrays). Unfortunately, they're also unstructured, but I can use them to make a structured object with the ak.from_buffers function.

Using the same input array,

array = ak._v2.from_parquet("zlib9-jagged3.parquet")

I make all the buffers I'll need to pass to ak.from_buffers by copying the shape and dtype of the original (not the contents).

offsets1 = np.zeros_like(array.layout.offsets)
offsets2 = np.zeros_like(array.layout.content.offsets)
offsets3 = np.zeros_like(array.layout.content.content.offsets)
data = np.zeros_like(array.layout.content.content.content.data)

Here's a sequence of statements that puts the right values into offsets1, offsets2, offsets3, and data to copy the original array.

f = CppStatements("""

offsets1[0] = 0;
offsets2[0] = 0;
offsets3[0] = 0;

int i1 = 0;
int i2 = 0;
int i3 = 0;
int i4 = 0;

for (auto inner1 : array) {
  offsets1[i1 + 1] = offsets1[i1] + inner1.size();
  i1++;
  for (auto inner2 : inner1) {
    offsets2[i2 + 1] = offsets2[i2] + inner2.size();
    i2++;
    for (auto inner3 : inner2) {
      offsets3[i3 + 1] = offsets3[i3] + inner3.size();
      i3++;
      for (auto inner4 : inner3) {
        data[i4] = inner4;
        i4++;
      }
    }
  }
}

""", array=array, offsets1=offsets1, offsets2=offsets2, offsets3=offsets3, data=data)

The following line runs in 2.04 seconds, consistently:

f(array=array, offsets1=offsets1, offsets2=offsets2, offsets3=offsets3, data=data)

To put the NumPy arrays together into an Awkward Array with the original structure, we need a Form, labeled by "form_key" at each node, the len(array), and a dict mapping "form_key" strings to the NumPy arrays. This operation is a view, not a copy.

output = ak._v2.from_buffers(
  ak._v2.forms.from_json(
    """
    {
        "class": "ListOffsetArray",
        "offsets": "i32",
        "content": {
            "class": "ListOffsetArray",
            "offsets": "i32",
            "content": {
                "class": "ListOffsetArray",
                "offsets": "i32",
                "content": {
                    "class": "NumpyArray",
                    "primitive": "float32",
                    "form_key": "data"
                },
                "form_key": "offsets3"
            },
            "form_key": "offsets2"
        },
        "form_key": "offsets1"
    }
    """
  ),
  len(array),
  {
    "offsets1": offsets1,
    "offsets2": offsets2,
    "offsets3": offsets3,
    "data": data,
  },
  buffer_key="{form_key}",
)

And now output is identical to the input array.

The time for this sequence of C++ statements, 2.04 seconds, is close to a run without output of 1.78 seconds, and it's very different from the 218 seconds of writing to the ArrayBuilder through external pointers. Behind those external pointers, the ArrayBuilder does some dynamically typed stuff, so we could also compare it to the 69 seconds of only filling the inmost data through a single external pointer per datum.

And if you're thinking it's the ArrayBuilder itself, not the external pointers in Clang Incremental, filling just the inmost data with an ArrayBuilder through external pointers in Numba takes 20 seconds. It could be that ArrayBuilder's overhead accounts for 20 ‒ 2 = 18 seconds, but then external pointers in Clang Incremental take 69 ‒ 20 = 49 seconds. I wouldn't be surprised by ArrayBuilder having this 18 seconds of overhead, but I am surprised that Clang Incremental has this 49 seconds of overhead over Numba: they're doing very similar things.

@wlav
Copy link

wlav commented Mar 16, 2022

Well, color me puzzled. I'll need to get a hand on "clang incremental" then to try it out or someone should post the disasm.

Just as some more data points, I compared cppyy.cppdef() which is (from PyPI, not the ancient stuff that ROOT distributes!) my optimized equivalent of Cling's declare(). (Aside, it JITs about 10x faster than Numba, as it does not run the expensive loop optimizations, of all things!)

Fastest by far is passing a function pointer on the stack (i.e. as an argument; this outperforms Numba, too, but obviously no surprise as this removes the load), slowest is when JITted in but through a relocation. But even then, the spread is only ~16%.

@jpivarski
Copy link
Member Author

@vgvassilev, I hope it's alright: I've posted the code you sent me on https://gist.github.com/jpivarski/aad015ac893a0d9cca6c6f42a90a9505 with instructions for reproducing the above. I think all the warnings that I put there (and the fact that it's a gist, not a regular repo) should prevent anyone from taking it too seriously as production code.

@vgvassilev
Copy link

I am fine posting that code.

@wlav
Copy link

wlav commented Mar 17, 2022

How to get the parquet file? Closest I seem to get when following the white rabbit is this script, but it still needs inputs that don't seem to be available?

@wlav
Copy link

wlav commented Mar 17, 2022

Ah, found it in the thread above. :)

@wlav
Copy link

wlav commented Mar 17, 2022

I can't get the Numba example to work yet:

    out = 0.0
    ^ 

This error may have been caused by the following argument(s):
- argument 0: Cannot determine Numba type of <class 'awkward._v2.highlevel.Array'>
- argument 1: Cannot determine Numba type of <class 'awkward._v2.highlevel.ArrayBuilder'>

might be obvious, but I'm still new to Numba.

However, when checking with top (yay for simple tools :) ), I find that both slow Cpp examples are massive memory hogs: the 6.9s one runs up to 8GB, then jumps to ~17GB at the end. The other increases slowly up to ~14GB, then jumps to ~19GB at the end.

I still see a rough 3x difference just like you do (more like 2.6x, from 57s and 22s, but I don't think I restricted the data set), and it can't be swapping, so smells to me that there are superfluous copies being made? With the slower one, which uses more memory, making more copies?

Just putting that out there in case it rings a bell? Will check in more detail once I get the Numba code to work ...

@jpivarski
Copy link
Member Author

Sorry about the issue with the Numba examples. Run this:

ak._v2.numba.register()

It's because everything in the ak._v2 submodule is a second implementation of the library, with the intention of replacing it when Awkward 2.0 is released (somewhat like the ROOT::Experimental namespace). The Numba extensions for Awkward v1 are automatically registered, but v2 is opt-in to make sure that there are no conflicts that break it for existing Numba users. (There shouldn't be—the type names are different—but I'm paranoid about breaking things for production users.)

With that, Numba should recognize these _v2 arrays as things that can be in an @nb.njit function.


I would not expect any changes in memory when going through the external function pointer, other than the fact that the ArrayBuilder contains GrowableBuffers that fill up like push_back on std::vector—i.e. replaces its buffer with another 1.5× larger every time the limit is reached. The memory use should jump exponentially larger, logarithmically less often. That effect would be clearest to see in the example without builder.begin_list()/end_list(). If it only has one builder.real(...) call, then it only has one GrowableBuffer, and you wouldn't see different GrowableBuffers expanding at different times.

If there's a large memory use when snapshot() happens, that can be the final transfer from ArrayBuilder to the new Array.

@wlav
Copy link

wlav commented Mar 17, 2022

Thanks! With that the Numba example works! It doesn't spike in memory like the other ones. Yes, growing with 1.5x will cause lots of memory copying all around and is the most likely culprit. Yes, the number of jumps slows down logarithmically with size, but the amount of copying needed on each jump grows exponentially (okay exp(n-1)).

A q&d profiling with perf checks the whole program, not just the timed portion, but total time scaling is pretty close to the load/store count scaling.

A q&d profiling with operf, also whole program, unequivocally puts the blame on _ext.cpython-38-x86_64-linux-gnu.so. Doesn't show symbols, I guess no -g used, but it is indeed the home of GrowableBuffer, per nm -D.

Is there any way you can use the semantic equivalent of reserve() somewhere?

Aside, mixing Numba and AArrayclangInterpreter in the same file results in a segfault in llvm::TargetPassConfig::addPass(llvm::Pass*).

@jpivarski
Copy link
Member Author

jpivarski commented Mar 17, 2022

Since the ArrayBuilder is used in both the Numba and C++ demos, they should show the same memory growth and copying, though the C++ demo still takes longer than the Numba one. As for the copying, we're slowly moving to a model that requires less copying: since snapshot() is rare, we no longer require that to be a view, and so the accumulated data no longer needs to be contiguous. It could be a linked list of large buffers (or similar), with the discontiguous pieces concatenated in a single copy in snapshot().

#1359 (comment) is the semantic equivalent of "reserve": the arrays that get filled in this example are preallocated with a given size. In general use-cases, we won't know the size, so we can't always use this technique. That's the one that's 2.04 seconds instead of 218 seconds (with 1.78 seconds to iterate without producing output).

To really zero in on the external function pointer, instead of ArrayBuilder's memory usage, how about

import numpy as np
import awkward as ak
from awkward._v2._connect.cling import CppStatements

builder = ak._v2.ArrayBuilder()

f = CppStatements("""
for (int i = 0;  i < 1000000000;  i++) {
  int64_t result;
  builder.length(&result);
}
""", builder=builder)

f(builder=builder)

versus

dummy = np.array([0])

g = CppStatements("""
for (int i = 0;  i < 1000000000;  i++) {
  dummy[0] = dummy[0] * 2 - 1;   // complicated enough that the optimizer can't remove it?
}
""", dummy=dummy)

g(dummy=dummy)

because builder.length has to go through an external pointer, but does not fill anything. f takes 8.1 seconds and g takes 0.06 seconds for me.

The Numba equivalent of f:

import numba as nb
import awkward as ak
ak._v2.numba.register()

@nb.njit
def f(builder):
    for i in range(1000000000):
        len(builder)

f(builder)

takes 3.0 seconds, which is noticeably less than 8.1 seconds. That's the odd thing: that they're both external function calls, but Numba is faster. I would have thought they'd be the same. It's an indication that something is slowing down Clang Incremental's use of external function pointers.

Aside, mixing Numba and AArrayclangInterpreter in the same file results in a segfault in llvm::TargetPassConfig::addPass(llvm::Pass*).

I noticed that, too, and forgot to report it to @vgvassilev! (I didn't narrow it down, though, just noticed that I needed separate conda environments for Numba and clangdev.) Thanks for doing so.

@wlav
Copy link

wlav commented Mar 17, 2022

Not sure what dummy is above, but if an array of double, then that 2nd example's loop gets unrolled (at -O3) using AVX (16x) and FMA (2x), giving a 32x speedup. That's not from 8.1s -> 0.06, though, but then, there's also zero memory access (not even the cache) until final write as both constants 2 and 1 are powers of two ...

As for the first and last, on my system, they are basically the same speed, both ~4.7s.

@wlav
Copy link

wlav commented Mar 17, 2022

Aside, I just took the examples as-is, but if I add a warmup call, then Clang easily outperfoms by ~20%.

@jpivarski
Copy link
Member Author

I forgot to copy in

dummy = np.array([0])

so you were right (integers, though). I fixed the comment above.


What's the "warmup"? Repeated application? Because with f defined using Clang:

>>> s = time.time(); f(builder=builder); time.time() - s
8.109642505645752
>>> s = time.time(); f(builder=builder); time.time() - s
8.104896306991577
>>> s = time.time(); f(builder=builder); time.time() - s
8.123145818710327
>>> s = time.time(); f(builder=builder); time.time() - s
8.121228694915771
>>> s = time.time(); f(builder=builder); time.time() - s
8.182328701019287

and with f defined using Numba:

>>> s = time.time(); f(builder=builder); time.time() - s
3.039454460144043
>>> s = time.time(); f(builder=builder); time.time() - s
3.0041792392730713
>>> s = time.time(); f(builder=builder); time.time() - s
3.0177931785583496
>>> s = time.time(); f(builder=builder); time.time() - s
3.0152335166931152
>>> s = time.time(); f(builder=builder); time.time() - s
3.0572359561920166

This is hitting the same external pointer one billion times per call but not doing anything beyond that.

If you're getting different results on repeated application, then we might have different hardware and that would be a clue to the origin of this effect. On the other hand, if this effect is something that even changes direction with hardware (Clang being eventually faster on your computer by 20%), then it's probably not something to deep-dive into, since we wouldn't be able to control for it in a product. Regardless of Clang or Numba, external pointers should be avoided if possible.

@wlav
Copy link

wlav commented Mar 17, 2022

Yes, is what I meant. It's not directly clear to me what Clang is still doing on lookup, although the code as written doesn't look free, parsing is already done at that point. Numba, however, doesn't JIT until first use (where types become known).

As for hardware, I'm using a Linux box with an AMD EPYC 7702P and 1TB of RAM.

Regardless of Clang or Numba, external pointers should be avoided if possible.

Yes, absolutely agree with that generally speaking. For cppyy/numba, that is what I want to achieve, too: place the IR from Cling C++ into the IR from Numba Python, effectively inlining it. Not only gets rid of the indirections, but also adds further optimization opportunities.

@wlav
Copy link

wlav commented Mar 17, 2022

One more quick note: yes, with an int array for dummy, that whole loop gets optimized out of existence already at -O1.

Edit: no, I messed up. My example code didn't take dummy as an argument so there was no side-effect. Still, for the actual example as written, the loop body does not access (cache) memory, which should easily explain the difference in overall performance:

  10:	48 c1 e1 08          	shl    $0x8,%rcx
  14:	48 81 c1 01 ff ff ff 	add    $0xffffffffffffff01,%rcx
  1b:	83 c2 f8             	add    $0xfffffff8,%edx
  1e:	75 f0                	jne    10 <_Z3fffPl+0x10>

@vgvassilev
Copy link

Aside, mixing Numba and AArrayclangInterpreter in the same file results in a segfault in llvm::TargetPassConfig::addPass(llvm::Pass*).

I noticed that, too, and forgot to report it to @vgvassilev! (I didn't narrow it down, though, just noticed that I needed separate conda environments for Numba and clangdev.) Thanks for doing so.

Can you set me up with a minimal reproducer?

@wlav
Copy link

wlav commented Mar 18, 2022

@vgvassilev This seems to be the most minimal example (note that there's another bug with the arguments of the C++ function). The basic point is to do "something" with Numba which creates its setup (doesn't need to run the full JIT by having the nb_noop() call execute), then do have to perform the function call through Clang incremental (which goes back to my earlier question, regarding warmup, how much is deferred to the actual call).

from numba import jit
from all_call_fn import InteractiveCppEnv

@jit(nopython=True)
def nb_noop():
    pass

CppEnv = InteractiveCppEnv()
CppEnv.cpp_compile("void c_noop() {}")
CppEnv.Cpp.c_noop(0)   # <- arg required? (bug)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants