Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental Metal backend #396

Closed
k-ye opened this issue Jan 20, 2020 · 36 comments
Closed

Experimental Metal backend #396

k-ye opened this issue Jan 20, 2020 · 36 comments
Assignees
Labels
feature request Suggest an idea on this project mac Mac OS X platform

Comments

@k-ye
Copy link
Member

k-ye commented Jan 20, 2020

Is your feature request related to a problem? Please describe.

I'd like to add a Metal backend to taichi, so as to allow Mac users to enjoy the GPU acceleration, too.

Describe the solution you'd like

Halide has already supported Metal backend, and its codebase has a lot to learn from. Specifically, they used source-to-source codegen to translate Halide to Metal compute kernels (not LLVM IR). They also wrapped Metal APIs into C++ via the objc-runtime APIs (taichi is also using this approach for its GUI).

After talking to @yuanming-hu , I think we can start by supporting dense first. This requires two things:

  • scattered read/write (access arbitrary memory address in the buffer)
  • atomic operations (e.g. atomic_add)

Metal supports both fairly well.

I also need to figure out if the memory returned by the existing/next-gen memory allocator can be used by Metal kernels.

Describe alternatives you've considered

Supporting OpenGL compute shader may be appealing to a broader audience scope. However, due to my dev environment setup and working experience, I feel more comfortable working on Metal. If this works, it should be useful for helping design the OpenGL backend as well.

Additional context

Some references

  • Halide
  • mtlpp: This is another library that exposes Metal APIs to C++, but it's implemented in Obj-C.
@k-ye k-ye added the feature request Suggest an idea on this project label Jan 20, 2020
@k-ye
Copy link
Member Author

k-ye commented Jan 25, 2020

Some progress so far.

I'm trying to keep all the codegen on the Metal kernel side, without generating host-side C++ code. One thing I found is that this makes it hard for a Metal struct to have embedded arrays. For example,

s1 = root.dense(ti.i, n)
s1.place(x)
s1.dense(ti.i, m).place(y)

This logically maps to a struct like this:

struct s1 {
  float x;
  float y[m];
};

Here, scalar x and array y are embedded in a contiguous chunk of memory.

However, my current approach can only produce something like

struct s1 {
  float& x;
  float* y;
};

where x and y are sub-ranges within the Metal buffers they belong to, respectively. That is, we have a metal buffer X of size n and Y of size n * m. A struct s1 has x = X[i] and y = Y[i * m, (i + 1) * m). In such cases, we only have SoA...

For variables placed at the same level, I believe I can still have either SoA or AoS, depending on how it's laid out. E.g.

root.dense(ti.i, n).place(x, y)

I can generate a struct s1 { float x; float y; };, and just one Metal buffer to hold s1.

However, these are just me manually doing the compilation. Let me see if I will encounter more problems when writing the compiler for ti.layout...

@yuanming-hu
Copy link
Member

Would that be easier if we simply allocate a huge buffer for the whole data structure tree, and then implement functions that map tensor + indices into a pointer (i.e. manually generate addressing computations instead of relying on structs)?

Also, it's sad that although Metal relies heavily on LLVM, Apple doesn't release a Metal backend for LLVM... https://worthdoingbadly.com/metalbitcode/ https://twitter.com/icculus/status/721893213452312576

@yuanming-hu
Copy link
Member

yuanming-hu commented Jan 25, 2020

If we only have (non-nested) dense data structures, then every tensor element, say, x[i, j, k] will have an address in the form of base + i * stride_i + j * stride_j + k * stride_k. I can help generate the base and strides (as well as the total data structure size), if this sounds easier than creating complex structs. AOS/SOA can still be supported, but we won't have fancy dense structures such as nested ones.

@k-ye
Copy link
Member Author

k-ye commented Jan 26, 2020

Would that be easier if we simply allocate a huge buffer for the whole data structure tree, and then implement functions that map tensor + indices into a pointer

Thanks! I am trying to follow this approach as well :) The structs I created didn't own the memory, they were basically used to flatten the logical index into the buffer position (base + stride).

If we only have (non-nested) dense data structures, then every tensor element, say, x[i, j, k] will have an address in the form of base + i * stride_i + j * stride_j + k * stride_k. I can help generate the base and strides (as well as the total data structure size), if this sounds easier than creating complex structs. AOS/SOA can still be supported, but we won't have fancy dense structures such as nested ones.

I see. Yeah, i believe if you just offer Metal the correct size of memory, Metal can treat it as a byte array and reinterpret freely. One thing that blocked me is how to calculate this size, without generating code on the client size then doing a sizeof(). For example, for a layout of float x[n], float y[m*n], is the memory really just (n + (m * n)) * sizeof(float), or there are some additional requirement? Now I think if following your suggestion and do not create all those intermediate structs, then this is an accurate estimation. I can give a try :)


Yeah, I'd really hope Apple to release the Metal backend as well (Given the current quality of Metal's documentation, maybe this isn't their top priority lol)

@yuanming-hu
Copy link
Member

yuanming-hu commented Jan 26, 2020

Currently size calculation depends on LLVM:

auto root_size =

I think you can simply modify struct_llvm.cpp and take that result.

For example, for a layout of float x[n], float y[m*n], is the memory really just (n + (m * n)) * sizeof(float), or there are some additional requirement?

You are right that compilers (and sometimes OS) have additional padding constraints on the actual data layout (which can go pretty complex: https://llvm.org/docs/LangRef.html#langref-datalayout). For now, we can just ignore it.

@yuanming-hu
Copy link
Member

After more thoughts, I feel like assuming

x[i, j, k] will have an address in the form of base + i * stride_i + j * stride_j + k * stride_k

is not really making things easier, given the fact that x[i, j, k] will be broken down into micro-access ops... How about this: I'll help with the data access part (LiearizeStmt, IntegerOffsetStmt, GetCh, SNodeLookUp) of the Metal backend, which are mainly address computations.

@yuanming-hu
Copy link
Member

A quick update: enum Arch is now extended to support several potential new backends: https://github.com/taichi-dev/taichi/blob/75992afe5b50067d5a70f7712387a70967e656cb/taichi/inc/archs.inc.h

@k-ye
Copy link
Member Author

k-ye commented Jan 28, 2020

Ah nice, thanks!

@kazimuth
Copy link

kazimuth commented Jan 29, 2020

Supporting OpenGL compute shader may be appealing to a broader audience scope. However, due to my dev environment setup and working experience, I feel more comfortable working on Metal. If this works, it should be useful for helping design the OpenGL backend as well.

I've also wondered about implementing a Vulkan backend as well, which would give portability to pretty much every platform, at least in theory. There may be some similar issues with using pointers in Vulkan shader code, but i think they can be resolved in a similar way, by sticking everything in (a few) chunks of global memory and doing pointer arithmetic by hand. Excited to see how this implementation goes :)

@yuanming-hu
Copy link
Member

@kazimuth Vulkan sounds a nice backend choice as well. My perception of real-time rendering APIs is probably out-dated, and I'm not sure how well-supported Vulkan is on modern GPUs. We need more investigation on this.

@k-ye
Copy link
Member Author

k-ye commented Jan 29, 2020

How about this: I'll help with the data access part (LiearizeStmt, IntegerOffsetStmt, GetCh, SNodeLookUp) of the Metal backend, which are mainly address computations.

I think I've figured out the basics of SNode (at least for dense...). What I did is basically generating those SNode and SNode_ch into Metal kernels (struct_metal.h), which, instead of actually owning the memory, is really just doing the pointer arithmetics on the amalgamated global memory. Then I think the set of micro-access ops work on them just fine (e.g. codegen_emtal.h).

I'm trying to put up the part of transferring data between host and Metal. Once that's finished, I'll give an update to see how it goes :)

@yuanming-hu
Copy link
Member

Fantastic! One note on struct-fors:
when using dense data structures only, struct-fors can be demoted into dense range-fors.

Implementing struct-fors following the old x86/CUDA strategies on Metal can be pretty tricky - a series of element list generations have to be done, which are a lot of work and will harm performance. I will implement #378, which can also be used in the Metal backend and no worries about implementing struct-fors.

@k-ye
Copy link
Member Author

k-ye commented Jan 29, 2020

So at https://github.com/k-ye/taichi/tree/ed1971f97ffa055ed3a7e4ee60111bd1cd8a2742, I've got the Metal kernel to work on a toy example :)

def test_basic():
  ti.cfg.arch = ti.metal
  ti.cfg.print_ir = True
  
  x = ti.var(ti.i32)
  n = 128

  @ti.layout
  def place():
    ti.root.dense(ti.i, n).place(x)

  @ti.kernel
  def func():
    for i in range(n):
      x[i] = i + 123
  
  func()

  xnp = x.to_numpy()
  for i in range(n):
    assert xnp[i] == i + 123
  print('passed')

A few things I'd like to point out:

  • I've disabled the task-offloading pass for now. It seems simpler to me to generate kernels directly from for-loops...
  • operator[] is not working yet. For now, to_numpy() can be used as a workaround. For to_numpy(), I saw that the data copy between host and device happened here. But I don't really know where that memory copy happened in get_snode_reader(). This is purely a host-side kernel?

Below are the generated Metal kernels. Note that I had to copy the S* nodes definition to every Metal kernel, because I'm using newLibraryWithSource:options:error:, which doesn't allow the generated kernel to include any headers except <metal_stdlib>.

  • func:
#include <metal_stdlib>
using namespace metal;

namespace {
using byte = uchar;

struct S2 {
  // place
  constant static constexpr int stride = sizeof(int32_t);
  S2(device byte* v) : val((device int32_t*)v) {}
  device int32_t* val;
};

class S1_ch {
 private:
  device byte* addr_;
 public:
  S1_ch(device byte* a) : addr_(a) {}
  S2 get0() {
    return {addr_};
  }
  constant static constexpr int stride = S2::stride;
};

struct S1 {
  // dense
  constant static constexpr int n = 128;
  constant static constexpr int stride = S1_ch::stride * n;
  S1(device byte* a) : addr_(a) {}
  S1_ch children(int i) {
    return {addr_ + i * S1_ch::stride};
  }
 private:
  device byte* addr_;
};

class S0_ch {
 private:
  device byte* addr_;
 public:
  S0_ch(device byte* a) : addr_(a) {}
  S1 get0() {
    return {addr_};
  }
  constant static constexpr int stride = S1::stride;
};

struct S0 {
  // root
  constant static constexpr int n = 1;
  constant static constexpr int stride = S0_ch::stride * n;
  S0(device byte* a) : addr_(a) {}
  S0_ch children(int i) {
    return {addr_ + i * S0_ch::stride};
  }
 private:
  device byte* addr_;
};

}  // namespace

kernel void k0001_func_c4_0__0(
    device byte* addr [[buffer(0)]],
    const uint utid_ [[thread_position_in_grid]]) {
  if (!(0 <= utid_ && utid_ < 128)) return;
  const int tmp0 = (int)utid_;
  const int32_t tmp4(tmp0);
  const int32_t tmp5 = 123;
  const int32_t tmp6((tmp4) + (tmp5));
  S0 tmp7(addr);
  auto tmp8 = 0;
  S0_ch tmp9 = tmp7.children(tmp8);
  S1 tmp10 = tmp9.get0();
  auto tmp11 = (((0 + tmp4) >> 0) & ((1 << 7) - 1));
  auto tmp12 = (0) * 128 + tmp11;
  S1_ch tmp13 = tmp10.children(tmp12);
  device int32_t* tmp14 = tmp13.get0().val;
  *tmp14 = tmp6;
}
  • to_numpy():
#include <metal_stdlib>
using namespace metal;

namespace {
using byte = uchar;

struct S2 {
  // place
  constant static constexpr int stride = sizeof(int32_t);
  S2(device byte* v) : val((device int32_t*)v) {}
  device int32_t* val;
};

class S1_ch {
 private:
  device byte* addr_;
 public:
  S1_ch(device byte* a) : addr_(a) {}
  S2 get0() {
    return {addr_};
  }
  constant static constexpr int stride = S2::stride;
};

struct S1 {
  // dense
  constant static constexpr int n = 128;
  constant static constexpr int stride = S1_ch::stride * n;
  S1(device byte* a) : addr_(a) {}
  S1_ch children(int i) {
    return {addr_ + i * S1_ch::stride};
  }
 private:
  device byte* addr_;
};

class S0_ch {
 private:
  device byte* addr_;
 public:
  S0_ch(device byte* a) : addr_(a) {}
  S1 get0() {
    return {addr_};
  }
  constant static constexpr int stride = S1::stride;
};

struct S0 {
  // root
  constant static constexpr int n = 1;
  constant static constexpr int stride = S0_ch::stride * n;
  S0(device byte* a) : addr_(a) {}
  S0_ch children(int i) {
    return {addr_ + i * S0_ch::stride};
  }
 private:
  device byte* addr_;
};

}  // namespace

namespace {
class k0002_tensor_to_ext_arr_c8_0__args {
 public:
  explicit k0002_tensor_to_ext_arr_c8_0__args(device byte* addr) : addr_(addr) {}
  device int32_t* arg0() {
    // array, size=512 B
    return (device int32_t*)(addr_ + 0);
  }
 private:
  device byte* addr_;
};
}  // namespace

kernel void k0002_tensor_to_ext_arr_c8_0__0(
    device byte* addr [[buffer(0)]],
    device byte* args_addr [[buffer(1)]],
    const uint utid_ [[thread_position_in_grid]]) {
  k0002_tensor_to_ext_arr_c8_0__args args_ctx_(args_addr);
  if (utid_ >= 128) return;
  int tid_ = (int)utid_;
  const int tmp0 = (tid_ / 1);
  tid_ = (tid_ % 1);
  const int32_t tmp2(tmp0);
  S0 tmp3(addr);
  auto tmp4 = 0;
  S0_ch tmp5 = tmp3.children(tmp4);
  S1 tmp6 = tmp5.get0();
  auto tmp7 = (((0 + tmp2) >> 0) & ((1 << 7) - 1));
  auto tmp8 = (0) * 128 + tmp7;
  S1_ch tmp9 = tmp6.children(tmp8);
  device int32_t* tmp10 = tmp9.get0().val;
  int32_t tmp11 = *tmp10;
  device int32_t *tmp12 = args_ctx_.arg0();
  device int32_t *tmp13 = (tmp12 + tmp2);
  *tmp13 = tmp11;
}

@k-ye
Copy link
Member Author

k-ye commented Jan 29, 2020

Implementing struct-fors following the old x86/CUDA strategies on Metal can be pretty tricky - a series of element list generations have to be done, which are a lot of work and will harm performance. I will implement #378, which can also be used in the Metal backend and no worries about implementing struct-fors.

That will be fantastic! Yeah, I had to look into SNode's extractors to figure out the size of each dimension for now...

@yuanming-hu
Copy link
Member

Amazing!!!

I've disabled the task-offloading pass for now. It seems simpler to me to generate kernels directly from for-loops...

Task offloading is to ensure kernels of forms

@ti.kernel
def test():
  for i in range(10):
    ...
  for j in range(30):
    ...

and

@ti.kernel
def test():
  a = 0
  for i in range(10):
    a += i

to be currently generated. For now if we assume all kernels are pure single for-loop then we are good.

But I don't really know where that memory copy happened in get_snode_reader(). This is purely a host-side kernel?

Yes, that's a host-side kernel. We make use of CUDA unified memory, so host-side read of GPU memory results in a page fault, then the OS automatically copies the GPU page to CPU memory. This design may not work on every platform though. For Metal if there's no unified memory support, maybe you can generate a device-side kernel for now. (However, GPU kernel launches can be rather slow, so maybe it's better to batch all the read/write requests, which needs a refactoring of the system...)

Note that I had to copy the S* nodes definition to every Metal kernel

Cool! You are doing the preprocessor's job and we don't have to worry about specifying the include directory :-)

I had to look into SNode's extractors to figure out the size of each dimension for now

Actually you only need to calculate two things:

  • In SNodeLookUpStmt, return stmt->input_snode_pointer + x * sizeof(stmt->snode_element_size), here sizeof(stmt->snode_element_size) is the size of the snodes' element.
  • In GetChStmt, return stmt->input_element_pointer + stmt->ch_component_offset. Each element of a SNode can have multiple components, and stmt->ch_component_offset is the offset of that component.

For example, if you have ti.root.dense(ti.i, 1024).place(x:i32, y:i32, z:i32), then sizeof(snode_element_size) for the dense node should be 12 bytes, and ch_component_offset for x, y, and z within a dense node element (which is something like struct {i32, i32, i32}) should be 0, 4, 8 bytes.

Your implementation of OffsetAndExtractBitsStmt, IntegerOffsetStmt, and SNodeLookupStmt codegen looks good to me!

@k-ye
Copy link
Member Author

k-ye commented Jan 29, 2020

For now if we assume all kernels are pure single for-loop then we are good.

Ah right, I do have a check to see if the kernel consists of purely for loops...

Actually you only need to calculate two things:

Thanks for the explanation :) I think that's pretty much what I did (I named it stride). For example, below is a generated kernel struct.

class S1_ch {
 private:
  device byte* addr_;
 public:
  S1_ch(device byte* a) : addr_(a) {}
  S2 get0() {
    return {addr_};
  }
  S3 get1() {
    return {addr_ + (S2::stride)};
  }
  S4 get2() {
    return {addr_ + (S2::stride + S3::stride)};
  }
  S5 get3() {
    return {addr_ + (S2::stride + S3::stride + S4::stride)};
  }
  constant static constexpr int stride = S2::stride + S3::stride + S4::stride + S5::stride;
};

struct S1 {
  // dense
  constant static constexpr int n = 2048;
  constant static constexpr int stride = S1_ch::stride * n;
  S1(device byte* a) : addr_(a) {}
  S1_ch children(int i) {
    return {addr_ + i * S1_ch::stride};
  }
 private:
  device byte* addr_;
};

The extractors info is used to deal with multi-dimensional struct-for. For example:

for i, j in x:
  # ...

At Metal level, this is still compiled to a 1D kernel. I then used the extractor's info at each index to un-linearize the thread ID into i, j. I guess this is probably what #378 plans to do?

Your implementation of OffsetAndExtractBitsStmt, IntegerOffsetStmt, and SNodeLookupStmt codegen looks good to me!

That's because I copied them from the legacy codegen XD.


AT LAST, I CAN RUN mpm88.py VIA METAL NOW!! (With some initialization being moved to the host side, because operator[] is not supported yet). The sad thing is I saw absolutely no performance gain. That said, I think currently my code compiles the kernels (at Metal level, not taichi level) on every invocation! I'll try to compile it once and cache the result...

@yuanming-hu
Copy link
Member

I see. That makes a lot of sense. Yes, #378 will compile multi-dimensional struct-for loops into range-based ones, so you only have to implement range-fors for now.

It's wonderful that you can already run a modified version of mpm88!

The GUI system may be the bottleneck of mpm88, where there are a lot of pybind11 function calls...
#411

@k-ye
Copy link
Member Author

k-ye commented Jan 30, 2020

The GUI system may be the bottleneck of mpm88, where there are a lot of pybind11 function calls...

I see. So if you're running this on CUDA, is the FPS similar to what you got on CPU...?

Yes, that's a host-side kernel. We make use of CUDA unified memory, so host-side read of GPU memory results in a page fault, then the OS automatically copies the GPU page to CPU memory.

Metal does have unified memory. IIUC, load_if_ptr((snode->expr)[indices]) returns you the correct memory address? Since I didn't generate SNodes on the host side, that won't work for now. I can try doing the same pointer arithmetic on the host side.

@yuanming-hu
Copy link
Member

So if you're running this on CUDA, is the FPS similar to what you got on CPU...

That depends on the computation/visualization workload ratio... For mpm99, yes, since visualization which includes a call through pybind11 is too slow, on my end CPU gets 16 FPS and GPU gets 20, although my GPU has 20x more FLOPs than my CPUs. This will be resolved if we batch the draw calls.

It's great that Metal also has unified memory! load_if_ptr will return you a pointer to the tensor element, yet I'm not sure how helpful that is, if you are not implementing a corresponding x64 codegen. One solution that will work, although probably not very efficient, is do everything on GPU and invoke something like cudaMemcpy(device_to_host) in Metal.

@k-ye
Copy link
Member Author

k-ye commented Jan 30, 2020

For mpm99, yes, since visualization which includes a call through pybind11 is too slow, on my end CPU gets 16 FPS and GPU gets 20

Ah OK :( For mpm88, I got around 4fps on both CPU and Metal 😂 ... I'm modifying the code so that Metal is only compiled once, but I guess that won't make much of a difference now..

One solution that will work, although probably not very efficient, is do everything on GPU and invoke something like cudaMemcpy(device_to_host) in Metal.

Yeah, same feeling here. But I think we have all the information on the host side to do pointer arithmetic. I'll give that a shot...

@yuanming-hu
Copy link
Member

I'll try to do #411 after #378 tonight.

@k-ye
Copy link
Member Author

k-ye commented Jan 30, 2020

After caching the compiled kernels, the FPS for mpm88 goes up to 7~8 FPS. Really looking forward to the batched GUI!

@yuanming-hu
Copy link
Member

Awesome! I need a couple more hours, but you can also tweak the quality parameter (say to 4) so that you have more work to do and can compare CPU/GPU performance better.

@yuanming-hu
Copy link
Member

#378 is now implemented. You need to add a pass for this demotion:

irpass::demote_dense_struct_fors(ir);

@yuanming-hu
Copy link
Member

#411 done. GUI 30x faster. Now your bottleneck should be mostly simulation instead of rendering.

@k-ye
Copy link
Member Author

k-ye commented Jan 30, 2020

Great! They are both faster now, but Metal became a bit slower than the CPU kernels (12 FPS vs 13 FPS) 😂 Hmm, I need some more profiling to see what's going on here....

@k-ye
Copy link
Member Author

k-ye commented Jan 30, 2020

Profiling data when simulating 81,920 particles in MPM88

CPU
[ 89.93%]                  substep_c4_0_     min   3.821 ms   avg   4.232 ms    max  49.045 ms   total  21.442 s [   5066x]
[  6.17%]                circles_batched     min  11.288 ms   avg  14.564 ms    max 117.464 ms   total   1.471 s [    101x]
[ ... ]

Metal
[ 78.61%]                  substep_c4_0_     min   1.991 ms   avg   2.943 ms    max   5.390 ms   total  14.909 s [   5066x]
[  9.71%]                circles_batched     min  12.897 ms   avg  18.228 ms    max 129.527 ms   total   1.841 s [    101x]
[ ... ]

We can see that Metal improved the avg performance by about 30%, and its variance is much smaller.

@yuanming-hu
Copy link
Member

yuanming-hu commented Jan 30, 2020

On my end

  • CPU 335ms (i7 7700K, multithreaded)
  • GPU 13 ms (GTX 1080 Ti, CUDA)

Profiling using

quality=3

...

for frame in range(20000):
  import time
  t = time.time()
  for s in range(int(2e-3 // dt)):
    substep()
  ti.sync()
  print((time.time() - t) * 1000)

@k-ye
Copy link
Member Author

k-ye commented Jan 30, 2020

Thanks, this is my number using the same code:

  • CPU 174ms (i9-9880H)
  • Metal 87ms (AMD Radeon Pro 5500M) Interestingly, a few times this number went down to 23ms and I changed nothing..

I guess ti.sync translates to cudaDeviceSychronize()? Right now the Metal kernels will sync automatically for each ti.kernel call. Maybe I should also have users manually call sync and see if there's further improvement.

@k-ye
Copy link
Member Author

k-ye commented Feb 7, 2020

I guess one way to support SNodes beyond dense is to have a mixed CPU + Metal runtime... My feeling is that if we get the memory layout right and inject synchronization properly, this is doable?

On the other hand, if we use CPU for kernels that use non-dense SNodes, then I'm afraid that almost all kernels would fallback to CPU in practice :-/

@k-ye
Copy link
Member Author

k-ye commented Feb 8, 2020

TODO:

@k-ye
Copy link
Member Author

k-ye commented Feb 10, 2020

A small update: If I use Runtime::root as Metal's root memory, then a host-side snode_reader kernel works just out of the box! No additional Metal kernel launches required.

I believe the memory layout between Metal's struct and LLVM's are the same, iff dense snodes are used.

@k-ye
Copy link
Member Author

k-ye commented Mar 14, 2020

Superseded by #593

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Suggest an idea on this project mac Mac OS X platform
Projects
None yet
Development

No branches or pull requests

3 participants