-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental Metal backend #396
Comments
Some progress so far. I'm trying to keep all the codegen on the Metal kernel side, without generating host-side C++ code. One thing I found is that this makes it hard for a Metal struct to have embedded arrays. For example, s1 = root.dense(ti.i, n)
s1.place(x)
s1.dense(ti.i, m).place(y) This logically maps to a struct s1 {
float x;
float y[m];
}; Here, scalar However, my current approach can only produce something like struct s1 {
float& x;
float* y;
}; where For variables placed at the same level, I believe I can still have either SoA or AoS, depending on how it's laid out. E.g. root.dense(ti.i, n).place(x, y) I can generate a However, these are just me manually doing the compilation. Let me see if I will encounter more problems when writing the compiler for |
Would that be easier if we simply allocate a huge buffer for the whole data structure tree, and then implement functions that map tensor + indices into a pointer (i.e. manually generate addressing computations instead of relying on structs)? Also, it's sad that although Metal relies heavily on LLVM, Apple doesn't release a Metal backend for LLVM... https://worthdoingbadly.com/metalbitcode/ https://twitter.com/icculus/status/721893213452312576 |
If we only have (non-nested) dense data structures, then every tensor element, say, |
Thanks! I am trying to follow this approach as well :) The
I see. Yeah, i believe if you just offer Metal the correct size of memory, Metal can treat it as a byte array and reinterpret freely. One thing that blocked me is how to calculate this size, without generating code on the client size then doing a Yeah, I'd really hope Apple to release the Metal backend as well (Given the current quality of Metal's documentation, maybe this isn't their top priority lol) |
Currently size calculation depends on LLVM: taichi/taichi/backends/struct_llvm.cpp Line 232 in d28ea5e
I think you can simply modify
You are right that compilers (and sometimes OS) have additional padding constraints on the actual data layout (which can go pretty complex: https://llvm.org/docs/LangRef.html#langref-datalayout). For now, we can just ignore it. |
After more thoughts, I feel like assuming
is not really making things easier, given the fact that |
A quick update: |
Ah nice, thanks! |
I've also wondered about implementing a Vulkan backend as well, which would give portability to pretty much every platform, at least in theory. There may be some similar issues with using pointers in Vulkan shader code, but i think they can be resolved in a similar way, by sticking everything in (a few) chunks of global memory and doing pointer arithmetic by hand. Excited to see how this implementation goes :) |
@kazimuth Vulkan sounds a nice backend choice as well. My perception of real-time rendering APIs is probably out-dated, and I'm not sure how well-supported Vulkan is on modern GPUs. We need more investigation on this. |
I think I've figured out the basics of I'm trying to put up the part of transferring data between host and Metal. Once that's finished, I'll give an update to see how it goes :) |
Fantastic! One note on struct-fors: Implementing struct-fors following the old x86/CUDA strategies on Metal can be pretty tricky - a series of element list generations have to be done, which are a lot of work and will harm performance. I will implement #378, which can also be used in the Metal backend and no worries about implementing struct-fors. |
So at https://github.com/k-ye/taichi/tree/ed1971f97ffa055ed3a7e4ee60111bd1cd8a2742, I've got the Metal kernel to work on a toy example :) def test_basic():
ti.cfg.arch = ti.metal
ti.cfg.print_ir = True
x = ti.var(ti.i32)
n = 128
@ti.layout
def place():
ti.root.dense(ti.i, n).place(x)
@ti.kernel
def func():
for i in range(n):
x[i] = i + 123
func()
xnp = x.to_numpy()
for i in range(n):
assert xnp[i] == i + 123
print('passed') A few things I'd like to point out:
Below are the generated Metal kernels. Note that I had to copy the
#include <metal_stdlib>
using namespace metal;
namespace {
using byte = uchar;
struct S2 {
// place
constant static constexpr int stride = sizeof(int32_t);
S2(device byte* v) : val((device int32_t*)v) {}
device int32_t* val;
};
class S1_ch {
private:
device byte* addr_;
public:
S1_ch(device byte* a) : addr_(a) {}
S2 get0() {
return {addr_};
}
constant static constexpr int stride = S2::stride;
};
struct S1 {
// dense
constant static constexpr int n = 128;
constant static constexpr int stride = S1_ch::stride * n;
S1(device byte* a) : addr_(a) {}
S1_ch children(int i) {
return {addr_ + i * S1_ch::stride};
}
private:
device byte* addr_;
};
class S0_ch {
private:
device byte* addr_;
public:
S0_ch(device byte* a) : addr_(a) {}
S1 get0() {
return {addr_};
}
constant static constexpr int stride = S1::stride;
};
struct S0 {
// root
constant static constexpr int n = 1;
constant static constexpr int stride = S0_ch::stride * n;
S0(device byte* a) : addr_(a) {}
S0_ch children(int i) {
return {addr_ + i * S0_ch::stride};
}
private:
device byte* addr_;
};
} // namespace
kernel void k0001_func_c4_0__0(
device byte* addr [[buffer(0)]],
const uint utid_ [[thread_position_in_grid]]) {
if (!(0 <= utid_ && utid_ < 128)) return;
const int tmp0 = (int)utid_;
const int32_t tmp4(tmp0);
const int32_t tmp5 = 123;
const int32_t tmp6((tmp4) + (tmp5));
S0 tmp7(addr);
auto tmp8 = 0;
S0_ch tmp9 = tmp7.children(tmp8);
S1 tmp10 = tmp9.get0();
auto tmp11 = (((0 + tmp4) >> 0) & ((1 << 7) - 1));
auto tmp12 = (0) * 128 + tmp11;
S1_ch tmp13 = tmp10.children(tmp12);
device int32_t* tmp14 = tmp13.get0().val;
*tmp14 = tmp6;
}
#include <metal_stdlib>
using namespace metal;
namespace {
using byte = uchar;
struct S2 {
// place
constant static constexpr int stride = sizeof(int32_t);
S2(device byte* v) : val((device int32_t*)v) {}
device int32_t* val;
};
class S1_ch {
private:
device byte* addr_;
public:
S1_ch(device byte* a) : addr_(a) {}
S2 get0() {
return {addr_};
}
constant static constexpr int stride = S2::stride;
};
struct S1 {
// dense
constant static constexpr int n = 128;
constant static constexpr int stride = S1_ch::stride * n;
S1(device byte* a) : addr_(a) {}
S1_ch children(int i) {
return {addr_ + i * S1_ch::stride};
}
private:
device byte* addr_;
};
class S0_ch {
private:
device byte* addr_;
public:
S0_ch(device byte* a) : addr_(a) {}
S1 get0() {
return {addr_};
}
constant static constexpr int stride = S1::stride;
};
struct S0 {
// root
constant static constexpr int n = 1;
constant static constexpr int stride = S0_ch::stride * n;
S0(device byte* a) : addr_(a) {}
S0_ch children(int i) {
return {addr_ + i * S0_ch::stride};
}
private:
device byte* addr_;
};
} // namespace
namespace {
class k0002_tensor_to_ext_arr_c8_0__args {
public:
explicit k0002_tensor_to_ext_arr_c8_0__args(device byte* addr) : addr_(addr) {}
device int32_t* arg0() {
// array, size=512 B
return (device int32_t*)(addr_ + 0);
}
private:
device byte* addr_;
};
} // namespace
kernel void k0002_tensor_to_ext_arr_c8_0__0(
device byte* addr [[buffer(0)]],
device byte* args_addr [[buffer(1)]],
const uint utid_ [[thread_position_in_grid]]) {
k0002_tensor_to_ext_arr_c8_0__args args_ctx_(args_addr);
if (utid_ >= 128) return;
int tid_ = (int)utid_;
const int tmp0 = (tid_ / 1);
tid_ = (tid_ % 1);
const int32_t tmp2(tmp0);
S0 tmp3(addr);
auto tmp4 = 0;
S0_ch tmp5 = tmp3.children(tmp4);
S1 tmp6 = tmp5.get0();
auto tmp7 = (((0 + tmp2) >> 0) & ((1 << 7) - 1));
auto tmp8 = (0) * 128 + tmp7;
S1_ch tmp9 = tmp6.children(tmp8);
device int32_t* tmp10 = tmp9.get0().val;
int32_t tmp11 = *tmp10;
device int32_t *tmp12 = args_ctx_.arg0();
device int32_t *tmp13 = (tmp12 + tmp2);
*tmp13 = tmp11;
} |
That will be fantastic! Yeah, I had to look into SNode's |
Amazing!!!
Task offloading is to ensure kernels of forms
and
to be currently generated. For now if we assume all kernels are pure single for-loop then we are good.
Yes, that's a host-side kernel. We make use of CUDA unified memory, so host-side read of GPU memory results in a page fault, then the OS automatically copies the GPU page to CPU memory. This design may not work on every platform though. For Metal if there's no unified memory support, maybe you can generate a device-side kernel for now. (However, GPU kernel launches can be rather slow, so maybe it's better to batch all the read/write requests, which needs a refactoring of the system...)
Cool! You are doing the preprocessor's job and we don't have to worry about specifying the include directory :-)
Actually you only need to calculate two things:
For example, if you have Your implementation of |
Ah right, I do have a check to see if the kernel consists of purely for loops...
Thanks for the explanation :) I think that's pretty much what I did (I named it class S1_ch {
private:
device byte* addr_;
public:
S1_ch(device byte* a) : addr_(a) {}
S2 get0() {
return {addr_};
}
S3 get1() {
return {addr_ + (S2::stride)};
}
S4 get2() {
return {addr_ + (S2::stride + S3::stride)};
}
S5 get3() {
return {addr_ + (S2::stride + S3::stride + S4::stride)};
}
constant static constexpr int stride = S2::stride + S3::stride + S4::stride + S5::stride;
};
struct S1 {
// dense
constant static constexpr int n = 2048;
constant static constexpr int stride = S1_ch::stride * n;
S1(device byte* a) : addr_(a) {}
S1_ch children(int i) {
return {addr_ + i * S1_ch::stride};
}
private:
device byte* addr_;
}; The for i, j in x:
# ... At Metal level, this is still compiled to a 1D kernel. I then used the
That's because I copied them from the legacy codegen XD. AT LAST, I CAN RUN |
I see. That makes a lot of sense. Yes, #378 will compile multi-dimensional struct-for loops into range-based ones, so you only have to implement range-fors for now. It's wonderful that you can already run a modified version of The GUI system may be the bottleneck of mpm88, where there are a lot of pybind11 function calls... |
I see. So if you're running this on CUDA, is the FPS similar to what you got on CPU...?
Metal does have unified memory. IIUC, |
That depends on the computation/visualization workload ratio... For mpm99, yes, since visualization which includes a call through pybind11 is too slow, on my end CPU gets 16 FPS and GPU gets 20, although my GPU has 20x more FLOPs than my CPUs. This will be resolved if we batch the It's great that Metal also has unified memory! |
Ah OK :( For
Yeah, same feeling here. But I think we have all the information on the host side to do pointer arithmetic. I'll give that a shot... |
After caching the compiled kernels, the FPS for |
Awesome! I need a couple more hours, but you can also tweak the |
#378 is now implemented. You need to add a pass for this demotion: taichi/taichi/backends/codegen_cuda.cpp Line 1006 in f2423d1
|
#411 done. GUI 30x faster. Now your bottleneck should be mostly simulation instead of rendering. |
Great! They are both faster now, but Metal became a bit slower than the CPU kernels (12 FPS vs 13 FPS) 😂 Hmm, I need some more profiling to see what's going on here.... |
Profiling data when simulating 81,920 particles in MPM88
We can see that Metal improved the avg performance by about 30%, and its variance is much smaller. |
On my end
Profiling using
|
Thanks, this is my number using the same code:
I guess |
I guess one way to support SNodes beyond On the other hand, if we use CPU for kernels that use non-dense SNodes, then I'm afraid that almost all kernels would fallback to CPU in practice :-/ |
TODO:
|
A small update: If I use I believe the memory layout between Metal's struct and LLVM's are the same, iff dense snodes are used. |
Superseded by #593 |
Is your feature request related to a problem? Please describe.
I'd like to add a Metal backend to taichi, so as to allow Mac users to enjoy the GPU acceleration, too.
Describe the solution you'd like
Halide has already supported Metal backend, and its codebase has a lot to learn from. Specifically, they used source-to-source codegen to translate Halide to Metal compute kernels (not LLVM IR). They also wrapped Metal APIs into C++ via the objc-runtime APIs (taichi is also using this approach for its GUI).
After talking to @yuanming-hu , I think we can start by supporting
dense
first. This requires two things:atomic_add
)Metal supports both fairly well.
I also need to figure out if the memory returned by the existing/next-gen memory allocator can be used by Metal kernels.
Describe alternatives you've considered
Supporting OpenGL compute shader may be appealing to a broader audience scope. However, due to my dev environment setup and working experience, I feel more comfortable working on Metal. If this works, it should be useful for helping design the OpenGL backend as well.
Additional context
Some references
The text was updated successfully, but these errors were encountered: