Experimental Metal backend #396

k-ye · 2020-01-20T01:51:56Z

Is your feature request related to a problem? Please describe.

I'd like to add a Metal backend to taichi, so as to allow Mac users to enjoy the GPU acceleration, too.

Describe the solution you'd like

Halide has already supported Metal backend, and its codebase has a lot to learn from. Specifically, they used source-to-source codegen to translate Halide to Metal compute kernels (not LLVM IR). They also wrapped Metal APIs into C++ via the objc-runtime APIs (taichi is also using this approach for its GUI).

After talking to @yuanming-hu , I think we can start by supporting dense first. This requires two things:

scattered read/write (access arbitrary memory address in the buffer)
atomic operations (e.g. atomic_add)

Metal supports both fairly well.

I also need to figure out if the memory returned by the existing/next-gen memory allocator can be used by Metal kernels.

Describe alternatives you've considered

Supporting OpenGL compute shader may be appealing to a broader audience scope. However, due to my dev environment setup and working experience, I feel more comfortable working on Metal. If this works, it should be useful for helping design the OpenGL backend as well.

Additional context

Some references

Halide
mtlpp: This is another library that exposes Metal APIs to C++, but it's implemented in Obj-C.

The text was updated successfully, but these errors were encountered:

k-ye · 2020-01-25T15:02:10Z

Some progress so far.

I'm trying to keep all the codegen on the Metal kernel side, without generating host-side C++ code. One thing I found is that this makes it hard for a Metal struct to have embedded arrays. For example,

s1 = root.dense(ti.i, n)
s1.place(x)
s1.dense(ti.i, m).place(y)

This logically maps to a struct like this:

struct s1 {
  float x;
  float y[m];
};

Here, scalar x and array y are embedded in a contiguous chunk of memory.

However, my current approach can only produce something like

struct s1 {
  float& x;
  float* y;
};

where x and y are sub-ranges within the Metal buffers they belong to, respectively. That is, we have a metal buffer X of size n and Y of size n * m. A struct s1 has x = X[i] and y = Y[i * m, (i + 1) * m). In such cases, we only have SoA...

For variables placed at the same level, I believe I can still have either SoA or AoS, depending on how it's laid out. E.g.

root.dense(ti.i, n).place(x, y)

I can generate a struct s1 { float x; float y; };, and just one Metal buffer to hold s1.

However, these are just me manually doing the compilation. Let me see if I will encounter more problems when writing the compiler for ti.layout...

yuanming-hu · 2020-01-25T18:38:22Z

Would that be easier if we simply allocate a huge buffer for the whole data structure tree, and then implement functions that map tensor + indices into a pointer (i.e. manually generate addressing computations instead of relying on structs)?

Also, it's sad that although Metal relies heavily on LLVM, Apple doesn't release a Metal backend for LLVM... https://worthdoingbadly.com/metalbitcode/ https://twitter.com/icculus/status/721893213452312576

yuanming-hu · 2020-01-25T18:43:54Z

If we only have (non-nested) dense data structures, then every tensor element, say, x[i, j, k] will have an address in the form of base + i * stride_i + j * stride_j + k * stride_k. I can help generate the base and strides (as well as the total data structure size), if this sounds easier than creating complex structs. AOS/SOA can still be supported, but we won't have fancy dense structures such as nested ones.

k-ye · 2020-01-26T01:52:17Z

Would that be easier if we simply allocate a huge buffer for the whole data structure tree, and then implement functions that map tensor + indices into a pointer

Thanks! I am trying to follow this approach as well :) The structs I created didn't own the memory, they were basically used to flatten the logical index into the buffer position (base + stride).

If we only have (non-nested) dense data structures, then every tensor element, say, x[i, j, k] will have an address in the form of base + i * stride_i + j * stride_j + k * stride_k. I can help generate the base and strides (as well as the total data structure size), if this sounds easier than creating complex structs. AOS/SOA can still be supported, but we won't have fancy dense structures such as nested ones.

I see. Yeah, i believe if you just offer Metal the correct size of memory, Metal can treat it as a byte array and reinterpret freely. One thing that blocked me is how to calculate this size, without generating code on the client size then doing a sizeof(). For example, for a layout of float x[n], float y[m*n], is the memory really just (n + (m * n)) * sizeof(float), or there are some additional requirement? Now I think if following your suggestion and do not create all those intermediate structs, then this is an accurate estimation. I can give a try :)

Yeah, I'd really hope Apple to release the Metal backend as well (Given the current quality of Metal's documentation, maybe this isn't their top priority lol)

yuanming-hu · 2020-01-26T02:07:30Z

Currently size calculation depends on LLVM:

taichi/taichi/backends/struct_llvm.cpp

Line 232 in d28ea5e

auto root_size =

I think you can simply modify struct_llvm.cpp and take that result.

For example, for a layout of float x[n], float y[m*n], is the memory really just (n + (m * n)) * sizeof(float), or there are some additional requirement?

You are right that compilers (and sometimes OS) have additional padding constraints on the actual data layout (which can go pretty complex: https://llvm.org/docs/LangRef.html#langref-datalayout). For now, we can just ignore it.

yuanming-hu · 2020-01-26T02:21:29Z

After more thoughts, I feel like assuming

x[i, j, k] will have an address in the form of base + i * stride_i + j * stride_j + k * stride_k

is not really making things easier, given the fact that x[i, j, k] will be broken down into micro-access ops... How about this: I'll help with the data access part (LiearizeStmt, IntegerOffsetStmt, GetCh, SNodeLookUp) of the Metal backend, which are mainly address computations.

yuanming-hu · 2020-01-27T05:25:30Z

A quick update: enum Arch is now extended to support several potential new backends: https://github.com/taichi-dev/taichi/blob/75992afe5b50067d5a70f7712387a70967e656cb/taichi/inc/archs.inc.h

k-ye · 2020-01-28T01:11:55Z

Ah nice, thanks!

kazimuth · 2020-01-29T05:02:29Z

Supporting OpenGL compute shader may be appealing to a broader audience scope. However, due to my dev environment setup and working experience, I feel more comfortable working on Metal. If this works, it should be useful for helping design the OpenGL backend as well.

I've also wondered about implementing a Vulkan backend as well, which would give portability to pretty much every platform, at least in theory. There may be some similar issues with using pointers in Vulkan shader code, but i think they can be resolved in a similar way, by sticking everything in (a few) chunks of global memory and doing pointer arithmetic by hand. Excited to see how this implementation goes :)

yuanming-hu · 2020-01-29T05:49:23Z

@kazimuth Vulkan sounds a nice backend choice as well. My perception of real-time rendering APIs is probably out-dated, and I'm not sure how well-supported Vulkan is on modern GPUs. We need more investigation on this.

k-ye · 2020-01-29T09:11:48Z

How about this: I'll help with the data access part (LiearizeStmt, IntegerOffsetStmt, GetCh, SNodeLookUp) of the Metal backend, which are mainly address computations.

I think I've figured out the basics of SNode (at least for dense...). What I did is basically generating those SNode and SNode_ch into Metal kernels (struct_metal.h), which, instead of actually owning the memory, is really just doing the pointer arithmetics on the amalgamated global memory. Then I think the set of micro-access ops work on them just fine (e.g. codegen_emtal.h).

I'm trying to put up the part of transferring data between host and Metal. Once that's finished, I'll give an update to see how it goes :)

yuanming-hu · 2020-01-29T12:37:26Z

Fantastic! One note on struct-fors:
when using dense data structures only, struct-fors can be demoted into dense range-fors.

Implementing struct-fors following the old x86/CUDA strategies on Metal can be pretty tricky - a series of element list generations have to be done, which are a lot of work and will harm performance. I will implement #378, which can also be used in the Metal backend and no worries about implementing struct-fors.

k-ye · 2020-01-29T14:24:04Z

So at https://github.com/k-ye/taichi/tree/ed1971f97ffa055ed3a7e4ee60111bd1cd8a2742, I've got the Metal kernel to work on a toy example :)

def test_basic():
  ti.cfg.arch = ti.metal
  ti.cfg.print_ir = True
  
  x = ti.var(ti.i32)
  n = 128

  @ti.layout
  def place():
    ti.root.dense(ti.i, n).place(x)

  @ti.kernel
  def func():
    for i in range(n):
      x[i] = i + 123
  
  func()

  xnp = x.to_numpy()
  for i in range(n):
    assert xnp[i] == i + 123
  print('passed')

A few things I'd like to point out:

I've disabled the task-offloading pass for now. It seems simpler to me to generate kernels directly from for-loops...
operator[] is not working yet. For now, to_numpy() can be used as a workaround. For to_numpy(), I saw that the data copy between host and device happened here. But I don't really know where that memory copy happened in get_snode_reader(). This is purely a host-side kernel?

Below are the generated Metal kernels. Note that I had to copy the S* nodes definition to every Metal kernel, because I'm using newLibraryWithSource:options:error:, which doesn't allow the generated kernel to include any headers except <metal_stdlib>.

func:

#include <metal_stdlib>
using namespace metal;

namespace {
using byte = uchar;

struct S2 {
  // place
  constant static constexpr int stride = sizeof(int32_t);
  S2(device byte* v) : val((device int32_t*)v) {}
  device int32_t* val;
};

class S1_ch {
 private:
  device byte* addr_;
 public:
  S1_ch(device byte* a) : addr_(a) {}
  S2 get0() {
    return {addr_};
  }
  constant static constexpr int stride = S2::stride;
};

struct S1 {
  // dense
  constant static constexpr int n = 128;
  constant static constexpr int stride = S1_ch::stride * n;
  S1(device byte* a) : addr_(a) {}
  S1_ch children(int i) {
    return {addr_ + i * S1_ch::stride};
  }
 private:
  device byte* addr_;
};

class S0_ch {
 private:
  device byte* addr_;
 public:
  S0_ch(device byte* a) : addr_(a) {}
  S1 get0() {
    return {addr_};
  }
  constant static constexpr int stride = S1::stride;
};

struct S0 {
  // root
  constant static constexpr int n = 1;
  constant static constexpr int stride = S0_ch::stride * n;
  S0(device byte* a) : addr_(a) {}
  S0_ch children(int i) {
    return {addr_ + i * S0_ch::stride};
  }
 private:
  device byte* addr_;
};

}  // namespace

kernel void k0001_func_c4_0__0(
    device byte* addr [[buffer(0)]],
    const uint utid_ [[thread_position_in_grid]]) {
  if (!(0 <= utid_ && utid_ < 128)) return;
  const int tmp0 = (int)utid_;
  const int32_t tmp4(tmp0);
  const int32_t tmp5 = 123;
  const int32_t tmp6((tmp4) + (tmp5));
  S0 tmp7(addr);
  auto tmp8 = 0;
  S0_ch tmp9 = tmp7.children(tmp8);
  S1 tmp10 = tmp9.get0();
  auto tmp11 = (((0 + tmp4) >> 0) & ((1 << 7) - 1));
  auto tmp12 = (0) * 128 + tmp11;
  S1_ch tmp13 = tmp10.children(tmp12);
  device int32_t* tmp14 = tmp13.get0().val;
  *tmp14 = tmp6;
}

to_numpy():

#include <metal_stdlib>
using namespace metal;

namespace {
using byte = uchar;

struct S2 {
  // place
  constant static constexpr int stride = sizeof(int32_t);
  S2(device byte* v) : val((device int32_t*)v) {}
  device int32_t* val;
};

class S1_ch {
 private:
  device byte* addr_;
 public:
  S1_ch(device byte* a) : addr_(a) {}
  S2 get0() {
    return {addr_};
  }
  constant static constexpr int stride = S2::stride;
};

struct S1 {
  // dense
  constant static constexpr int n = 128;
  constant static constexpr int stride = S1_ch::stride * n;
  S1(device byte* a) : addr_(a) {}
  S1_ch children(int i) {
    return {addr_ + i * S1_ch::stride};
  }
 private:
  device byte* addr_;
};

class S0_ch {
 private:
  device byte* addr_;
 public:
  S0_ch(device byte* a) : addr_(a) {}
  S1 get0() {
    return {addr_};
  }
  constant static constexpr int stride = S1::stride;
};

struct S0 {
  // root
  constant static constexpr int n = 1;
  constant static constexpr int stride = S0_ch::stride * n;
  S0(device byte* a) : addr_(a) {}
  S0_ch children(int i) {
    return {addr_ + i * S0_ch::stride};
  }
 private:
  device byte* addr_;
};

}  // namespace

namespace {
class k0002_tensor_to_ext_arr_c8_0__args {
 public:
  explicit k0002_tensor_to_ext_arr_c8_0__args(device byte* addr) : addr_(addr) {}
  device int32_t* arg0() {
    // array, size=512 B
    return (device int32_t*)(addr_ + 0);
  }
 private:
  device byte* addr_;
};
}  // namespace

kernel void k0002_tensor_to_ext_arr_c8_0__0(
    device byte* addr [[buffer(0)]],
    device byte* args_addr [[buffer(1)]],
    const uint utid_ [[thread_position_in_grid]]) {
  k0002_tensor_to_ext_arr_c8_0__args args_ctx_(args_addr);
  if (utid_ >= 128) return;
  int tid_ = (int)utid_;
  const int tmp0 = (tid_ / 1);
  tid_ = (tid_ % 1);
  const int32_t tmp2(tmp0);
  S0 tmp3(addr);
  auto tmp4 = 0;
  S0_ch tmp5 = tmp3.children(tmp4);
  S1 tmp6 = tmp5.get0();
  auto tmp7 = (((0 + tmp2) >> 0) & ((1 << 7) - 1));
  auto tmp8 = (0) * 128 + tmp7;
  S1_ch tmp9 = tmp6.children(tmp8);
  device int32_t* tmp10 = tmp9.get0().val;
  int32_t tmp11 = *tmp10;
  device int32_t *tmp12 = args_ctx_.arg0();
  device int32_t *tmp13 = (tmp12 + tmp2);
  *tmp13 = tmp11;
}

k-ye · 2020-01-29T14:26:52Z

Implementing struct-fors following the old x86/CUDA strategies on Metal can be pretty tricky - a series of element list generations have to be done, which are a lot of work and will harm performance. I will implement #378, which can also be used in the Metal backend and no worries about implementing struct-fors.

That will be fantastic! Yeah, I had to look into SNode's extractors to figure out the size of each dimension for now...

yuanming-hu · 2020-01-29T16:52:09Z

Amazing!!!

I've disabled the task-offloading pass for now. It seems simpler to me to generate kernels directly from for-loops...

Task offloading is to ensure kernels of forms

@ti.kernel
def test():
  for i in range(10):
    ...
  for j in range(30):
    ...

and

@ti.kernel
def test():
  a = 0
  for i in range(10):
    a += i

to be currently generated. For now if we assume all kernels are pure single for-loop then we are good.

But I don't really know where that memory copy happened in get_snode_reader(). This is purely a host-side kernel?

Yes, that's a host-side kernel. We make use of CUDA unified memory, so host-side read of GPU memory results in a page fault, then the OS automatically copies the GPU page to CPU memory. This design may not work on every platform though. For Metal if there's no unified memory support, maybe you can generate a device-side kernel for now. (However, GPU kernel launches can be rather slow, so maybe it's better to batch all the read/write requests, which needs a refactoring of the system...)

Note that I had to copy the S* nodes definition to every Metal kernel

Cool! You are doing the preprocessor's job and we don't have to worry about specifying the include directory :-)

I had to look into SNode's extractors to figure out the size of each dimension for now

Actually you only need to calculate two things:

In SNodeLookUpStmt, return stmt->input_snode_pointer + x * sizeof(stmt->snode_element_size), here sizeof(stmt->snode_element_size) is the size of the snodes' element.
In GetChStmt, return stmt->input_element_pointer + stmt->ch_component_offset. Each element of a SNode can have multiple components, and stmt->ch_component_offset is the offset of that component.

For example, if you have ti.root.dense(ti.i, 1024).place(x:i32, y:i32, z:i32), then sizeof(snode_element_size) for the dense node should be 12 bytes, and ch_component_offset for x, y, and z within a dense node element (which is something like struct {i32, i32, i32}) should be 0, 4, 8 bytes.

Your implementation of OffsetAndExtractBitsStmt, IntegerOffsetStmt, and SNodeLookupStmt codegen looks good to me!

k-ye · 2020-01-29T17:19:07Z

For now if we assume all kernels are pure single for-loop then we are good.

Ah right, I do have a check to see if the kernel consists of purely for loops...

Actually you only need to calculate two things:

Thanks for the explanation :) I think that's pretty much what I did (I named it stride). For example, below is a generated kernel struct.

class S1_ch {
 private:
  device byte* addr_;
 public:
  S1_ch(device byte* a) : addr_(a) {}
  S2 get0() {
    return {addr_};
  }
  S3 get1() {
    return {addr_ + (S2::stride)};
  }
  S4 get2() {
    return {addr_ + (S2::stride + S3::stride)};
  }
  S5 get3() {
    return {addr_ + (S2::stride + S3::stride + S4::stride)};
  }
  constant static constexpr int stride = S2::stride + S3::stride + S4::stride + S5::stride;
};

struct S1 {
  // dense
  constant static constexpr int n = 2048;
  constant static constexpr int stride = S1_ch::stride * n;
  S1(device byte* a) : addr_(a) {}
  S1_ch children(int i) {
    return {addr_ + i * S1_ch::stride};
  }
 private:
  device byte* addr_;
};

The extractors info is used to deal with multi-dimensional struct-for. For example:

for i, j in x:
  # ...

At Metal level, this is still compiled to a 1D kernel. I then used the extractor's info at each index to un-linearize the thread ID into i, j. I guess this is probably what #378 plans to do?

Your implementation of OffsetAndExtractBitsStmt, IntegerOffsetStmt, and SNodeLookupStmt codegen looks good to me!

That's because I copied them from the legacy codegen XD.

AT LAST, I CAN RUN mpm88.py VIA METAL NOW!! (With some initialization being moved to the host side, because operator[] is not supported yet). The sad thing is I saw absolutely no performance gain. That said, I think currently my code compiles the kernels (at Metal level, not taichi level) on every invocation! I'll try to compile it once and cache the result...

yuanming-hu · 2020-01-29T19:49:13Z

I see. That makes a lot of sense. Yes, #378 will compile multi-dimensional struct-for loops into range-based ones, so you only have to implement range-fors for now.

It's wonderful that you can already run a modified version of mpm88!

The GUI system may be the bottleneck of mpm88, where there are a lot of pybind11 function calls...
#411

k-ye · 2020-01-30T00:08:47Z

The GUI system may be the bottleneck of mpm88, where there are a lot of pybind11 function calls...

I see. So if you're running this on CUDA, is the FPS similar to what you got on CPU...?

Yes, that's a host-side kernel. We make use of CUDA unified memory, so host-side read of GPU memory results in a page fault, then the OS automatically copies the GPU page to CPU memory.

Metal does have unified memory. IIUC, load_if_ptr((snode->expr)[indices]) returns you the correct memory address? Since I didn't generate SNodes on the host side, that won't work for now. I can try doing the same pointer arithmetic on the host side.

yuanming-hu · 2020-01-30T00:24:21Z

So if you're running this on CUDA, is the FPS similar to what you got on CPU...

That depends on the computation/visualization workload ratio... For mpm99, yes, since visualization which includes a call through pybind11 is too slow, on my end CPU gets 16 FPS and GPU gets 20, although my GPU has 20x more FLOPs than my CPUs. This will be resolved if we batch the draw calls.

It's great that Metal also has unified memory! load_if_ptr will return you a pointer to the tensor element, yet I'm not sure how helpful that is, if you are not implementing a corresponding x64 codegen. One solution that will work, although probably not very efficient, is do everything on GPU and invoke something like cudaMemcpy(device_to_host) in Metal.

k-ye · 2020-01-30T00:36:49Z

For mpm99, yes, since visualization which includes a call through pybind11 is too slow, on my end CPU gets 16 FPS and GPU gets 20

Ah OK :( For mpm88, I got around 4fps on both CPU and Metal 😂 ... I'm modifying the code so that Metal is only compiled once, but I guess that won't make much of a difference now..

One solution that will work, although probably not very efficient, is do everything on GPU and invoke something like cudaMemcpy(device_to_host) in Metal.

Yeah, same feeling here. But I think we have all the information on the host side to do pointer arithmetic. I'll give that a shot...

yuanming-hu · 2020-01-30T00:40:10Z

I'll try to do #411 after #378 tonight.

k-ye · 2020-01-30T02:04:04Z

After caching the compiled kernels, the FPS for mpm88 goes up to 7~8 FPS. Really looking forward to the batched GUI!

yuanming-hu · 2020-01-30T02:06:16Z

Awesome! I need a couple more hours, but you can also tweak the quality parameter (say to 4) so that you have more work to do and can compare CPU/GPU performance better.

yuanming-hu · 2020-01-30T04:45:07Z

#378 is now implemented. You need to add a pass for this demotion:

taichi/taichi/backends/codegen_cuda.cpp

Line 1006 in f2423d1

irpass::demote_dense_struct_fors(ir);

yuanming-hu · 2020-01-30T05:44:10Z

#411 done. GUI 30x faster. Now your bottleneck should be mostly simulation instead of rendering.

k-ye · 2020-01-30T06:25:56Z

Great! They are both faster now, but Metal became a bit slower than the CPU kernels (12 FPS vs 13 FPS) 😂 Hmm, I need some more profiling to see what's going on here....

k-ye · 2020-01-30T12:38:42Z

Profiling data when simulating 81,920 particles in MPM88

CPU
[ 89.93%]                  substep_c4_0_     min   3.821 ms   avg   4.232 ms    max  49.045 ms   total  21.442 s [   5066x]
[  6.17%]                circles_batched     min  11.288 ms   avg  14.564 ms    max 117.464 ms   total   1.471 s [    101x]
[ ... ]

Metal
[ 78.61%]                  substep_c4_0_     min   1.991 ms   avg   2.943 ms    max   5.390 ms   total  14.909 s [   5066x]
[  9.71%]                circles_batched     min  12.897 ms   avg  18.228 ms    max 129.527 ms   total   1.841 s [    101x]
[ ... ]

We can see that Metal improved the avg performance by about 30%, and its variance is much smaller.

yuanming-hu · 2020-01-30T14:11:35Z

On my end

CPU 335ms (i7 7700K, multithreaded)
GPU 13 ms (GTX 1080 Ti, CUDA)

Profiling using

quality=3

...

for frame in range(20000):
  import time
  t = time.time()
  for s in range(int(2e-3 // dt)):
    substep()
  ti.sync()
  print((time.time() - t) * 1000)

k-ye · 2020-01-30T14:57:49Z

Thanks, this is my number using the same code:

CPU 174ms (i9-9880H)
Metal 87ms (AMD Radeon Pro 5500M) Interestingly, a few times this number went down to 23ms and I changed nothing..

I guess ti.sync translates to cudaDeviceSychronize()? Right now the Metal kernels will sync automatically for each ti.kernel call. Maybe I should also have users manually call sync and see if there's further improvement.

k-ye · 2020-02-07T15:07:29Z

I guess one way to support SNodes beyond dense is to have a mixed CPU + Metal runtime... My feeling is that if we get the memory layout right and inject synchronization properly, this is doable?

On the other hand, if we use CPU for kernels that use non-dense SNodes, then I'm afraid that almost all kernels would fallback to CPU in practice :-/

k-ye · 2020-02-08T05:24:31Z

TODO:

Port tests to Metal
Wrap atomic_add() and floordiv() as functions, so that they won't be emitted repeatedly.
Use Runtime's root and temporaries to back the Metal buffers. (Use Runtime::root as Metal root buffer's memory #488, Use Runtime::temporaries to back Metal runtime's global tmps buffer #500)
Profiling the Metal kernels to find out the bottleneck. [Mac] GUI::update had a big variance on different MBP models #489
Make args with types <= 32 bit work. Make Metal support <=32bit args #530
Reduce Metal synchronization overhead
- grid-stride loop: https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
- more info: Fix how num_elements is computed in Dense node #512 (comment)
? Try supporting bitmasked (Postponed until the unified memory assumption is removed)
? Have a ti.gpu option, which would choose the correct GPU backend on each platform smartly (e.g. CUDA if CUDA is avaiable, Metal if on a Mac, etc.)
~~Try a mixed CPU + Metal runtime to support sparse data structures.~~ This can be very slow.

k-ye · 2020-02-10T13:50:13Z

A small update: If I use Runtime::root as Metal's root memory, then a host-side snode_reader kernel works just out of the box! No additional Metal kernel launches required.

I believe the memory layout between Metal's struct and LLVM's are the same, iff dense snodes are used.

k-ye · 2020-03-14T12:20:38Z

Superseded by #593

k-ye added the feature request Suggest an idea on this project label Jan 20, 2020

This was referenced Feb 4, 2020

Implement Metal kernel launching in the runtime #436

Merged

AMDGPU backend #412

Open

Add Metal codegen skeleton code #443

Merged

Implement Metal kernels and args struct codegen #444

Merged

Add IR visiting methods for Metal codegen #448

Merged

k-ye mentioned this issue Feb 8, 2020

Plugin in Metal backend to Taichi! #450

Merged

k-ye mentioned this issue Feb 9, 2020

Fix bugs found when porting tests to Metal #453

Merged

This was referenced Feb 11, 2020

Major Metal compiler/runtime refactors #457

Merged

Encapsulate fatomic_fetch_add and ifloordiv as functions #465

Merged

Use Runtime::root as Metal root buffer's memory #488

Merged

This was referenced Feb 17, 2020

[Backend] New OpenGL Compute Shader Backend #492

Closed

[OpenGL] Basic OpenGL backend #492 #495

Closed

k-ye self-assigned this Feb 24, 2020

k-ye added the mac Mac OS X platform label Feb 24, 2020

This was referenced Feb 25, 2020

Make Metal support <=32bit args #530

Merged

Do not instantiate Metal root buffer if it's not used #532

Merged

This was referenced Mar 4, 2020

[refactor] Construct MetalKernelArgsAttribute directly from Kernel::Args #564

Merged

[Metal] Decouple Metal's runtime from LLVM memory #568

Merged

[Refactor] Hide MetalStructCompiler to the cpp file #569

Merged

k-ye mentioned this issue Mar 14, 2020

Support basic sparsity SNode on Metal #593

Closed

k-ye closed this as completed Mar 14, 2020

k-ye mentioned this issue Apr 5, 2020

[Metal] Support ti.random() on Metal #710

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experimental Metal backend #396

Experimental Metal backend #396

k-ye commented Jan 20, 2020 •

edited

Loading

k-ye commented Jan 25, 2020 •

edited

Loading

yuanming-hu commented Jan 25, 2020

yuanming-hu commented Jan 25, 2020 •

edited

Loading

k-ye commented Jan 26, 2020 •

edited

Loading

yuanming-hu commented Jan 26, 2020 •

edited

Loading

yuanming-hu commented Jan 26, 2020

yuanming-hu commented Jan 27, 2020

k-ye commented Jan 28, 2020

kazimuth commented Jan 29, 2020 •

edited

Loading

yuanming-hu commented Jan 29, 2020

k-ye commented Jan 29, 2020 •

edited

Loading

yuanming-hu commented Jan 29, 2020

k-ye commented Jan 29, 2020 •

edited

Loading

k-ye commented Jan 29, 2020

yuanming-hu commented Jan 29, 2020

k-ye commented Jan 29, 2020 •

edited

Loading

yuanming-hu commented Jan 29, 2020

k-ye commented Jan 30, 2020

yuanming-hu commented Jan 30, 2020

k-ye commented Jan 30, 2020

yuanming-hu commented Jan 30, 2020

k-ye commented Jan 30, 2020

yuanming-hu commented Jan 30, 2020

yuanming-hu commented Jan 30, 2020

yuanming-hu commented Jan 30, 2020

k-ye commented Jan 30, 2020

k-ye commented Jan 30, 2020 •

edited

Loading

yuanming-hu commented Jan 30, 2020 •

edited

Loading

k-ye commented Jan 30, 2020 •

edited

Loading

k-ye commented Feb 7, 2020 •

edited

Loading

k-ye commented Feb 8, 2020 •

edited

Loading

k-ye commented Feb 10, 2020 •

edited

Loading

k-ye commented Mar 14, 2020

Experimental Metal backend #396

Experimental Metal backend #396

Comments

k-ye commented Jan 20, 2020 • edited Loading

k-ye commented Jan 25, 2020 • edited Loading

yuanming-hu commented Jan 25, 2020

yuanming-hu commented Jan 25, 2020 • edited Loading

k-ye commented Jan 26, 2020 • edited Loading

yuanming-hu commented Jan 26, 2020 • edited Loading

yuanming-hu commented Jan 26, 2020

yuanming-hu commented Jan 27, 2020

k-ye commented Jan 28, 2020

kazimuth commented Jan 29, 2020 • edited Loading

yuanming-hu commented Jan 29, 2020

k-ye commented Jan 29, 2020 • edited Loading

yuanming-hu commented Jan 29, 2020

k-ye commented Jan 29, 2020 • edited Loading

k-ye commented Jan 29, 2020

yuanming-hu commented Jan 29, 2020

k-ye commented Jan 29, 2020 • edited Loading

yuanming-hu commented Jan 29, 2020

k-ye commented Jan 30, 2020

yuanming-hu commented Jan 30, 2020

k-ye commented Jan 30, 2020

yuanming-hu commented Jan 30, 2020

k-ye commented Jan 30, 2020

yuanming-hu commented Jan 30, 2020

yuanming-hu commented Jan 30, 2020

yuanming-hu commented Jan 30, 2020

k-ye commented Jan 30, 2020

k-ye commented Jan 30, 2020 • edited Loading

yuanming-hu commented Jan 30, 2020 • edited Loading

k-ye commented Jan 30, 2020 • edited Loading

k-ye commented Feb 7, 2020 • edited Loading

k-ye commented Feb 8, 2020 • edited Loading

k-ye commented Feb 10, 2020 • edited Loading

k-ye commented Mar 14, 2020

k-ye commented Jan 20, 2020 •

edited

Loading

k-ye commented Jan 25, 2020 •

edited

Loading

yuanming-hu commented Jan 25, 2020 •

edited

Loading

k-ye commented Jan 26, 2020 •

edited

Loading

yuanming-hu commented Jan 26, 2020 •

edited

Loading

kazimuth commented Jan 29, 2020 •

edited

Loading

k-ye commented Jan 29, 2020 •

edited

Loading

k-ye commented Jan 29, 2020 •

edited

Loading

k-ye commented Jan 29, 2020 •

edited

Loading

k-ye commented Jan 30, 2020 •

edited

Loading

yuanming-hu commented Jan 30, 2020 •

edited

Loading

k-ye commented Jan 30, 2020 •

edited

Loading

k-ye commented Feb 7, 2020 •

edited

Loading

k-ye commented Feb 8, 2020 •

edited

Loading

k-ye commented Feb 10, 2020 •

edited

Loading