Major Metal compiler/runtime refactors #457

k-ye · 2020-02-11T08:05:24Z

Issue #396

Sorry about the large size, I'm mostly moving MetalRuntime's code to cpp file.

Use Pimpl for MetalRuntime so that we don't have to check TC_PLATFORM_OSX
whereever it's been used.
Correctly detects the Metal API availability at runtime
If Metal API is not available, Program will automatically falls back to x86_64

k-ye · 2020-02-11T08:28:37Z

Test passed but fialed at build.py try_upload ..? https://travis-ci.com/taichi-dev/taichi/builds/148330581

======================= 201 passed in 369.10s (0:06:09) ========================
Running C++ tests...
===============================================================================
All tests passed (25 assertions in 3 test cases)
The command "ti test_verbose && cd python && $PYTHON build.py try_upload" exited with 1.
Done. Your build exited with 1.

k-ye · 2020-02-11T12:14:43Z

I tested mpm99.py and pbf2d.py on my old 2015 MBP13 (Intel Core i5 + Intel Iris Graphics 6100 1536 MB graphics), and they ran just fine with 512 threads per block.

What's both funny and ironic is that, I got a huge performance gain on this old laptop. The FPS is much much better than what I got with my 2019 MBP16.... Benchmark numbers for mpm99.py:

2015 MBP13:

cpu: ~18fps
metal: 28~30fps

2019 MBP16:

cpu: ~15fps
metal: ~16fps

I don't believe the numbers on the new MBP is anywhere near its capability, otherwise that's a good waste of my money 😠

yuanming-hu · 2020-02-12T00:27:49Z

Maybe you can enable the profiler to see exactly which kernel leads to unsatisfactory performance boost :-)

k-ye · 2020-02-12T01:16:14Z

Maybe you can enable the profiler to see exactly which kernel leads to unsatisfactory performance boost :-)

Yep, I'll give it a shot later today or so :) Meanwhile, I wonder if the metal tests would still crash your laptop... If so, could you try a small number of kThreadsPerGroup, e.g. 64, and see if that helps?

taichi/taichi/platform/metal/metal_runtime.cpp

Line 23 in 1a42ea4

constexpr int kThreadsPerGroup = 512;

FYI my old MBP can only use a maximum number of 256 threads for test_linalg.py::test_polar_decomp(). I have another PR to dynamically set this number for each Metal kernel...

yuanming-hu · 2020-02-12T15:49:38Z

After rebooting my old 2013 MBP five times, I realized it crashes at

taichi/tests/python/test_loops.py

Line 88 in be66730

def test_zero_outer_loop():

(I also changed kThreadsPerGroup to 256, but probably it's caused by a zero outer loop size? Maybe the Metal driver has a bug when number of blocks is zero?) :-)

k-ye · 2020-02-13T01:34:06Z

After rebooting my old 2013 MBP five times, I realized it crashes at

taichi/tests/python/test_loops.py

Line 88 in be66730

def test_zero_outer_loop():

(I also changed kThreadsPerGroup to 256, but probably it's caused by a zero outer loop size? Maybe the Metal driver has a bug when number of blocks is zero?) :-)

😂 so sorry about that! I've made another fix in this PR to do the followings:

skip the metal kernel if num_threads == 0
dynamically decide the number of threads per group on a per kernel basis

yuanming-hu · 2020-02-13T01:36:02Z

Cool, thanks! Running tests now. Fingers crossed! :-)

k-ye · 2020-02-13T01:54:40Z

FYI I think Travis CI failed at the cpp tests step. On my local machine, if i do

ti test_python
echo $?

I got 0, whereas

ti test_cpp
echo $?

I got 1...

yuanming-hu · 2020-02-13T01:57:50Z

Now all tests passed on my MBP 2013, except this one:

taichi/tests/python/test_loop_grad.py

Line 32 in 5d63cad

assert x[k, i] == 2**i * k

yuanming-hu · 2020-02-13T01:59:40Z

FYI I think Travis CI failed at the cpp tests step.

Nice catch... test_cpp seems to always return an empty string. Fixing that... Thanks for pointing this out!

k-ye · 2020-02-13T02:12:40Z

Ah ok, I will take a look at the test after work...

k-ye · 2020-02-13T09:12:38Z

Here's the kernel params for the failed test on my 2015 MBP:

kernel=mtl_k0002_func_c4_0__0 num_threads=16 num_threads_per_group=1024 num_groups=1

I also printed out x.to_numpy():

x=[[0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00]
 [1.000e+00 2.000e+00 4.000e+00 8.000e+00 1.600e+01 3.200e+01 6.400e+01
  1.280e+02]
 [2.000e+00 4.000e+00 8.000e+00 1.600e+01 3.200e+01 6.400e+01 1.280e+02
  2.560e+02]
 [3.000e+00 6.000e+00 1.200e+01 2.400e+01 4.800e+01 9.600e+01 1.920e+02
  3.840e+02]
 [4.000e+00 8.000e+00 1.600e+01 3.200e+01 6.400e+01 1.280e+02 2.560e+02
  5.120e+02]
 [5.000e+00 1.000e+01 2.000e+01 4.000e+01 8.000e+01 1.600e+02 3.200e+02
  6.400e+02]
 [6.000e+00 1.200e+01 2.400e+01 4.800e+01 9.600e+01 1.920e+02 3.840e+02
  7.680e+02]
 [7.000e+00 1.400e+01 2.800e+01 5.600e+01 1.120e+02 2.240e+02 4.480e+02
  8.960e+02]
 [8.000e+00 1.600e+01 3.200e+01 6.400e+01 1.280e+02 2.560e+02 5.120e+02
  1.024e+03]
 [9.000e+00 1.800e+01 3.600e+01 7.200e+01 1.440e+02 2.880e+02 5.760e+02
  1.152e+03]
 [1.000e+01 2.000e+01 4.000e+01 8.000e+01 1.600e+02 3.200e+02 6.400e+02
  1.280e+03]
 [1.100e+01 2.200e+01 4.400e+01 8.800e+01 1.760e+02 3.520e+02 7.040e+02
  1.408e+03]
 [1.200e+01 2.400e+01 4.800e+01 9.600e+01 1.920e+02 3.840e+02 7.680e+02
  1.536e+03]
 [1.300e+01 2.600e+01 5.200e+01 1.040e+02 2.080e+02 4.160e+02 8.320e+02
  1.664e+03]
 [1.400e+01 2.800e+01 5.600e+01 1.120e+02 2.240e+02 4.480e+02 8.960e+02
  1.792e+03]
 [1.500e+01 3.000e+01 6.000e+01 1.200e+02 2.400e+02 4.800e+02 9.600e+02
  1.920e+03]]

Could you:

tell me which specific coord (i, j) failed?
the kernel params on your machine? (But I don't think this is the root cause this time, as the kernel was a trivial one.)

(Or a quick and robust fix: if Taichi detected the mac hardware is older than the 2015 model, print out "Please buy a new hardware" and exit(1) :-p ...)

yuanming-hu · 2020-02-13T16:55:13Z

I minimized the test into

import taichi as ti

ti.init(arch=ti.metal, print_ir=True)

n = 2
m = 4

x = ti.var(ti.i32, shape=(n, m))

@ti.kernel
def func():
  for k in range(2):
    for i in range(m - 1):
      x[k, i + 1] = x[k, i] * 2

x[0, 0] = 1
func()

for i in range(m):
  print(f'i = {i} x[0, i] = {x[0, i]} (should be {2**i})')

which gives

i = 0 x[0, i] = 1 (should be 1)
i = 1 x[0, i] = 2 (should be 2)
i = 2 x[0, i] = 4 (should be 4)
i = 3 x[0, i] = 0 (should be 8)

yuanming-hu · 2020-02-13T17:01:30Z

Added a logging patch here:

    const int num_threads_per_group =
        get_max_total_threads_per_threadgroup(pipeline_state_.get());
    const int num_groups =
        ((num_threads + num_threads_per_group - 1) / num_threads_per_group);
    TC_INFO("num_threads {}, num_threads_per_group {}, num_groups {}", num_threads, num_threads_per_group, num_groups); // logging here..
    dispatch_threadgroups(encoder.get(), num_groups,
                          std::min(num_threads, num_threads_per_group));
    end_encoding(encoder.get());
    profiler_->stop();

Gives me num_threads 2, num_threads_per_group 1024, num_groups 1 for the main kernel:

[Taichi version 0.4.5, cpu only, commit d0a71abd]
[T 02/13/20 12:00:15.920] [memory_pool.cpp:MemoryPool@15] Memory pool created. Default buffer size per allocator = 1024 MB
[T 02/13/20 12:00:15.928] [taichi_llvm_context.cpp:TaichiLLVMContext@57] Creating llvm context for arch: x86_64
[T 02/13/20 12:00:16.030] [/Users/yuanming/repos/taichi/python/taichi/lang/impl.py:materialize@116] Materializing layout...
[T 02/13/20 12:00:16.086] [taichi_llvm_context.cpp:compile_runtime_bitcode@110] Compiling runtime module bitcode...
[T 02/13/20 12:00:17.123] [struct_llvm.cpp:operator()@287] Allocating data structure of size 32 B
[T 02/13/20 12:00:17.169] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 02/13/20 12:00:17.169] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.39e-05 s
[runtime.cpp: Initializing runtime with 3 snode(s)...]
[runtime.cpp: Runtime initialized.]
[D 02/13/20 12:00:17.186] [memory_pool.cpp:daemon@103] Processing memory alloc request 0
[D 02/13/20 12:00:17.186] [memory_pool.cpp:daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/13/20 12:00:17.186] [memory_pool.cpp:daemon@112]   Allocated. Ptr = 0x1253cc000
[D 02/13/20 12:00:17.187] [memory_pool.cpp:daemon@103] Processing memory alloc request 1
[D 02/13/20 12:00:17.187] [memory_pool.cpp:daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/13/20 12:00:17.187] [memory_pool.cpp:daemon@112]   Allocated. Ptr = 0x1253cf000
[D 02/13/20 12:00:17.188] [memory_pool.cpp:daemon@103] Processing memory alloc request 2
[D 02/13/20 12:00:17.188] [memory_pool.cpp:daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/13/20 12:00:17.188] [memory_pool.cpp:daemon@112]   Allocated. Ptr = 0x1253d2000
[D 02/13/20 12:00:17.190] [memory_pool.cpp:daemon@103] Processing memory alloc request 3
[D 02/13/20 12:00:17.190] [memory_pool.cpp:daemon@110]   Allocating memory 3145728 B (alignment 4096B)
[D 02/13/20 12:00:17.190] [memory_pool.cpp:daemon@112]   Allocated. Ptr = 0x1253d5000
[I 02/13/20 12:00:17.221] [program.cpp:materialize_layout@121] Metal root buffer size: 32 B
[I 02/13/20 12:00:17.245] [metal_runtime.cpp:register_taichi_kernel@286] Registered Taichi kernel <mtl_k0001_snode_writer_2>
[I 02/13/20 12:00:17.245] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
[T 02/13/20 12:00:17.248] [/Users/yuanming/repos/taichi/python/taichi/lang/kernel.py:materialize@180] Compiling kernel func_c4_0_...
[D 02/13/20 12:00:17.261] [/Users/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@350] Launching Taichi kernel func...
[I 02/13/20 12:00:17.263] [metal_runtime.cpp:register_taichi_kernel@286] Registered Taichi kernel <mtl_k0002_func_c4_0_>
[I 02/13/20 12:00:17.263] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 2, num_threads_per_group 1024, num_groups 1
[I 02/13/20 12:00:17.265] [metal_runtime.cpp:register_taichi_kernel@286] Registered Taichi kernel <mtl_k0003_snode_reader_2>
[I 02/13/20 12:00:17.265] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 0 x[0, i] = 1 (should be 1)
[I 02/13/20 12:00:17.266] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 1 x[0, i] = 2 (should be 2)
[I 02/13/20 12:00:17.266] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 2 x[0, i] = 4 (should be 4)
[I 02/13/20 12:00:17.266] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 3 x[0, i] = 0 (should be 8)

yuanming-hu · 2020-02-13T17:19:44Z

Constructed an even more magical test:

ti.init(arch=ti.metal)

n = 2
m = 8

x = ti.var(ti.i32, shape=(n, m))

@ti.kernel
def func():
  for k in range(n):
    for i in range(m - 1):
      x[k, i + 1] = x[k, i] + 1

func()

for i in range(m):
  print(f'i = {i} x[0, i] = {x[0, i]} (should be {i})')

Yielding

[I 02/13/20 12:18:20.713] [metal_runtime.cpp:register_taichi_kernel@286] Registered Taichi kernel <mtl_k0001_func_c4_0_>
[I 02/13/20 12:18:20.713] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 2, num_threads_per_group 1024, num_groups 1
[I 02/13/20 12:18:20.715] [metal_runtime.cpp:register_taichi_kernel@286] Registered Taichi kernel <mtl_k0002_snode_reader_2>
[I 02/13/20 12:18:20.715] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 0 x[0, i] = 0 (should be 0)
[I 02/13/20 12:18:20.716] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 1 x[0, i] = 1 (should be 1)
[I 02/13/20 12:18:20.717] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 2 x[0, i] = 2 (should be 2)
[I 02/13/20 12:18:20.726] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 3 x[0, i] = 1 (should be 3)
[I 02/13/20 12:18:20.726] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 4 x[0, i] = 1 (should be 4)
[I 02/13/20 12:18:20.727] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 5 x[0, i] = 1 (should be 5)
[I 02/13/20 12:18:20.727] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 6 x[0, i] = 2 (should be 6)
[I 02/13/20 12:18:20.727] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 7 x[0, i] = 1 (should be 7)

It seems to me that some writes to x are not visible to the next read...

yuanming-hu · 2020-02-13T17:29:59Z

(Or a quick and robust fix: if Taichi detected the mac hardware is older than the 2015 model, print out "Please buy a new hardware" and exit(1) :-p ...)

Let's seriously take this option: if the mac is pre-2015, just warn the user that the Metal backend may have undefined behavior (which is not that serious in most cases). And in the next few years, nobody will be using MBP 2013...

k-ye · 2020-02-13T23:25:30Z

It seems to me that some writes to x are not visible to the next read...

That's a very enlightening point. I was about to say that we should add some memory fence, but then realized that Metal was breaking the memory order on a single thread... I don't assume any memory order sync is generated for CUDA here?

yuanming-hu · 2020-02-13T23:34:46Z

I searched for Metal memory ordering yet found no results. CUDA has a weak memory ordering, which means unless fences are added memory read/writes have no order guarantee across threads. Within threads, memory accesses are still ordered. Metal seems to be doing something crazy here :-)

Anyway, if it works on MBP 2015, it's not too bad that this special case fails on an especially old GPU...

I'm good with this PR. Are you ready to merge this in? :-)

k-ye · 2020-02-14T00:06:18Z

Metal seems to be doing something crazy here :-)

Yeah, i feel like Apple is just too lazy fixing bugs on old machines... Weak memory order on a single thread is a bit too crazy... XD

Are you ready to merge this in? :-)

Ah great, let me do a rebase first :)

* Use Pimpl for MetalRuntime so that we don't have to check TC_PLATFORM_OSX whereever it's been used. * Correctly detects the Metal API availability at runtime * If Metal API is not available, Program will automatically falls back to x86_64

* Dynamically determine the number of threads per group for a given compute kernel pipeline * Skip the kernel launch if its num_thread is zero * Check CUDA macro to fix the build

k-ye · 2020-02-14T00:10:54Z

Rebased! I'm trying to port the failed test's Metal kernel over to [cpp-host-metal]. Once that's done, could you run it again and verify it still fails? If so, I can post on stackoverflow and ask if anyone is aware of this issue...

yuanming-hu · 2020-02-14T00:12:40Z

Sure, happy to test this for you!!

k-ye force-pushed the mtlcompile branch from 2798a51 to 205ef01 Compare February 11, 2020 08:12

k-ye force-pushed the mtlcompile branch from 205ef01 to d0a71ab Compare February 11, 2020 09:06

k-ye force-pushed the mtlcompile branch 2 times, most recently from 4738969 to 4e25f25 Compare February 13, 2020 01:32

k-ye force-pushed the mtlcompile branch from 4e25f25 to 93ee5cb Compare February 13, 2020 08:36

k-ye mentioned this pull request Feb 13, 2020

Fix Travis CI by detecting Metal API at runtime #461

Merged

k-ye force-pushed the mtlcompile branch 2 times, most recently from 589d605 to 4d32245 Compare February 13, 2020 11:47

k-ye added 2 commits February 14, 2020 09:08

Major Metal compiler/runtime refactors

446c619

* Use Pimpl for MetalRuntime so that we don't have to check TC_PLATFORM_OSX whereever it's been used. * Correctly detects the Metal API availability at runtime * If Metal API is not available, Program will automatically falls back to x86_64

MetalRuntime fixes:

3fc1543

* Dynamically determine the number of threads per group for a given compute kernel pipeline * Skip the kernel launch if its num_thread is zero * Check CUDA macro to fix the build

k-ye force-pushed the mtlcompile branch from 4d32245 to 3fc1543 Compare February 14, 2020 00:08

yuanming-hu approved these changes Feb 14, 2020

View reviewed changes

yuanming-hu merged commit 4ecbf3c into taichi-dev:master Feb 14, 2020

k-ye deleted the mtlcompile branch February 14, 2020 00:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major Metal compiler/runtime refactors #457

Major Metal compiler/runtime refactors #457

k-ye commented Feb 11, 2020 •

edited

Loading

k-ye commented Feb 11, 2020 •

edited

Loading

k-ye commented Feb 11, 2020 •

edited

Loading

yuanming-hu commented Feb 12, 2020

k-ye commented Feb 12, 2020 •

edited

Loading

yuanming-hu commented Feb 12, 2020

k-ye commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020

k-ye commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020 •

edited

Loading

k-ye commented Feb 13, 2020

k-ye commented Feb 13, 2020 •

edited

Loading

yuanming-hu commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020

k-ye commented Feb 13, 2020 •

edited

Loading

yuanming-hu commented Feb 13, 2020

k-ye commented Feb 14, 2020 •

edited

Loading

k-ye commented Feb 14, 2020

yuanming-hu commented Feb 14, 2020

Major Metal compiler/runtime refactors #457

Major Metal compiler/runtime refactors #457

Conversation

k-ye commented Feb 11, 2020 • edited Loading

k-ye commented Feb 11, 2020 • edited Loading

k-ye commented Feb 11, 2020 • edited Loading

yuanming-hu commented Feb 12, 2020

k-ye commented Feb 12, 2020 • edited Loading

yuanming-hu commented Feb 12, 2020

k-ye commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020

k-ye commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020 • edited Loading

k-ye commented Feb 13, 2020

k-ye commented Feb 13, 2020 • edited Loading

yuanming-hu commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020

yuanming-hu commented Feb 13, 2020

k-ye commented Feb 13, 2020 • edited Loading

yuanming-hu commented Feb 13, 2020

k-ye commented Feb 14, 2020 • edited Loading

k-ye commented Feb 14, 2020

yuanming-hu commented Feb 14, 2020

k-ye commented Feb 11, 2020 •

edited

Loading

k-ye commented Feb 11, 2020 •

edited

Loading

k-ye commented Feb 11, 2020 •

edited

Loading

k-ye commented Feb 12, 2020 •

edited

Loading

yuanming-hu commented Feb 13, 2020 •

edited

Loading

k-ye commented Feb 13, 2020 •

edited

Loading

k-ye commented Feb 13, 2020 •

edited

Loading

k-ye commented Feb 14, 2020 •

edited

Loading