Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major Metal compiler/runtime refactors #457

Merged
merged 2 commits into from
Feb 14, 2020

Conversation

k-ye
Copy link
Member

@k-ye k-ye commented Feb 11, 2020

Issue #396

Sorry about the large size, I'm mostly moving MetalRuntime's code to cpp file.

  • Use Pimpl for MetalRuntime so that we don't have to check TC_PLATFORM_OSX
    whereever it's been used.
  • Correctly detects the Metal API availability at runtime
  • If Metal API is not available, Program will automatically falls back to x86_64

@k-ye
Copy link
Member Author

k-ye commented Feb 11, 2020

Test passed but fialed at build.py try_upload ..? https://travis-ci.com/taichi-dev/taichi/builds/148330581

======================= 201 passed in 369.10s (0:06:09) ========================
Running C++ tests...
===============================================================================
All tests passed (25 assertions in 3 test cases)
The command "ti test_verbose && cd python && $PYTHON build.py try_upload" exited with 1.
Done. Your build exited with 1.

@k-ye
Copy link
Member Author

k-ye commented Feb 11, 2020

I tested mpm99.py and pbf2d.py on my old 2015 MBP13 (Intel Core i5 + Intel Iris Graphics 6100 1536 MB graphics), and they ran just fine with 512 threads per block.

What's both funny and ironic is that, I got a huge performance gain on this old laptop. The FPS is much much better than what I got with my 2019 MBP16.... Benchmark numbers for mpm99.py:

2015 MBP13:

  • cpu: ~18fps
  • metal: 28~30fps

2019 MBP16:

  • cpu: ~15fps
  • metal: ~16fps

I don't believe the numbers on the new MBP is anywhere near its capability, otherwise that's a good waste of my money 😠

@yuanming-hu
Copy link
Member

Maybe you can enable the profiler to see exactly which kernel leads to unsatisfactory performance boost :-)

@k-ye
Copy link
Member Author

k-ye commented Feb 12, 2020

Maybe you can enable the profiler to see exactly which kernel leads to unsatisfactory performance boost :-)

Yep, I'll give it a shot later today or so :) Meanwhile, I wonder if the metal tests would still crash your laptop... If so, could you try a small number of kThreadsPerGroup, e.g. 64, and see if that helps?

constexpr int kThreadsPerGroup = 512;


FYI my old MBP can only use a maximum number of 256 threads for test_linalg.py::test_polar_decomp(). I have another PR to dynamically set this number for each Metal kernel...

@yuanming-hu
Copy link
Member

After rebooting my old 2013 MBP five times, I realized it crashes at

def test_zero_outer_loop():

(I also changed kThreadsPerGroup to 256, but probably it's caused by a zero outer loop size? Maybe the Metal driver has a bug when number of blocks is zero?) :-)

@k-ye k-ye force-pushed the mtlcompile branch 2 times, most recently from 4738969 to 4e25f25 Compare February 13, 2020 01:32
@k-ye
Copy link
Member Author

k-ye commented Feb 13, 2020

After rebooting my old 2013 MBP five times, I realized it crashes at

def test_zero_outer_loop():

(I also changed kThreadsPerGroup to 256, but probably it's caused by a zero outer loop size? Maybe the Metal driver has a bug when number of blocks is zero?) :-)

😂 so sorry about that! I've made another fix in this PR to do the followings:

  • skip the metal kernel if num_threads == 0
  • dynamically decide the number of threads per group on a per kernel basis

@yuanming-hu
Copy link
Member

Cool, thanks! Running tests now. Fingers crossed! :-)

@k-ye
Copy link
Member Author

k-ye commented Feb 13, 2020

FYI I think Travis CI failed at the cpp tests step. On my local machine, if i do

ti test_python
echo $?

I got 0, whereas

ti test_cpp
echo $?

I got 1...

@yuanming-hu
Copy link
Member

Now all tests passed on my MBP 2013, except this one:

assert x[k, i] == 2**i * k

@yuanming-hu
Copy link
Member

yuanming-hu commented Feb 13, 2020

FYI I think Travis CI failed at the cpp tests step.

Nice catch... test_cpp seems to always return an empty string. Fixing that... Thanks for pointing this out!

@k-ye
Copy link
Member Author

k-ye commented Feb 13, 2020

Ah ok, I will take a look at the test after work...

@k-ye
Copy link
Member Author

k-ye commented Feb 13, 2020

Here's the kernel params for the failed test on my 2015 MBP:

kernel=mtl_k0002_func_c4_0__0 num_threads=16 num_threads_per_group=1024 num_groups=1

I also printed out x.to_numpy():

x=[[0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00]
 [1.000e+00 2.000e+00 4.000e+00 8.000e+00 1.600e+01 3.200e+01 6.400e+01
  1.280e+02]
 [2.000e+00 4.000e+00 8.000e+00 1.600e+01 3.200e+01 6.400e+01 1.280e+02
  2.560e+02]
 [3.000e+00 6.000e+00 1.200e+01 2.400e+01 4.800e+01 9.600e+01 1.920e+02
  3.840e+02]
 [4.000e+00 8.000e+00 1.600e+01 3.200e+01 6.400e+01 1.280e+02 2.560e+02
  5.120e+02]
 [5.000e+00 1.000e+01 2.000e+01 4.000e+01 8.000e+01 1.600e+02 3.200e+02
  6.400e+02]
 [6.000e+00 1.200e+01 2.400e+01 4.800e+01 9.600e+01 1.920e+02 3.840e+02
  7.680e+02]
 [7.000e+00 1.400e+01 2.800e+01 5.600e+01 1.120e+02 2.240e+02 4.480e+02
  8.960e+02]
 [8.000e+00 1.600e+01 3.200e+01 6.400e+01 1.280e+02 2.560e+02 5.120e+02
  1.024e+03]
 [9.000e+00 1.800e+01 3.600e+01 7.200e+01 1.440e+02 2.880e+02 5.760e+02
  1.152e+03]
 [1.000e+01 2.000e+01 4.000e+01 8.000e+01 1.600e+02 3.200e+02 6.400e+02
  1.280e+03]
 [1.100e+01 2.200e+01 4.400e+01 8.800e+01 1.760e+02 3.520e+02 7.040e+02
  1.408e+03]
 [1.200e+01 2.400e+01 4.800e+01 9.600e+01 1.920e+02 3.840e+02 7.680e+02
  1.536e+03]
 [1.300e+01 2.600e+01 5.200e+01 1.040e+02 2.080e+02 4.160e+02 8.320e+02
  1.664e+03]
 [1.400e+01 2.800e+01 5.600e+01 1.120e+02 2.240e+02 4.480e+02 8.960e+02
  1.792e+03]
 [1.500e+01 3.000e+01 6.000e+01 1.200e+02 2.400e+02 4.800e+02 9.600e+02
  1.920e+03]]

Could you:

  • tell me which specific coord (i, j) failed?
  • the kernel params on your machine? (But I don't think this is the root cause this time, as the kernel was a trivial one.)

(Or a quick and robust fix: if Taichi detected the mac hardware is older than the 2015 model, print out "Please buy a new hardware" and exit(1) :-p ...)

@k-ye k-ye force-pushed the mtlcompile branch 2 times, most recently from 589d605 to 4d32245 Compare February 13, 2020 11:47
@yuanming-hu
Copy link
Member

I minimized the test into

import taichi as ti

ti.init(arch=ti.metal, print_ir=True)

n = 2
m = 4

x = ti.var(ti.i32, shape=(n, m))

@ti.kernel
def func():
  for k in range(2):
    for i in range(m - 1):
      x[k, i + 1] = x[k, i] * 2

x[0, 0] = 1
func()

for i in range(m):
  print(f'i = {i} x[0, i] = {x[0, i]} (should be {2**i})')

which gives

i = 0 x[0, i] = 1 (should be 1)
i = 1 x[0, i] = 2 (should be 2)
i = 2 x[0, i] = 4 (should be 4)
i = 3 x[0, i] = 0 (should be 8)

@yuanming-hu
Copy link
Member

Added a logging patch here:

    const int num_threads_per_group =
        get_max_total_threads_per_threadgroup(pipeline_state_.get());
    const int num_groups =
        ((num_threads + num_threads_per_group - 1) / num_threads_per_group);
    TC_INFO("num_threads {}, num_threads_per_group {}, num_groups {}", num_threads, num_threads_per_group, num_groups); // logging here..
    dispatch_threadgroups(encoder.get(), num_groups,
                          std::min(num_threads, num_threads_per_group));
    end_encoding(encoder.get());
    profiler_->stop();

Gives me num_threads 2, num_threads_per_group 1024, num_groups 1 for the main kernel:

[Taichi version 0.4.5, cpu only, commit d0a71abd]
[T 02/13/20 12:00:15.920] [memory_pool.cpp:MemoryPool@15] Memory pool created. Default buffer size per allocator = 1024 MB
[T 02/13/20 12:00:15.928] [taichi_llvm_context.cpp:TaichiLLVMContext@57] Creating llvm context for arch: x86_64
[T 02/13/20 12:00:16.030] [/Users/yuanming/repos/taichi/python/taichi/lang/impl.py:materialize@116] Materializing layout...
[T 02/13/20 12:00:16.086] [taichi_llvm_context.cpp:compile_runtime_bitcode@110] Compiling runtime module bitcode...
[T 02/13/20 12:00:17.123] [struct_llvm.cpp:operator()@287] Allocating data structure of size 32 B
[T 02/13/20 12:00:17.169] [unified_allocator.cpp:UnifiedAllocator@53] Allocating virtual address space of size 1024 MB
[T 02/13/20 12:00:17.169] [unified_allocator.cpp:UnifiedAllocator@62] Memory allocated. Allocation time = 3.39e-05 s
[runtime.cpp: Initializing runtime with 3 snode(s)...]
[runtime.cpp: Runtime initialized.]
[D 02/13/20 12:00:17.186] [memory_pool.cpp:daemon@103] Processing memory alloc request 0
[D 02/13/20 12:00:17.186] [memory_pool.cpp:daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/13/20 12:00:17.186] [memory_pool.cpp:daemon@112]   Allocated. Ptr = 0x1253cc000
[D 02/13/20 12:00:17.187] [memory_pool.cpp:daemon@103] Processing memory alloc request 1
[D 02/13/20 12:00:17.187] [memory_pool.cpp:daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/13/20 12:00:17.187] [memory_pool.cpp:daemon@112]   Allocated. Ptr = 0x1253cf000
[D 02/13/20 12:00:17.188] [memory_pool.cpp:daemon@103] Processing memory alloc request 2
[D 02/13/20 12:00:17.188] [memory_pool.cpp:daemon@110]   Allocating memory 8232 B (alignment 4096B)
[D 02/13/20 12:00:17.188] [memory_pool.cpp:daemon@112]   Allocated. Ptr = 0x1253d2000
[D 02/13/20 12:00:17.190] [memory_pool.cpp:daemon@103] Processing memory alloc request 3
[D 02/13/20 12:00:17.190] [memory_pool.cpp:daemon@110]   Allocating memory 3145728 B (alignment 4096B)
[D 02/13/20 12:00:17.190] [memory_pool.cpp:daemon@112]   Allocated. Ptr = 0x1253d5000
[I 02/13/20 12:00:17.221] [program.cpp:materialize_layout@121] Metal root buffer size: 32 B
[I 02/13/20 12:00:17.245] [metal_runtime.cpp:register_taichi_kernel@286] Registered Taichi kernel <mtl_k0001_snode_writer_2>
[I 02/13/20 12:00:17.245] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
[T 02/13/20 12:00:17.248] [/Users/yuanming/repos/taichi/python/taichi/lang/kernel.py:materialize@180] Compiling kernel func_c4_0_...
[D 02/13/20 12:00:17.261] [/Users/yuanming/repos/taichi/python/taichi/lang/kernel.py:__call__@350] Launching Taichi kernel func...
[I 02/13/20 12:00:17.263] [metal_runtime.cpp:register_taichi_kernel@286] Registered Taichi kernel <mtl_k0002_func_c4_0_>
[I 02/13/20 12:00:17.263] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 2, num_threads_per_group 1024, num_groups 1
[I 02/13/20 12:00:17.265] [metal_runtime.cpp:register_taichi_kernel@286] Registered Taichi kernel <mtl_k0003_snode_reader_2>
[I 02/13/20 12:00:17.265] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 0 x[0, i] = 1 (should be 1)
[I 02/13/20 12:00:17.266] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 1 x[0, i] = 2 (should be 2)
[I 02/13/20 12:00:17.266] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 2 x[0, i] = 4 (should be 4)
[I 02/13/20 12:00:17.266] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 3 x[0, i] = 0 (should be 8)

@yuanming-hu
Copy link
Member

Constructed an even more magical test:

ti.init(arch=ti.metal)

n = 2
m = 8

x = ti.var(ti.i32, shape=(n, m))

@ti.kernel
def func():
  for k in range(n):
    for i in range(m - 1):
      x[k, i + 1] = x[k, i] + 1

func()

for i in range(m):
  print(f'i = {i} x[0, i] = {x[0, i]} (should be {i})')

Yielding

[I 02/13/20 12:18:20.713] [metal_runtime.cpp:register_taichi_kernel@286] Registered Taichi kernel <mtl_k0001_func_c4_0_>
[I 02/13/20 12:18:20.713] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 2, num_threads_per_group 1024, num_groups 1
[I 02/13/20 12:18:20.715] [metal_runtime.cpp:register_taichi_kernel@286] Registered Taichi kernel <mtl_k0002_snode_reader_2>
[I 02/13/20 12:18:20.715] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 0 x[0, i] = 0 (should be 0)
[I 02/13/20 12:18:20.716] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 1 x[0, i] = 1 (should be 1)
[I 02/13/20 12:18:20.717] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 2 x[0, i] = 2 (should be 2)
[I 02/13/20 12:18:20.726] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 3 x[0, i] = 1 (should be 3)
[I 02/13/20 12:18:20.726] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 4 x[0, i] = 1 (should be 4)
[I 02/13/20 12:18:20.727] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 5 x[0, i] = 1 (should be 5)
[I 02/13/20 12:18:20.727] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 6 x[0, i] = 2 (should be 6)
[I 02/13/20 12:18:20.727] [metal_runtime.cpp:launch_if_not_empty@101] num_threads 1, num_threads_per_group 1024, num_groups 1
i = 7 x[0, i] = 1 (should be 7)

It seems to me that some writes to x are not visible to the next read...

@yuanming-hu
Copy link
Member

(Or a quick and robust fix: if Taichi detected the mac hardware is older than the 2015 model, print out "Please buy a new hardware" and exit(1) :-p ...)

Let's seriously take this option: if the mac is pre-2015, just warn the user that the Metal backend may have undefined behavior (which is not that serious in most cases). And in the next few years, nobody will be using MBP 2013...

@k-ye
Copy link
Member Author

k-ye commented Feb 13, 2020

It seems to me that some writes to x are not visible to the next read...

That's a very enlightening point. I was about to say that we should add some memory fence, but then realized that Metal was breaking the memory order on a single thread... I don't assume any memory order sync is generated for CUDA here?

@yuanming-hu
Copy link
Member

I searched for Metal memory ordering yet found no results. CUDA has a weak memory ordering, which means unless fences are added memory read/writes have no order guarantee across threads. Within threads, memory accesses are still ordered. Metal seems to be doing something crazy here :-)

Anyway, if it works on MBP 2015, it's not too bad that this special case fails on an especially old GPU...

I'm good with this PR. Are you ready to merge this in? :-)

@k-ye
Copy link
Member Author

k-ye commented Feb 14, 2020

Metal seems to be doing something crazy here :-)

Yeah, i feel like Apple is just too lazy fixing bugs on old machines... Weak memory order on a single thread is a bit too crazy... XD

Are you ready to merge this in? :-)

Ah great, let me do a rebase first :)

k-ye added 2 commits February 14, 2020 09:08
* Use Pimpl for MetalRuntime so that we don't have to check TC_PLATFORM_OSX
  whereever it's been used.
* Correctly detects the Metal API availability at runtime
* If Metal API is not available, Program will automatically falls back to x86_64
* Dynamically determine the number of threads per group for a given compute kernel pipeline
* Skip the kernel launch if its num_thread is zero
* Check CUDA macro to fix the build
@k-ye
Copy link
Member Author

k-ye commented Feb 14, 2020

Rebased! I'm trying to port the failed test's Metal kernel over to [cpp-host-metal]. Once that's done, could you run it again and verify it still fails? If so, I can post on stackoverflow and ask if anyone is aware of this issue...

@yuanming-hu
Copy link
Member

Sure, happy to test this for you!!

@yuanming-hu yuanming-hu merged commit 4ecbf3c into taichi-dev:master Feb 14, 2020
@k-ye k-ye deleted the mtlcompile branch February 14, 2020 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants