-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major Metal compiler/runtime refactors #457
Conversation
Test passed but fialed at
|
I tested What's both funny and ironic is that, I got a huge performance gain on this old laptop. The FPS is much much better than what I got with my 2019 MBP16.... Benchmark numbers for 2015 MBP13:
2019 MBP16:
I don't believe the numbers on the new MBP is anywhere near its capability, otherwise that's a good waste of my money 😠 |
Maybe you can enable the profiler to see exactly which kernel leads to unsatisfactory performance boost :-) |
Yep, I'll give it a shot later today or so :) Meanwhile, I wonder if the metal tests would still crash your laptop... If so, could you try a small number of
FYI my old MBP can only use a maximum number of |
After rebooting my old 2013 MBP five times, I realized it crashes at taichi/tests/python/test_loops.py Line 88 in be66730
(I also changed |
4738969
to
4e25f25
Compare
😂 so sorry about that! I've made another fix in this PR to do the followings:
|
Cool, thanks! Running tests now. Fingers crossed! :-) |
FYI I think Travis CI failed at the cpp tests step. On my local machine, if i do
I got
I got |
Now all tests passed on my MBP 2013, except this one: taichi/tests/python/test_loop_grad.py Line 32 in 5d63cad
|
Nice catch... |
Ah ok, I will take a look at the test after work... |
Here's the kernel params for the failed test on my 2015 MBP:
I also printed out x=[[0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
0.000e+00]
[1.000e+00 2.000e+00 4.000e+00 8.000e+00 1.600e+01 3.200e+01 6.400e+01
1.280e+02]
[2.000e+00 4.000e+00 8.000e+00 1.600e+01 3.200e+01 6.400e+01 1.280e+02
2.560e+02]
[3.000e+00 6.000e+00 1.200e+01 2.400e+01 4.800e+01 9.600e+01 1.920e+02
3.840e+02]
[4.000e+00 8.000e+00 1.600e+01 3.200e+01 6.400e+01 1.280e+02 2.560e+02
5.120e+02]
[5.000e+00 1.000e+01 2.000e+01 4.000e+01 8.000e+01 1.600e+02 3.200e+02
6.400e+02]
[6.000e+00 1.200e+01 2.400e+01 4.800e+01 9.600e+01 1.920e+02 3.840e+02
7.680e+02]
[7.000e+00 1.400e+01 2.800e+01 5.600e+01 1.120e+02 2.240e+02 4.480e+02
8.960e+02]
[8.000e+00 1.600e+01 3.200e+01 6.400e+01 1.280e+02 2.560e+02 5.120e+02
1.024e+03]
[9.000e+00 1.800e+01 3.600e+01 7.200e+01 1.440e+02 2.880e+02 5.760e+02
1.152e+03]
[1.000e+01 2.000e+01 4.000e+01 8.000e+01 1.600e+02 3.200e+02 6.400e+02
1.280e+03]
[1.100e+01 2.200e+01 4.400e+01 8.800e+01 1.760e+02 3.520e+02 7.040e+02
1.408e+03]
[1.200e+01 2.400e+01 4.800e+01 9.600e+01 1.920e+02 3.840e+02 7.680e+02
1.536e+03]
[1.300e+01 2.600e+01 5.200e+01 1.040e+02 2.080e+02 4.160e+02 8.320e+02
1.664e+03]
[1.400e+01 2.800e+01 5.600e+01 1.120e+02 2.240e+02 4.480e+02 8.960e+02
1.792e+03]
[1.500e+01 3.000e+01 6.000e+01 1.200e+02 2.400e+02 4.800e+02 9.600e+02
1.920e+03]] Could you:
(Or a quick and robust fix: if Taichi detected the mac hardware is older than the 2015 model, print out "Please buy a new hardware" and |
589d605
to
4d32245
Compare
I minimized the test into
which gives
|
Added a logging patch here:
Gives me
|
Constructed an even more magical test:
Yielding
It seems to me that some writes to |
Let's seriously take this option: if the mac is pre-2015, just warn the user that the Metal backend may have undefined behavior (which is not that serious in most cases). And in the next few years, nobody will be using MBP 2013... |
That's a very enlightening point. I was about to say that we should add some memory fence, but then realized that Metal was breaking the memory order on a single thread... I don't assume any memory order sync is generated for CUDA here? |
I searched for Anyway, if it works on MBP 2015, it's not too bad that this special case fails on an especially old GPU... I'm good with this PR. Are you ready to merge this in? :-) |
Yeah, i feel like Apple is just too lazy fixing bugs on old machines... Weak memory order on a single thread is a bit too crazy... XD
Ah great, let me do a rebase first :) |
* Use Pimpl for MetalRuntime so that we don't have to check TC_PLATFORM_OSX whereever it's been used. * Correctly detects the Metal API availability at runtime * If Metal API is not available, Program will automatically falls back to x86_64
* Dynamically determine the number of threads per group for a given compute kernel pipeline * Skip the kernel launch if its num_thread is zero * Check CUDA macro to fix the build
Rebased! I'm trying to port the failed test's Metal kernel over to [cpp-host-metal]. Once that's done, could you run it again and verify it still fails? If so, I can post on stackoverflow and ask if anyone is aware of this issue... |
Sure, happy to test this for you!! |
Issue #396
Sorry about the large size, I'm mostly moving
MetalRuntime
's code tocpp
file.whereever it's been used.