-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial gfx1036 and gfx1103 support #111
Conversation
1e191b4
to
3fc282c
Compare
1453618
to
3051c3e
Compare
A fresh build of this is currently broken, ROCm-CompilerSupport complains about missing |
Hmm, something seems to be going wrong initializing repos. A Edit: ah, it's probably this: 27a9480 . It's a shame there's no easy way to see what requires reinitializing source. Complete reinit did not fix the |
Building rocBLAS Tensile fails for the gfx1036, same issue I mentioned in #103 (comment). At least this confirms my own initial attempt at adding support wasn't any worse. :P The fix is adding |
I'll just leave this potentially useful oneliner here:
To be executed in An even neater idea would be to write some kind of auto-patcher that, given the gfx, adds the necessary changes, since it's basically the same things over and over, save for the core definitions of HW capabilities that we do need to supply. Then again, this may be one of those things that ultimately takes more time to develop than it actually saves. |
Interesting stuff so far. Building for the gfx1036 is a disastrous performance regression compared to using As an aside, |
3051c3e
to
b14c488
Compare
b14c488
to
d6129bf
Compare
d6129bf
to
a09b964
Compare
1f1974c
to
989fdf5
Compare
Sorry, responding little late.
|
I will test all the changes with a fresh build to see if any problems remain. My perf test is fairly simple so it should be easy to repro. I'm using llama.cpp because it's particularly easy to compile standalone; all you need is hipBLAS and
The There's probably even simpler tests that don't require even a model by just poking BLAS directly with matrix multiplications (llama actually includes a matrix multiplication benchmark, but it only tests CPU perf). |
Thanks for confirming, I try to run llama.ccp. I did some research yesterday for existing rocBLAS tests and found these two in addition of tests on rocBLAS itself. I launched 2 clean builds on last night and the one for gfx1102 and gfx1103 has just finished on fedora 40. |
I may have been able reproduce and fix the rocBLAS slowness what you are seeing if not using the HSA override with my gfx1103; I did following steps to test it.
|
That's promising. I wonder if any specific gains could be made by specializing things for the cache size/batch size/CU occupancy appropriate for the iGPU rather than the general arch (mind you I haven't looked into these files so I might be talking out of my behind here). In other news, my build is finally done -- if I never have to see aotriton compile again, I'll be a happy man -- and I can confirm it works from scratch now without further tweaking (gfx1030, gfx1036, gfx1102 targets, for completeness). Side note in case you didn't already know: Linux kernel 6.10 (which went final a few days ago) has tweaks in amdkfd specifically to improve the memory allocation story for iGPUs. Instead of using the locked memory, it will go straight to GTT, making it much easier to process large data without changing software or reserving large amounts of memory through BIOS settings. I have been running 6.10 since before I knew this was a thing, which explains why |
My old top dislike has traditionally beeen the rocFFT database generation phase but I agree that the aotriton build is also bad :-) I would like to have benchmark where we could run same operations both on CPU, GPU and iGPU. Good benchmarks would be then also nice to run both with the older 6,8/9 kernel and with the 6,10, those changes looks very interesting and I wanna build and boot with that to test also once the apu acceleration is in place. |
- add initial rocBLAS logic files for rembrandt (gfx1035), raphael (gfx1036) and phoenix (gfx1103) iGPUs. - when testing with the https://github.com/LeiWang1999/rocblas-benchmark by using the std::make_tuple(8192, 8192, 8192, false, false, enable_tune), the speedup was about 4-5x. - gfx1035 without logic files Device 0: AMD Radeon Graphics m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec) 8192,8192,8192,n,n,0,912.287,814.502,854.257,865.103 - gfx1035 with logic files Device 0: AMD Radeon Graphics m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec) 8192,8192,8192,n,n,0,652.499,834.796,237.42,189.945 - gfx1103 without logic files Device 0: AMD Radeon 780M m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec) 8192,8192,8192,n,n,0,916.684,820.721,823.48,1018.46 - gfx1103 with logic files ROCR_VISIBLE_DEVICES="1" ./rocblas_benchmark Device 0: AMD Radeon 780M m,n,k,a_t,b_t,enable_tune,fp32 time (msec),fp16-f32 time (msec), f16-f16 time (msec), int8-int32 time (msec) 8192,8192,8192,n,n,0,1346.02,634.836,193.613,119.29 Signed-off-by: Mika Laitio <lamikr@gmail.com>
I will merge these now, let's keep this discussion on going on I moved our discussion to #114 |
fixes: #101
fixes: #103