-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD - tinyBLAS windows prebuilt support stopped working with 0.8.5 #441
Comments
Was llamafile v0.8.4 prebuilt amd gpu support working for you on Windows? I wasn't able to include prebuild AMD GPU support for Windows users in the recent release for a couple reasons, one of which being ggerganov/llama.cpp#7156. There's a workaround you should be able to use. You need to install the AMD ROCm "HIP SDK" on your computer. Once that's installed, llamafile will compile just for your machine automatically a highly-optimized GPU module that'll give you a better experience. |
yes it was the prebuilt amd gpu support that was working with 0.8.4 I understand that this is all moving very fast ; thank you for your help. according to https://rocm.docs.amd.com/projects/install-on-windows/en/latest/reference/system-requirements.html so it seems I can only have the HIP runtime and not the SDK does that mean I am out of luck with my GPU ? I will try installing the HIP SDK anyway and report here what happens |
You're not out of luck. Me and the llama.cpp developers are working on finding a way to reduce the code size, so we can include the prebuilt ggml-rocm.dll for you in a future release real soon. I recommend just using 0.8.4 for a few weeks until that happens. Sound good? |
yes I will continue using 0.8.4 for now. so you think it won't work on my setup with "RX 6700 XT" even if I install the HIP SDK ? it is true that for this card the amd spec page only talks about "runtime" compatibility so I guess that excludes the just-in-time compilation that you described. libraries are described on https://rocm.docs.amd.com/en/latest/reference/api-libraries.html I am not sure I understand what is the needed ROCm component as a dependency for the just-in-time GPU support compilation. Is it the C++ libraries that are mentioned as ? As another solution that would not involve a |
Here's the Windows AMD GPU DSO I built for the last release that wasn't included. You can use zipalign to include it yourself if you can make it fit. ggml-rocm.dll.zip I don't know what specific component is needed from ROCm. If you're proposing we bundle AMD's DSOs in our llamafile releases, I'd be reluctant to do that. I'm already unhappy about how the address space has to be tainted in order to talk to GPUs. I don't know how we'd call this project open source if our release artifacts were tainted too. |
I don't have Windows, but on linux to rebuild you need : (more element here: #188) Note: I need to find time to test with last llamafile it may change. |
quick test (on linux / AMD Ryzen 9 5950X + AMD Radeon RX 6900 XT) > ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time = 556.46 ms / 33 tokens ( 16.86 ms per token, 59.30 tokens per second)
llama_print_timings: eval time = 37776.07 ms / 133 runs ( 284.03 ms per token, 3.52 tokens per second)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --nocompile --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time = 229.94 ms / 33 tokens ( 6.97 ms per token, 143.52 tokens per second)
llama_print_timings: eval time = 73144.58 ms / 1411 runs ( 51.84 ms per token, 19.29 tokens per second)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile --tinyblas --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time = 233.25 ms / 33 tokens ( 7.07 ms per token, 141.48 tokens per second)
llama_print_timings: eval time = 38342.75 ms / 811 runs ( 47.28 ms per token, 21.15 tokens per second)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile --temp 0.7 -p '[INST]Write a story about llamas. Please write it in french for a young girle that nead to go to bed.[/INST]'
llama_print_timings: prompt eval time = 119.48 ms / 33 tokens ( 3.62 ms per token, 276.20 tokens per second)
llama_print_timings: eval time = 26408.86 ms / 583 runs ( 45.30 ms per token, 22.08 tokens per second) for quick test use new release with old "weight". remove with longer prompt: > ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile -p "..."
llama_print_timings: prompt eval time = 1029.84 ms / 1466 tokens ( 0.70 ms per token, 1423.53 tokens per second)
llama_print_timings: eval time = 21118.46 ms / 432 runs ( 48.89 ms per token, 20.46 tokens per second)
> ./llamafile-0.8.6 -m Mistral-7b-instruct-v0.2.F16.llamafile -ngl 9999 --recompile --tinyblas -p "..."
llama_print_timings: prompt eval time = 1852.72 ms / 1466 tokens ( 1.26 ms per token, 791.27 tokens per second)
llama_print_timings: eval time = 28902.66 ms / 518 runs ( 55.80 ms per token, 17.92 tokens per second) |
Thanks for posting your numbers! |
V0.8.6 is really impressive for BF16 and Q6_K... llamafile-bench-0.8.6 -p "256,512,1024" -m "mistral-7b-instruct-v0.2.BF16.gguf,mistral-7b-instruct-v0.2.F16.gguf,mistral-7b-instruct-v0.2.Q4_K_M.gguf,mistral-7b-instruct-v0.2.Q5_K_S.gguf,mistral-7b-instruct-v0.2.Q6_K.gguf,mistral-7b-instruct-v0.2.Q8_0.gguf"
(you can compare with #439 (comment) for actual llama.cpp ) |
is it possible to use llamafile-bench with GPU? |
Try passing the llamafile-bench will support GPU soon. It's a bit trickier because llama-bench was designed in a way that assumes GPU support was figured out at compile-time. So it'll likely take some overhauling. |
in my case it is slower... dit I made mistake? #> ryzen 7940HS:
> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf -fa -ngl 0 --temp 0 -c 2048
llama_print_timings: prompt eval time = 18860.70 ms / 1466 tokens ( 12.87 ms per token, 77.73 tokens per second)
llama_print_timings: eval time = 120744.94 ms / 437 runs ( 276.30 ms per token, 3.62 tokens per second)
> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf -ngl 0 --temp 0 -c 2048
llama_print_timings: prompt eval time = 17340.75 ms / 1466 tokens ( 11.83 ms per token, 84.54 tokens per second)
llama_print_timings: eval time = 103088.90 ms / 384 runs ( 268.46 ms per token, 3.72 tokens per second) (for GPU I have to rebuild with LLAMA_HIP_UMA=1 after make some modification on llamafile) |
Interesting, so in some environments it can make things slower. I wonder why that is. Maybe that's why it isn't enabled by default. Thanks for sharing this. As for |
for GPU
|
It looks like flash attention is still a work in progress for AMD GPUs. It's probably due to it being 6mb of code. AMD GPUs usually have smaller instruction caches and are more sensitive than NVIDIA to code size issues. |
I need to go to bed... bud will add HIP_UMA (and the Optimisation) and test with that on ryzen 7940HS tomorrow. |
OK made some patch (https://github.com/Djip007/llamafile/tree/feature/hip_uma)
Some result: (BF16 on CPU FP16 on GPU) #> ryzen 7940HS:
> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf -fa -ngl 0 --temp 0 -c 2048
llama_print_timings: prompt eval time = 18860.70 ms / 1466 tokens ( 12.87 ms per token, 77.73 tokens per second)
llama_print_timings: eval time = 120744.94 ms / 437 runs ( 276.30 ms per token, 3.62 tokens per second)
> ../../llamafile-0.8.6 -m mistral-7b-instruct-v0.2.BF16.gguf -ngl 0 --temp 0 -c 2048
llama_print_timings: prompt eval time = 17340.75 ms / 1466 tokens ( 11.83 ms per token, 84.54 tokens per second)
llama_print_timings: eval time = 103088.90 ms / 384 runs ( 268.46 ms per token, 3.72 tokens per second)
>>- with HIP_UMA
> ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -ngl 9999 --recompile --tinyblas --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 31051.46 ms / 1466 tokens ( 21.18 ms per token, 47.21 tokens per second)
llama_print_timings: eval time = 138180.55 ms / 384 runs ( 359.85 ms per token, 2.78 tokens per second)
> HSA_OVERRIDE_GFX_VERSION=11.0.1 ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -ngl 9999 --recompile --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 14817.91 ms / 1466 tokens ( 10.11 ms per token, 98.93 tokens per second)
llama_print_timings: eval time = 157568.49 ms / 635 runs ( 248.14 ms per token, 4.03 tokens per second)
>>- with HIP_UMA+"CoarseGrain patch"
> ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -fa -ngl 9999 --recompile --tinyblas --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 12391.85 ms / 1466 tokens ( 8.45 ms per token, 118.30 tokens per second)
llama_print_timings: eval time = 102629.02 ms / 384 runs ( 267.26 ms per token, 3.74 tokens per second)
> ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -ngl 9999 --recompile --tinyblas --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 11119.20 ms / 1466 tokens ( 7.58 ms per token, 131.84 tokens per second)
llama_print_timings: eval time = 83272.67 ms / 384 runs ( 216.86 ms per token, 4.61 tokens per second)
> HSA_OVERRIDE_GFX_VERSION=11.0.1 ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -fa -ngl 9999 --recompile --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 9719.47 ms / 1466 tokens ( 6.63 ms per token, 150.83 tokens per second)
llama_print_timings: eval time = 114512.12 ms / 437 runs ( 262.04 ms per token, 3.82 tokens per second)
> HSA_OVERRIDE_GFX_VERSION=11.0.1 ../../usr/bin/llamafile -m mistral-7b-instruct-v0.2.F16.gguf -ngl 9999 --recompile --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time = 7208.44 ms / 1466 tokens ( 4.92 ms per token, 203.37 tokens per second)
llama_print_timings: eval time = 101313.12 ms / 507 runs ( 199.83 ms per token, 5.00 tokens per second) as you see:
|
I tested the ggml-rocm.dll you provided by simply putting it in the I am not totally familiar yet with the way the releases are built. I tought that 0.8.4 came bundled with the ggml-rocm.dll so my idea was that :
also I am not sure if this could help here (because of the needed alignments), but long ago on high compression requirements on windows I used https://upx.github.io/ with good success |
I tried to install HIP SDK (version 5.7) it added a the available .exe are so when llamafile looks for
then it looks for
should it look for the one in HIP_PATH ? then it looks for
then it seems to find it in $HIP_PATH
but it does'nt seem to find a graphics card even though there is no doubt I have a
and then it tries to fallback on the prebuilt AMD GPU support on Windows but does not find it, which is normal for 0.8.5 and 0.8.6 Note that the "missing graphics card" problem is also mentioned in #446 I tried adding the $HIP_PATH in $PATH to force it into finding |
I checked the Line 286 in 397175e
the parsing algorithm seems correct and correclty finds could the problem come from the execution stream for now I cannot test that in my compilation setup. |
There is was hipInfo.exe issue in 0.8.6 that stopped the @jart, I compiled a version of llamafile with Cosmopolitan v3.3.10 to try and see if the version could now build its own For this I installed
Removed
and launched after compilation the powershell terminal started shaking and repeating
Now if I remove the Note that this works both with the original In the rocm.bat version, the command is
while in the auto-compilation procedure the log shows
the main difference seems to be that the auto-compilation procedure seems to compile and indeed if I note that the now I don't know why the could it be because of my graphics card ? what is the difference between tinyblas and cublas support and do you think it can be solved or is it a problem inside the proprietary AMD SDK ? |
If I am correct in your case:
so it may have some "bug" with rocmblas ... |
Indeed there was a log message in the console that I initially did not see
so it is indeed a problem with the rocblas support for gfx1031 on windows. it does not seem to be officially supported on Linux either according to https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html but I saw that there may be a way to make it work by recompiling rocBlas from source or simply adding Tensile files and Kernels pre-compiled for gfx1031 inside I tried that. it makes the CUBLAS support work on my gfx1031, but there does not seem to be performance gains compared to tinyblas (~50 tokens/sec in both cases). I was on the expectation that CUBLAS would bring a significative performance boost but that does not seem to be the case in my setup. I will have to dig further to understand if this is to be expected or not. |
Considered fixed after the release of 0.8.7 |
Hello,
on my computer, with an "AMD 6700 XT" graphics card, the tinyBLAS is working with 0.8.4.
now with 0.8.5 it says
the file is present in
/C/Users/ordib/.llamafile/v/0.8.5/ggml-cuda.dll
when I look in the directory.in the 0.8.4 version the file is
/C/Users/ordib/.llamafile/ggml-cuda.dll
and it loads correctly, logging that tinyBLAS was setup.note: I tried both with
-ngl 35
and-ngl 9999
- not sure what is the correct way now for AMD/tinyBLAS supporttell me if you need more information to understand what is the difference between 0.8.4 and 0.8.5 on this issue
The text was updated successfully, but these errors were encountered: