Added the ability to use LLAMA_HIP_UMA #439

Djip007 · 2024-05-24T16:07:58Z

With AMD APU (like my Ryzen 7940HX) it is possible to use "UMA" to extand VRAM. And in my case I can't alloc more than 4Go of VRAM (bios config).

And with this (ggerganov/llama.cpp#7399) it may be as fast as with VRAM (I can't do a full test because I can't allocate more than 4Go of VRAM with my config)

I can (:crossed_fingers: ) make a PR here but need to know what the best is to made it available.

a runtime option
a failback alloc
a default on some hardware...

Djip007 · 2024-05-24T16:16:10Z

First in all case it is not good to use it on dGPU (it work but really slow) so only to activate on iGPU.
It look to work well on my Ryzen 7940HX on linux.

We may have to get more benchmark to deside what to do.

Djip007 · 2024-05-24T17:47:00Z

If you want to help with benchmark, what I did:
for Linux

# get PR (until it is merged.)
git clone https://github.com/ggerganov/llama.cpp.git llama.cpp_bench
cd llama.cpp_bench
git fetch origin pull/7414/head:benchmark
git checkout benchmark

# get Models: (to allow benchmark fare compare.)
cd ..
mkdir models
cd models
wget https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.F16.llamafile
unzip mistral-7b-instruct-v0.2.F16.llamafile mistral-7b-instruct-v0.2.F16.gguf
rm mistral-7b-instruct-v0.2.F16.llamafile 

wget https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q8_0.llamafile
unzip mistral-7b-instruct-v0.2.Q8_0.llamafile mistral-7b-instruct-v0.2.Q8_0.gguf
rm mistral-7b-instruct-v0.2.Q8_0.llamafile 

wget https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.llamafile
unzip mistral-7b-instruct-v0.2.Q4_K_M.llamafile mistral-7b-instruct-v0.2.Q4_K_M.gguf
rm mistral-7b-instruct-v0.2.Q4_K_M.llamafile 

wget https://huggingface.co/Mozilla/Mistral-7B-Instruct-v0.2-llamafile/resolve/main/mistral-7b-instruct-v0.2.BF16.llamafile
unzip mistral-7b-instruct-v0.2.BF16.llamafile mistral-7b-instruct-v0.2.BF16.gguf
rm mistral-7b-instruct-v0.2.BF16.llamafile

# build for CPU  [n°0]
cd llama.cpp_bench
make clean
make -j16

# build for GPU
#  - for ryzen 7040 gfx1103 is note "supported" use gfx1101 on linux
export HSA_OVERRIDE_GFX_VERSION=11.0.1
export GFX_HARDWARE=gfx1101
#  - for other ???

# - weight on VRAM [n°1]
make clean
make -j16 LLAMA_HIPBLAS=1 AMDGPU_TARGETS=${GFX_HARDWARE}

# - weight on "UMA" [n°2]
make clean
make -j16 LLAMA_HIPBLAS=1 LLAMA_HIP_UMA=1 AMDGPU_TARGETS=gfx1101

# benchmark:
# - for CPU:
./llama-bench --mmap 1 -p 256,512,1024 \
   -m ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
   -m ../models/mistral-7b-instruct-v0.2.Q8_0.gguf \
   -m ../models/mistral-7b-instruct-v0.2.F16.gguf \
   -m ../models/mistral-7b-instruct-v0.2.BF16.gguf 

# - for GPU:
./llama-bench --mmap 0 -p 256,512,1024 \
   -m ../models/mistral-7b-instruct-v0.2.Q4_K_M.gguf \
   -m ../models/mistral-7b-instruct-v0.2.Q8_0.gguf \
   -m ../models/mistral-7b-instruct-v0.2.F16.gguf

Djip007 · 2024-05-24T17:52:31Z

Hardware: Ryzen 7940HS / 64Go
export HSA_OVERRIDE_GFX_VERSION=11.0.1
export GFX_HARDWARE=gfx1101

CPU (n°0):

model	size	params	backend	threads	test	t/s
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	CPU	8	pp256	46.49 ± 0.57
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	CPU	8	pp512	45.33 ± 0.06
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	CPU	8	pp1024	44.08 ± 0.21
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	CPU	8	tg128	12.91 ± 0.04
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	CPU	8	pp512+tg128	29.58 ± 0.08
llama 7B Q8_0	7.17 GiB	7.24 B	CPU	8	pp256	57.58 ± 0.10
llama 7B Q8_0	7.17 GiB	7.24 B	CPU	8	pp512	55.43 ± 0.07
llama 7B Q8_0	7.17 GiB	7.24 B	CPU	8	pp1024	54.98 ± 0.04
llama 7B Q8_0	7.17 GiB	7.24 B	CPU	8	tg128	7.39 ± 0.09
llama 7B Q8_0	7.17 GiB	7.24 B	CPU	8	pp512+tg128	23.86 ± 0.25
llama 7B F16	13.49 GiB	7.24 B	CPU	8	pp256	54.08 ± 0.24
llama 7B F16	13.49 GiB	7.24 B	CPU	8	pp512	43.61 ± 0.05
llama 7B F16	13.49 GiB	7.24 B	CPU	8	pp1024	43.33 ± 0.08
llama 7B F16	13.49 GiB	7.24 B	CPU	8	tg128	3.96 ± 0.01
llama 7B F16	13.49 GiB	7.24 B	CPU	8	pp512+tg128	14.50 ± 0.02
llama 7B BF16	13.49 GiB	7.24 B	CPU	8	pp256	39.65 ± 0.20
llama 7B BF16	13.49 GiB	7.24 B	CPU	8	pp512	39.11 ± 0.01
llama 7B BF16	13.49 GiB	7.24 B	CPU	8	pp1024	38.44 ± 0.21
llama 7B BF16	13.49 GiB	7.24 B	CPU	8	tg128	4.06 ± 0.01
llama 7B BF16	13.49 GiB	7.24 B	CPU	8	pp512+tg128	14.33 ± 0.01

GPU-VRAM (n°1): (not enough VRAM for me)
GPU-UMA (n°2):

model	size	params	backend	ngl	test	t/s
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	ROCm	99	pp256	220.96 ± 0.94
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	ROCm	99	pp512	196.54 ± 0.15
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	ROCm	99	pp1024	192.39 ± 0.33
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	ROCm	99	tg128	15.08 ± 0.09
llama 7B Q4_K - Medium	4.07 GiB	7.24 B	ROCm	99	pp512+tg128	56.49 ± 0.04
llama 7B Q8_0	7.17 GiB	7.24 B	ROCm	99	pp256	213.08 ± 0.22
llama 7B Q8_0	7.17 GiB	7.24 B	ROCm	99	pp512	194.34 ± 0.57
llama 7B Q8_0	7.17 GiB	7.24 B	ROCm	99	pp1024	190.32 ± 0.45
llama 7B Q8_0	7.17 GiB	7.24 B	ROCm	99	tg128	10.37 ± 0.01
llama 7B Q8_0	7.17 GiB	7.24 B	ROCm	99	pp512+tg128	41.89 ± 0.04
llama 7B F16	13.49 GiB	7.24 B	ROCm	99	pp256	271.43 ± 1.39
llama 7B F16	13.49 GiB	7.24 B	ROCm	99	pp512	215.60 ± 0.41
llama 7B F16	13.49 GiB	7.24 B	ROCm	99	pp1024	209.43 ± 0.34
llama 7B F16	13.49 GiB	7.24 B	ROCm	99	tg128	5.01 ± 0.36
llama 7B F16	13.49 GiB	7.24 B	ROCm	99	pp512+tg128	21.40 ± 0.23

Djip007 · 2024-05-26T01:13:00Z

@jart
#441 (comment)

As for LLAMA_HIP_UMA=1 do you know what, if anything, it'll do to environments that don't have this? If you know how to detect it at runtime, I could change ggml-cuda to runtime dispatch to the right implementation.

What I test is add option: ./llamafile --recompile --use_hip_uma and change the args use for rebuild.

I don't know what to detect... all AMD APU ?

But main consern with this option is use GTT over VRAM ... but there is a new update in linux kernel 6.10: https://www.phoronix.com/news/Linux-6.10-AMDKFD-Small-APUs remove this need... (don't know what happen for Windows.) so may be a simple option for rebuild is good.

There is even more interesting thing after we did some POC ( ggerganov/llama.cpp#7399 (comment) ) it look we can leave weight mmap in place with good perf. But it is more complicate to do it properly...

Djip007 · 2024-05-26T13:32:36Z

Make a POC on that: https://github.com/Djip007/llamafile/tree/feature/hip_uma
add --use_hip_uma to use with --recompile

Do you want I make a MergeRequest?

Djip007 · 2024-05-26T19:24:12Z

OK some other benchmark...

# GPU: HSA_OVERRIDE_GFX_VERSION=11.0.1 llamafile -m mixtral-8x7b-instruct-v0.1.Q6_K.gguf -ngl 9999 --no-mmap --recompile --use-hip-uma --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time =   15434.50 ms /  1466 tokens (   10.53 ms per token,    94.98 tokens per second)
llama_print_timings:        eval time =   85566.43 ms /   535 runs   (  159.94 ms per token,     6.25 tokens per second)

#> CPU: llamafile -m mixtral-8x7b-instruct-v0.1.Q6_K.gguf -ngl 0 --temp 0 -c 2048 -p "..."
llama_print_timings: prompt eval time =   31892.26 ms /  1466 tokens (   21.75 ms per token,    45.97 tokens per second)
llama_print_timings:        eval time =   89044.50 ms /   449 runs   (  198.32 ms per token,     5.04 tokens per second)

I'm thinking of finding a way to allow it to be activated by default...

Djip007 · 2024-08-02T23:23:57Z

Not 100% sure but the last ggml update may have break the patch I made here.
I'll check that in the next few days

reopened to remind me ;)

=> yes part of this PR have been remove with llama.cpp synchronize e9ee3f9

Patch is here: https://github.com/Djip007/llamafile/tree/feature/hip_uma_3 but I have some OS bug on my config so can't completely test it...

mofosyne added the question label May 25, 2024

Djip007 mentioned this issue May 26, 2024

AMD - tinyBLAS windows prebuilt support stopped working with 0.8.5 #441

Closed

Djip007 mentioned this issue Jun 14, 2024

update GGML_HIP_UMA #473

Merged

Djip007 closed this as completed Jun 22, 2024

Djip007 reopened this Aug 2, 2024

Djip007 mentioned this issue Aug 13, 2024

update GGML_HIP_UMA #536

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added the ability to use LLAMA_HIP_UMA #439

Added the ability to use LLAMA_HIP_UMA #439

Djip007 commented May 24, 2024

Djip007 commented May 24, 2024

Djip007 commented May 24, 2024 •

edited

Loading

Djip007 commented May 24, 2024 •

edited

Loading

Djip007 commented May 26, 2024

Djip007 commented May 26, 2024

Djip007 commented May 26, 2024

Djip007 commented Aug 2, 2024 •

edited

Loading

Added the ability to use LLAMA_HIP_UMA #439

Added the ability to use LLAMA_HIP_UMA #439

Comments

Djip007 commented May 24, 2024

Djip007 commented May 24, 2024

Djip007 commented May 24, 2024 • edited Loading

Djip007 commented May 24, 2024 • edited Loading

Djip007 commented May 26, 2024

Djip007 commented May 26, 2024

Djip007 commented May 26, 2024

Djip007 commented Aug 2, 2024 • edited Loading

Djip007 commented May 24, 2024 •

edited

Loading

Djip007 commented May 24, 2024 •

edited

Loading

Djip007 commented Aug 2, 2024 •

edited

Loading