We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No response
Issue Description:
The token generation speed of llamafile is slower compared to the upstream llama.cpp project.
llamafile
llama.cpp
Details:
nvcc -arch=all -DIGNORE123 -O3 --shared --use_fast_math --forward-unknown-to-host-compiler --compiler-options "/nologo /EHsc /O2 /GR /MT" -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_CUBLAS -DTEHFLASH -o ggml-cuda.dll.all ggml-cuda.cu -lcublas -lcuda
In comparison:
GPU Utilization:
nvidia-smi
Model Used for Testing:
qwen2-7b-instruct-q3_k_m.gguf
Test Environment:
llamafile v0.8.12
Windows
Logs Comparison:
llama.cpp Log (b3567):
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.32 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: CPU buffer size = 223.33 MiB llm_load_tensors: CUDA0 buffer size = 3402.96 MiB
llamafile Log:
ggml_cuda_link: welcome to CUDA SDK with cuBLAS ... ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.38 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloaded 28/29 layers to GPU llm_load_tensors: CPU buffer size = 3626.29 MiB llm_load_tensors: CUDA0 buffer size = 2976.59 MiB
The llamafile log is missing the line offloading non-repeating layers to GPU. I’m wondering if this could be the reason for the performance issue.
offloading non-repeating layers to GPU
The text was updated successfully, but these errors were encountered:
Model: Qwen/Qwen2-7B-Instruct-GGUF Specific file: qwen2-7b-instruct-q3_k_m.gguf
Model: Qwen/Qwen2-7B-Instruct-GGUF
Specific file: qwen2-7b-instruct-q3_k_m.gguf
Share qwen2_5-7b.llamafile
Sorry, something went wrong.
No branches or pull requests
Contact Details
No response
What happened?
Issue Description:
The token generation speed of
llamafile
is slower compared to the upstreamllama.cpp
project.Details:
llamafile
version 0.8.12:gglm-cuda built with command:llama.cpp
versions tested:llamafile
should have included all updates from this version)In comparison:
llamafile
only achieves 26 tokens/s.llama.cpp
achieve 51 tokens/s.GPU Utilization:
nvidia-smi
, the GPU utilization forllamafile
is observed to be 41%, whereas forllama.cpp
, it reaches 80%.Model Used for Testing:
qwen2-7b-instruct-q3_k_m.gguf
Test Environment:
Version
llamafile v0.8.12
What operating system are you seeing the problem on?
Windows
Relevant log output
Logs Comparison:
llama.cpp Log (b3567):
llamafile Log:
The llamafile log is missing the line
offloading non-repeating layers to GPU
. I’m wondering if this could be the reason for the performance issue.The text was updated successfully, but these errors were encountered: