Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: The token generation speed is slower compared to the upstream llama.cpp project #533

Closed
BIGPPWONG opened this issue Aug 13, 2024 · 1 comment

Comments

@BIGPPWONG
Copy link
Contributor

BIGPPWONG commented Aug 13, 2024

Contact Details

No response

What happened?

Issue Description:

The token generation speed of llamafile is slower compared to the upstream llama.cpp project.

Details:

  • llamafile version 0.8.12:gglm-cuda built with command:
nvcc -arch=all -DIGNORE123 -O3 --shared --use_fast_math --forward-unknown-to-host-compiler --compiler-options "/nologo /EHsc /O2 /GR /MT" -DGGML_BUILD=1 -DGGML_SHARED=1 -DGGML_MULTIPLATFORM -DGGML_CUDA_DMMV_X=32 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -DGGML_CUDA_MMV_Y=1 -DGGML_USE_CUBLAS -DTEHFLASH -o ggml-cuda.dll.all ggml-cuda.cu -lcublas -lcuda
  • llama.cpp versions tested:
    • b3567 (latest version)
    • b2968 (version from May 22nd, which means llamafile should have included all updates from this version)

In comparison:

  • llamafile only achieves 26 tokens/s.
  • Both versions of llama.cpp achieve 51 tokens/s.

GPU Utilization:

  • Using nvidia-smi, the GPU utilization for llamafile is observed to be 41%, whereas for llama.cpp, it reaches 80%.

Model Used for Testing:

  • Model: Qwen/Qwen2-7B-Instruct-GGUF
  • Specific file: qwen2-7b-instruct-q3_k_m.gguf

Test Environment:

  • Operating System: Windows 10
  • GPU: RTX 2080
  • CUDA Version: 12.6

Version

llamafile v0.8.12

What operating system are you seeing the problem on?

Windows

Relevant log output

Logs Comparison:

  • llama.cpp Log (b3567):

    ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
    ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
    llm_load_tensors: ggml ctx size =    0.32 MiB
    llm_load_tensors: offloading 28 repeating layers to GPU
    llm_load_tensors: offloading non-repeating layers to GPU
    llm_load_tensors: offloaded 29/29 layers to GPU
    llm_load_tensors:        CPU buffer size =   223.33 MiB
    llm_load_tensors:      CUDA0 buffer size =  3402.96 MiB
  • llamafile Log:

    ggml_cuda_link: welcome to CUDA SDK with cuBLAS
    ...
    ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
    ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
    ggml_cuda_init: found 1 CUDA devices:
    Device 0: NVIDIA GeForce RTX 2080, compute capability 7.5, VMM: yes
    llm_load_tensors: ggml ctx size =    0.38 MiB
    llm_load_tensors: offloading 28 repeating layers to GPU
    llm_load_tensors: offloaded 28/29 layers to GPU
    llm_load_tensors:        CPU buffer size =  3626.29 MiB
    llm_load_tensors:      CUDA0 buffer size =  2976.59 MiB

The llamafile log is missing the line offloading non-repeating layers to GPU. I’m wondering if this could be the reason for the performance issue.

@bphd
Copy link

bphd commented Dec 19, 2024

  • Model: Qwen/Qwen2-7B-Instruct-GGUF

  • Specific file: qwen2-7b-instruct-q3_k_m.gguf

Share qwen2_5-7b.llamafile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants