Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: llama cpp server arg LLAMA_ARG_N_GPU_LAYERS doesn't follow the same convention as llama cpp python n_gpu_layers #9556

Open
mvonpohle opened this issue Sep 20, 2024 · 1 comment
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)

Comments

@mvonpohle
Copy link

What happened?

If creating a llama model in python code, you can specific n_gpu_layers=-1 so that all layers are offloaded to GPU. (see below example) When starting llama cpp server using the docker image, setting LLAMA_ARG_N_GPU_LAYERS: -1 doesn't have the same functionality.

from llama_cpp import Llama

Llama('path/to/model', chat_format="llama-3", n_ctx=1024, n_gpu_layers=-1, verbose=False)
llamacpp-server:
  image: ghcr.io/ggerganov/llama.cpp:server-cuda@sha256:fe887bd3debd1a55ddd95f067435a38166f15a058bf50fee173517b9831081c8
  ports:
    - 8080:8080
  volumes:
    # TODO: change
    - ./model:/model
  environment:
    # alternatively, you can use "LLAMA_ARG_MODEL_URL" to download the model
    LLAMA_ARG_MODEL: /model/path-to-model.gguf
    LLAMA_ARG_N_GPU_LAYERS: -1

Name and Version

From the prebuilt docker image ghcr.io/ggerganov/llama.cpp:server-cuda@sha256:fe887bd3debd1a55ddd95f067435a38166f15a058bf50fee173517b9831081c8

version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

llamacpp-server-1  | ggml_cuda_init: found 1 CUDA devices:
llamacpp-server-1  |   Device 0: Tesla T4, compute capability 7.5, VMM: yes
llamacpp-server-1  | llm_load_tensors: ggml ctx size =    0.14 MiB
llamacpp-server-1  | llm_load_tensors: offloading 0 repeating layers to GPU
llamacpp-server-1  | llm_load_tensors: offloaded 0/33 layers to GPU
llamacpp-server-1  | llm_load_tensors:        CPU buffer size =  6282.97 MiB
@mvonpohle mvonpohle added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Sep 20, 2024
@ggerganov
Copy link
Owner

You can set it to a very large number, for example LLAMA_ARG_N_GPU_LAYERS=999

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)
Projects
None yet
Development

No branches or pull requests

2 participants