Bug: llama cpp server arg LLAMA_ARG_N_GPU_LAYERS doesn't follow the same convention as llama cpp python n_gpu_layers #9556

mvonpohle · 2024-09-20T01:42:39Z

What happened?

If creating a llama model in python code, you can specific n_gpu_layers=-1 so that all layers are offloaded to GPU. (see below example) When starting llama cpp server using the docker image, setting LLAMA_ARG_N_GPU_LAYERS: -1 doesn't have the same functionality.

from llama_cpp import Llama

Llama('path/to/model', chat_format="llama-3", n_ctx=1024, n_gpu_layers=-1, verbose=False)

llamacpp-server:
  image: ghcr.io/ggerganov/llama.cpp:server-cuda@sha256:fe887bd3debd1a55ddd95f067435a38166f15a058bf50fee173517b9831081c8
  ports:
    - 8080:8080
  volumes:
    # TODO: change
    - ./model:/model
  environment:
    # alternatively, you can use "LLAMA_ARG_MODEL_URL" to download the model
    LLAMA_ARG_MODEL: /model/path-to-model.gguf
    LLAMA_ARG_N_GPU_LAYERS: -1

Name and Version

From the prebuilt docker image ghcr.io/ggerganov/llama.cpp:server-cuda@sha256:fe887bd3debd1a55ddd95f067435a38166f15a058bf50fee173517b9831081c8

version: 0 (unknown)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

llamacpp-server-1  | ggml_cuda_init: found 1 CUDA devices:
llamacpp-server-1  |   Device 0: Tesla T4, compute capability 7.5, VMM: yes
llamacpp-server-1  | llm_load_tensors: ggml ctx size =    0.14 MiB
llamacpp-server-1  | llm_load_tensors: offloading 0 repeating layers to GPU
llamacpp-server-1  | llm_load_tensors: offloaded 0/33 layers to GPU
llamacpp-server-1  | llm_load_tensors:        CPU buffer size =  6282.97 MiB

ggerganov · 2024-09-20T06:49:41Z

You can set it to a very large number, for example LLAMA_ARG_N_GPU_LAYERS=999

mvonpohle added bug-unconfirmed low severity Used to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches) labels Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: llama cpp server arg LLAMA_ARG_N_GPU_LAYERS doesn't follow the same convention as llama cpp python n_gpu_layers #9556

Bug: llama cpp server arg LLAMA_ARG_N_GPU_LAYERS doesn't follow the same convention as llama cpp python n_gpu_layers #9556

mvonpohle commented Sep 20, 2024

ggerganov commented Sep 20, 2024

Bug: llama cpp server arg LLAMA_ARG_N_GPU_LAYERS doesn't follow the same convention as llama cpp python n_gpu_layers #9556

Bug: llama cpp server arg LLAMA_ARG_N_GPU_LAYERS doesn't follow the same convention as llama cpp python n_gpu_layers #9556

Comments

mvonpohle commented Sep 20, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

ggerganov commented Sep 20, 2024