Skip to content

[User] CUDA is broken #1756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
3 tasks done
howard0su opened this issue Jun 8, 2023 · 2 comments
Closed
3 tasks done

[User] CUDA is broken #1756

howard0su opened this issue Jun 8, 2023 · 2 comments

Comments

@howard0su
Copy link
Collaborator

howard0su commented Jun 8, 2023

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [ X] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Current Behavior

Crash when I enable two GPU cards. Gabage output when I enable one GPU card.

Two Cards:

PS C:\gpt\llama.cpp> .\build\bin\RelWithDebInfo\main.exe -m ..\en-models\7B\ggml-alpaca-7b-q4.bin -p "what is cuda?" -ngl 40
main: build = 635 (5c64a09)
main: seed  = 1686202333
ggml_init_cublas: found 2 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB
  Device 1: NVIDIA GeForce GTX 1070
llama.cpp: loading model from ..\en-models\7B\ggml-alpaca-7b-q4.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (Tesla P100-PCIE-16GB) as main device
llama_model_load_internal: mem required  = 1932.71 MB (+ 1026.00 MB per state)
llama_model_load_internal: allocating batch_size x 1 MB = 512 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 layers to GPU
llama_model_load_internal: offloading output layer to GPU
llama_model_load_internal: total VRAM used: 3987 MB
...................................................................................................
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 what is cuda?CUDA error 9 at C:\GPT\llama.cpp\ggml-cuda.cu:1574: invalid configuration argument
@FlareP1
Copy link

FlareP1 commented Jun 8, 2023

I suggest looking at these issues that are currently being worked on. There is a pull that addresses the gibberish for 1 GPU maybe this also fixes the 2 GPU issue?
Issue #1735
and
Pull #1735

@howard0su
Copy link
Collaborator Author

Yeah it fixes this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants