Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Won't Use GPU+CPU in 1.78. #1225

Closed
sekushi18 opened this issue Nov 19, 2024 · 12 comments
Closed

Won't Use GPU+CPU in 1.78. #1225

sekushi18 opened this issue Nov 19, 2024 · 12 comments

Comments

@sekushi18
Copy link

Describe the Issue
Using the latest update has made issues with all models i run mostly in anything above 11B. Upon which in Vulkan, Clblast, Cublas and all legacy's. the ai with character card injected. crashes with over flow vram rather then to use cpu and gpu together.
essentially instead of ai model being held by gpu and cpu it just only does gpu and crashes.

Additional Information:
using UBUNTU 24.04 cinnamon fully updated with latest Silly Tavern as well. for reference 1.73 works well with 11B and 13B using ai fall back rather then now where the 13B model won't fit into 10G's RTX 3080 LHR, ryzen 7 5700G 128 gig's ddr4. to help gauge specs.

(I've never made issues in git hub often to know if I'm doing it right. Sorry.)

@LostRuins
Copy link
Owner

How many layers is it currently offloading? Try offloading 1 or 2 fewer layers.

@sekushi18
Copy link
Author

I've tried the max it thinks being 49 layers. to as low as 33 with same results. no matter the layer offload it says it failed and goes to cpu only backend.

@LostRuins
Copy link
Owner

Alright, could you try running it in a command prompt terminal, then copying the console output (including the crash message) here?

@sekushi18
Copy link
Author

sekushi18 commented Nov 28, 2024

Sorry for the long delay, Here is hopefully what you wanted. while using a llm 13b model.

CLblast:
OpenCL GPU Offload Fallback...
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 8801.63 MiB

Processing Prompt [BLAS] (512 / 1247 tokens)CLBlast: OpenCL error: clEnqueueNDRangeKernel: -4

QF32 Matmul Failed (-4): [dims: 5120,5120,5120,512] You may be out of VRAM. Please check if you have enough.
ggml-opencl.cpp:1892: GGML_ASSERT(false) failed
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

Cublas:
llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 107.42 MiB
llm_load_tensors: CUDA0 model buffer size = 8694.21 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:17794.7).
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 6272
llama_new_context_with_model: n_ctx_per_seq = 6272
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 17794.7
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (6272) > n_ctx_train (4096) -- possible training context overflow
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4900.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
gpttype_load_model: error: failed to load model '/home/chris/kobold/llms/mythomax-l2-13b.Q5_K_M.gguf'
Load Text Model OK: False

vulkan:
llm_load_tensors: offloading 37 repeating layers to GPU
llm_load_tensors: offloaded 37/41 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 8801.63 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:17794.7).
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 6272
llama_new_context_with_model: n_ctx_per_seq = 6272
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 17794.7
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (6272) > n_ctx_train (4096) -- possible training context overflow
llama_kv_cache_init: CPU KV buffer size = 4900.00 MiB
llama_new_context_with_model: KV self size = 4900.00 MiB, K (f16): 2450.00 MiB, V (f16): 2450.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 542.26 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 1

@LostRuins
Copy link
Owner

Can you try with cublas, with lowvram enabled and flash attention disabled?

@sekushi18
Copy link
Author

Cublas, lowvram, flash memory i always have disabled.
llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 107.42 MiB
llm_load_tensors: CUDA0 model buffer size = 8694.21 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:17794.7).
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 6272
llama_new_context_with_model: n_ctx_per_seq = 6272
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 17794.7
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (6272) > n_ctx_train (4096) -- possible training context overflow
llama_kv_cache_init: CUDA_Host KV buffer size = 4900.00 MiB
llama_new_context_with_model: KV self size = 4900.00 MiB, K (f16): 2450.00 MiB, V (f16): 2450.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 84.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 532.26 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 82

@LostRuins
Copy link
Owner

And it just crashes after that line?

@sekushi18
Copy link
Author

It doesn't crash but as the report says is auto falls back to CPU only mode, it just makes the GPU do nothing for 40 of 41 layers and saddles the rest on the CPU turning my Linux into a slow slide show until i tell the terminal to stop.

@LostRuins
Copy link
Owner

If you mean llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead, this is not an error, it's just for 1 tensor only

@sekushi18
Copy link
Author

Thank you, yet as i made this issue for, why is it only using 1 tensor? as i explained at the beginning. the last version i used lets me use all my tensors for the ai then the cpu for the rest. How ever currently it seems to that outside of 11B models the program cannot actively use my gpu with the cpu henceforth making the statement and as you already said just makes the gpu use 1 tensor whilst throttling my cpu for both operations.

@LostRuins
Copy link
Owner

No it's the opposite. Everything is working with GPU fine, beside one tensor that's on the cpu.

@sekushi18
Copy link
Author

oh i see, sorry for all of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants