Won't Use GPU+CPU in 1.78. #1225

sekushi18 · 2024-11-19T03:43:15Z

Describe the Issue
Using the latest update has made issues with all models i run mostly in anything above 11B. Upon which in Vulkan, Clblast, Cublas and all legacy's. the ai with character card injected. crashes with over flow vram rather then to use cpu and gpu together.
essentially instead of ai model being held by gpu and cpu it just only does gpu and crashes.

Additional Information:
using UBUNTU 24.04 cinnamon fully updated with latest Silly Tavern as well. for reference 1.73 works well with 11B and 13B using ai fall back rather then now where the 13B model won't fit into 10G's RTX 3080 LHR, ryzen 7 5700G 128 gig's ddr4. to help gauge specs.

(I've never made issues in git hub often to know if I'm doing it right. Sorry.)

LostRuins · 2024-11-23T10:46:54Z

How many layers is it currently offloading? Try offloading 1 or 2 fewer layers.

sekushi18 · 2024-11-23T17:19:43Z

I've tried the max it thinks being 49 layers. to as low as 33 with same results. no matter the layer offload it says it failed and goes to cpu only backend.

LostRuins · 2024-11-24T01:43:49Z

Alright, could you try running it in a command prompt terminal, then copying the console output (including the crash message) here?

sekushi18 · 2024-11-28T07:37:46Z

Sorry for the long delay, Here is hopefully what you wanted. while using a llm 13b model.

CLblast:
OpenCL GPU Offload Fallback...
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 8801.63 MiB

Processing Prompt [BLAS] (512 / 1247 tokens)CLBlast: OpenCL error: clEnqueueNDRangeKernel: -4

QF32 Matmul Failed (-4): [dims: 5120,5120,5120,512] You may be out of VRAM. Please check if you have enough.
ggml-opencl.cpp:1892: GGML_ASSERT(false) failed
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

Cublas:
llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 107.42 MiB
llm_load_tensors: CUDA0 model buffer size = 8694.21 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:17794.7).
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 6272
llama_new_context_with_model: n_ctx_per_seq = 6272
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 17794.7
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (6272) > n_ctx_train (4096) -- possible training context overflow
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4900.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_new_context_with_model: llama_kv_cache_init() failed for self-attention cache
gpttype_load_model: error: failed to load model '/home/chris/kobold/llms/mythomax-l2-13b.Q5_K_M.gguf'
Load Text Model OK: False

vulkan:
llm_load_tensors: offloading 37 repeating layers to GPU
llm_load_tensors: offloaded 37/41 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 8801.63 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:17794.7).
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 6272
llama_new_context_with_model: n_ctx_per_seq = 6272
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 17794.7
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (6272) > n_ctx_train (4096) -- possible training context overflow
llama_kv_cache_init: CPU KV buffer size = 4900.00 MiB
llama_new_context_with_model: KV self size = 4900.00 MiB, K (f16): 2450.00 MiB, V (f16): 2450.00 MiB
llama_new_context_with_model: CPU output buffer size = 0.12 MiB
llama_new_context_with_model: CPU compute buffer size = 542.26 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 1

LostRuins · 2024-11-28T11:14:51Z

Can you try with cublas, with lowvram enabled and flash attention disabled?

sekushi18 · 2024-11-29T11:16:02Z

Cublas, lowvram, flash memory i always have disabled.
llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU_Mapped model buffer size = 107.42 MiB
llm_load_tensors: CUDA0 model buffer size = 8694.21 MiB
....................................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:17794.7).
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 6272
llama_new_context_with_model: n_ctx_per_seq = 6272
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 17794.7
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_pre_seq (6272) > n_ctx_train (4096) -- possible training context overflow
llama_kv_cache_init: CUDA_Host KV buffer size = 4900.00 MiB
llama_new_context_with_model: KV self size = 4900.00 MiB, K (f16): 2450.00 MiB, V (f16): 2450.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.12 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 84.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 532.26 MiB
llama_new_context_with_model: graph nodes = 1286
llama_new_context_with_model: graph splits = 82

LostRuins · 2024-11-29T11:33:49Z

And it just crashes after that line?

sekushi18 · 2024-11-29T11:43:07Z

It doesn't crash but as the report says is auto falls back to CPU only mode, it just makes the GPU do nothing for 40 of 41 layers and saddles the rest on the CPU turning my Linux into a slow slide show until i tell the terminal to stop.

LostRuins · 2024-12-01T03:00:05Z

If you mean llm_load_tensors: tensor 'token_embd.weight' (q5_K) (and 0 others) cannot be used with preferred buffer type CUDA_Host, using CPU instead, this is not an error, it's just for 1 tensor only

sekushi18 · 2024-12-01T04:18:43Z

Thank you, yet as i made this issue for, why is it only using 1 tensor? as i explained at the beginning. the last version i used lets me use all my tensors for the ai then the cpu for the rest. How ever currently it seems to that outside of 11B models the program cannot actively use my gpu with the cpu henceforth making the statement and as you already said just makes the gpu use 1 tensor whilst throttling my cpu for both operations.

LostRuins · 2024-12-01T14:22:04Z

No it's the opposite. Everything is working with GPU fine, beside one tensor that's on the cpu.

sekushi18 · 2024-12-01T17:30:03Z

oh i see, sorry for all of this.

sekushi18 closed this as completed Dec 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Won't Use GPU+CPU in 1.78. #1225

Won't Use GPU+CPU in 1.78. #1225

sekushi18 commented Nov 19, 2024

LostRuins commented Nov 23, 2024

sekushi18 commented Nov 23, 2024

LostRuins commented Nov 24, 2024

sekushi18 commented Nov 28, 2024 •

edited

Loading

LostRuins commented Nov 28, 2024

sekushi18 commented Nov 29, 2024

LostRuins commented Nov 29, 2024

sekushi18 commented Nov 29, 2024

LostRuins commented Dec 1, 2024

sekushi18 commented Dec 1, 2024

LostRuins commented Dec 1, 2024

sekushi18 commented Dec 1, 2024

Won't Use GPU+CPU in 1.78. #1225

Won't Use GPU+CPU in 1.78. #1225

Comments

sekushi18 commented Nov 19, 2024

LostRuins commented Nov 23, 2024

sekushi18 commented Nov 23, 2024

LostRuins commented Nov 24, 2024

sekushi18 commented Nov 28, 2024 • edited Loading

LostRuins commented Nov 28, 2024

sekushi18 commented Nov 29, 2024

LostRuins commented Nov 29, 2024

sekushi18 commented Nov 29, 2024

LostRuins commented Dec 1, 2024

sekushi18 commented Dec 1, 2024

LostRuins commented Dec 1, 2024

sekushi18 commented Dec 1, 2024

sekushi18 commented Nov 28, 2024 •

edited

Loading