Bug: [SYCL] Error loading models larger than Q4 #9472

HumerousGorgon · 2024-09-13T13:48:33Z

What happened?

After building the SYCL server image, trying to load a model larger than Q4 on my Arc A770 fails with a memory error.
Anything below Q4 will execute, but this is due to the "llm_load_tensors: SYCL0 buffer size" being below ~4200MiB.
The Arc A770 has 16GB of VRAM, so should be perfectly capable of loading much higher buffer values into its VRAM.

Looking for information on this. Thanks!

Name and Version

Relevant docker run command used:
docker run -it --rm -p 11434:11434 -v /mnt/user/models/model-files:/app/models --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 -e OverrideGpuAddressSpace=48 -e NEOReadDebugKeys=1 llama-server-cpp-intel -m /app/models/Meta-Llama-3.1-8B-Instruct-Q5_K_L.gguf -n 2048 -e -ngl 33 --port 11434

What operating system are you seeing the problem on?

No response

Relevant log output

llama_model_loader: - kv  29:                      quantize.imatrix.file str              = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q8_0:    2 tensors
llama_model_loader: - type q5_K:  192 tensors
llama_model_loader: - type q6_K:   32 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.7999 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 5.63 GiB (6.03 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
ggml_sycl_init: GGML_SYCL_FORCE_MMQ:   no
ggml_sycl_init: SYCL_USE_XMX: yes
ggml_sycl_init: found 1 SYCL devices:
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  5236.84 MiB
llm_load_tensors:        CPU buffer size =   532.31 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 131072
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
[SYCL] call ggml_check_sycl
ggml_check_sycl: GGML_SYCL_DEBUG: 0
ggml_check_sycl: GGML_SYCL_F16: no
found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|    1.3|    512|    1024|   32| 16225M|            1.3.27642|
Native API failed. Native API returns: -6 (PI_ERROR_OUT_OF_HOST_MEMORY) -6 (PI_ERROR_OUT_OF_HOST_MEMORY)
Exception caught at file:/app/ggml/src/ggml-sycl.cpp, line:4313, func:operator()
SYCL error: CHECK_TRY_ERROR((*stream) .memset(ctx->dev_ptr, value, buffer->size) .wait()): Meet error in this line code!
  in function ggml_backend_sycl_buffer_clear at /app/ggml/src/ggml-sycl.cpp:4313
/app/ggml/src/ggml-sycl/common.hpp:107: SYCL error

HumerousGorgon · 2024-09-13T14:45:53Z

Another note: loading in a pure FP16 model resulted in a SYCL buffer size around 14GB, which loaded fine, so now I'm even more stumped.
I tried to load a reflection-llama-3.1-8B-Q_8 model, which resulted in a buffer size of 7.6GB and it still didn't load, so I'm not sure what is going on.

ggerganov · 2024-09-13T14:50:03Z

Reduce the context size with -c 8192

NeoZhangJianyu · 2024-09-14T01:02:12Z

Yes, it works well.

qnixsynapse · 2024-09-15T06:12:41Z

Your context size is 131072 which adds another 16GB.

github-actions · 2024-10-30T01:19:49Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

HumerousGorgon added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Sep 13, 2024

github-actions bot added the stale label Oct 16, 2024

github-actions bot closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: [SYCL] Error loading models larger than Q4 #9472

Bug: [SYCL] Error loading models larger than Q4 #9472

HumerousGorgon commented Sep 13, 2024

HumerousGorgon commented Sep 13, 2024

ggerganov commented Sep 13, 2024

NeoZhangJianyu commented Sep 14, 2024

qnixsynapse commented Sep 15, 2024

github-actions bot commented Oct 30, 2024

Bug: [SYCL] Error loading models larger than Q4 #9472

Bug: [SYCL] Error loading models larger than Q4 #9472

Comments

HumerousGorgon commented Sep 13, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

HumerousGorgon commented Sep 13, 2024

ggerganov commented Sep 13, 2024

NeoZhangJianyu commented Sep 14, 2024

qnixsynapse commented Sep 15, 2024

github-actions bot commented Oct 30, 2024