CUDA error: out of memory after b1697 #4680

hydai · 2023-12-29T07:49:20Z

Summary

In b1696, everything works fine.
However, when the b1697 introduces the cuda vmm, it never works.

Hardware

NVIDIA Jetson AGX Orin 64GB

uname -a
Linux jetson-orin 5.10.104-tegra #1 SMP PREEMPT Tue Jan 24 15:09:44 PST 2023 aarch64 aarch64 aarch64 GNU/Linux

OS

lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
Release:	20.04
Codename:	focal

Reproduce steps

Build llama.cpp b1710

git checkout b1710
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release -- -j

Execute with llama.cpp

./build/bin/main -m /disk/models/baichuan2-7b-base.Q5_K.gguf -n 512 -ngl 35 -p '這是一段中文測試'

Get error message:

Log start
main: build = 1710 (65e5f6d)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1703835560
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /disk/models/baichuan2-7b-base.Q5_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 17
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,125696]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,125696]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,125696]  = [2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: mismatch in special tokens definition ( 1298/125696 vs 259/125696 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 125696
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.51 B
llm_load_print_meta: model size       = 4.99 GiB (5.71 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  =  337.67 MiB
llm_load_tensors: VRAM used           = 4775.16 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 256.69 MiB
llama_new_context_with_model: VRAM scratch buffer: 253.50 MiB
llama_new_context_with_model: total VRAM used: 5284.66 MiB (model: 4775.16 MiB, context: 509.50 MiB)
CUDA error: out of memory
  current device: 0, in function ggml_cuda_pool_malloc_vmm at /home/hydai/workspace/llama.cpp/ggml-cuda.cu:6694
  cuMemAddressReserve(&g_cuda_pool_addr[device], CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0)
GGML_ASSERT: /home/hydai/workspace/llama.cpp/ggml-cuda.cu:225: !"CUDA error"
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
[1]    456065 abort (core dumped)  ./build/bin/main -m /disk/models/baichuan2-7b-base.Q5_K.gguf -n 512 -ngl 35 -

Expected output (b1696)

Log start
main: build = 1696 (708e179)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1703836082
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /disk/models/baichuan2-7b-base.Q5_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 17
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,125696]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,125696]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,125696]  = [2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: mismatch in special tokens definition ( 1298/125696 vs 259/125696 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 125696
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.51 B
llm_load_print_meta: model size       = 4.99 GiB (5.71 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  =  337.67 MiB
llm_load_tensors: VRAM used           = 4775.16 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 256.69 MiB
llama_new_context_with_model: VRAM scratch buffer: 253.50 MiB
llama_new_context_with_model: total VRAM used: 5284.66 MiB (model: 4775.16 MiB, context: 509.50 MiB)

system_info: n_threads = 6 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0


 這是一段中文測試,其中會呼叫 JavaFX API 的各種方法。
  - `HelloWorld` 類別包含兩個成員: `main()` 和 `greeting()`。主方法 `main()` 會傳入一個具有兩個參數的傳回值: (1) 一个 `String[]`,其中包含本機執行緒的名稱,以及 (2) 一個指向本機執行緒的指针。此方法的作業是呼叫函式 `greeting(),並將結果傳給 `System.out`。
  - 當您呼叫主方法時, JavaFX 的執行階段會在執行本機執行緒之前,先執行所有內嵌執行階段的執行方法。在本例中,我們指定了一個名為 `greeting()` 的執行方法作為內嵌執行方法的起始點;這與 JavaFX API 使用內嵌執行階段處理程序的目標一樣:
    - `stage.show();`會啟動舞台。
  - `new HelloWorld(args).greeting()`是從主方法傳入本機執行緒的一個參數,並呼叫此方法的作業,然後將結果傳給 System.out。這些程序與內嵌執行階段之間的關係在圖中顯示如下:
    ![Inner Stage](images/image_inner-stage1024x359.png "Inner stage")
  - 當您呼叫 `main()` 的時候,執行階段會先呼叫內嵌執行階段的 `greeting(),並將結果傳給 System.out。這些程序與內嵌執行階段之間的關係在圖中顯示如下:
    ![Inner Stage](images/image_inner-stage1024x359.png "Inner stage") [end of text]

llama_print_timings:        load time =    2055.94 ms
llama_print_timings:      sample time =     411.77 ms /   428 runs   (    0.96 ms per token,  1039.42 tokens per second)
llama_print_timings: prompt eval time =     335.33 ms /     7 tokens (   47.90 ms per token,    20.88 tokens per second)
llama_print_timings:        eval time =   28286.10 ms /   427 runs   (   66.24 ms per token,    15.10 tokens per second)
llama_print_timings:       total time =   29544.50 ms
Log end

The text was updated successfully, but these errors were encountered:

slaren · 2023-12-29T11:06:43Z

Using a smaller value for CUDA_POOL_VMM_MAX_SIZE may fix it. I cannot test on this hardware, so PRs are welcome.

hydai · 2023-12-29T16:27:19Z

Hi @slaren
Thanks for the hint. After reducing CUDA_POOL_VMM_MAX_SIZE to 32GB, it works.
I created #4687 for this, please let me know if this PR is fine or if you need more tests on this device.

hydai added the bug-unconfirmed label Dec 29, 2023

hydai mentioned this issue Dec 29, 2023

cuda: fix vmm oom issue on NVIDIA AGX Orin #4687

Merged

slaren closed this as completed in #4687 Dec 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: out of memory after b1697 #4680

CUDA error: out of memory after b1697 #4680

hydai commented Dec 29, 2023

slaren commented Dec 29, 2023

hydai commented Dec 29, 2023

CUDA error: out of memory after b1697 #4680

CUDA error: out of memory after b1697 #4680

Comments

hydai commented Dec 29, 2023

Summary

Hardware

OS

Reproduce steps

Expected output (b1696)

slaren commented Dec 29, 2023

hydai commented Dec 29, 2023