Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: out of memory after b1697 #4680

Closed
hydai opened this issue Dec 29, 2023 · 2 comments · Fixed by #4687
Closed

CUDA error: out of memory after b1697 #4680

hydai opened this issue Dec 29, 2023 · 2 comments · Fixed by #4687

Comments

@hydai
Copy link
Contributor

hydai commented Dec 29, 2023

Summary

In b1696, everything works fine.
However, when the b1697 introduces the cuda vmm, it never works.

Hardware

NVIDIA Jetson AGX Orin 64GB

uname -a
Linux jetson-orin 5.10.104-tegra #1 SMP PREEMPT Tue Jan 24 15:09:44 PST 2023 aarch64 aarch64 aarch64 GNU/Linux

OS

lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.6 LTS
Release:	20.04
Codename:	focal

Reproduce steps

  1. Build llama.cpp b1710
git checkout b1710
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release -- -j
  1. Execute with llama.cpp
./build/bin/main -m /disk/models/baichuan2-7b-base.Q5_K.gguf -n 512 -ngl 35 -p '這是一段中文測試'
  1. Get error message:
Log start
main: build = 1710 (65e5f6d)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1703835560
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /disk/models/baichuan2-7b-base.Q5_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 17
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,125696]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,125696]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,125696]  = [2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: mismatch in special tokens definition ( 1298/125696 vs 259/125696 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 125696
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.51 B
llm_load_print_meta: model size       = 4.99 GiB (5.71 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  =  337.67 MiB
llm_load_tensors: VRAM used           = 4775.16 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 256.69 MiB
llama_new_context_with_model: VRAM scratch buffer: 253.50 MiB
llama_new_context_with_model: total VRAM used: 5284.66 MiB (model: 4775.16 MiB, context: 509.50 MiB)
CUDA error: out of memory
  current device: 0, in function ggml_cuda_pool_malloc_vmm at /home/hydai/workspace/llama.cpp/ggml-cuda.cu:6694
  cuMemAddressReserve(&g_cuda_pool_addr[device], CUDA_POOL_VMM_MAX_SIZE, 0, 0, 0)
GGML_ASSERT: /home/hydai/workspace/llama.cpp/ggml-cuda.cu:225: !"CUDA error"
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
[1]    456065 abort (core dumped)  ./build/bin/main -m /disk/models/baichuan2-7b-base.Q5_K.gguf -n 512 -ngl 35 -

Expected output (b1696)

Log start
main: build = 1696 (708e179)
main: built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
main: seed  = 1703836082
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /disk/models/baichuan2-7b-base.Q5_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 17
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,125696]  = ["<unk>", "<s>", "</s>", "<SEP>", "<C...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,125696]  = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,125696]  = [2, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q5_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: mismatch in special tokens definition ( 1298/125696 vs 259/125696 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 125696
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 7.51 B
llm_load_print_meta: model size       = 4.99 GiB (5.71 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1099 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.11 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  =  337.67 MiB
llm_load_tensors: VRAM used           = 4775.16 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_build_graph: non-view tensors processed: 676/676
llama_new_context_with_model: compute buffer total size = 256.69 MiB
llama_new_context_with_model: VRAM scratch buffer: 253.50 MiB
llama_new_context_with_model: total VRAM used: 5284.66 MiB (model: 4775.16 MiB, context: 509.50 MiB)

system_info: n_threads = 6 / 12 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |
sampling:
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 512, n_keep = 0


 這是一段中文測試,其中會呼叫 JavaFX API 的各種方法。
  - `HelloWorld` 類別包含兩個成員: `main()` 和 `greeting()`。主方法 `main()` 會傳入一個具有兩個參數的傳回值: (1) 一个 `String[]`,其中包含本機執行緒的名稱,以及 (2) 一個指向本機執行緒的指针。此方法的作業是呼叫函式 `greeting(),並將結果傳給 `System.out`。
  - 當您呼叫主方法時, JavaFX 的執行階段會在執行本機執行緒之前,先執行所有內嵌執行階段的執行方法。在本例中,我們指定了一個名為 `greeting()` 的執行方法作為內嵌執行方法的起始點;這與 JavaFX API 使用內嵌執行階段處理程序的目標一樣:
    - `stage.show();`會啟動舞台。
  - `new HelloWorld(args).greeting()`是從主方法傳入本機執行緒的一個參數,並呼叫此方法的作業,然後將結果傳給 System.out。這些程序與內嵌執行階段之間的關係在圖中顯示如下:
    ![Inner Stage](images/image_inner-stage1024x359.png "Inner stage")
  - 當您呼叫 `main()` 的時候,執行階段會先呼叫內嵌執行階段的 `greeting(),並將結果傳給 System.out。這些程序與內嵌執行階段之間的關係在圖中顯示如下:
    ![Inner Stage](images/image_inner-stage1024x359.png "Inner stage") [end of text]

llama_print_timings:        load time =    2055.94 ms
llama_print_timings:      sample time =     411.77 ms /   428 runs   (    0.96 ms per token,  1039.42 tokens per second)
llama_print_timings: prompt eval time =     335.33 ms /     7 tokens (   47.90 ms per token,    20.88 tokens per second)
llama_print_timings:        eval time =   28286.10 ms /   427 runs   (   66.24 ms per token,    15.10 tokens per second)
llama_print_timings:       total time =   29544.50 ms
Log end
@slaren
Copy link
Collaborator

slaren commented Dec 29, 2023

Using a smaller value for CUDA_POOL_VMM_MAX_SIZE may fix it. I cannot test on this hardware, so PRs are welcome.

@hydai
Copy link
Contributor Author

hydai commented Dec 29, 2023

Hi @slaren
Thanks for the hint. After reducing CUDA_POOL_VMM_MAX_SIZE to 32GB, it works.
I created #4687 for this, please let me know if this PR is fine or if you need more tests on this device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants