You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I reviewed the Discussions, and have a new bug or useful enhancement to share.
!all yes!
Expected Behavior
I'm using llama.cpp loading AquilaChat2-34B-16K-Q4_0.gguf, and I think this way will allow me to have a conversation with this model
Current Behavior
I used the following command to use llama.cpp and there was no problem loading ggml-model-q4_0.gguf, but when loading AquilaChat2-34B-16K-Q4_0.gguf he ended up with a CUDA error 9 at ggml-cuda.cu:6863: invalid configuration argument
current device: 0
command: ./main -m /home/ps/app/edison/Aquila2-main/checkpoints/AquilaChat2-34B-16K-Q4_0/AquilaChat2-34B-16K-Q4_0.gguf --color
--ctx_size 2048 -n -1 -ins -b 256 --top_k 10000
--temp 0.2 --repeat_penalty 1.1 -t 8 -ngl 10000
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
Physical (or virtual) hardware you are using, e.g. for Linux:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7F72 24-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
Stepping: 0
Frequency boost: enabled
CPU max MHz: 3200.0000
CPU min MHz: 2500.0000
BogoMIPS: 6400.16
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pa
t pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid e
xtd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16
sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp
_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefet
ch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext
perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ib
pb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed a
dx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc c
qm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerpt
r rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_s
cale vmcb_clean flushbyasid decodeassists pausefilter pfthreshol
d avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_reco
v succor smca sev sev_es
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 1.5 MiB (48 instances)
L1i: 1.5 MiB (48 instances)
L2: 24 MiB (48 instances)
L3: 384 MiB (24 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP prote
ction
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitiza
tion
Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB f
illing, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
$ uname -a
Linux ps 6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
$ python3 --version
$ make --version
$ g++ --version
Python 3.10.13
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Failure Information (for bugs)
CUDA error 9 at ggml-cuda.cu:6863: invalid configuration argument
current device: 0
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Also, please try to avoid using screenshots if at all possible. Instead, copy/paste the console output and use Github's markdown to cleanly format your logs for easy readability.
Prerequisites
Please answer the following questions for yourself before submitting an issue.
!all yes!
Expected Behavior
I'm using llama.cpp loading AquilaChat2-34B-16K-Q4_0.gguf, and I think this way will allow me to have a conversation with this model
Current Behavior
I used the following command to use llama.cpp and there was no problem loading ggml-model-q4_0.gguf, but when loading AquilaChat2-34B-16K-Q4_0.gguf he ended up with a CUDA error 9 at ggml-cuda.cu:6863: invalid configuration argument
current device: 0
command: ./main -m /home/ps/app/edison/Aquila2-main/checkpoints/AquilaChat2-34B-16K-Q4_0/AquilaChat2-34B-16K-Q4_0.gguf --color
--ctx_size 2048 -n -1 -ins -b 256 --top_k 10000
--temp 0.2 --repeat_penalty 1.1 -t 8 -ngl 10000
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Vendor ID: AuthenticAMD
Model name: AMD EPYC 7F72 24-Core Processor
CPU family: 23
Model: 49
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
Stepping: 0
Frequency boost: enabled
CPU max MHz: 3200.0000
CPU min MHz: 2500.0000
BogoMIPS: 6400.16
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pa
t pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt
pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid e
xtd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16
sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp
_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefet
ch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext
perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ib
pb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed a
dx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc c
qm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerpt
r rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_s
cale vmcb_clean flushbyasid decodeassists pausefilter pfthreshol
d avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_reco
v succor smca sev sev_es
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 1.5 MiB (48 instances)
L1i: 1.5 MiB (48 instances)
L2: 24 MiB (48 instances)
L3: 384 MiB (24 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Mitigation; untrained return thunk; SMT enabled with STIBP prote
ction
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitiza
tion
Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB f
illing, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
$ uname -a
Linux ps 6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Python 3.10.13
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Failure Information (for bugs)
CUDA error 9 at ggml-cuda.cu:6863: invalid configuration argument
current device: 0
Steps to Reproduce
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make LLAMA_CUBLAS=1
./main -m /home/ps/app/edison/Aquila2-main/checkpoints/AquilaChat2-34B-16K-Q4_0/AquilaChat2-34B-16K-Q4_0.gguf --color
--ctx_size 2048 -n -1 -ins -b 256 --top_k 10000
--temp 0.2 --repeat_penalty 1.1 -t 8 -ngl 10000
Failure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Also, please try to avoid using screenshots if at all possible. Instead, copy/paste the console output and use Github's markdown to cleanly format your logs for easy readability.
Example environment info:
Log start
main: build = 1428 (6961c4b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1698634798
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA RTX A6000, compute capability 8.6
llama_model_loader: loaded meta data with 21 key-value pairs and 543 tensors from /home/ps/app/edison/Aquila2-main/checkpoints/AquilaChat2-34B-16K-Q4_0/AquilaChat2-34B-16K-Q4_0.gguf (version GGUF V2 (latest))
llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 6144, 100008, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_0 [ 6144, 6144, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_0 [ 6144, 1024, 1, 1 ]
... ...
llama_model_loader: - tensor 540: blk.59.ffn_norm.weight f32 [ 6144, 1, 1, 1 ]
llama_model_loader: - tensor 541: output_norm.weight f32 [ 6144, 1, 1, 1 ]
llama_model_loader: - tensor 542: output.weight q6_K [ 6144, 100008, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
... ...
tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 20: general.quantization_version u32
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q4_0: 421 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: mismatch in special tokens definition ( 9/100008 vs 8/100008 ).
llm_load_print_meta: format = GGUF V2 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 100008
llm_load_print_meta: n_merges = 99743
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 6144
llm_load_print_meta: n_head = 48
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 60
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 6
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 24576
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 0.25
llm_load_print_meta: model type = 30B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 33.69 B
llm_load_print_meta: model size = 17.80 GiB (4.54 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 100006 '[CLS]'
llm_load_print_meta: EOS token = 100007 ''
llm_load_print_meta: UNK token = 0 '<|endoftext|>'
llm_load_print_meta: LF token = 129 'Ä'
llm_load_tensors: ggml ctx size = 0.18 MB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required = 329.80 MB
llm_load_tensors: offloading 60 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 63/63 layers to GPU
llm_load_tensors: VRAM used: 17898.53 MB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.25
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 480.00 MB
llama_new_context_with_model: kv self size = 480.00 MB
llama_new_context_with_model: compute buffer total size = 122.13 MB
llama_new_context_with_model: VRAM scratch buffer: 116.00 MB
llama_new_context_with_model: total VRAM used: 18494.53 MB (model: 17898.53 MB, context: 596.00 MB)
system_info: n_threads = 8 / 96 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
main: interactive mode on.
Reverse prompt: '### Instruction:
'
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 10000, tfs_z = 1.000, top_p = 0.950, typical_p = 1.000, temp = 0.200
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
generate: n_ctx = 2048, n_batch = 256, n_predict = -1, n_keep = 1
== Running in interactive mode. ==
CUDA error 9 at ggml-cuda.cu:6863: invalid configuration argument
current device: 0
The text was updated successfully, but these errors were encountered: