llama.cpp main hangs at prompt with latest mmap updates #669

vikrantrathore · 2023-04-01T04:10:19Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[x ] I carefully followed the README.md.
[ x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

After upgrading to latest code compiling and then running an inference using main the following prompt should return results like before:

./main -m models/13B/ggml-model-q4_0.bin -n 512 --repeat_penalty 1.0 --color  -p "What is controlled delivery?"
main: seed = 1680321331
llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 7759.83 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 9807.93 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from 'models/13B/ggml-model-q4_0.bin'
llama_model_load: model size =  7759.39 MB / num tensors = 363
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 32 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 8, n_predict = 512, n_keep = 0



 What is controlled delivery?
Controlled Delivery (CD) refers to the process of delivering content or messages in specific locations, such as on college campuses and at events like concerts where there are large crowds present. It can also refer specifically to a method for distributing print materials that is targeted towards particular audiences based upon their demographic characteristics (e.g., gender, age range).

Current Behavior

llama.cpp main just hangs without output showing prompt only:


./main -m models/13B/ggml-model-q4_0.bin -n 512 --repeat_penalty 1.0 --color  -p "What is controlled delivery?"
main: seed = 1680321575
llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 7759.83 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 9807.93 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from 'models/13B/ggml-model-q4_0.bin'
llama_model_load: model size =  7759.39 MB / num tensors = 363
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 32 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 8, n_predict = 512, n_keep = 0


 What is controlled delivery?

Environment and Context

Linux Ubuntu 22.04
Nvidia GPU 3070 12GB.
RAM 128 GB
NVMe storage

Stepping: 0
Frequency boost: enabled
CPU max MHz: 5083.3979
CPU min MHz: 2200.0000
BogoMIPS: 6800.50
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq mon
itor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm
sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb
cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed ad
x smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero
irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pause
filter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor sm
ca fsrm
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 512 KiB (16 instances)
L1i: 512 KiB (16 instances)
L2: 8 MiB (16 instances)
L3: 64 MiB (2 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected

Linux ai-llm-dev 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Python 3.11.2

GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The text was updated successfully, but these errors were encountered:

vikrantrathore · 2023-04-01T08:54:17Z

The problem is fixed by increasing the number of threads and it begins to respond after a bit of delay.

$ ./main -m models/13B/ggml-model-q4_0.bin -t 16 -n 512 --repeat_penalty 1.0 --color -p "What is controlled delivery?"

chrissound · 2023-04-12T20:25:44Z

How do you know how many threads are required to prevent it hanging?

…terial-9.2.8 Bump mkdocs-material from 9.2.5 to 9.2.8

vikrantrathore changed the title ~~[User] Insert summary of your issue or enhancement..~~ llama.cpp main hangs at prompt with latest mmap updates Apr 1, 2023

vikrantrathore closed this as completed Apr 1, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023

Merge pull request ggml-org#669 from abetlen/dependabot/pip/mkdocs-ma…

d1ec569

…terial-9.2.8 Bump mkdocs-material from 9.2.5 to 9.2.8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp main hangs at prompt with latest mmap updates #669

llama.cpp main hangs at prompt with latest mmap updates #669

vikrantrathore commented Apr 1, 2023

vikrantrathore commented Apr 1, 2023

chrissound commented Apr 12, 2023

llama.cpp main hangs at prompt with latest mmap updates #669

llama.cpp main hangs at prompt with latest mmap updates #669

Comments

vikrantrathore commented Apr 1, 2023

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

vikrantrathore commented Apr 1, 2023

chrissound commented Apr 12, 2023