Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama.cpp main hangs at prompt with latest mmap updates #669

Closed
2 tasks done
vikrantrathore opened this issue Apr 1, 2023 · 2 comments
Closed
2 tasks done

llama.cpp main hangs at prompt with latest mmap updates #669

vikrantrathore opened this issue Apr 1, 2023 · 2 comments

Comments

@vikrantrathore
Copy link

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • [x ] I carefully followed the README.md.
  • [ x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

After upgrading to latest code compiling and then running an inference using main the following prompt should return results like before:

./main -m models/13B/ggml-model-q4_0.bin -n 512 --repeat_penalty 1.0 --color  -p "What is controlled delivery?"
main: seed = 1680321331
llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 7759.83 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 9807.93 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from 'models/13B/ggml-model-q4_0.bin'
llama_model_load: model size =  7759.39 MB / num tensors = 363
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 32 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 8, n_predict = 512, n_keep = 0



 What is controlled delivery?
Controlled Delivery (CD) refers to the process of delivering content or messages in specific locations, such as on college campuses and at events like concerts where there are large crowds present. It can also refer specifically to a method for distributing print materials that is targeted towards particular audiences based upon their demographic characteristics (e.g., gender, age range).

Current Behavior

llama.cpp main just hangs without output showing prompt only:


./main -m models/13B/ggml-model-q4_0.bin -n 512 --repeat_penalty 1.0 --color  -p "What is controlled delivery?"
main: seed = 1680321575
llama_model_load: loading model from 'models/13B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 7759.83 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 9807.93 MB (+ 1608.00 MB per state)
llama_model_load: loading tensors from 'models/13B/ggml-model-q4_0.bin'
llama_model_load: model size =  7759.39 MB / num tensors = 363
llama_init_from_file: kv self size  =  400.00 MB

system_info: n_threads = 32 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 8, n_predict = 512, n_keep = 0


 What is controlled delivery?

Environment and Context

  • Linux Ubuntu 22.04
  • Nvidia GPU 3070 12GB.
  • RAM 128 GB
  • NVMe storage

Stepping: 0
Frequency boost: enabled
CPU max MHz: 5083.3979
CPU min MHz: 2200.0000
BogoMIPS: 6800.50
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext
fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq mon
itor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm
sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb
cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed ad
x smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero
irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pause
filter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor sm
ca fsrm
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 512 KiB (16 instances)
L1i: 512 KiB (16 instances)
L2: 8 MiB (16 instances)
L3: 64 MiB (2 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected

Linux ai-llm-dev 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Python 3.11.2

GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@vikrantrathore vikrantrathore changed the title [User] Insert summary of your issue or enhancement.. llama.cpp main hangs at prompt with latest mmap updates Apr 1, 2023
@vikrantrathore
Copy link
Author

The problem is fixed by increasing the number of threads and it begins to respond after a bit of delay.

$ ./main -m models/13B/ggml-model-q4_0.bin -t 16 -n 512 --repeat_penalty 1.0 --color -p "What is controlled delivery?"

@chrissound
Copy link

How do you know how many threads are required to prevent it hanging?

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
…terial-9.2.8

Bump mkdocs-material from 9.2.5 to 9.2.8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants