-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLBlast fails on context lengths above 2048 after merging #4256 #4296
Comments
Does it work with this patch: diff --git a/llama.cpp b/llama.cpp
index fd905ade..69c45c3f 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -3813,7 +3813,7 @@ static struct ggml_tensor * llm_build_kqv(
struct ggml_tensor * kq = ggml_mul_mat(ctx, k, q);
cb(kq, "kq", il);
- if (max_alibi_bias > 0.0f) {
+ if (true) {
// temporary branch until we figure out how to handle ggml_alibi through ggml_add
kq = ggml_scale(ctx, kq, kq_scale);
cb(kq, "kq_scaled", il); |
Nope, unfortunately this did not fix the issue, it still segfaults around the same point. |
Hm, I don't see what could have affected the OpenCL backend in that change. |
There's no stack trace. In fact, there's no printout whatsoever, the program simply halts. I tried it again with 0 layers offloaded and it seems to happen too, it still crashes at the same place. CUDA is fine, however. Here's a video of demo.mp4The test text file I used for input is the first 5 sections of the GPL license, which you can find here: I am able to repro this consistently as it crashes at the same place. Reducing the prompt to a shorter one allows it to work. |
Can confirm this happens for me too. Same command and prompt as @LostRuins. Hardware is RTX 2070S and Intel i7-8700, and I'm using Linux 6.5.9. Happens with
Different error followed by a segfault with
|
And |
I tested more, and I get a coredump with lower |
Reverting this specific commit: |
The |
I'm able to reproduce - looking into it |
I built with ASan, here's the error traceback I get when running with the command: ./main -m ~/models/openhermes-2-mistral-7b.Q6_K.gguf -c 4096 -b 512 -n 32 -ngl 99 -f test.txt Error:
Reverted |
@AlpinDale When running with ASAN, you need to add this env variable: Doing that, I now get the following sanitizer errors, confirming a bug in system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1,100, frequency_penalty = 0,000, presence_penalty = 0,000
top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800
mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000
generate: n_ctx = 4096, n_batch = 512, n_predict = 32, n_keep = 0
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright © 2007 Free Software Foundation, Inc. <https://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
Preamble
The GNU General Public License is a free, copyleft license for software and other kinds of works.
The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too.
When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights.
Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions.
Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free.
The precise terms and conditions for copying, distribution and modification follow.
TERMS AND CONDITIONS
0. Definitions.
“This License” refers to version 3 of the GNU General Public License.
“Copyright” also means copyright-like laws that apply to other kinds of works, such as semiconductor masks.
“The Program” refers to any copyrightable work licensed under this License. Each licensee is addressed as “you”. “Licensees” and “recipients” may be individuals or organizations.
To “modify” a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a “modified version” of the earlier work or a work “based on” the earlier work.
A “covered work” means either the unmodified Program or a work based on the Program.
To “propagate” a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes=================================================================
==364805==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62d000fc6580 at pc 0x5620bf18802b bp 0x7fe2cf3f2840 sp 0x7fe2cf3f2830
WRITE of size 4 at 0x62d000fc6580 thread T28
#0 0x5620bf18802a in ggml_vec_cpy_f32 /home/ggerganov/development/github/llama.cpp/ggml.c:1158
#1 0x5620bf22385d in ggml_compute_forward_soft_max_f32 /home/ggerganov/development/github/llama.cpp/ggml.c:10614
#2 0x5620bf2244aa in ggml_compute_forward_soft_max /home/ggerganov/development/github/llama.cpp/ggml.c:10668
#3 0x5620bf25fbbe in ggml_compute_forward /home/ggerganov/development/github/llama.cpp/ggml.c:13905
#4 0x5620bf27e361 in ggml_graph_compute_thread /home/ggerganov/development/github/llama.cpp/ggml.c:15860
#5 0x7fe42b494ac2 in start_thread nptl/pthread_create.c:442
#6 0x7fe42b526a3f (/lib/x86_64-linux-gnu/libc.so.6+0x126a3f)
0x62d000fc6580 is located 0 bytes to the right of 33152-byte region [0x62d000fbe400,0x62d000fc6580)
allocated by thread T0 here:
#0 0x7fe42ccb61e7 in operator new(unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:99
#1 0x5620bf14270a in __gnu_cxx::new_allocator<unsigned char>::allocate(unsigned long, void const*) /usr/include/c++/11/ext/new_allocator.h:127
#2 0x5620bf11ee72 in std::allocator_traits<std::allocator<unsigned char> >::allocate(std::allocator<unsigned char>&, unsigned long) /usr/include/c++/11/bits/alloc_traits.h:464
#3 0x5620bf0ea3eb in std::_Vector_base<unsigned char, std::allocator<unsigned char> >::_M_allocate(unsigned long) /usr/include/c++/11/bits/stl_vector.h:346
#4 0x5620bf0a3ffb in std::vector<unsigned char, std::allocator<unsigned char> >::_M_default_append(unsigned long) /usr/include/c++/11/bits/vector.tcc:635
#5 0x5620bf06d1ab in std::vector<unsigned char, std::allocator<unsigned char> >::resize(unsigned long) /usr/include/c++/11/bits/stl_vector.h:940
#6 0x5620bef398d0 in ggml_graph_compute_helper /home/ggerganov/development/github/llama.cpp/llama.cpp:668
#7 0x5620bef8f6b2 in llama_decode_internal /home/ggerganov/development/github/llama.cpp/llama.cpp:5577
#8 0x5620befc9a09 in llama_decode /home/ggerganov/development/github/llama.cpp/llama.cpp:9462
#9 0x5620bedd4eb5 in llama_init_from_gpt_params(gpt_params&) /home/ggerganov/development/github/llama.cpp/common/common.cpp:996
#10 0x5620bed77fc5 in main /home/ggerganov/development/github/llama.cpp/examples/main/main.cpp:187
#11 0x7fe42b429d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
Thread T28 created by T0 here:
#0 0x7fe42cc58685 in __interceptor_pthread_create ../../../../src/libsanitizer/asan/asan_interceptors.cpp:216
#1 0x5620bf282b56 in ggml_graph_compute /home/ggerganov/development/github/llama.cpp/ggml.c:16094
#2 0x5620bef3994f in ggml_graph_compute_helper /home/ggerganov/development/github/llama.cpp/llama.cpp:672
#3 0x5620bef8f6b2 in llama_decode_internal /home/ggerganov/development/github/llama.cpp/llama.cpp:5577
#4 0x5620befc9a09 in llama_decode /home/ggerganov/development/github/llama.cpp/llama.cpp:9462
#5 0x5620bed8b2fa in main /home/ggerganov/development/github/llama.cpp/examples/main/main.cpp:605
#6 0x7fe42b429d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
SUMMARY: AddressSanitizer: heap-buffer-overflow /home/ggerganov/development/github/llama.cpp/ggml.c:1158 in ggml_vec_cpy_f32
Shadow bytes around the buggy address:
0x0c5a801f0c60: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c5a801f0c70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c5a801f0c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c5a801f0c90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0c5a801f0ca0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
=>0x0c5a801f0cb0:[fa]fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c5a801f0cc0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c5a801f0cd0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c5a801f0ce0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c5a801f0cf0: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
0x0c5a801f0d00: fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa
Shadow byte legend (one shadow byte represents 8 application bytes):
Addressable: 00
Partially addressable: 01 02 03 04 05 06 07
Heap left redzone: fa
Freed heap region: fd
Stack left redzone: f1
Stack mid redzone: f2
Stack right redzone: f3
Stack after return: f5
Stack use after scope: f8
Global redzone: f9
Global init order: f6
Poisoned by user: f7
Container overflow: fc
Array cookie: ac
Intra object redzone: bb
ASan internal: fe
Left alloca redzone: ca
Right alloca redzone: cb
Shadow gap: cc
==364805==ABORTING |
Please confirm that #4307 works |
Sorry I couldn't help more with the debugging. Anyway #4307 seems to work for me. The segfault no longer occurs. |
No problem - thank you very much for reporting this issue |
Inference with CLBlast fails with a segfault after the commit that merged #4256 on context sizes above 2k when all GPU layers are offloaded.
Command line:
C:\test\llama-b1601-bin-win-clblast-x64>main.exe -m E:\LLaMA\models\airoboros-mistral2.2-7b.Q4_K_S.gguf -c 4096 -b 512 -n 32 -ngl 33 -f C:\test\test.txt
Result:
Prompt processing starts, and then segfaults halfway around the 2k token mark, before generation begins. Only if the prompt is short enough (less than 2k tokens) it appears to work.
The text was updated successfully, but these errors were encountered: