-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error: inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on x86_64 - better support for different x86_64 CPU instruction extensions #196
Comments
Looks like a duplicate of #107. Can you please confirm you're running on native x86_64 and not emulated? |
Yes, not in a virtual environment such as docker. |
If it is Arch I'm guessing you're using a very recent |
I tried to execute
The gcc version is as follows:
I'm not sure if this is the reason for archlinux. |
You need to add |
Are the |
I am using the provided Makefile, which set those flags for you https://github.com/ggerganov/llama.cpp/blob/master/Makefile#L92 |
Have the same issue using provided Makefile. Ubuntu 22.04 LTS, gcc-11.3.0, Xeon E5 2690 |
I believe that CPU supports only AVX, not AVX2. edit: this is wrong, see below |
No, when I execute |
Try running:
If it doesn’t print |
My CPU does not support avx2, but it can run normally through the above method. |
I understand, but my CPU does not support FP16C, only avx. |
We should set DEFINES for each featureflag and decide which code to use inside ggml.c on a more granular level. |
I made a patch and can make normally
|
I have this issue reported issue on my CPU. Apparently it has AVX, but no F16C (and no AVX2). I have quite old 10-15 year old Intel CPU on laptop. Probably it is the case that some old CPUs have AVX while having no F16C. I had this compilation issue on Windows latest 16-th Clang when provided My compilation was fixed and program was working (although not to very fast) after I implemented this conversion functions myself and placed following code inside
If some C/C++ gurus know faster implementation of this function for AVX then please tell here. For know suggesting to put fix above into main branch by any volunteer. If code above is alright. |
It would be great if @xiliuya and @polkovnikov could work together to both create a pull request with your patches so we can support a wider range of CPUs. |
Yes, it works. But it doesn't perform as well on my machine as not using
After adding code:
|
@xiliuya You have two different answers for three reasons:
Please try some more descriptive query, for example instead of |
I am testing and there is no problem with the generated output. The problem is that the generation speed has slowed down since AVX was turned on. Does turning on AVX really make the output better? |
@xiliuya Turning on AVX changes only speed of computation, but not the quality of answer. Don't know why it happened that AVX version is 1.5 times slower in your case. Maybe because AVX is not used that much but instead F16C is used more. And in case of my code I emulate F16C through generic algorithms, which are slow. It could be the case that my generic version is slower somehow than generic version of NON-AVX code. It could explain the reason of slow down. |
There is nothing wrong with your code.
|
@xiliuya The problem with your last patch of code is that it TOTALLY removes use of AVX or any other SIMD. Because if you don't define GGML_SIMD macro then only generic non-SIMD code is used everywhere. But in case of my code I only implement two functions as generic code, while the rest of AVX is still used, like So if I was to compare my code patch and yours I would choose my version as it should be more faster. But other experts may disagree, we need more ideas here about ways to improve. |
Good work guys. I am not a C++ programmer... I am however interested in performance. I'd ideally want the most performant CPU code for any arch. |
This patch allowed me to successfully run the make command. |
I run into the same error when executing In the virtual machine, the content of the diff --git a/Makefile b/Makefile
index 98a2d85..1b0f28c 100644
--- a/Makefile
+++ b/Makefile
@@ -80,6 +80,8 @@ ifeq ($(UNAME_M),$(filter $(UNAME_M),x86_64 i686))
CFLAGS += -mavx2
endif
else ifeq ($(UNAME_S),Linux)
+ CFLAGS += -mfma
+ CFLAGS += -mf16c
AVX1_M := $(shell grep "avx " /proc/cpuinfo)
ifneq (,$(findstring avx,$(AVX1_M)))
CFLAGS += -mavx After that, i can run then got the result: main: seed = 1679824080
llama_model_load: loading model from './models/7B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 4096
llama_model_load: n_mult = 256
llama_model_load: n_head = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 11008
llama_model_load: n_parts = 1
llama_model_load: type = 1
llama_model_load: ggml ctx size = 4273.34 MB
llama_model_load: mem required = 6065.34 MB (+ 1026.00 MB per state)
llama_model_load: loading model part 1/1 from './models/7B/ggml-model-q4_0.bin'
llama_model_load: .................................... done
llama_model_load: model size = 4017.27 MB / num tensors = 291
llama_init_from_file: kv self size = 256.00 MB
system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0
Hello there and welcome to my personal web-site. I'm a graphic designer, illustrator & photographer based in London, UK. This site is primarily used as an online portfolio showcasing my work but I also use it to blog about what I am doing in my spare time (and the occasional rant!)
I'm available for freelance projects and always open to new opportunities. Please feel free to contact me with any project proposals or enquiries via my details page, email address is there as well as a Twitter link. [end of text]
llama_print_timings: load time = 4220.08 ms
llama_print_timings: sample time = 90.68 ms / 117 runs ( 0.78 ms per run)
llama_print_timings: prompt eval time = 673.42 ms / 3 tokens ( 224.47 ms per token)
llama_print_timings: eval time = 34385.98 ms / 116 runs ( 296.43 ms per run)
llama_print_timings: total time = 40892.13 ms BTW, I also noticed the CMakeLists.txt currently always enable F16C, FMA, AVX and AVX2 for X86 Linux, so i could just use
that also works well for me. |
@beaclnd92 The problem with your solution that it just enables F16C feature of CPU. But my old CPU has only AVX, but no F16C feature. So your solution works for part of CPUs, but doesn't work for mine. |
Yeah, it maybe only works for a guest virtual machine based on a qualified physical cpu with the required SIMD features. Generally, it gets a better performance compared to not enabling the features. |
hi, I have the same inlining problem when using -mavx , running Linux Mint on i7-2630QM, 16GB RAM, pretty old laptop (13 years), and the problem is I'm not able to get AVX to be used. I know cpu supports it however. I tried @polkovnikov patch, it allows making but still no AVX and prompt reply is really slow. Any idea? |
Fixed in #563 |
@slaren @ggerganov I found a BAD mistake in this #563 issue - and posted a comment, if you have access/desire to modifying code, please fix this commented issue. |
I can confirm I now have AVX1 activated. However no apparent improvement in performance, also with the fix suggested by @polkovnikov. Probably AVX1 is not sufficient for good performance. I'm on i7-2630QM 16BG RAM, clearly not a quantum computer, but I hoped better honestly. Takes minutes to complete a 1 line of reply. Thank you for your work btw. |
When I compile with make, the following error occurs
Error will be reported when executing
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -mavx -msse3 -c ggml.c -o ggml.o
.But the error of executing
cc -I. -O3 -DNDEBUG -std=c11 -fPIC -pthread -msse3 -c ggml.c -o ggml.o
will not occur.Must
-mavx
be used with-mf16c
?OS: Arch Linux x86_64
Kernel: 6.1.18-1-lts
The text was updated successfully, but these errors were encountered: