-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AVX acceleration #617
Add AVX acceleration #617
Conversation
That is a very significant increase ! |
For the vector dot-product, have you looked at In the inner loop, after subtracting 8: ...
// Get absolute values of x vectors
const __m128i ax = _mm_sign_epi8(bx, bx);
// Sign the values of the y vectors
const __m128i sy = _mm_sign_epi8(by, bx);
// Perform multiplication and create 16-bit values
const __m128i dot = _mm_maddubs_epi16(ax, sy);
const __m128i ones = _mm_set1_epi16(1);
i32[j] = _mm_madd_epi16(ones, dot);
} Beware: this is completely untested. |
Great job! In my case the premise is I am using alpaca.cpp, so I just modified the ggml.c according to @perserk modifications. I don't know therefore what kind of difference this might pose. I had an improvement although not as dramatic as the one the OP had. I'm on i7-2630QM, 16GB RAM. I hope I implemented it correctly. As I have time I'll try this on llama.cpp |
@RiccaDS why would you use alpaca.cpp ? - afaik llama.cpp has all and more features of alpaca.cpp |
@Green-Sky Merely a matter of first approach. I heard about alpaca and llama a few days ago. From my few readings I understood that Alpaca was more chatGPT-like, whilst llama was more like the auto-completion core on which Alpaca is based. But I might have understand it incorrectly, my knowledge on these models is pretty scarce, let alone on AI in general. I will try out llama too thanks. |
@sw Thanks for the link and the patch. I changed the code and it got a bit faster, at least not slower. |
It's looking good, almost too good ;-) because your AVX acceleration is almost as fast as the AVX2 we currently have. What I mean by that: if I have an AVX2-capable CPU and allow it with the #elif 0 //defined(__AVX2__)
... // don't use AVX2 explicitly
#elif defined(__AVX__)
... // but use AVX intrinsics and let the compiler do its magic This means we could just have a plain AVX implementation that would work on your Xeon but also be just as fast on AVX2 processors as before. Can anyone else confirm or disprove this? This would be preferable because the code is getting quite unwieldy with all the different processor optimizations. |
I think this is very possible. I think this can help explain the observations in #603 In short, I believe the AVX2 code has room for improvements. |
I tried adapting the same technique used here to AVX2 (src) and it performs a little better:
While it is not exactly twice as fast, there is still a significant performance advantage in using AVX2. This change reduces eval times for me just a bit, but in the perplexity calculation the difference is bigger, from 43 seconds/pass to 35 seconds/pass. |
If you observe good results with this and #642 , go ahead and merge both PR. I won't have much time to look today. Btw, here is also something worth looking into: #205 (comment) |
* ggml : add AVX quantize_row_q4_0() * ggml : add AVX ggml_vec_dot_q4_0() * ggml : refactor AVX part of ggml_vec_dot_q4_0() ggml-org#617 (comment)
* ggml : add AVX quantize_row_q4_0() * ggml : add AVX ggml_vec_dot_q4_0() * ggml : refactor AVX part of ggml_vec_dot_q4_0() ggml-org/llama.cpp#617 (comment)
My old Xeon E5-2670 doesn't have AVX2 support. So I have added AVX acceleration to quantize_row_q4_0() and ggml_vec_dot_q4_0().
Here is the result before the change:
and after: