Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial AVX512 support for dot product on Linux #320

Merged
merged 4 commits into from
Mar 21, 2023
Merged

Add initial AVX512 support for dot product on Linux #320

merged 4 commits into from
Mar 21, 2023

Conversation

Ameobea
Copy link
Contributor

@Ameobea Ameobea commented Mar 20, 2023

NOTE: I am seeing different outputs when running with these changes. They seem of equal quality, but this isn't something I observed when first testing this out on alpaca.cpp.

It's possible that some rounding behavior is happening slightly differently or something like that. If this is a dealbreaker, I can try to figure out what is causing the difference and check if it's possible to get rid of it.

Changes

  • Update Makefile to detect AVX512 support and add compiler flags if it's available
  • Add AVX512 impl based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
  • Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
  • Use built-in AVX512 horizontal reduce add to get sum at the end
  • Manual unrolling on inner dot product loop to reduce loop counter overhead

Performance Impact

I initially implemented this over on alpaca.cpp where I saw an ~10% speedup to inferrence.

Before:

main: mem per token = 14368644 bytes
main:     load time =   923.25 ms
main:   sample time =    85.94 ms
main:  predict time = 23502.37 ms / 92.17 ms per token
main:    total time = 24845.69 ms

After:

main: mem per token = 14368644 bytes
main:     load time =   928.89 ms
main:   sample time =    16.18 ms
main:  predict time =  5720.41 ms / 82.90 ms per token
main:    total time =  6982.89 ms

I was hoping for more, but some other stuff I tried like converting the bytesFromNibbles function to operate on two blocks at a time by using AVX512 were not successful.

 * Update Makefile to detect AVX512 support and add compiler flags if it's available
 * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
 * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
 * Use built-in AVX512 horizontal reduce add to get sum at the end
 * Manual unrolling on inner dot product loop to reduce loop counter overhead
@congdm
Copy link

congdm commented Mar 21, 2023

Thank you for your effort! It also works on Windows and gives a little boost on my i7-11700F, from ~208 ms/token to 195 ms/token or sometimes even 185 ms/token on Alpaca7B.

@sw sw merged commit 2e664f1 into ggml-org:master Mar 21, 2023
mudler pushed a commit to go-skynet/llama that referenced this pull request Mar 21, 2023
 * Update Makefile to detect AVX512 support and add compiler flags if it's available
 * Based on existing AVX2 implementation, dot product on one 32-value block of 4-bit quantized ints at a time
 * Perform 8 bit -> 16 bit sign extension and multiply+add on 32 values at time instead of 16
 * Use built-in AVX512 horizontal reduce add to get sum at the end
 * Manual unrolling on inner dot product loop to reduce loop counter overhead
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants