-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : add ggml_soft_max_ext #4256
Merged
Merged
Changes from 7 commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
e89597c
metal : implement soft_max_ext
ggerganov 88519fb
cuda : implement soft_max_ext
ggerganov 6a66f69
ggml : implement soft_max_ext (CPU)
ggerganov 390a445
batched-bench : print threads
ggerganov 580fe20
metal : simplify soft_max encoding
ggerganov ebd062b
cuda : use 512 threads for soft_max instead of 32
ggerganov c7c8dab
ggml : update soft max cpu
ggerganov 62532c0
cuda : do warp-based block reduce
ggerganov 6b86bcf
cuda : increase max block size to 1024
ggerganov 68e02c0
cuda : fix warp reduction initialization of shared mem
ggerganov 55717c9
metal : warp-based reduction for soft max kernel
ggerganov c4db592
metal : warp-based reduce for rms_norm
ggerganov d9c8fa3
metal : simplify soft max kernel
ggerganov eb594c0
alloc : fix build with debug
ggerganov File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced the old
WARP_SIZE
kernel with a new that uses up toCUDA_SOFT_MAX_BLOCK_SIZE=512
threads per block (based onne00
). This seems to perform better on V100.Any reason to prefer the old kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Block wide reductions are a lot more expensive than warp reductions, so most of the kernels only distribute operations that need a reduction to a warp. A block reduction is usually done by doing a reduction in every warp and storing the result of each warp to shared memory, and then doing a reduction of the shared memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use warp-based block reduction with max of 32 warps. Results improved, more significantly for the largest contexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to do similar optimization for the Metal kernel