-
Notifications
You must be signed in to change notification settings - Fork 13.4k
metal : allow ops to run concurrently #15929
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
hi @ggerganov , thanks for the nice work this pr can not merge into master branch any more Applied patch to 'ggml/src/ggml-metal/CMakeLists.txt' cleanly.
Performing three-way merge...
error: ggml/src/ggml-metal/ggml-metal-common.cpp: does not exist in index
error: cannot read the current contents of 'ggml/src/ggml-metal/ggml-metal-common.cpp'
error: ggml/src/ggml-metal/ggml-metal-common.cpp: patch does not apply
Performing three-way merge...
error: ggml/src/ggml-metal/ggml-metal-common.h: does not exist in index
error: cannot read the current contents of 'ggml/src/ggml-metal/ggml-metal-common.h'
error: ggml/src/ggml-metal/ggml-metal-common.h: patch does not apply
Applied patch to 'ggml/src/ggml-metal/ggml-metal.m' cleanly. |
0a6f0eb
to
417df40
Compare
@calvin2021y The branch is now rebased on latest |
I get 1% tps speedup with this patch. will try more models and update late. |
ggml-ci
17cf93d
to
faffbec
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Apple Metal
https://en.wikipedia.org/wiki/Metal_(API)
ggml
changes relating to the ggml tensor library for machine learning
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
While queueing the graph nodes, keep track of the memory intervals/ranges from which we read data and to which we write data. Using this information, for the new node we can determine if it can safely run concurrently with all the concurrent ops prior to it:
This feature can be disabled with
GGML_METAL_CONCURRENCY_DISABLE=1
env.Improvements depends on the order of the nodes in the graph. Some models do not currently allow to benefit much from this logic, but utilizing a graph optimization approach similar to #15850 should improve things.Introduced logic for optimizing the graph to improve concurrency in a similar way as in #15850. The benefits are large for TG and decent for PP.TODO:
Example
For example, before this patch, the graph of one layer of
gpt-oss-20b
is executed like this :(concurrent)
means that the node runs in parallel with the previous oneAfter this patch, the nodes are reordered and executed like this:
Perf