-
Notifications
You must be signed in to change notification settings - Fork 11.4k
ggml : change ggml_graph_compute() API to not require context #1999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think that Good job! |
Of course, So, create another data structure for example |
Yes precisely! That's what I was thinking as well. I am not sure that we need a new struct for this, is there any other data from the tensor would belong there? I am thinking that adding a simple |
No way.
Great! The n_tasks array is good enough.
That's will be a long story. I had tried add another enum to I think Just some random thoughts :D |
That's very similar to what I have been thinking. I am working on a CUDA implementation that can execute While doing this, I think that we should also remove |
Yes, I saw the changes during the last two months, getting better and better. Both CUDA and CL implementations impose potential de-coupling requirements.
Sure, similar to At present, I'm not sure but looks like for cross cgraph computing, the I think it's OK to leave |
What I was thinking is that For now, the CUDA runner can still operate in the same way unaffected by this, but in the long run this method of hijacking the compute flow should be removed. Mixed CPU/GPU execution can be achieved in other ways, such as by splitting the computation into multiple graphs, each of them executed by a different backend. |
d9876af
to
2a773ce
Compare
I like the idea. However, I am not favoring the implementation:
|
Yes, totally make points. A general context is used for future extensibility, but should be seen as over design. If you have read previous comments you would get the conclusion that a general context tends be "extended" as a basket. So, if we can't tell what we want beyond current requirement, a specialized API should be the best choice. |
Agree Good points by @howard0su Overall, I think this is on the right track and we should finalize the PR and merge it. There is no need to worry about backwards compatibility (i.e. the "_v2" stuff is not necessary). Just assume that we have just started developing the API and nobody has been using it and expect it to be compatible. We will start to worry about this kind of things after the v1.0 release sometime in the future. |
This is great, I'll rewrite this PR, show you later. |
2a773ce
to
e052bc4
Compare
The PR description was updated. You may want to have a look at it. Apart from
No crash, all of them output reasonable text. So this PR is ready to review again. |
Looks good, I only have a few minor nits:
|
Thanks @slaren
Make sense, this is also one of my concerns. I'll try the std:vector way.
The sugar is used by |
62ec4b8
to
bf63002
Compare
bf63002
to
b1331d7
Compare
rebased |
- backwards compatible API - deduplicates a lot of copy-paste
2313c54
to
1b9994f
Compare
I've refactored the changes:
Overall the change is good. |
I would suggest |
I think this looks good.
Eventually we should support custom ops on the GPU backends too. It will be slow since it will require copying data back and forth from the device, but that's still a lot better than not supporting them at all. So whenever we do this, I think we should do it in a backend-agnostic way, rather than tying it to the CPU backend. |
Looks good. Verified main and test-grad0.
Anyway, better than never. |
Good suggestion. Btw, this reminds me that
Ah good point. So the developer has to be careful and modify |
Try resolve ggml-org/ggml#287
EDIT: see latest update at the end.
### Intro
- The
- One of the corner cases is: the
The design is a bit different to the suggested one: named the buffer type as a generalized one:
ggml_cgraph_context
.I'll explain
planned
later, let's focus on the APIs.The first parameter
ctx
ofggml_graph_compute()
is deprecated (pass in NULL is OK).Removed
wsize
andwork
fromggml_cgraph
, unlikely break external users because no reason to use them directly.To avoid break external users, can not simply change the signature of
ggml_graph_compute()
, have to addggml_graph_compute_v2()
, the name looks weird but this is the reality :/ Nowggml_graph_compute()
is a thin wrapper of `ggml_graph_compute_v2().Extracted codes to new function
ggml_graph_compute_plan()
: set node->n_tasks
, calculatework_size
.Usage
Why did I add the field
planned
?Because:
ctx
is allowed to be NULL, empty or initialized byggml_graph_compute_plan()
.work_size
andwork_data
can be default values even if the plan has been called. So we can not simply determine whether or not callggml_graph_compute_plan()
by default values.ggml_graph_compute_plan()
MUST be called because it also setsnode->n_tasks
. Thework_size
depends onn_tasks
.The
planned
makes plan-compute sequence stateful, not good enough. Any ideas?Update on JUL 3
plan phase
MUST be executed beforecompute
. Andgml_graph_compute()
no longer responsible for creating any kind of buffer. @ggerganovn_tasks
fromggml_tensor
; removedn_threads
fromggml_cgraph
. @slarenstruct ggml_graph_compute_plan
implies that it should be initialized or created by some procedure. @howard0suUsage:
I tested
main
andperplexity
.Update JUL 6
🤖 Generated by Copilot at 551ed08
Summary
🛠️📚🚀
This pull request improves the performance and usability of the
ggml_graph_compute
function by introducing a newggml_cplan
structure and a helper functionggml_graph_compute_helper
. It also adds support for specifying the number of command buffers for the Metal context, and updates the examples, tests, and workflows to use the new API and settings.Walkthrough
ggml_graph_compute
function to require aggml_cplan
argument and add new functionsggml_graph_plan
andggml_graph_compute_with_ctx
to support the new API (link)ggml_graph_compute_helper
to wrap the logic of creating and using aggml_cplan
structure and update the example code inggml.h
to use the new API (link, link, link, link, link)n_threads
,n_tasks
, andwork
from theggml_cgraph
andggml_tensor
structures and add them to theggml_cplan
structure (link, link, link)llama_eval_internal
function to use the new API and adjust the number of threads for the BLAS settings (link, link, link, link)n_cb
to theggml_metal_init
function and a functionggml_metal_set_n_cb
to control the number of command buffers for the Metal context and update theggml_metal_graph_compute
function to use then_cb
field instead of then_threads
field (link, link, link, link, link)metal.cpp
example to pass the number of threads to theggml_metal_init
function (link)baby-llama.cpp
,benchmark-matmult.cpp
, andtrain-text-from-scratch.cpp
examples to use the new API and remove the assignments ofgf.n_threads
(link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)llama.cpp
file to use the new API and pass the number 1 to theggml_metal_init
function (link, link, link, link, link, link, link, link, link)test-grad0.c
test and update it to use the new API and ignore the double promotion warning (link, link, link, link, link, link, link)test-opt.c
test to use the new API and ignore the double promotion warning (link, link, link, link, link)ctest
command for the GitHub workflow jobs to avoid the tests from hanging or running too long and add a new environment variableGGML_NLOOP
to control the number of iterations for the benchmark tests (link, link, link, link, link)