ggml : change ggml_graph_compute() API to not require context #1999

mqy · 2023-06-26T07:56:48Z

EDIT: see latest update at the end.

### Intro
The design is a bit different to the suggested one: named the buffer type as a generalized one:ggml_cgraph_context.

struct ggml_cgraph_context { size_t work_size; void * work_data; bool planned; // true means ready to compute graph nodes. };

I'll explain planned later, let's focus on the APIs.

The first parameter ctx of ggml_graph_compute() is deprecated (pass in NULL is OK).
Removed wsize and work from ggml_cgraph, unlikely break external users because no reason to use them directly.

To avoid break external users, can not simply change the signature of ggml_graph_compute(), have to add ggml_graph_compute_v2(), the name looks weird but this is the reality :/ Now ggml_graph_compute() is a thin wrapper of `ggml_graph_compute_v2().

Extracted codes to new function ggml_graph_compute_plan(): set node->n_tasks, calculate work_size.

Usage

struct ggml_cgraph_context ctx = {0}; // case 1: ggml_graph_compute_plan(&ctx, cgraph); if (ctx.work_size > 0) { ctx.work_data = malloc(ctx.work_size); } ggml_graph_compute_v2(&ctx, cgraph); // case 2: ggml_graph_compute_v2(&ctx, cgraph); // case 3: ggml_graph_compute_v2(NULL, cgraph); // case 4: ggml_graph_compute(ggml_context *ctx, cgraph); // case 5: ggml_graph_compute(NULL, cgraph);

Why did I add the field planned?

Because:

The ctx is allowed to be NULL, empty or initialized by ggml_graph_compute_plan().

One of the corner cases is: the work_size and work_data can be default values even if the plan has been called. So we can not simply determine whether or not call ggml_graph_compute_plan() by default values.

ggml_graph_compute_plan() MUST be called because it also sets node->n_tasks. The work_size depends on n_tasks.

The planned makes plan-compute sequence stateful, not good enough. Any ideas?

Update on JUL 3

No longer consider backward compatibility, this means the plan phase MUST be executed before compute. And gml_graph_compute() no longer responsible for creating any kind of buffer. @ggerganov
Removed n_tasks from ggml_tensor; removed n_threads from ggml_cgraph. @slaren
the struct ggml_graph_compute_plan implies that it should be initialized or created by some procedure. @howard0su

Usage:

struct ggml_graph_compute_plan plan = ggml_graph_compute_make_plan(gb, params.n_threads);
if (plan.work_size > 0) {
    plan.work_data = malloc(plan.work_size);
    GGML_ASSERT(plan.work_data);
}
ggml_graph_compute(&plan, gb);
if (plan.work_data) {
    free(plan.work_data);
}

I tested main and perplexity.

Update JUL 6

`🤖 Generated by Copilot at 551ed08`

Summary

🛠️📚🚀

This pull request improves the performance and usability of the ggml_graph_compute function by introducing a new ggml_cplan structure and a helper function ggml_graph_compute_helper. It also adds support for specifying the number of command buffers for the Metal context, and updates the examples, tests, and workflows to use the new API and settings.

This pull request makes some changes
To ggml_graph_compute and friends
It adds a new plan
And some Metal commands
And fixes some warnings and ends

Walkthrough

Change the ggml_graph_compute function to require a ggml_cplan argument and add new functions ggml_graph_plan and ggml_graph_compute_with_ctx to support the new API (link)
Add a helper function ggml_graph_compute_helper to wrap the logic of creating and using a ggml_cplan structure and update the example code in ggml.h to use the new API (link, link, link, link, link)
Remove the fields n_threads, n_tasks, and work from the ggml_cgraph and ggml_tensor structures and add them to the ggml_cplan structure (link, link, link)
Update the llama_eval_internal function to use the new API and adjust the number of threads for the BLAS settings (link, link, link, link)
Add a parameter n_cb to the ggml_metal_init function and a function ggml_metal_set_n_cb to control the number of command buffers for the Metal context and update the ggml_metal_graph_compute function to use the n_cb field instead of the n_threads field (link, link, link, link, link)
Update the metal.cpp example to pass the number of threads to the ggml_metal_init function (link)
Update the baby-llama.cpp, benchmark-matmult.cpp, and train-text-from-scratch.cpp examples to use the new API and remove the assignments of gf.n_threads (link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link, link)
Update the llama.cpp file to use the new API and pass the number 1 to the ggml_metal_init function (link, link, link, link, link, link, link, link, link)
Enable the test-grad0.c test and update it to use the new API and ignore the double promotion warning (link, link, link, link, link, link, link)
Update the test-opt.c test to use the new API and ignore the double promotion warning (link, link, link, link, link)
Add a timeout option to the ctest command for the GitHub workflow jobs to avoid the tests from hanging or running too long and add a new environment variable GGML_NLOOP to control the number of iterations for the benchmark tests (link, link, link, link, link)

slaren · 2023-06-26T08:15:21Z

ggml_graph_compute_plan() MUST be called because it also sets node->n_tasks. The work_size depends on n_tasks.

I think that n_tasks should be removed from ggml_tensor. For now, the easiest way to address may be to move it to the new ggml_cgraph_context, but eventually I think that this is something that ggml_graph_compute should be able to determine automatically when executing a node.

Good job!

mqy · 2023-06-26T08:31:09Z

I think that n_tasks should be removed from ggml_tensor.

Of course, n_tasks should belong to the compute facility I think, it's ideal to migrate to some place else.

So, create another data structure for example tensor_compute_config, 1:1 mapping to graph nodes, place them inside ggml_cgraph_context, during computing nodes, set params.nth as the n_tasks from the sibling i-th config object. It is Right?

slaren · 2023-06-26T08:42:03Z

Of course, n_tasks should belong to the compute facility I think, it's ideal to migrate to some place else.

Yes precisely! That's what I was thinking as well.

I am not sure that we need a new struct for this, is there any other data from the tensor would belong there? I am thinking that adding a simple int n_tasks[GGML_MAX_NODES] to ggml_cgraph_context with a 1:1 mapping between the graph nodes should do it. It is not great, but for now it will solve the problem, and eventually I think that we can remove it and have ggml_graph_compute determine this automatically instead.

mqy · 2023-06-26T09:27:56Z

is there any other data from the tensor would belong there?

No way.

adding a simple int n_tasks[GGML_MAX_NODES]

Great! The n_tasks array is good enough.

and eventually I think that we can remove it and have ggml_graph_compute determine this automatically instead.

That's will be a long story. I had tried add another enum to ggml_task_type named GGML_TASK_CONFIG, ggml_graph_compute() runs the planning before computing by calling ggml_graph_forward() for every node, similar to the planning idea in this PR. At that time I prefer this kind of design because actually the computing codes should know exactly what they can provide (parallel?) and ~~want~~ what are required (wsize?). That's easy to maintain-- just have a glance at the actual codes, less chances to mis-config.

I think ggml_graph_compute() MAY NOT necessary to hard code everything in the long run, it's better to build some kinds of new protocol/contract between the scheduler and actual computing codes. To achieve this, we can construct another slightly abstract data structure of on top of {node, params}, with methods such as config(), compute(). With this design, we could create and register several compute vendors (with id/name for example) for every kind of node, this provide great flexibility for evaluating experimental algorithms, benchmarking. With the compute vendor id, we are hopefully able to define a compute plan offline (if this is unnecessary, at least we can specify certain vendors) , this is similar to the dumped cgraph.

Just some random thoughts :D

slaren · 2023-06-26T09:44:44Z

That's very similar to what I have been thinking. I am working on a CUDA implementation that can execute ggml_cgraphs directly, and what it needs to do that is very similar to this, a method to "prepare" or "plan" the graph, and another to execute the "prepared" graph. Eventually, we will want a common interface to all compute backends so that users can use any of them interchangeably, without having to add specific code for each one.

While doing this, I think that we should also remove n_threads from ggml_cgraph, since this parameter is only relevant to the CPU compute backend and not applicable to the other compute backends.

mqy · 2023-06-26T10:13:31Z

Eventually, we will want a common interface to all compute backends so that users can use any of them interchangeably, without having to add specific code for each one.

Yes, I saw the changes during the last two months, getting better and better. Both CUDA and CL implementations impose potential de-coupling requirements.

While doing this, I think that we should also remove n_threads from ggml_cgraph, since this parameter is only relevant to the CPU compute backend and not applicable to the other compute backends.

Sure, similar to n_tasks in tensor as well, but no idea where to put it when compute, hello ggml_cgraph_context?

At present, I'm not sure but looks like for cross cgraph computing, the n_threads is still required, especially when a node runner is allowed to hijack the computing flow or fallback to default vendor -- the CUDA way. In cross cgraph context, n_threads or thread_ctx should be owned by something like compute session?

I think it's OK to leave n_threads there for a while, GPU codes can simply ignore it, unless the GPU implementation has it's complete cgraph that is not shared with CPU, this is yet another long story.

slaren · 2023-06-26T10:22:28Z

What I was thinking is that n_threads could be a parameter to ggml_graph_compute_plan, and it would also be stored in ggml_cgraph_context for use by ggml_graph_compute.

For now, the CUDA runner can still operate in the same way unaffected by this, but in the long run this method of hijacking the compute flow should be removed. Mixed CPU/GPU execution can be achieved in other ways, such as by splitting the computation into multiple graphs, each of them executed by a different backend.

howard0su · 2023-06-29T16:09:50Z

I like the idea. However, I am not favoring the implementation:

We should not rely on consumer's behavior says don't touch plan. We should not expose cgraph_compute_context at all or maintain the state elsewhere.
a common API pattern for external provided buffer is like: Pass NULL as buf, and buffer size into API. API will return the size it expected but don't do the real computation. Pass a valid pointer but the size is less than expected, the function will fail in the same way.
plan is not a clear to indicate this function needs to be called before compute API. compute_v2 takes a result of this plan will make the API explict. also maybe create compute_context is better API name. plan is too general.

mqy · 2023-06-29T16:32:23Z

I am not favoring the implementation

Yes, totally make points. A general context is used for future extensibility, but should be seen as over design. If you have read previous comments you would get the conclusion that a general context tends be "extended" as a basket.

So, if we can't tell what we want beyond current requirement, a specialized API should be the best choice.
Anyway, I don't think plan is a bad idea, the actual problem is we haven't decouple the calculating wsize from setting n_tasks. Perhaps we'd reconsider this design after splitting that part of codes.

ggerganov · 2023-07-01T16:42:14Z

I think that n_tasks should be removed from ggml_tensor

Agree

Good points by @howard0su

Overall, I think this is on the right track and we should finalize the PR and merge it.

There is no need to worry about backwards compatibility (i.e. the "_v2" stuff is not necessary). Just assume that we have just started developing the API and nobody has been using it and expect it to be compatible. We will start to worry about this kind of things after the v1.0 release sometime in the future.

mqy · 2023-07-03T01:55:45Z

There is no need to worry about backwards compatibility

This is great, I'll rewrite this PR, show you later.

ggml.c

tests/test-grad0.c

mqy · 2023-07-03T12:57:50Z

The PR description was updated. You may want to have a look at it. Apart from main and perplexity, also tested:

tests/test-grad0
tests/test-opt
examples/benchmark
examples/baby-llama
examples/train-text-from-scratch (force stop at Example 2, opt iter 29)

No crash, all of them output reasonable text.

So this PR is ready to review again.

slaren · 2023-07-04T09:05:02Z

Looks good, I only have a few minor nits:

In llama.cpp, to avoid allocations in every eval, the work buffer memory could be stored as a std::vector in lama_context. Just resize it to the work_size, and if it is already big enough, it won't reallocate memory.
More generally, in all the C++ code, I would suggest always using std::vector instead of malloc/free
I am not sure about ggml_graph_compute_sugar

ggml.c

mqy · 2023-07-04T09:43:37Z

Thanks @slaren

In llama.cpp, to avoid allocations in every eval, the work buffer memory could be stored as a std::vector in lama_context. Just resize it to the work_size, and if it is already big enough, it won't reallocate memory.

More generally, in all the C++ code, I would suggest always using std::vector instead of malloc/free

Make sense, this is also one of my concerns. I'll try the std:vector way.

I am not sure about ggml_graph_compute_sugar

The sugar is used by baby-llama. It calls ggml_graph_compute again and again. I suggest we delay it to next PR.

mqy · 2023-07-04T19:17:04Z

bf63002 is the latest commit per @slaren 's suggestion.

TODO: The sugar is used by baby-llama. It calls ggml_graph_compute again and again. I suggest we delay it to next PR.

Rebased on to master.

mqy · 2023-07-06T05:13:29Z

rebased

- backwards compatible API - deduplicates a lot of copy-paste

ggerganov · 2023-07-06T18:44:53Z

I've refactored the changes:

add ggml_graph_compute_with_ctx() - similar to the old way of graph computation
rename struct ggml_graph_compute_plan to struct ggml_cplan
rename ggml_graph_compute_make_plan() to ggml_graph_plan()
factored common "plan + compute" code in a helper function ggml_graph_compute_helper()
enabled test-grad0 test
fixed Metal build

Overall the change is good.
A positive side-effect is that the user can now control the number of tasks for each op. This can be utilized also when creating custom operators, if we expose the struct ggml_compute_params to the custom function signatures.

@mqy @slaren Please take a look

monatis · 2023-07-06T20:37:46Z

rename ggml_graph_compute_make_plan() to ggml_graph_plan()

I would suggest ggml_cplan_make() --both short as intended and also consistent with the struct naming.

slaren · 2023-07-06T21:44:26Z

I think this looks good.

A positive side-effect is that the user can now control the number of tasks for each op. This can be utilized also when creating custom operators, if we expose the struct ggml_compute_params to the custom function signatures.

Eventually we should support custom ops on the GPU backends too. It will be slow since it will require copying data back and forth from the device, but that's still a lot better than not supporting them at all. So whenever we do this, I think we should do it in a backend-agnostic way, rather than tying it to the CPU backend.

mqy · 2023-07-07T06:47:17Z

I've refactored the changes:

Looks good. Verified main and test-grad0.

A positive side-effect is that the user can now control the number of tasks for each op.

Increasing n_tasks[i] after plan may cause work buffer overflow.
For OPs that can only be run with n_tasks == 1, it's UB if run with n_tasks > 1.

Anyway, better than never.

ggerganov · 2023-07-07T16:22:41Z

I would suggest ggml_cplan_make() --both short as intended and also consistent with the struct naming.

Good suggestion.
I like grouping names with common prefixes and currently ggml_graph_plan() fits neatly with the rest: ggml_graph_compute(), ggml_graph_reset() so I prefer not to isolate it in a new prefix just yet. If we start doing more stuff with the plans, like ggml_cplan_optimize(), ggml_cplan_print(), etc, we can rename it.

Btw, this reminds me that ggml_graph_xxx probably should have been ggml_cgraph_xxx. Shall we change it?

Increasing n_tasks[i] after plan may cause work buffer overflow.

Ah good point. So the developer has to be careful and modify n_tasks only for their own custom operators or be familiar with the inner workings of the ggml ops.

mqy force-pushed the ggml_graph_compute_context branch 2 times, most recently from d9876af to 2a773ce Compare June 27, 2023 05:38

ggerganov added high priority Very important issue refactoring Refactoring labels Jul 1, 2023

ggerganov mentioned this pull request Jul 2, 2023

ggml : improve API to allow allocating compute graphs on the heap ggml-org/ggml#299

Closed

mqy commented Jul 3, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

mqy force-pushed the ggml_graph_compute_context branch from 2a773ce to e052bc4 Compare July 3, 2023 08:01

mqy commented Jul 3, 2023

View reviewed changes

tests/test-grad0.c Outdated Show resolved Hide resolved

mqy changed the title ~~ggml_graph_compute: deprecate using ggml_context, try resolve issue #287~~ ggml: breaking change to ggml_graph_compute(): planning is required before computing Jul 3, 2023

mqy marked this pull request as draft July 3, 2023 10:11

mqy commented Jul 3, 2023

View reviewed changes

tests/test-grad0.c Show resolved Hide resolved

mqy marked this pull request as ready for review July 3, 2023 12:59

slaren reviewed Jul 4, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

mqy force-pushed the ggml_graph_compute_context branch from 62ec4b8 to bf63002 Compare July 4, 2023 19:10

mqy mentioned this pull request Jul 5, 2023

callback to abort ggml_graph_compute() ggml-org/ggml#328

Merged

mqy added 2 commits July 6, 2023 11:39

minor: update comments

cb1dec0

reusable buffers

b1331d7

mqy force-pushed the ggml_graph_compute_context branch from bf63002 to b1331d7 Compare July 6, 2023 03:43

ggerganov added 5 commits July 6, 2023 20:23

ggml : more consistent naming + metal fixes

53cfb4b

ggml : fix docs

4646cc2

tests : disable grad / opt + minor naming changes

8e1f0b6

ggml : add ggml_graph_compute_with_ctx()

2392f7a

- backwards compatible API - deduplicates a lot of copy-paste

ci : enable test-grad0

1b9994f

ggerganov force-pushed the ggml_graph_compute_context branch from 2313c54 to 1b9994f Compare July 6, 2023 17:57

ggerganov added 6 commits July 6, 2023 21:08

examples : factor out plan allocation into a helper function

a67404e

llama : factor out plan stuff into a helper function

2d3a525

ci : fix env

8fdf86d

llama : fix duplicate symbols + refactor example benchmark

9c9bdaf

ggml : remove obsolete assert + refactor n_tasks section

8dc7f10

ggml : fix indentation in switch

551ed08

ggerganov approved these changes Jul 6, 2023

View reviewed changes

llama : avoid unnecessary bool

f789f2c

ggerganov changed the title ~~ggml: breaking change to ggml_graph_compute(): planning is required before computing~~ ggml : change ggml_graph_compute() API Jul 7, 2023

ggml : remove comments from source file and match order in header

c15833c

ggerganov changed the title ~~ggml : change ggml_graph_compute() API~~ ggml : change ggml_graph_compute() API to not require context Jul 7, 2023

ggerganov merged commit 1d656d6 into ggml-org:master Jul 7, 2023

mqy deleted the ggml_graph_compute_context branch July 7, 2023 22:09

mqy mentioned this pull request Jul 14, 2023

fixed runtime bugs and compile errors related to GGML_PERF and GGML_DEBUG #2219

Merged

Djip007 mentioned this pull request Nov 10, 2024

ggml : move LLAMAFILE/tinyBLAS into a backend #10183

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml : change ggml_graph_compute() API to not require context #1999

ggml : change ggml_graph_compute() API to not require context #1999

mqy commented Jun 26, 2023 •

edited by ghost

Loading

slaren commented Jun 26, 2023

mqy commented Jun 26, 2023

slaren commented Jun 26, 2023

mqy commented Jun 26, 2023 •

edited

Loading

slaren commented Jun 26, 2023 •

edited

Loading

mqy commented Jun 26, 2023

slaren commented Jun 26, 2023

howard0su commented Jun 29, 2023 •

edited

Loading

mqy commented Jun 29, 2023

ggerganov commented Jul 1, 2023

mqy commented Jul 3, 2023

mqy commented Jul 3, 2023

slaren commented Jul 4, 2023

mqy commented Jul 4, 2023

mqy commented Jul 4, 2023

mqy commented Jul 6, 2023

ggerganov commented Jul 6, 2023

monatis commented Jul 6, 2023

slaren commented Jul 6, 2023

mqy commented Jul 7, 2023

ggerganov commented Jul 7, 2023 •

edited

Loading

ggml : change ggml_graph_compute() API to not require context #1999

ggml : change ggml_graph_compute() API to not require context #1999

Conversation

mqy commented Jun 26, 2023 • edited by ghost Loading

Usage

Why did I add the field planned?

Update on JUL 3

Update JUL 6

🤖 Generated by Copilot at 551ed08

Summary

Walkthrough

slaren commented Jun 26, 2023

mqy commented Jun 26, 2023

slaren commented Jun 26, 2023

mqy commented Jun 26, 2023 • edited Loading

slaren commented Jun 26, 2023 • edited Loading

mqy commented Jun 26, 2023

slaren commented Jun 26, 2023

howard0su commented Jun 29, 2023 • edited Loading

mqy commented Jun 29, 2023

ggerganov commented Jul 1, 2023

mqy commented Jul 3, 2023

mqy commented Jul 3, 2023

slaren commented Jul 4, 2023

mqy commented Jul 4, 2023

mqy commented Jul 4, 2023

mqy commented Jul 6, 2023

ggerganov commented Jul 6, 2023

monatis commented Jul 6, 2023

slaren commented Jul 6, 2023

mqy commented Jul 7, 2023

ggerganov commented Jul 7, 2023 • edited Loading

mqy commented Jun 26, 2023 •

edited by ghost

Loading

Why did I add the field `planned`?

`🤖 Generated by Copilot at 551ed08`

mqy commented Jun 26, 2023 •

edited

Loading

slaren commented Jun 26, 2023 •

edited

Loading

howard0su commented Jun 29, 2023 •

edited

Loading

ggerganov commented Jul 7, 2023 •

edited

Loading