-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml: new optimization interface #988
ggml: new optimization interface #988
Conversation
How do you plan to support multiple GPUs? Currently this interface is taking a |
For some reason |
I mentioned some possible solutions to the problem with |
I added tests for the new optimization interface. I'll do the transition towards |
I adapted the new optimization interface to use |
With some changes it can be used with BLAS and Metal. On M3 Max with BLAS it takes just 3 seconds to train, compared to 15 seconds with 3090 Ti CUDA or ~9 seconds with 13900k CPU.
diff --git a/examples/mnist/mnist-common.h b/examples/mnist/mnist-common.h
index 6e2d235..c2a4464 100644
--- a/examples/mnist/mnist-common.h
+++ b/examples/mnist/mnist-common.h
@@ -134,6 +134,17 @@ struct mnist_model {
devices.push_back(dev);
}
+ // add accel devices
+ for (size_t i = 0; i < ggml_backend_dev_count(); ++i) {
+ ggml_backend_dev_t dev = ggml_backend_dev_get(i);
+ if (ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_CPU) {
+ ggml_backend_t backend = ggml_backend_dev_init(dev, nullptr);
+ GGML_ASSERT(backend);
+ backends.push_back(backend);
+ devices.push_back(dev);
+ }
+ }
+
ggml_backend_dev_t dev_cpu = ggml_backend_dev_by_name("CPU");
GGML_ASSERT(dev_cpu);
ggml_backend_t backend_cpu = ggml_backend_dev_init(dev_cpu, nullptr);
@@ -151,12 +162,17 @@ struct mnist_model {
if (backends.size() == 1) {
fprintf(stderr, "%s: using %s (%s) backend\n",
__func__, ggml_backend_name(backends[0]), ggml_backend_dev_description(devices[0]));
- } else if (backends.size() == 2) {
- fprintf(stderr, "%s: using %s (%s) backend with %s (%s) fallback\n",
- __func__, ggml_backend_name(backends[0]), ggml_backend_dev_description(devices[0]),
- ggml_backend_name(backends[1]), ggml_backend_dev_description(devices[1]));
} else {
- GGML_ASSERT(false);
+
+ fprintf(stderr, "%s: using %s (%s) backend with fallbacks: ",
+ __func__, ggml_backend_name(backends[0]), ggml_backend_dev_description(devices[0]));
+ for (size_t i = 1; i < backends.size(); ++i) {
+ fprintf(stderr, "%s (%s)", ggml_backend_name(backends[i]), ggml_backend_dev_description(devices[i]));
+ if (i + 1 < backends.size()) {
+ fprintf(stderr, ", ");
+ }
+ }
+ fprintf(stderr, "\n");
}
{
diff --git a/src/ggml-metal.m b/src/ggml-metal.m
index fb2efc6..a9f35c7 100644
--- a/src/ggml-metal.m
+++ b/src/ggml-metal.m
@@ -3285,6 +3285,12 @@ static void ggml_backend_metal_buffer_free_buffer(ggml_backend_buffer_t buffer)
return ctx->all_data;
}
+static void ggml_backend_metal_buffer_memset_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, uint8_t value, size_t offset, size_t size) {
+ memset((char *)tensor->data + offset, value, size);
+
+ UNUSED(buffer);
+}
+
static void ggml_backend_metal_buffer_set_tensor(ggml_backend_buffer_t buffer, struct ggml_tensor * tensor, const void * data, size_t offset, size_t size) {
memcpy((char *)tensor->data + offset, data, size);
@@ -3318,7 +3324,7 @@ static void ggml_backend_metal_buffer_clear(ggml_backend_buffer_t buffer, uint8_
/* .free_buffer = */ ggml_backend_metal_buffer_free_buffer,
/* .get_base = */ ggml_backend_metal_buffer_get_base,
/* .init_tensor = */ NULL,
- /* .memset_tensor = */ NULL,
+ /* .memset_tensor = */ ggml_backend_metal_buffer_memset_tensor,
/* .set_tensor = */ ggml_backend_metal_buffer_set_tensor,
/* .get_tensor = */ ggml_backend_metal_buffer_get_tensor,
/* .cpy_tensor = */ ggml_backend_metal_buffer_cpy_tensor, |
I changed the MNIST code slightly to a version that I think is simpler. Am I right in assuming that it's unproblematic to initialize two backends for the same device and to then pass those backends to the same instance of |
The performance seems very poor but since the model is so small that is basically just a measure of the overhead. I think to remember that you are using WSL2 so maybe that has to do with it? On my machines (all running native Linux) I see the following performance:
Notably the RX 6800 is also performing much worse than the P40 even though with llama.cpp the performance is very similar. |
It may waste some resources and make graph splitting a bit slower, but not much. Generally I don't think it is very useful to have multiple GPU backends, the CPU backend is usually a better fallback since it the cost of copying the state is lower.
Kernel launch overhead is higher on Windows (it's the same reason |
I removed the use of GGML graph exports from the MNIST example. In its current state the feature is fundamentally incompatible because it relies on statically allocated CPU tensors (also it would be necessary to mess with the internals of the optimization context). Currently the optimization interface works by making the user statically allocate the model weights and inputs, and defining the computation of the outputs without allocation. The optimization context then statically allocates tensors for e.g. the optimizer momenta and defines the backward pass without allocation. The unallocated tensors are then given to |
513a786
to
7a35e26
Compare
After this is merged, can all the "opt" functions from ggml.h/ggml.c be removed, or is any that still used? I am moving all the CPU backend specific code to a separate file, and it would be easier if I could just remove these functions, since they only work with the CPU backend. |
About the graph exports - I don't think these are used, it seems that it was an experimental feature that never really took off. It may be better to remove these functions entirely. cc @ggerganov. |
Yes, everything graph export should be removed. |
Actually, my plan was to nail down the features for the new interface, then remove the old ggml_opt functions and rename the new interface to ggml_opt (I just thought it would be easier that way). The ggml_opt functionality on master can already be removed ahead of time I think. |
I fixed gradient accumulation and I think that this PR is now feature complete and just needs the ggml_op_new -> ggml_opt transition. @slaren since you are currently also doing something where the old optimization interface would be removed, how should we coordinate this? |
I am almost done with the change, I was planning to open a PR later tonight. It's moving code around so there will be merge conflicts, but it should be fairly straightforward to resolve them since I am not changing the functions that you are modifying here. |
Unless I'm forgetting something I now have all features that I was targeting for this PR. After #1006 is merged all that is left to do is to rebase the code and change the prefix from ggml_opt_new to ggml_opt. |
63d4133
to
71da71d
Compare
Minor patch to clear some compile warnings with clang: diff --git a/src/ggml-opt.cpp b/src/ggml-opt.cpp
index ec9bccd..a1fb512 100644
--- a/src/ggml-opt.cpp
+++ b/src/ggml-opt.cpp
@@ -635,7 +635,7 @@ void ggml_opt_epoch_callback_progress_bar(
const int64_t t_eta_m = t_eta_s / 60;
t_eta_s -= t_eta_m * 60;
- fprintf(stderr, "| data=%06ld/%06ld, loss=%.6lf+-%.6lf, accuracy=%.2lf+-%.2lf%%, t=%02ld:%02ld:%02ld, ETA=%02ld:%02ld:%02ld]\r",
+ fprintf(stderr, "| data=%06" PRId64 "/%06" PRId64 ", loss=%.6lf+-%.6lf, accuracy=%.2lf+-%.2lf%%, t=%02" PRId64 ":%02" PRId64 ":%02" PRId64 ", ETA=%02" PRId64 ":%02" PRId64 ":%02" PRId64 "]\r",
idata, idata_max, loss, loss_unc, 100.0*accuracy, 100.0*accuracy_unc,
t_ibatch_h, t_ibatch_m, t_ibatch_s, t_eta_h, t_eta_m, t_eta_s);
if (ibatch == ibatch_max) {
@@ -712,7 +712,7 @@ void ggml_opt_fit(
t_total_s -= t_total_h * 3600;
const int64_t t_total_m = t_total_s / 60;
t_total_s -= t_total_m * 60;
- fprintf(stderr, "%s: training took %02ld:%02ld:%02ld\n", __func__, t_total_h, t_total_m, t_total_s);
+ fprintf(stderr, "%s: training took %02" PRId64 ":%02" PRId64 ":%02" PRId64 "\n", __func__, t_total_h, t_total_m, t_total_s);
}
ggml_opt_free(opt_ctx); |
|
The reasoning is that they are opaque types and it is not relevant to the user whether they are structs or not. This is done with all the structs that are hidden from user code. |
It's up to you. We can also add a CODEOWNERS where maintainers can add themselves if they would like to be notified for PRs.
Sounds good. |
I noticed that the carriage return for the progress bar only results in the expected animation-like behavior if the progress bar is short enough to fit the terminal, otherwise it only returns to the point where the line is broken and spams the terminal with one new line for each minibatch. I just reduced the size of the progress bar but maybe there is a better solution. |
I pushed a version that stores the optimization parameters in the graph. They are allocated in a CPU buffer and written at the start of an eval. The CPU backend can use the parameters directly. I changed the CUDA backend to expect a device buffer with parameters instead of passing the parameters as kernel arguments. To change optimization parameters from their defaults users need to pass a custom function that calculates them. Edit: no changes to |
d321214
to
cd2b9ab
Compare
I did some fixup, from my end this PR would be ready to merge. There is still the issue of refactoring the GGML code in such a way that I adapted the MNIST example README and while doing so I noticed that the convolutional model can now be trained with partial CUDA support which is faster than CPU only. |
I pushed a refactor of the code around gradients. Currently for to get the gradient or gradient accumulator for a tensor there is a loop over |
Actually, I think the hash table should be part of |
The goal is to progressive port the code to C++, but modifying |
I did an implementation using hashsets but I realized that with the current GGML hashsets building the backward pass would still take quadratic time. If a tensor is not contained in the hash set |
Is this correct? It should only iterate until the first empty slot. That's just the way of dealing with collisions, but if the table is correctly sized, the number of collisions will be very close to zero. |
You are absolutely right, looking at the code again it seems I missed part of the condition in the while loop. |
4004e25
to
bbf203c
Compare
Definitely thank you for the clarification, my hashset-based implementation (which I now pushed) had a defect related to my misunderstanding that randomly did not manifest as a bug. |
Are there still pending (re-)reviews or should we merge this? |
Let me have a look now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool stuff!
Before merging, I would like to first sync the backend split from llama.cpp
and resolve the PR conflicts here. Otherwise if we merge it now, then I will have a much harder time to resolve these conflicts through the sync scripts and the git-am
commands since some of the files have moved.
Would that be OK? The sync might be ready tonight, but more likely tomorrow.
From my end I am in no particular rush, I am definitely not running out of things to work on (even ignoring projects unrelated to llama.cpp/GGML). I just don't want to do more rebases than necessary since this PR touches a lot of lines. |
Should be ready to rebase and merge. |
remove test2.c, test3.c store adamw params in tensor move grads from tensor to graph
bbf203c
to
e35567a
Compare
I noticed that the memory for gradient (accumulator) pointers upon graph creation is not explicitly cleared so it was possible to provoke a segfault via API misuse. This is fixed. There also seem to be build issues on Apple where (I also noticed that |
4685431
to
8fdeb12
Compare
* ggml: new optimization interface remove test2.c, test3.c store adamw params in tensor move grads from tensor to graph * avoid segfault upon API misuse * add ggml-opt.h to public headers * remove dependence of ggml-opt.cpp on ggml-cpu.h
This PR adapts the training code from the MNIST example into GGML with the goal of establishing a new interface for training models. The goal is to provide downstream projects with a common, more high-level interface for training that can be tested and debugged more easily. The general design is procedural and relies on the definition of data structures for optimization contexts, datasets, and results.
As of right now essentially only feed-forward classifiers are supported. I put the code into a new file
ggml-opt.cpp
with a corresponding new headerggml-opt.h
. One reason for this is that I am using some C++ functionality that is not performance critical but convenient. Another reason is that with the current GGML code there is no need to mess around with the internals of a GGML graph so I think it makes sense to split off functionality that is only going to be used by a subset of the userbase into a separate header (also the general vibe from what I can tell is that people findggml.c
hard to navigate due to its size).There is still a lot to do but I would like to get feedback on the interface early if possible. In particular, one thing that is still missing is testing code for the new interface. For now the prefix that I am using for the new interface is
ggml_opt_new
, I plan to change this toggml_opt
and remove the oldggml_opt
code prior to merging.