-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : expose hash table API from ggml.c and reuse in ggml-alloc #549
Comments
I need a hash table in So I am not sure if it is worth to do this. Should we add a private header file for internal use within ggml only? |
Sure, we should add an internal header (e.g. Btw, I'm still not really worried about breaking things in the |
I am very concerned about breaking our own examples, especially the training examples depend very strongly on the internals of all of this. I think that the whole graph building code should be redesigned. I don't know how to solve every problem with dependencies, but I think a good start would be to automatically attach all the tensors created in a context to a graph, with an option to explicitly detach them. Currently we do the opposite, we create all tensor detached and then attach them with calls to |
That's a good idea. I think it is totally doable. The order of the nodes of the graph will be determined by the order of creation (i.e. the order of calling The graph construction where we traverse the children / parents is probably completely unnecessary. It was one of the very first features in Edit: Actually it is relevant for the backward pass, since in that situation you could have many extra ops that are not necessary for the calculation of the gradients that you are interested. So in this case, traversing the dependencies will automatically avoid any unnecessary ops. Maybe we can think of a way to support both modes - would make the API transition easier too. |
The backwards pass should always start from a loss, right? So I think in that case it is perfectly reasonable to start from the loss node and follow the parents to build the backwards graph. But if we have all the nodes somewhere, we may be able to add missing dependencies automatically. The typical case is the KV update: ggml_cpy(ctx0, Kcur, ggml_view_1d(ctx0, kv_self.k, ...));
struct ggml_tensor * K = ggml_view_3d(ctx0, kv_self.k, ...);
struct ggml_tensor * KQ = ggml_mul_mat(ctx0, K, Q); The |
Let me think some more about the backward pass.
Why would we need to figure out the dependency? The operations will be executed in the order that they have been declared, so no need to find the dependency. I am thinking more that we will basically record the node order while constructing the graph, instead of trying to determine it by traversing the nodes post-construction. |
For parallel execution we need to know all the dependencies. To minimize backend switches when partial offloading we may need to reorder the evaluation order, but we can only do that if we know all the dependencies. I assume that we also need to know the dependencies to build the correct backwards pass, but I don't know enough about that. We may also be able to find the best evaluation order to minimize memory usage during training if we know all the dependencies. |
The order in which we invoke the ops (A) + the existing dependency information through If we start recording the order, it will add extra constraints: an op The KV cache update example would be trivially solved if we start recording (A) and apply the new constraints. The parallelisation can be achieved with an extra pass that analyses the graph and takes both (A) and (B) into an account. Same goes for the training use cases. Having (A), we also avoid the Let me know if you agree - is there any other use case that you think we might need to cover? |
I think this should be "an op (1): or if a view of |
How is it a contradiction - I mean that in the example above, |
The contradiction is that |
Let's say we have the following ops: // these can go in parallel
K = ggml_mul_mat(ctx, model.wk, cur);
Q = ggml_mul_mat(ctx, model.wq, cur);
V = ggml_mul_mat(ctx, model.wv, cur);
// these can go in parallel
Q_scaled = ggml_scale(ctx, Q, scale);
V_scaled = ggml_scale(ctx, V, scale);
// these can go in parallel
Q_masked = ggml_add(ctx, Q_scaled, mask);
V_masked = ggml_add(ctx, V_scaled, mask);
Similarly, In contrast, the following graph will not allow parallelization: K = ggml_mul_mat(ctx, model.wk, cur);
Q = ggml_mul_mat(ctx, model.wq, cur);
Q_scaled = ggml_scale(ctx, Q, scale);
Q_masked = ggml_add(ctx, Q_scaled, mask);
V = ggml_mul_mat(ctx, model.wv, cur);
V_scaled = ggml_scale(ctx, V, scale);
V_masked = ggml_add(ctx, V_scaled, mask); Or technically, in this case we could parallelize ( With this approach, I don't think we even have to consider |
I think I must be missing something. If |
Parallel mean we can run the ops concurrently. E.g. in multiple CUDA streams or with Metal using concurrent dispatch command buffer followed by a sync / barrier. |
As with CPU threads, there is no guarantee that two kernels launched on different CUDA streams will run at the same time. They may run in any order. I think we need to be very clear that running multiple operations in parallel requires the operations to be invariant to the order of execution. To add another example, with this wording, I don't see what would prevent executing the ggml_cpy(ctx0, Kcur, ggml_view_1d(ctx0, kv_self.k, ...));
KQ = ggml_mul_mat(ctx0, ggml_view_3d(ctx0, kv_self.k, ...), Q); The wording "an op |
Nothing would prevent the ops from executing in parallel, but I'm not convinced that this is actually wrong. In a similar example, I could be copying data into the first half of a tensor I would rather have the user be explicit about the dependencies in the graph, rather than trying to solve the dependency problem in the general case. The logic seem to become too complicated and the only benefit is that we saved an extra op in the graph. In the default scenario, the parallelization optimization would be disabled and the example above would evaluate correctly, without need for carefully placed We are still just brainstorming, so I am open to considerations and more discussion. But I'm thinking that starting to respect the invocation order would give us multiple benefits and should be easy to implement. |
In the case of the KV update, it is undoubtedly wrong to not add a dependency between the Admittedly, the main difficulty of the full requirement is the complexity of the implementation. However I think that this could actually be pretty simple to implement if we are just a bit smarter about the way we handle views. For example, if we kept a list of the views of a tensor, we could in principle simply modify the view ops to use as a source the most recent view of the source tensor (again, if we want to optimize this, we could revise this definition to "the most recent view of the source tensor whose range overlaps the current view"). Note that this may require allowing negative offsets internally in the view ops, but that's not a big issue. The problem of requiring the user to handle these dependencies explicitly is that if they don't, then the program will silently produce wrong results, and that's a very difficult problem to diagnose unless you are already an expert in ggml. In my opinion that's not good if we want ggml to be widely used. |
#548 adds hash table implementation:
ggml/src/ggml.c
Lines 16926 to 16985 in a094d9c
It should be exposed through the API and reused in
ggml-alloc
The text was updated successfully, but these errors were encountered: