Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is LayerSkip / self-speculative decoding possible (requires getting one intermediate layer's output + some KV cache changes)? #10820

Open
tc-wolf opened this issue Dec 13, 2024 Discussed in #10787 · 1 comment

Comments

@tc-wolf
Copy link

tc-wolf commented Dec 13, 2024

Discussed in #10787

Originally posted by tc-wolf December 11, 2024

Summary

There is a technique called LayerSkip, (https://arxiv.org/abs/2404.16710) that allows for a type of acceleration for decoding without a draft model (self-speculative decoding). As I understand it, the models are trained with layer dropout + a loss function that allows it to be possible for earlier layers to predict tokens directly (and thus can get a speedup by letting subset of layers draft tokens). There is a collection of llama models trained with this property, so implementing this would be useful.

A basic implementation can be done with making a full split of the model using the first E layers as a separate draft model, but this is not optimal (because weights and KV cache would not be reused). This was the initial strategy used in the transformers library before a more optimized implementation was merged.

Challenges

To get "draft" tokens, we'd need to determine an exit layer, get outputs from that layer, and pass them through the LM head of the model. I could see this as being possible with llama.cpp if at startup time we make a split in the graph / make a separate graph with shared weights for early exit and the layer for early exit is fixed.

(Verification is done by predicting the next token with the remaining layers for the next token after the drafted tokens and looking for a contradiction, as with "normal" speculative decoding with a separate draft model).

This also has some complications for how KV caching is done, since the KV cache for the draft model (first E layers) can be used as part of the KV cache for the full model, and the exit layer's query vector can be saved as well. I have to read this section of the paper + look at implementations more thoroughly to understand the details here.

I've looked at #4224 and #2783, but I think this may be easier to implement since the layer at which to do early exiting can be fixed before beginning inference (so doesn't require the full output from each hidden layer and could split the graph ahead-of-time).

Requests

  • Any advice on making a GGUF model from a subset of layers of llama architecture models? I think I can just limit n_layers in the model config.json (through looking at convert_hf_to_gguf.py, but I'm not sure if would have to override anything elsewhere / if would require a little graph surgery to take first E layers + final LM head layer.
    • The immediate thing I want to try myself is converting + quantizing their checkpoints, making a subset into a draft model, and testing the token generation rate through treating as fully separate models.
  • Do you think it's plausible to implement this (with weight sharing + cache re-use) in llama.cpp? Or would this run into some complications because of the way the graph is constructed / inference is done?
@foldl
Copy link
Contributor

foldl commented Dec 19, 2024

You can do some experiments on generating drafts using only first X layers on-the-fly with this (I named it as Layer shuffling):

https://github.com/foldl/chatllm.cpp/blob/master/docs/fun.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants