-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Implement Llama longest prefix cache #158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This looks like it's good to go now, the cache is in-memory currently so you'll need to set |
Tried this out in oobabooga and it's a game-changer for chats with frequent editing. Before this the model had to spend a lot of time reingesting the entire prompt even on a small edit. |
@eiery glad to hear! I should point out though that text-generation-webui is just using the new |
@abetlen I'm curious about using this feature in the Python API. I tried this approach but re-running the same
Which gives:
I also tried appending the model output to the previous prompt and sending it to the
But again got similar results:
|
Opening this up to track the development of the new caching behaviour I'm planning to implement. This will leverage 2 significant improvements
The text was updated successfully, but these errors were encountered: