Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prompt Caching #91

Closed
lucasavila00 opened this issue Apr 8, 2024 · 1 comment
Closed

Prompt Caching #91

lucasavila00 opened this issue Apr 8, 2024 · 1 comment
Labels
models Additions to model or architectures new feature New feature or request optimization processing Processing related to the model

Comments

@lucasavila00
Copy link
Contributor

Use case:

In a conversation, skip prefilling for the entire prompt every time a new message is sent.

Store the previously calculated KVs if there is enough VRAM for it.

Examples

Automatic Prefix Caching

Fast and Expressive LLM Inference with RadixAttention and SGLang

@EricLBuehler EricLBuehler added optimization backend Backend work models Additions to model or architectures processing Processing related to the model new feature New feature or request and removed backend Backend work labels Apr 8, 2024
@EricLBuehler
Copy link
Owner

EricLBuehler commented Apr 8, 2024

Tasklist:

  • Store kv caches of seqs by hash(token ids) to avoid prefill step.
  • Ensure that storing the kv caches does not cause OOM, in that case swap to CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
models Additions to model or architectures new feature New feature or request optimization processing Processing related to the model
Projects
None yet
Development

No branches or pull requests

2 participants