Skip to content

xenshinu/LMCache

 
 

Repository files navigation

lmcache logo

💡 What is LMCache?

LMCache lets LLMs prefill each text only once. By storing the KV caches of all reusable texts, LMCache can reuse the KV caches of any reused text (not necessarily prefix) in any serving engine instance. It thus reduces prefill delay, i.e., time to first token (TTFT), as well as saves the precious GPU cycles.

By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.

Try LMCache with pre-built vllm docker images here.

🚀 Performance snapshot

image

💻 Quickstart

We provide a docker-based quickstart demo in the folder examples/. This quickstart lets you start a serving engine (vLLM) with LMCache and then query the serving engine with a long context.

- Prerequisites

First, clone and cd into the LMCache repo with

git clone https://github.com/LMCache/LMCache && cd LMCache

To run the quickstart demo, your server should have 1 GPU and the docker environment with the nvidia-runtime installed.

You may need sudo access to run the docker depending on the server configuration.

This demo will use the port 8000 (for vLLM) and 8501 (for the frontend).

- Start the serving engine with LMCache

Start the docker-based serving engine by:

bash examples/quickstart.sh

The vLLM serving engine is ready after you see the following lines in the log: image

- Start the frontend

The quickstart comes with a frontend. To run the frontend, use:

pip install openai streamlit
streamlit run examples/quickstart-frontend.py

You should be able to access the frontend from your browser at http://<your server's IP>:8501

The first query has a long TTFT because the server needs to prefill the long context. But once the first quey finishes, the TTFT of all future queries will be much lower as LMCache shares the KV cache to vLLM which can then skip the prefill of the long context.

- What's next

We provide multiple demos at 🔗LMCache-demos repo. The demos cover the following use cases:

  • Share KV caches across multiple serving engines (🔗link)
  • Loading non-prefix KV caches for RAG (🔗link)

🛣️ Project Milestones

  • First release of LMCache
  • Support installation through pip install
  • Integration with latest vLLM

📖 Blogs and papers

LMCache is built on two key techniques:

  1. CacheGen [SIGCOMM'24]: A KV-cache compression system that encodes KV caches into compact bitstreams.
  2. CacheBlend [EuroSys'25]: A KV-cache blending system that dynamically composes new KV caches from smaller ones.

Please read our blog posts for more details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.0%
  • Shell 1.0%