💡 What is LMCache?

LMCache lets LLMs prefill each text only once. By storing the KV caches of all reusable texts, LMCache can reuse the KV caches of any reused text (not necessarily prefix) in any serving engine instance. It thus reduces prefill delay, i.e., time to first token (TTFT), as well as saves the precious GPU cycles.

By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.

Try LMCache with pre-built vllm docker images here.

🚀 Performance snapshot

💻 Quickstart

We provide a docker-based quickstart demo in the folder examples/. This quickstart lets you start a serving engine (vLLM) with LMCache and then query the serving engine with a long context.

- Prerequisites

First, clone and cd into the LMCache repo with

git clone https://github.com/LMCache/LMCache && cd LMCache

To run the quickstart demo, your server should have 1 GPU and the docker environment with the nvidia-runtime installed.

You may need sudo access to run the docker depending on the server configuration.

This demo will use the port 8000 (for vLLM) and 8501 (for the frontend).

- Start the serving engine with LMCache

Start the docker-based serving engine by:

bash examples/quickstart.sh

The vLLM serving engine is ready after you see the following lines in the log:

- Start the frontend

The quickstart comes with a frontend. To run the frontend, use:

pip install openai streamlit
streamlit run examples/quickstart-frontend.py

You should be able to access the frontend from your browser at http://<your server's IP>:8501

The first query has a long TTFT because the server needs to prefill the long context. But once the first quey finishes, the TTFT of all future queries will be much lower as LMCache shares the KV cache to vLLM which can then skip the prefill of the long context.

- What's next

We provide multiple demos at 🔗LMCache-demos repo. The demos cover the following use cases:

Share KV caches across multiple serving engines (🔗link)
Loading non-prefix KV caches for RAG (🔗link)

🛣️ Project Milestones

First release of LMCache
Support installation through pip install
Integration with latest vLLM

📖 Blogs and papers

LMCache is built on two key techniques:

CacheGen [SIGCOMM'24]: A KV-cache compression system that encodes KV caches into compact bitstreams.
CacheBlend [EuroSys'25]: A KV-cache blending system that dynamically composes new KV caches from smaller ones.

Please read our blog posts for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.buildkite		.buildkite
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
examples		examples
lmcache		lmcache
tests		tests
third_party/torchac_cuda		third_party/torchac_cuda
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO		TODO
requirements-test.txt		requirements-test.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

💡 What is LMCache?

🚀 Performance snapshot

💻 Quickstart

- Prerequisites

- Start the serving engine with LMCache

- Start the frontend

- What's next

🛣️ Project Milestones

📖 Blogs and papers

About

Releases

Packages

Languages

License

xenshinu/LMCache

Folders and files

Latest commit

History

Repository files navigation

💡 What is LMCache?

🚀 Performance snapshot

💻 Quickstart

- Prerequisites

- Start the serving engine with LMCache

- Start the frontend

- What's next

🛣️ Project Milestones

📖 Blogs and papers

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages