LMCache lets LLMs prefill each text only once. By storing the KV caches of all reusable texts, LMCache can reuse the KV caches of any reused text (not necessarily prefix) in any serving engine instance. It thus reduces prefill delay, i.e., time to first token (TTFT), as well as saves the precious GPU cycles.
By combining LMCache with vLLM, LMCaches achieves 3-10x delay savings and GPU cycle reduction in many LLM use cases, including multi-round QA and RAG.
Try LMCache with pre-built vllm docker images here.
We provide a docker-based quickstart demo in the folder examples/
. This quickstart lets you start a serving engine (vLLM) with LMCache and then query the serving engine with a long context.
First, clone and cd into the LMCache repo with
git clone https://github.com/LMCache/LMCache && cd LMCache
To run the quickstart demo, your server should have 1 GPU and the docker environment with the nvidia-runtime installed.
You may need sudo access to run the docker depending on the server configuration.
This demo will use the port 8000 (for vLLM) and 8501 (for the frontend).
Start the docker-based serving engine by:
bash examples/quickstart.sh
The vLLM serving engine is ready after you see the following lines in the log:
The quickstart comes with a frontend. To run the frontend, use:
pip install openai streamlit
streamlit run examples/quickstart-frontend.py
You should be able to access the frontend from your browser at http://<your server's IP>:8501
The first query has a long TTFT because the server needs to prefill the long context. But once the first quey finishes, the TTFT of all future queries will be much lower as LMCache shares the KV cache to vLLM which can then skip the prefill of the long context.
We provide multiple demos at 🔗LMCache-demos repo. The demos cover the following use cases:
- Share KV caches across multiple serving engines (🔗link)
- Loading non-prefix KV caches for RAG (🔗link)
- First release of LMCache
- Support installation through pip install
- Integration with latest vLLM
LMCache is built on two key techniques:
- CacheGen [SIGCOMM'24]: A KV-cache compression system that encodes KV caches into compact bitstreams.
- CacheBlend [EuroSys'25]: A KV-cache blending system that dynamically composes new KV caches from smaller ones.
Please read our blog posts for more details.