This is the code repo for CacheBlend: Fast Large Language Model Serving with Cached Knowledge Fusion. The current implementation is based on vLLM.
The newest updates will always be at LMCache. Stay tuned !!!
Python>=3.9
and CUDA >= 12.1
are required. An Nvidia GPU with >=40 GB
memory is recommended.
To install CacheBlend depenencies:
git clone git@github.com:YaoJiayi/CacheBlend.git
cd CacheBlend/vllm_blend
pip install -e .
cd ..
pip install -r requirements.txt
python example/blend.py
python example/blend_musique.py
To run datasets other than musique, please replace musique
with samsum
or wikimqa
in the above command.
@misc{yao2024cacheblendfastlargelanguage,
title={CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion},
author={Jiayi Yao and Hanchen Li and Yuhan Liu and Siddhant Ray and Yihua Cheng and Qizheng Zhang and Kuntai Du and Shan Lu and Junchen Jiang},
year={2024},
eprint={2405.16444},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2405.16444},
}