Keyformer proposes KV Cache reduction through key tokens identification and without the need for fine-tuning.
This repository contains the source code implementation of Keyformer.
This source code is available under the Apache 2.0 License.
For the readers, to get a quick overview of how keyformer reduces the KV Cache size and speeds up LLM inference, we invite you to browse through our blog -
- 🎫 Blog - Quick summary of Keyformer
But if you'd rather see keyformer in action first, please continue reading.
You can create a conda environment using the conda-env.yml
file.
conda env create --file=conda-env.yml
conda activate keyformer-env
Right now, keyformer has been tested on a 1xA100 node for mosaicml/mpt-7b
and
EleutherAI/gpt-j-6B
with the models in FP32
and evaluation done using FP16
data formats.
We welcome contributions to port and evaluate keyformer with different
data formats, targeting lower model memory requirements and overall KVCache size
Note - Keyformer has been evaluated on the following models and we restrict this tutorial to the use of these -
['cerebras/Cerebras-GPT-6.7B', 'mosaicml/mpt-7b', 'EleutherAI/gpt-j-6B']
Clone the relevant model from huggingface into the models
directory. For instance, to clone mosaicml/mpt-7b
into models/mpt-7b-keyformer
, do the following -
cd models
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/mosaicml/mpt-7b mpt-7b-keyformer
cd ..
Then, download the model weights (*.bin
) using download_model.py
file and use the model_name as the argument.
cd models/model_download
python3 download_model.py --model_name mosaicml/mpt-7b
cd ../../
Please keep in mind that supported arguments for model_name
are restricted to ['cerebras/Cerebras-GPT-6.7B', 'mosaicml/mpt-7b', 'EleutherAI/gpt-j-6B']
This step will download the model parameters by creating a model
folder inside model_download
directory.
Downloaded model parameters are required to be moved to the same directory where model files are available.
To move the model parameters from models/model_download/model/
to models/mpt-7b-keyformer
mv models/model_download/model/* models/mpt-7b-keyformer/.
Copy the Keyformer source files from models/mpt-keyformer-lib
to models/mpt-7b-keyformer
cp -r models/mpt-keyformer-lib models/mpt-7b-keyformer
This sets up models/mpt-7b-keyformer
to be used with the various tasks described below.
Similarly, find the keyformer-lib files for other supported models in their respective directories in models/
.
After setting up a model with keyformer
, let's run it with a downstream task of interest. In the case of keyformer
, we provide examples on how to run downstream tasks of summarization and conversation. Further, for the purposes of evaluation, we provide a step-by-step tutorial on how to use lm_eval_harness.
Depending on your task of interest, please refer to the following links for more details
- 📖 Summarization - Running the Summarization task with Keyformer
- 💬 Conversation - Running the Conversation task with Keyformer
- 📊 LM Eval Harness - Using LM Eval harness for evaluation with Keyformer
[ ] Instructions to integrate keyformer with any model from huggingface
[ ] Using keyformer with quantized models
Keyformer uses open source components available on huggingface, mosaicml/mpt-7b, cerebras/Cerebras-GPT-6.7B, and
EleutherAI/gpt-j-6B.
@article{2023keyformer,
title={Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference},
author={Adnan, Muhammad and Arunkumar, Akhil and Jain, Gaurav and Nair, Prashant and Soloveychik, Ilya and Kamath, Purushotham},
journal={Proceedings of Machine Learning and Systems},
volume={7},
year={2024}
}