A simple implementation of Accelerating Large Language Model Decoding with Speculative Sampling in NumPy for GPT-2. See main.py
. I also wrote a blog post for this implementation.
Install Dependencies:
pip install -r picoGPT/requirements.txt
Tested on Python 3.9.10
.
Usage:
python main.py \
--prompt "Alan Turing theorized that computers would one day become" \
--n_tokens_to_generate 40 \
--draft_model_size "124M" \
--target_model_size "1558M" \
--K 4 \
--temperature 0 # 0 for greedy sampling
Which outputs:
Autoregressive Decode
---------------------
Time = 60.64s
Text = Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.
In the 1950s, he proposed a way to build a computer that could think like a human. He called it the "T
Speculative Decode
------------------
Time = 27.15s
Text = Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.
In the 1950s, he proposed a way to build a computer that could think like a human. He called it the "T