Llama3.1 and kv_cache quantization #738

HDCharles · 2024-08-23T06:14:35Z

this PR has support for llama 3.1 and some improvements to kv_cache quantization and general peak memory performance for llama

high level, we can now do inference with 130k context length in 18.9 GB peak memory if we apply kv cache quantization, linear causal mask and int4 weight-only quantization

summary of changes

add 3.1 support for llama
change quantized_kv_cache init so it doesn't create a full precision peak: see below
reorder causal mask init: see below
add option for linear causal mask: see below
add option for cache_size: the default generate.py behavior requires you do generate 32k tokens if you want to haev a size 32k kv_cache/causal_mask, the cache_size option lets you simply set the cache size but generate a smaller number of tokens to make it easier to benchmark
add option to generate memory profile: used to generate the images below

context length (tokens)	normal peak (GB)	kv_quant peak (GB)	kv quant+causal fix peak (GB)
8192	17.86	17.52	17.47
16384	19.81	18.75	18.48
32768	23.83	21.72	20.64
65536	33.5	29.54	25.24
131072	59.27	52.62	34.18

Change to quantized kv_cache init

The first change is avoiding creating of the full precision kv_cache, previously we would initialize the kv_cache and then convert it to the quantized form as seen in this memory profile:

those horizontal lines from ~16.1 GB to 16.6GB are the normal kv_cache and you can see them being deallocated on the right side of the image as the quantized kv_cache's are instantiated. This created an unnecessary increase in peak memory any time the initialization is the peak (which was the case for very long context lengths).

Change to causal mask

This is a memory profile for 32k context length without kv_cache quantization or any other changes, compare to one with kv_cache quantization

those horizontal bands that run from 16GB to 20.5 GB on the top image and 18.5 on the bottom, are the kv_cache. With quantization its 2 GB smaller which shows the technique is performing as expected, however there is a large blue (top) or (green) blob (with a spike on the left side) that appears in the memory profile, this is the causal mask.

Normally the causal mask is handled by creating a (token length x token length) tensor of ones, then creating a copy that is lower triangular and taking slices from it throughout the model runs. Notice the sharp peak right at the start, this occurs because in order to copy a tensor of ones into a lower triangular matrix requires you to hold 2 instances of this in memory for a moment, thereby doubling its impact in addition to taking up O(context_length^2) memory. The doubling issue was solved by creating the causal mask before the kv_cache, if done like that, the momentary doubling spike doesn't affect the peak memory since the kv_cache will be higher than the spike.

Although the earlier instantiation of the causal mask helps (red blob now), it is still taking up a ton of space, especially at even higher context lengths, which is eating into the gains we expect from kv_cache quantization. Why do we need to actually store the causal mask though? A slice of the causal mask is essentually just a sequence of n ones in a row and then
context_length-n zeros in a row where n is the current token being generated. Each slice differs from the next only by a single value. We can just store the slice and update it each iteration instead. Result:

tests:

see benchmarks.sh

the 18.9 GB number came from

python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --write_result benchmark_results.txt --cache_size 131072 --kv_cache_quantization --linear_causal_mask --quantization int4wo-64

pytorch-bot · 2024-08-23T06:14:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/738

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit c5e4dcb with merge base 37276d6 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

msaroufim · 2024-08-23T17:09:00Z

Mostly looking good!

There's a merge conflict, @gau-nernst recently added training support for gpt-fast
The memory traces you shared seemed compelling, let's have the baseline be gpt-fast as is and the intervention kv-cache + reshuffled mask init + vector mask
cc @Jack-Khuu and @kartikayk since this is landing soon

Summary: TODO: finish kv_cache testing generate memory_trace Added the 3.1 frequency rescaling and model definitions testing is ongoing Test Plan: python eval.py --checkpoint_path $../../../checkpoints/meta-llama/Meta-Llama-3.1-8B/model.pth --compile wikitext: {'word_perplexity,none': 7.441690325135099, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.4554823564993407, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.541497351075118, 'bits_per_byte_stderr,none': 'N/A', 'alias': 'wikitext'} python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --write_result benchmark_results.txt --max_new_tokens 16384 python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --write_result benchmark_results.txt --kv_cache_quantization --max_new_tokens 16384 python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --write_result benchmark_results.txt --max_new_tokens 32768 python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --write_result benchmark_results.txt --kv_cache_quantization --max_new_tokens 32768 Reviewers: Subscribers: Tasks: Tags:

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

HDCharles · 2024-08-27T01:50:44Z

Mostly looking good!

There's a merge conflict, @gau-nernst recently added training support for gpt-fast

The memory traces you shared seemed compelling, let's have the baseline be gpt-fast as is and the intervention kv-cache + reshuffled mask init + vector mask

cc @Jack-Khuu and @kartikayk since this is landing soon

fixed
gpt-fast errors even at 32k context length, it requires a bunch of fixes to even get it working, I don't have a good way to compare apples to apples. At the moment i'm comparing normal performance (with reordering of causal mask init) v kv_cache quantization v kv_cache quantization + linear causal mask

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

msaroufim · 2024-08-27T17:46:51Z

This feels good to merge to me, fwiw @iseeyuan and @felipemello1 have also noticed a large difference by reducing the memory requirements of logits pytorch/executorch#4688 from O(context length) to O(1)

Also remind me if you also quantized the model? (seems like no?) I'm trying to see if we can hit a 24GB VRAM budget or whether we need to explore int4 kv quantization. It'd be pretty sick do to a full llama 8b inference on a 128K context length, that should for example be enough to fit the entire AO code repo

Also mind adding the top line VRAM requirements at the top, the line chart doesnt have even ranges on the y-axis (log scale?) so a bit hard to eyeball

vadimkantorov · 2024-08-28T15:40:24Z

Could the mask be generated via a broadcasting trick (arange broadcasted and compared to another arange broadcasted differently) to alleviate the need for ones and then tril? Or not in this context? Otherwise, does FlexAttention allow to avoid materialization of such masks and compute the masking directly during the attention? (I thought that flash attention supported such materialization-free causal masks too...)

kir152 · 2024-08-29T15:33:59Z

Great work on the memory optimizations! Have you measured any impact on model accuracy or perplexity, with this method?

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 23, 2024

HDCharles requested a review from msaroufim August 23, 2024 06:19

msaroufim approved these changes Aug 27, 2024

View reviewed changes

HDCharles added 3 commits August 26, 2024 18:21

further kv_cache investigation

65d73df

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

memory fixes for kv_cache quant

08d598c

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

HDCharles force-pushed the 057_llama3-1 branch from 3f712df to 08d598c Compare August 27, 2024 01:46

HDCharles requested review from supriyar and jerryzh168 August 27, 2024 01:52

fix benchmarks

c5e4dcb

Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:

HDCharles merged commit 86c7b0d into main Aug 27, 2024
16 checks passed

felipemello1 mentioned this pull request Aug 29, 2024

Bring to torchtune memory wins for generate.py pytorch/torchtune#1448

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3.1 and kv_cache quantization #738

Llama3.1 and kv_cache quantization #738

HDCharles commented Aug 23, 2024 •

edited

Loading

pytorch-bot bot commented Aug 23, 2024 •

edited

Loading

msaroufim commented Aug 23, 2024

HDCharles commented Aug 27, 2024

msaroufim commented Aug 27, 2024 •

edited

Loading

vadimkantorov commented Aug 28, 2024 •

edited

Loading

kir152 commented Aug 29, 2024

Llama3.1 and kv_cache quantization #738

Llama3.1 and kv_cache quantization #738

Conversation

HDCharles commented Aug 23, 2024 • edited Loading

summary of changes

Change to quantized kv_cache init

Change to causal mask

pytorch-bot bot commented Aug 23, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/738

✅ No Failures

msaroufim commented Aug 23, 2024

HDCharles commented Aug 27, 2024

msaroufim commented Aug 27, 2024 • edited Loading

vadimkantorov commented Aug 28, 2024 • edited Loading

kir152 commented Aug 29, 2024

HDCharles commented Aug 23, 2024 •

edited

Loading

pytorch-bot bot commented Aug 23, 2024 •

edited

Loading

msaroufim commented Aug 27, 2024 •

edited

Loading

vadimkantorov commented Aug 28, 2024 •

edited

Loading