[`Core generation`] Adds support for static KV cache #27931

ArthurZucker · 2023-12-10T09:48:11Z

~4x speedups with cuda graphs! 🥳

Currently getting ~4x speedups compare to dynamic cache with torch.compile for a single forward pass (agnostic to batch but faster for smaller batch)

Forward is very very very fast, but materializing the input costs a bit!
~10ms / forward is what we get to!

Refactors the way we deal with attention mask:
- causal and padding are separated
- does not rely on the past_key_values
- merged in 2 line. No attention mask utils are needed, no extra complicated logic all explicit
- LlamaAttention is not self contained, this added 20% overhead in a simple forward
- Gets rid of the entire mask_attn_utils 😄
Save the cache class in the generation config
Init the cache with the batch size (from the generate call) and the max_length from the generation config (taking max_new_tokens) into account
torch.compile

Benchmark using af097af

Use it in generate:

Use this: EDIT: TO COME

Failing test left for @gante

Related to the fact that I don't return past_key_values / is None so the test_new_cache_format fails. I don't want to dive in this.

fixes #28075 , fixes #28610, fixes #28190

HuggingFaceDocBuilderDev · 2023-12-11T10:30:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…c-cache

src/transformers/cache_utils.py

oobabooga · 2024-01-04T04:12:41Z

If I understand correctly, this PR should close the existing gap between inference with transformers + AutoGPTQ and inference with ExLlama, as the VRAM usage would become much more controlled. I'm rooting for it :)

ArthurZucker · 2024-01-04T07:56:44Z

Thanks! 🤗

…c-cache

…simple

patrickvonplaten · 2024-01-09T13:15:39Z

Exciting PR!

xkszltl · 2024-02-26T11:32:16Z

Could you help clarify the removal of comment?

It's probably removed by mistake as attn_weights is still not supported in FA2.
We should also add a warning before setting it to False, just like the SDPA counterpart.

paulcx · 2024-03-06T09:47:20Z

Hi @ArthurZucker It seems that the increase in VRAM could potentially lead to out of memory (OOM) comment1 comment2, as pointed out in this PR by @danielhanchen

It seems like a change was made in another PR which allocates a causal mask of size (16384, 16384) https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L940

The triu causes the causal mask to upcast to float32, using 16384^2 * 4bytes = 1GB of extra VRAM. We have n^2 * 4 / 1024 / 1024 = 37.25GB in your screenshot, so I'm assuming you're also doing RoPE Scaling to 100K context length? So ie a (100K, 100K) matrix was trying to be created.

Could you please take a look into it?

gante · 2024-03-06T14:48:14Z

@paulcx your issue is related to this one (#29484) -- let's keep the discussion there! :)

ArthurZucker · 2024-03-07T00:03:16Z

Yes! And @paulcx I'm sorry this broke for you

fxmarty · 2024-03-18T06:14:47Z

See #29241 which alleviates but does not fix the issue @paulcx

aliencaocao · 2024-03-18T15:14:16Z

Does this work for llava? from my testing it doesnt

paulcx · 2024-03-19T13:01:28Z

See #29241 which alleviates but does not fix the issue @paulcx

does this temporary fix work for 200K?

ArthurZucker · 2024-03-19T13:06:27Z

No, you have to update the max_position_embedding. It is allocating 200k because you set it to 200K while you machine does not support 200K input

ArthurZucker · 2024-03-19T13:08:01Z

max_position_embedding is only use for the causal_mask now, while previously it was used for sin and cos. In both cases it was to cache the maximum number of positions that will be passed to the model. If you have 200K context length, that does not mean you can do inference / training with it!

We can also just remove it, but then you need to allocate the causal_mask at each forward pass

paulcx · 2024-03-20T01:44:49Z

@ArthurZucker If I understand correctly, my thought is to lower the max_position_embedding, such as 4096, because it only affects the initial position embedding and caching for 200K. But during inference, lengths approaching 200K will still be calculated, just slower. This workaround, it can ensure normal training and inference for non-200K cases. Is my understanding correct?

ArthurZucker · 2024-03-21T06:59:58Z

This was fixed by #29753 ! Sorry @paulcx for the inconvenience. For static / compile cache you should still reduce the max position embedding or it will OOM 😉

paulcx · 2024-03-21T19:13:47Z

static / compile cache

Thank you and great work on new release @ArthurZucker.

Would you mind clarifying the use case of "static / compile cache" in release note? I'm not sure if I understand correctly.

ArthurZucker · 2024-03-22T09:34:53Z

It is mostly this: https://gist.github.com/ArthurZucker/af34221def212259b43d55a2811d2dbb, you can get x4 generation speed in transformers with torch compile and static cache!

aliencaocao · 2024-03-22T11:13:55Z

It is mostly this: https://gist.github.com/ArthurZucker/af34221def212259b43d55a2811d2dbb, you can get x4 generation speed in transformers with torch compile and static cache!

Is this expected to work with llava-next?

ArthurZucker · 2024-03-22T12:59:17Z

I believe so yes, if not we can add support for it

ArthurZucker · 2024-03-22T13:00:02Z

Feel free to open an issue if it doesnt work

aliencaocao · 2024-03-27T11:02:30Z

I have tried and it dont work because the vision tower changes the shape of inputs after encoding to patches. Also, it doesnt work for bnb 4 bits

ArthurZucker · 2024-03-27T15:12:50Z

bnb is a different issue, torch.compile might not support this (int8 yes).
For the encoder part cc @NielsRogge could be nice

aliencaocao · 2024-03-27T15:15:15Z

we are using NF4 for bnb

nxphi47 · 2024-03-28T06:47:50Z

It is mostly this: https://gist.github.com/ArthurZucker/af34221def212259b43d55a2811d2dbb, you can get x4 generation speed in transformers with torch compile and static cache!

@ArthurZucker Just checking have you added this to model.generate, or we still have to follow your script there to use static KV cache?

ArthurZucker · 2024-03-28T06:58:11Z

@gante is working on this here #29374

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>

ArthurZucker added 4 commits December 10, 2023 04:45

initial commit

17b8b38

lol

80ef815

nits

2639b5d

nits nits nits nits nits

9f2e1e4

ArthurZucker changed the title ~~[Core genration] Adds support for static KV cache~~ [Core generation] Adds support for static KV cache Dec 12, 2023

kimjaewon96 mentioned this pull request Dec 20, 2023

Use 'Cache' Class in past_key_values for transformers dottxt-ai/outlines#452

Open

Merge branch 'main' of github.com:huggingface/transformers into stati…

271260c

…c-cache

ArthurZucker mentioned this pull request Dec 22, 2023

torch.compile() silently fails when used on HuggingFace pipeline inference code #28190

Closed

4 tasks

patrickvonplaten reviewed Dec 26, 2023

View reviewed changes

src/transformers/cache_utils.py Show resolved Hide resolved

patrickvonplaten reviewed Dec 26, 2023

View reviewed changes

src/transformers/cache_utils.py Outdated Show resolved Hide resolved

ArthurZucker mentioned this pull request Jan 2, 2024

torch CUDA graphs with HF generate #27837

Open

oobabooga mentioned this pull request Jan 4, 2024

Adding support for a static shape generate #28075

Closed

ArthurZucker added 10 commits January 4, 2024 10:26

Merge branch 'main' of github.com:huggingface/transformers into stati…

5be65ff

…c-cache

some nits and some testing

c6b6d35

nits

90224dd

Wrong implementation but creates good masks in general and is pretty …

24ffbfb

…simple

what seems to work for now

cd95e98

nites

7cd3655

re-init cache

eeebc66

make it automatic

5819a85

nits and nits

216dd8f

more nits

a48ae88

learning-chip mentioned this pull request Jan 6, 2024

Understanding why TorchInductor cannot speed-up huggingface transformer inference pytorch-labs/gpt-fast#59

Closed

ArthurZucker added 3 commits January 8, 2024 05:19

nits

aeefa26

nits

e05f8da

more nits

07f5cdc

gante mentioned this pull request Feb 21, 2024

🚨 Llama: update rope scaling to match static cache changes #29143

Merged

thiagocrepaldi mentioned this pull request Feb 21, 2024

llama model: causal_mask does not exist #29173

Closed

4 tasks

gante mentioned this pull request Feb 21, 2024

Cache: standardize cache interface #29180

Closed

RaymondLi0 mentioned this pull request Feb 22, 2024

Starcoder2 model - bis #29215

Merged

fxmarty mentioned this pull request Feb 27, 2024

Better SDPA unmasking implementation #29318

Merged

huningxin mentioned this pull request Mar 19, 2024

Allow 0 size dimensions (dimensions containing a 0 in the list of sizes, not a rank of 0 which is valid) webmachinelearning/webnn#391

Open

ArthurZucker mentioned this pull request Mar 20, 2024

[causal_mask] Add a warning when registering the buffer #29742

Closed

hiyouga mentioned this pull request Jul 3, 2024

Implement efficient packing without cross-contamination attention hiyouga/LLaMA-Factory#4224

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Core generation`] Adds support for static KV cache #27931

[`Core generation`] Adds support for static KV cache #27931

ArthurZucker commented Dec 10, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 11, 2023

oobabooga commented Jan 4, 2024

ArthurZucker commented Jan 4, 2024

patrickvonplaten commented Jan 9, 2024

xkszltl commented Feb 26, 2024

paulcx commented Mar 6, 2024

gante commented Mar 6, 2024

ArthurZucker commented Mar 7, 2024

fxmarty commented Mar 18, 2024

aliencaocao commented Mar 18, 2024

paulcx commented Mar 19, 2024

ArthurZucker commented Mar 19, 2024

ArthurZucker commented Mar 19, 2024 •

edited

Loading

paulcx commented Mar 20, 2024

ArthurZucker commented Mar 21, 2024 •

edited

Loading

paulcx commented Mar 21, 2024

ArthurZucker commented Mar 22, 2024

aliencaocao commented Mar 22, 2024

ArthurZucker commented Mar 22, 2024

ArthurZucker commented Mar 22, 2024

aliencaocao commented Mar 27, 2024

ArthurZucker commented Mar 27, 2024

aliencaocao commented Mar 27, 2024

nxphi47 commented Mar 28, 2024 •

edited

Loading

ArthurZucker commented Mar 28, 2024

[Core generation] Adds support for static KV cache #27931

[Core generation] Adds support for static KV cache #27931

Conversation

ArthurZucker commented Dec 10, 2023 • edited Loading

~4x speedups with cuda graphs! 🥳

Benchmark using af097af

Use it in generate:

Failing test left for @gante

HuggingFaceDocBuilderDev commented Dec 11, 2023

oobabooga commented Jan 4, 2024

ArthurZucker commented Jan 4, 2024

patrickvonplaten commented Jan 9, 2024

xkszltl commented Feb 26, 2024

paulcx commented Mar 6, 2024

gante commented Mar 6, 2024

ArthurZucker commented Mar 7, 2024

fxmarty commented Mar 18, 2024

aliencaocao commented Mar 18, 2024

paulcx commented Mar 19, 2024

ArthurZucker commented Mar 19, 2024

ArthurZucker commented Mar 19, 2024 • edited Loading

paulcx commented Mar 20, 2024

ArthurZucker commented Mar 21, 2024 • edited Loading

paulcx commented Mar 21, 2024

ArthurZucker commented Mar 22, 2024

aliencaocao commented Mar 22, 2024

ArthurZucker commented Mar 22, 2024

ArthurZucker commented Mar 22, 2024

aliencaocao commented Mar 27, 2024

ArthurZucker commented Mar 27, 2024

aliencaocao commented Mar 27, 2024

nxphi47 commented Mar 28, 2024 • edited Loading

ArthurZucker commented Mar 28, 2024

[`Core generation`] Adds support for static KV cache #27931

[`Core generation`] Adds support for static KV cache #27931

ArthurZucker commented Dec 10, 2023 •

edited

Loading

ArthurZucker commented Mar 19, 2024 •

edited

Loading

ArthurZucker commented Mar 21, 2024 •

edited

Loading

nxphi47 commented Mar 28, 2024 •

edited

Loading