Add group kv support and fix past kv from cache #2263

siddartha-RE · 2023-08-18T15:30:08Z

Why are these changes needed?

Primary goal is to enable support for group key-value in Llama2 models, specifically the 70B model.
In addition

The code simplifies flash-attn usage by avoiding most transpose operation
Avoids stacking when possible using the unstacked variants
FIxes behavior in the presence of caching, KV lengths can be different from Q lengths

Related issue number (if applicable)

Addresses #2229

Checks

I've run format.sh to lint the changes in this PR.
I've included any doc changes needed.
I've made sure the relevant tests are passing (if applicable).

tmm1 · 2023-08-19T00:11:05Z

hi @siddartha-RE, i was looking into this the other day too (inference support for fa2 llama2, i.e. past_kv/cache), and ended up using flash_attn_kvpacked_func. does that seem right to you?

https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/monkeypatch/llama_attn_hijack_flash.py#L204-L209

siddartha-RE · 2023-08-19T01:27:45Z

No there is an issue with alignment. I opened an in flash attention repo. Owner already asked and is planning to adjust behavior. Bump the issue if you want to let him know there is more interest.

This PR has a pretty comprehensive check of correctness.

merrymercy · 2023-08-21T04:16:55Z

@siddartha-RE Thanks for the contribution.

If this is comprehensively tested, can we make this one the default implementation and replace the old monkey patch with this one?

siddartha-RE · 2023-08-21T14:19:05Z

The test runs the attention block against

base HF implementation
current monkey patch
new monkey patch

current and new are exactly identical and both differ from HF implementation with tolerance. I also have tested training 70b with this and the loss looks correct so the group kv handling appears to be correct. I could add a test with group kv enabled and compare new against HF to further verify implementation.

FYI -- I have an issue open on Flash-attn code base
Dao-AILab/flash-attention#466
regarding using flash attn for the use cache case. I am considering falling back to the old implementation when past_kv len >> q_len is present as I think it may actually be faster since it would not be quadratic in q_len + past_kv_len

siddartha-RE · 2023-08-21T17:58:37Z

One more comment. I tested the proper optimized implementation of use_cache:
siddartha-RE@6ad40e9

against a build of flash-attn from this PR:
Dao-AILab/flash-attention#436
which fixes the Causal handling of the q_len != k_len case

and confirmed that the test I added in the file still passes.

merrymercy · 2023-08-23T07:48:55Z

@siddartha-RE Is this ready for merge? Which one do you prefer?

Merge this first as an optional monkey patch and let more people do the test
Directly replace this line

FastChat/fastchat/train/train_mem.py

Line 8 in 9078fe9

replace_llama_attn_with_flash_attn()

with your implementation in this PR

siddartha-RE · 2023-08-23T13:15:17Z

This is ready for merge. It is well tested so I am comfortable updating train hook but suggest keeping both versions for now so that people can test if they see issues.

I have a change that further fixes the inference usage but will need to wait till the underlying fix in flash attention is released for handling of the causal flag when q != kv.

merrymercy · 2023-08-23T23:40:42Z

@siddartha-RE Thanks! It is merged.

Jingru · 2023-08-26T07:19:44Z

fastchat/train/llama2_flash_attn_monkey_patch.py

+):
+    # [bsz, seq_len]
+    if past_key_values_length > 0 and attention_mask is not None:
+        attention_mask = torch.cat(


Is this concatenation necessary? attention_mask passed in should have shape of (bsz, kv_len), and past_kv_len is already in kv_len

siddartha-RE added 3 commits August 18, 2023 15:06

Add group kv support and fix past kv from cache

151c6ac

Apply format.sh

1569b24

Fix an issue when past_kv is present

e2ee3d4

merrymercy added the high-priority label Aug 21, 2023

Enable by default so that llama1/2 are supported

51821f1

merrymercy merged commit da2b80b into lm-sys:main Aug 23, 2023

PhanTask mentioned this pull request Aug 25, 2023

when tuning 70b llama 2 RuntimeError encountered: shape is invalid for input of size #2316

Closed

Jingru reviewed Aug 26, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add group kv support and fix past kv from cache #2263

Add group kv support and fix past kv from cache #2263

siddartha-RE commented Aug 18, 2023 •

edited

Loading

tmm1 commented Aug 19, 2023

siddartha-RE commented Aug 19, 2023

merrymercy commented Aug 21, 2023

siddartha-RE commented Aug 21, 2023

siddartha-RE commented Aug 21, 2023

merrymercy commented Aug 23, 2023

siddartha-RE commented Aug 23, 2023

merrymercy commented Aug 23, 2023

Jingru Aug 26, 2023

Add group kv support and fix past kv from cache #2263

Add group kv support and fix past kv from cache #2263

Conversation

siddartha-RE commented Aug 18, 2023 • edited Loading

Why are these changes needed?

Related issue number (if applicable)

Checks

tmm1 commented Aug 19, 2023

siddartha-RE commented Aug 19, 2023

merrymercy commented Aug 21, 2023

siddartha-RE commented Aug 21, 2023

siddartha-RE commented Aug 21, 2023

merrymercy commented Aug 23, 2023

siddartha-RE commented Aug 23, 2023

merrymercy commented Aug 23, 2023

Jingru Aug 26, 2023

Choose a reason for hiding this comment

siddartha-RE commented Aug 18, 2023 •

edited

Loading