Llama: fix batched generation #29109

gante · 2024-02-19T13:48:24Z

What does this PR do?

Fixes batched inference on llama, after the static cache changes were added. For instance, RUN_SLOW=1 py.test tests/test_cache_utils.py::CacheIntegrationTest::test_dynamic_cache_beam_search now passes.

What was wrong?

position_ids has shape [bsz, seq_len]. The line computing freqs was correct for batch size = 1, but incorrect for larger batch sizes: it was summing the values for the different batch members. Therefore, we need to create another dimension to prevent this sum from happening, which is what this PR does.

Throughput impact of changes

None 🙌 [Measured on my end, RTX3090 + TinyLlama/TinyLlama-1.1B-Chat-v1.0]

Before this PR

After this PR

HuggingFaceDocBuilderDev · 2024-02-19T14:09:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

gante · 2024-02-19T14:31:23Z

tests/test_cache_utils.py

@@ -293,7 +293,7 @@ def test_sink_cache_iterative_prompts(self):
    @parameterized.expand(["eager", "sdpa", "flash_attention_2"])
    def test_static_cache_greedy_sampling_pad_left(self, attn_implementation):
        EXPECTED_GENERATION = [
-            "The best color is the one that complements the subject you are photograph",
+            "The best color is the one that complements the skin tone of the",


These changed test results were checked against 4b236aed7618d90546cd2e8797dab5b4a24c5dce (the commit before the static caches were introduced).

These tests do batched generation, hence the need to change.

👉 the fact that this PR matches the commit before the static caches in this test means that we can now do left-padded batched generation with the same results!

ArthurZucker · 2024-02-20T00:43:04Z

I'll have to run the benchmark on the A100 to make sure everything is alright but otherwise should be good

ArthurZucker

Great work, nice catch! I'll approve but let me run the benchmark on my side!

ArthurZucker · 2024-02-20T00:45:41Z

src/transformers/models/llama/modeling_llama.py

+    cos = cos.unsqueeze(unsqueeze_dim)
+    sin = sin.unsqueeze(unsqueeze_dim)


let's unsqueeze in the rotary embedding no? or that changes the shape we previously had?

Same shapes/no shape problems, but unsqueezing here is preferable by some users (see #27117)

ArthurZucker · 2024-02-20T00:46:17Z

src/transformers/models/llama/modeling_llama.py

+        freqs = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1) @ (
+            position_ids[:, None, :].float()
+        )
+        freqs = freqs.transpose(1, 2)
        emb = torch.cat((freqs, freqs), dim=-1)
        return emb.cos().to(dtype=x.dtype), emb.sin().to(dtype=x.dtype)


BTW for BC we could / should still cache the rope no?
With a property _sin_cache: logger.warning_once(will be removed in 4.39) WDYT?

ArthurZucker · 2024-02-20T00:46:35Z

src/transformers/models/llama/modeling_llama.py

-            causal_mask = torch.triu(mask, diagonal=1).to(dtype)
+            causal_mask = torch.triu(mask, diagonal=1)

+        causal_mask = causal_mask.to(dtype=dtype, device=device)


good catch!

ArthurZucker · 2024-02-20T00:47:14Z

tests/test_cache_utils.py

@@ -333,18 +333,18 @@ def test_static_cache_greedy_sampling_pad_left(self, attn_implementation):
    @parameterized.expand(["eager", "sdpa", "flash_attention_2"])
    def test_static_cache_greedy_sampling_pad_right(self, attn_implementation):
        EXPECTED_GENERATION = [
-            "The best color is\n\n\n\n\n\n\n\n\n\n",
-            "We should not undermind the issues at hand, but address them head on.\nI think",
+            "The best color isЋ the one that complements the skin tone of",


-isЋ t +is t

seems strange 😅 but alright

hehe this weird one is a copy/paste

(it has right-padding, so we should expect weird things at generation time)

ArthurZucker · 2024-02-20T01:10:31Z

Alright, no significant slow downs so 🟢 but I can't do naive Dynamic generation with the same script as before:
Probably because I gave position_ids = torch.arange(seq_length, device=device) and they are not unsqueezed

  File "/home/arthur/transformers/../static-kv-cache/clean_bench.py", line 147, in <module>
    outputs = model(input_ids, past_key_values=past_key_values,position_ids=position_ids,cache_position=cache_position, return_dict=False, use_cache = True)
  File "/home/arthur/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1536, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/arthur/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1545, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arthur/transformers/src/transformers/models/llama/modeling_llama.py", line 1155, in forward
    outputs = self.model(
  File "/home/arthur/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1536, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/arthur/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1545, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arthur/transformers/src/transformers/models/llama/modeling_llama.py", line 995, in forward
    layer_outputs = decoder_layer(
  File "/home/arthur/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1536, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/arthur/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1545, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arthur/transformers/src/transformers/models/llama/modeling_llama.py", line 721, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/arthur/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1536, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/arthur/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1545, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arthur/transformers/src/transformers/models/llama/modeling_llama.py", line 628, in forward
    cos, sin = self.rotary_emb(value_states, position_ids, seq_len=None)
  File "/home/arthur/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1536, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/arthur/miniconda3/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1545, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/arthur/transformers/src/transformers/models/llama/modeling_llama.py", line 107, in forward
    position_ids[:, None, :].float()
IndexError: too many indices for tensor of dimension 1

gante · 2024-02-20T08:57:39Z

@ArthurZucker regarding the benchmark error: position ids should be a 2D tensor, just like the input ids :D I also had to adapt it on my end

ArthurZucker · 2024-02-20T09:07:27Z

Alright if passing a 1d before was erroring out!

fxmarty · 2024-02-20T13:41:08Z

@gante thanks a lot for this

ArthurZucker · 2024-02-22T03:04:32Z

src/transformers/models/llama/modeling_llama.py

+        self._cos_cached = cos
+        self._sin_cached = sin


we should. not always overwrite them. We need them accessible but not to be overwritten at the forward

gante requested a review from ArthurZucker February 19, 2024 13:53

gante marked this pull request as ready for review February 19, 2024 13:53

gante commented Feb 19, 2024

View reviewed changes

gante changed the title ~~batched llama~~ Llama: fix batched generation Feb 19, 2024

gante added 2 commits February 19, 2024 15:39

batched llama

33b3cfb

fix other cache tests

a64dc2d

gante force-pushed the batched_llama branch from 1f14324 to a64dc2d Compare February 19, 2024 15:39

ArthurZucker approved these changes Feb 20, 2024

View reviewed changes

add bc

9b2c3f7

gante merged commit 7d312ad into huggingface:main Feb 20, 2024
19 checks passed

gante deleted the batched_llama branch February 20, 2024 10:23

younesbelkada mentioned this pull request Feb 21, 2024

FIX: [CI / Adaptation Prompt] Fix CI on transformers main huggingface/peft#1493

Merged

ArthurZucker mentioned this pull request Feb 21, 2024

Fix static generation when compiling! #28937

Merged

gante mentioned this pull request Feb 21, 2024

Generate: low memory tests are flaky #29136

Closed

BowenBao mentioned this pull request Feb 21, 2024

llama model: causal_mask does not exist #29173

Closed

4 tasks

ArthurZucker reviewed Feb 22, 2024

View reviewed changes

ArthurZucker mentioned this pull request Feb 22, 2024

[Llama ROPE] Fix torch export but also slow downs in forward #29198

Merged

fxmarty mentioned this pull request Feb 22, 2024

torch.export fails for llama model #29190

Closed

4 tasks

gante mentioned this pull request Feb 26, 2024

Fix llama sin_cached/cos_cached backward compatibility #29299

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama: fix batched generation #29109

Llama: fix batched generation #29109

gante commented Feb 19, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Feb 19, 2024

gante Feb 19, 2024 •

edited

Loading

ArthurZucker commented Feb 20, 2024

ArthurZucker left a comment

ArthurZucker Feb 20, 2024

gante Feb 20, 2024 •

edited

Loading

ArthurZucker Feb 20, 2024

ArthurZucker Feb 20, 2024

ArthurZucker Feb 20, 2024

gante Feb 20, 2024

ArthurZucker commented Feb 20, 2024

gante commented Feb 20, 2024 •

edited

Loading

ArthurZucker commented Feb 20, 2024 •

edited

Loading

fxmarty commented Feb 20, 2024

ArthurZucker Feb 22, 2024

		cos = cos.unsqueeze(unsqueeze_dim)
		sin = sin.unsqueeze(unsqueeze_dim)

Llama: fix batched generation #29109

Llama: fix batched generation #29109

Conversation

gante commented Feb 19, 2024 • edited Loading

What does this PR do?

What was wrong?

Throughput impact of changes

HuggingFaceDocBuilderDev commented Feb 19, 2024

gante Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker commented Feb 20, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Feb 20, 2024

Choose a reason for hiding this comment

gante Feb 20, 2024 • edited Loading

Choose a reason for hiding this comment

ArthurZucker Feb 20, 2024

Choose a reason for hiding this comment

ArthurZucker Feb 20, 2024

Choose a reason for hiding this comment

ArthurZucker Feb 20, 2024

Choose a reason for hiding this comment

gante Feb 20, 2024

Choose a reason for hiding this comment

ArthurZucker commented Feb 20, 2024

gante commented Feb 20, 2024 • edited Loading

ArthurZucker commented Feb 20, 2024 • edited Loading

fxmarty commented Feb 20, 2024

ArthurZucker Feb 22, 2024

Choose a reason for hiding this comment

gante commented Feb 19, 2024 •

edited

Loading

gante Feb 19, 2024 •

edited

Loading

gante Feb 20, 2024 •

edited

Loading

gante commented Feb 20, 2024 •

edited

Loading

ArthurZucker commented Feb 20, 2024 •

edited

Loading