Fix: Jamba batched generation #32914

vasqu · 2024-08-21T10:33:03Z

What does this PR do?

Basically a continuation of #32677 which implements the fixes for Jamba this time. Batched generation tests might need to be changed, especially the logits, but not sure how to proceed there as the logits are HW dependent.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@molbap @ArthurZucker

src/transformers/models/jamba/modeling_jamba.py

vasqu · 2024-08-21T10:35:01Z

tests/models/jamba/test_modeling_jamba.py

-    def test_left_padding_compatibility(self):
-        r"""
-        Overriding the test_left_padding_compatibility test as the mamba layers accentuate the numerical differences
-        effect of the left padding discussed in the issue in the note. Using a more permissive tolerance value.
-        """
-        import inspect
-        # NOTE: left-padding results in small numerical differences. This is expected.
-        # See https://github.com/huggingface/transformers/issues/25420#issuecomment-1775317535
-
-        # First, filter out models that don't support left padding - generative and decoder-only.
-        # Jamba is a decoder-only architecture
-        decoder_only_classes = self.all_generative_model_classes
-
-        # Then, test left-padding
-        def _prepare_model_kwargs(input_ids, attention_mask, signature):
-            model_kwargs = {"input_ids": input_ids, "attention_mask": attention_mask}
-            if "position_ids" in signature:
-                position_ids = torch.cumsum(attention_mask, dim=-1) - 1
-                position_ids.masked_fill_(attention_mask == 0, 1)
-                model_kwargs["position_ids"] = position_ids
-            if "cache_position" in signature:
-                cache_position = torch.arange(input_ids.shape[-1], device=torch_device)
-                model_kwargs["cache_position"] = cache_position
-            return model_kwargs
-
-        for model_class in decoder_only_classes:
-            config, input_ids, attention_mask = self._get_input_ids_and_config()
-            model = model_class(config).to(torch_device).eval()
-            signature = inspect.signature(model.forward).parameters.keys()
-
-            # Without padding
-            model_kwargs = _prepare_model_kwargs(input_ids, attention_mask, signature)
-            next_logits_wo_padding = model(**model_kwargs).logits[:, -1, :]
-
-            # With left-padding (length 32)
-            pad_size = (input_ids.shape[0], 32)
-            padding = torch.ones(pad_size, dtype=input_ids.dtype, device=torch_device) * config.pad_token_id
-            padded_input_ids = torch.cat((padding, input_ids), dim=1)
-            padded_attention_mask = torch.cat((torch.zeros_like(padding), attention_mask), dim=1)
-            model_kwargs = _prepare_model_kwargs(padded_input_ids, padded_attention_mask, signature)
-            next_logits_with_padding = model(**model_kwargs).logits[:, -1, :]
-
-            # They should result in very similar logits
-            self.assertTrue(torch.allclose(next_logits_wo_padding, next_logits_with_padding, atol=3e-3))
-


Passed locally without the higher rtol/atol. Will see if the CI agrees.

Seems like it does :D

Keeping it open for visibility: Left padding works fine now, it was an issue of how padding has been handled in general (for mamba-related models).

tests/models/jamba/test_modeling_jamba.py

vasqu · 2024-08-21T10:37:27Z

CI failure seems unrelated to the PR, some import issues from another model.

tests/models/jamba/test_modeling_jamba.py

ArthurZucker

What I would find weird is if this does not improve / change the results. Especially for batched generation! The model is tiny random, would be nice if we can run this with the big one 👀

src/transformers/models/jamba/modeling_jamba.py

tests/models/jamba/test_modeling_jamba.py

vasqu · 2024-08-23T10:05:04Z

Yea, I think it should definitely improve the batched generation. Especially since the test_left_padding_compatibility doesn't need higher atols anymore, padding is not as big of a problem as before (I think).

Too GPU poor to run the Jamba models, iirc they require at least an 80GB Vram GPU 😢 Maybe we could notify the guys behind Jamba? I doubt they are aware of this issue.

… batch gen (with todo on logits comp)

vasqu · 2024-08-23T10:22:10Z

#32250 seems to have changed the integration tests cc @gante

Guess we have to redo them again 👀

vasqu · 2024-08-23T10:23:14Z

tests/models/jamba/test_modeling_jamba.py

@@ -737,10 +692,11 @@ def test_simple_batched_generate_with_padding(self):
            with torch.no_grad():
                logits = self.model(input_ids=inputs["input_ids"]).logits

+            # TODO fix logits


For more visibility so that I don't forget about it.

gante

LGTM, thank you for fixing! 🙌

Added a nit to confirm. Pre-approving assuming the logits tests will be addressed (sorry about that :) )

gante · 2024-08-23T13:06:00Z

src/transformers/models/jamba/modeling_jamba.py

+        # No need for zeroing states when
+        # 1. Cached forward
+        # 2. Attending to all inputs
+        if cache_position[0] > 0 or (attention_mask is not None and torch.all(attention_mask == 1)):


I suspect this line will fail at compilation time (data-dependent conditional branch). Can you confirm, i.e. try running a compiled forward pass?

If it fails, we can add a compile guard, i.e. start the if with not is_torchdynamo_compiling()

Tested via the following mini script:

import torch from transformers import JambaForCausalLM, AutoTokenizer model_id = "ai21labs/Jamba-tiny-random" model = JambaForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True).to("cuda") model = torch.compile(model) tokenizer = AutoTokenizer.from_pretrained(model_id) # tested on both, batched or non-batched input #input = tokenizer(["Hey how are you doing on this lovely evening?", "What is the purpose of life?"], padding=True, return_tensors="pt").to("cuda") input = tokenizer(["What is the purpose of life?"], padding=True, return_tensors="pt").to("cuda") # tested on both, forward call or generate out = model(**input) #out = model.generate(**input, do_sample=False, max_new_tokens=10)

Haven't encountered any compilation errors locally, so seems to be fine. Is this what you had in mind to test compilation?

Yes, that's it!

Perfect, thank you for confirming :)

ArthurZucker

LGTM thanks again @vasqu for your great contributions!

* init fix * fix mask during cached forward, move mask related stuff to own function * adjust tests as left padding does not change logits as much anymore + batch gen (with todo on logits comp) * revert overwriting new integration tests * move some comments to docstring

vasqu commented Aug 21, 2024

View reviewed changes

src/transformers/models/jamba/modeling_jamba.py Outdated Show resolved Hide resolved

vasqu commented Aug 21, 2024

View reviewed changes

tests/models/jamba/test_modeling_jamba.py Outdated Show resolved Hide resolved

vasqu commented Aug 21, 2024

View reviewed changes

tests/models/jamba/test_modeling_jamba.py Outdated Show resolved Hide resolved

ArthurZucker reviewed Aug 22, 2024

View reviewed changes

src/transformers/models/jamba/modeling_jamba.py Outdated Show resolved Hide resolved

tests/models/jamba/test_modeling_jamba.py Outdated Show resolved Hide resolved

vasqu added 3 commits August 23, 2024 12:06

init fix

c02bf38

fix mask during cached forward, move mask related stuff to own function

fcd6d20

adjust tests as left padding does not change logits as much anymore +…

e2c2341

… batch gen (with todo on logits comp)

vasqu force-pushed the jamba-batched-gen-fix branch from cfc73d9 to e2c2341 Compare August 23, 2024 10:08

revert overwriting new integration tests

e81eee2

vasqu commented Aug 23, 2024

View reviewed changes

gante approved these changes Aug 23, 2024

View reviewed changes

move some comments to docstring

43e08dd

ArthurZucker approved these changes Aug 28, 2024

View reviewed changes

ArthurZucker merged commit 3bfd3e4 into huggingface:main Aug 28, 2024
18 checks passed

vasqu deleted the jamba-batched-gen-fix branch August 28, 2024 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Jamba batched generation #32914

Fix: Jamba batched generation #32914

vasqu commented Aug 21, 2024

vasqu Aug 21, 2024

vasqu Aug 21, 2024

vasqu Aug 23, 2024

vasqu commented Aug 21, 2024

ArthurZucker left a comment

vasqu commented Aug 23, 2024

vasqu commented Aug 23, 2024

vasqu Aug 23, 2024

gante left a comment •

edited

Loading

gante Aug 23, 2024

vasqu Aug 23, 2024

gante Aug 23, 2024

ArthurZucker left a comment

Fix: Jamba batched generation #32914

Fix: Jamba batched generation #32914

Conversation

vasqu commented Aug 21, 2024

What does this PR do?

Before submitting

Who can review?

vasqu Aug 21, 2024

Choose a reason for hiding this comment

vasqu Aug 21, 2024

Choose a reason for hiding this comment

vasqu Aug 23, 2024

Choose a reason for hiding this comment

vasqu commented Aug 21, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

vasqu commented Aug 23, 2024

vasqu commented Aug 23, 2024

vasqu Aug 23, 2024

Choose a reason for hiding this comment

gante left a comment • edited Loading

Choose a reason for hiding this comment

gante Aug 23, 2024

Choose a reason for hiding this comment

vasqu Aug 23, 2024

Choose a reason for hiding this comment

gante Aug 23, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

gante left a comment •

edited

Loading