Add MistralAI's 7B Transformer as a backbone in KerasNLP Models #1314

tirthasheshpatel · 2023-11-13T19:38:25Z

Fixes #1275

This PR adds a MistralBackbone backbone model and all it's components.

Most of the components share a lot of code with #1203.

Reference implementation: mistralai/mistral-src

Colab for weight transfer: https://colab.research.google.com/drive/1MoD7JJasThxmalspG3c21oYMEc_qRbti?usp=sharing

TODOs:

Add docs for all the layers and the backbone.
Add tests to confirm the forward pass matches.
Add a checkpoint conversion script.
Add the 7B model preset
Add dropout to the CachedMistralAttention and MistralTransformerDecoder layers.

mattdangerw · 2023-11-13T21:35:05Z

Still need to take a pass, but a quick note on tests.

Looks like some Keras nightly changes broke us recently, debugging currently. You can ignore the Keras 3 failures. However, the Keras 2 failure looks mistral related and is worth digging into.

mattdangerw

Nice work! I will probably try to step more carefully through the sliding window caching part to understand it better, but left some initial comments.

keras_nlp/models/mistral/mistral_attention.py

mattdangerw · 2023-11-14T00:46:16Z

keras_nlp/models/mistral/mistral_attention.py

+# TODO(tirthasheshpatel): Generalize the attention layer
+# TODO(tirthasheshpatel): Merge `LlamaAttention` with this layer
+# TODO(tirthasheshpatel): Use flash attention
+# TODO(tirthasheshpatel): Add dropout


Let's try to do this one if it's easy enough. We usually try to add dropout along with the original architecture.

mattdangerw · 2023-11-14T00:47:36Z

keras_nlp/models/mistral/mistral_attention.py

+        query = self._query_dense(hidden_states)
+
+        # Note that the original PyTorch implementation uses
+        # view_as_complex/view_as_real while we use split/concatenate to


Can you explain this a bit more? Why do we need to consider complex numbers here?

Here's the mistral source for computing frequencies (same as Llama 2) and computing the embeddings (same as Llama 2 too)

The frequencies and inputs are treated as complex numbers and the computation follows the "Theoretical Explanation" section in the paper.

PyTorch's view_as_complex is used to convert the tensors to complex numbers which reshapes the inputs to shape (*x.shape[:-1], x.shape[-1] // 2, 2) and treats each pair of elements in axis -1 as a (real, complex) pair. RotaryEmbedding uses ops.split(x, 2) to convert the inputs to a complex representation (after splitting, the first half of the inputs become the real part and the other half becomes the complex part).

This is the only fundamental difference in both the computations. We can get the same results if we shuffle the inputs such that the alternate elements get moved to the end of the tensor. Hence, the x = ops.concatenate([x[..., ::2], x[..., 1::2]], axis=-1) bit before passing the inputs to the rotary embedding layer.

The reverse transformation exactly mirrors/undoes what we did above.

Code demonstration of the above explaination

import torch import numpy as np from keras import ops def _reshape_for_broadcast(freqs_cis, x): """ freqs_cis: complex - (seq_len, head_dim / 2) x: complex - (bsz, seq_len, head_dim / 2) """ ndim = x.ndim assert 1 < ndim assert freqs_cis.shape == (x.shape[1], x.shape[-1]), ( freqs_cis.shape, (x.shape[1], x.shape[-1]), ) shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)] return freqs_cis.view(*shape) # Llama's version of rotary embeddings def apply_rotary_emb( xq, freqs_cis, ): xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2)) freqs_cis = _reshape_for_broadcast(freqs_cis, xq_) xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3) return xq_out.type_as(xq) # Our version of the same computation. # With transformations to match the `apply_rotary_emb` function above. def apply_rotary_pos_emb(tensor, cos_emb, sin_emb): tensor = ops.concatenate((tensor[..., ::2], tensor[..., 1::2]), axis=-1) x1, x2 = ops.split(tensor, 2, axis=-1) half_rot_tensor = ops.concatenate((-x2, x1), axis=-1) res = (tensor * cos_emb) + (half_rot_tensor * sin_emb) return res x = np.random.standard_normal((1,2,1,16)) cos_emb = np.random.standard_normal((2,8)) sin_emb = np.random.standard_normal((2,8)) print(x) print(ops.concatenate((x[..., ::2], x[..., 1::2]), axis=-1)) print(np.split(np.concatenate((x[..., ::2], x[..., 1::2]), axis=-1), 2, axis=-1)) print(torch.view_as_complex(torch.tensor(x).reshape(*x.shape[:-1], -1, 2))) print(apply_rotary_emb(torch.tensor(x), torch.tensor(cos_emb + sin_emb * 1.0j))) y = apply_rotary_pos_emb( x, np.concatenate([cos_emb[None, :, None, :]]*2, axis=-1), np.concatenate([sin_emb[None, :, None, :]]*2, axis=-1) ) print(ops.reshape(ops.stack(ops.split(y, 2, axis=-1), axis=-1), (y.shape[0], y.shape[1], y.shape[2], -1)))

A bit complicated but it should be possible to achieve the same behavior by shuffling the weights using the same transformations. I believe that's what the huggingface folks have done which is why this isn't required in the Llama backbone PR.

Thanks!

by shuffling the weights using the same transformations

Shuffling what weights?

At the highest level, we should just consider whether we should pull this into the lower level RotaryEmbedding layer. We want it to be useful for the most common use cases of rotary embeddings.

mattdangerw · 2023-11-14T00:51:47Z

keras_nlp/models/mistral/mistral_attention.py

+            key = ops.cast(
+                cache_k[
+                    :,
+                    : (cache_update_index + seq_len - 1) % self._sliding_window


general note, use intermediate variables to improve the readability here if you can. especially if you can come up with good names, say for xx = (cache_update_index + seq_len - 1) % self._sliding_window

Done. This part was removed to make the caching step XLA compatible.

mattdangerw · 2023-11-14T00:56:49Z

keras_nlp/models/mistral/mistral_attention.py

+            attention_scores, attention_mask
+        )
+        attention_output = ops.einsum(
+            "acbe,aecd->abcd", attention_scores, value


Consider something like this, where we collocate all einsum equations in build, and we add a nice key at the top. Helps readability.

https://github.com/keras-team/keras/blob/master/keras/layers/attention/grouped_query_attention.py#L124-L167

Could still pull these up into build for a nice co-location, and rewrite to use the same symbols as above key.

in build... self._dot_product_equation = ... self._combine_equation = ...

keras_nlp/models/mistral/mistral_backbone_test.py

keras_nlp/models/mistral/mistral_layer_norm.py

mattdangerw · 2023-11-14T01:06:19Z

keras_nlp/models/mistral/mistral_transformer_decoder.py

+        sliding_window=512,
+        **kwargs,
+    ):
+        decoder_sequence_shape = kwargs.pop("decoder_sequence_shape", None)


The reason we needed this for TransformerDecoder was Keras 2 struggle with multiple build shape arguments. We shouldn't need this here I think.

mattdangerw · 2023-11-14T01:06:56Z

keras_nlp/models/mistral/mistral_transformer_decoder.py

+                "kernel_initializer": keras.initializers.serialize(
+                    self.kernel_initializer
+                ),
+                "decoder_sequence_shape": self._decoder_sequence_shape,


I don't think we should need this.

mattdangerw · 2023-11-14T01:10:35Z

keras_nlp/models/mistral/mistral_transformer_decoder.py

+        )
+        # Below is a workaround for `ops.triu` for Keras 2.
+        # TODO(tirthasheshpatel): Use `ops.triu` once Keras 2 support is removed.
+        # causal_mask = ops.triu(causal_mask_upper, k=-self.sliding_window)


If ops.triu is ready now, we could do this like

if config.keras_3: ops.triu(...) else: ops.arange...

What does the overall structure of this mask look like?

Mistral uses a banded matrix structure. For example, for inputs of sequence length 5 and sliding window of size 2, we would have something like:

In [1]: from keras import ops In [2]: ops.triu(ops.tril(ops.ones((5, 5)), k=0), k=-2) # generally k = -sliding_window Out[2]: <tf.Tensor: shape=(5, 5), dtype=float32, numpy= array([[1., 0., 0., 0., 0.], [1., 1., 0., 0., 0.], [1., 1., 1., 0., 0.], [0., 1., 1., 1., 0.], [0., 0., 1., 1., 1.]], dtype=float32)>

tirthasheshpatel · 2023-11-22T05:46:03Z

@mattdangerw I think I have addressed all your comments except the docs one. Will add docs in the next commit.

keras_nlp/models/mistral/mistral_backbone.py

keras_nlp/models/mistral/mistral_layer_norm.py

tirthasheshpatel · 2023-11-27T21:51:26Z

keras_nlp/models/mistral/mistral_backbone.py

+        **kwargs,
+    ):
+        # Get the dtype
+        dtype = kwargs.pop("dtype", keras.backend.floatx())


Dtypes work as expected for the TensorFlow and JAX backends but PyTorch currently fails internally in Keras 3 due to dtype issues.

tirthasheshpatel · 2023-11-27T21:54:56Z

keras_nlp/models/mistral/mistral_attention.py

+                cache_k = cache_k[:, :update_end_index, ...]
+                cache_v = cache_v[:, :update_end_index, ...]


JAX fails here if cache_update_index is a traced JAX array. But the value of cache_update_index should be known at each step. I think the right fix here is to make sure that the GenerateTask model passes concrete values here. Otherwise, it would be pretty tricky to make sliding window attention work in JAX.

Yeah, this looks unsupported by XLA today, as this would involve dynamic shapes in a compiled while_loop. Let's work on a fix as a follow up.

tirthasheshpatel · 2023-11-27T21:59:03Z

keras_nlp/models/mistral/mistral_attention.py

+        cache=None,
+        cache_update_index=None,


Note: Right now, caching doesn't work when the sequence length is greater than the sliding window.

Can be addressed as a follow-up when adding the Generator model; shouldn't be a blocker here.

Sounds good. If the upstream version is not solving this correctly, let's not worry too much about this.

mattdangerw · 2023-11-28T18:38:35Z

/gcbrun

mattdangerw

Looks good! Feel free to pull this in after test green and addressing remaining comments.

mattdangerw · 2023-11-28T18:40:43Z

keras_nlp/layers/modeling/rotary_embedding.py

@@ -97,7 +97,7 @@ def _apply_rotary_pos_emb(self, tensor, cos_emb, sin_emb):
        return (tensor * cos_emb) + (half_rot_tensor * sin_emb)

    def _compute_cos_sin_embedding(self, x, rotary_dim, start_index):
-        freq_range = ops.arange(0, rotary_dim, 2, dtype="float32")
+        freq_range = ops.cast(ops.arange(0, rotary_dim, 2), self.compute_dtype)


This looks like a double cast (see the next line). Remove one or the other?

mattdangerw · 2023-11-28T18:41:45Z

keras_nlp/models/mistral/mistral_attention.py

+        cache=None,
+        cache_update_index=None,


Sounds good. If the upstream version is not solving this correctly, let's not worry too much about this.

mattdangerw · 2023-11-28T18:50:04Z

keras_nlp/models/mistral/mistral_attention.py

+                update_end_index = (
+                    cache_update_index + seq_len - 1
+                ) % self._sliding_window + 1
+                update_end_index = ops.cast(update_end_index, "int32")


Just a general note, torch and jax like int32 on gpu, but tensorflow has limited op support with int32 (and does better with int64). We probably don't have coverage for this code path on GPU on TF with an accelerator yet, but might come up down the line.

mattdangerw · 2023-11-28T18:56:07Z

keras_nlp/models/mistral/mistral_attention.py

+        query = self._query_dense(hidden_states)
+
+        # Note that the original PyTorch implementation uses
+        # view_as_complex/view_as_real while we use split/concatenate to


Thanks!

by shuffling the weights using the same transformations

Shuffling what weights?

At the highest level, we should just consider whether we should pull this into the lower level RotaryEmbedding layer. We want it to be useful for the most common use cases of rotary embeddings.

mattdangerw · 2023-11-28T19:40:21Z

keras_nlp/models/mistral/mistral_attention.py

+                cache_k = cache_k[:, :update_end_index, ...]
+                cache_v = cache_v[:, :update_end_index, ...]


Yeah, this looks unsupported by XLA today, as this would involve dynamic shapes in a compiled while_loop. Let's work on a fix as a follow up.

mattdangerw · 2023-11-28T19:47:28Z

keras_nlp/models/mistral/mistral_backbone.py

+            layers in each transformer decoder. Only `sliding_window` number of tokens
+            are saved in the cache and used to generate the next token.
+            Defaults to `512`.
+


Document dtype here, as most models won't support it in the way this backbone does.

mattdangerw · 2023-11-28T19:49:27Z

keras_nlp/models/mistral/mistral_layer_norm.py

+from keras_nlp.backend import ops
+
+
+# TODO: Deprecate this in favor of `keras.layers.LayerNormalization` once


keras.layers.LayerNormalization(rms_scaling=True)

mattdangerw · 2023-11-28T19:49:47Z

keras_nlp/models/mistral/mistral_transformer_decoder.py

+
+class MistralTransformerDecoder(keras.layers.Layer):
+    """A Transformer decoder layer for the Mistral backbone."""
+


Remove newline.

mattdangerw · 2023-11-28T19:49:58Z

keras_nlp/models/mistral/mistral_transformer_decoder.py

+
+    def __init__(
+        self,
+        *,


Remove star for now.

tools/checkpoint_conversion/scripts/mistral_torch.py

for Keras 2 compatibility

mattdangerw · 2023-12-20T00:37:04Z

Looks all green! Let's pull this in.

tirthasheshpatel requested review from kanpuriyanawab and mattdangerw November 13, 2023 19:38

tirthasheshpatel added the type:feature New feature or request label Nov 13, 2023

tirthasheshpatel changed the title ~~Add MistralAI's Transformer as a backbone in KerasNLP Models~~ Add MistralAI's 7B Transformer as a backbone in KerasNLP Models Nov 13, 2023

mattdangerw requested changes Nov 14, 2023

View reviewed changes

tirthasheshpatel requested a review from mattdangerw November 22, 2023 05:44

tirthasheshpatel commented Nov 22, 2023

View reviewed changes

keras_nlp/models/mistral/mistral_backbone.py Outdated Show resolved Hide resolved

tirthasheshpatel commented Nov 27, 2023

View reviewed changes

mattdangerw approved these changes Nov 28, 2023

View reviewed changes

sampathweb added the kokoro:force-run Runs Tests on GPU label Dec 9, 2023

kokoro-team removed the kokoro:force-run Runs Tests on GPU label Dec 9, 2023

tirthasheshpatel added 9 commits December 19, 2023 15:55

Add MistralBackbone

1bc5d65

Fix Keras 2 failure

f2fb781

Fix another Keras 2 failure

1931ab1

Make the caching step XLA compatible

b14304b

Add dtype support for the MistralBackbone

cd30279

Address review comments

c51261d

Add docs; Make args keyword-only; Cosmetic fixes

301cf0f

Use keras.backend.floatx() instead of keras.config.floatx()

97f11e4

for Keras 2 compatibility

Add review comments

c3d71c2

mattdangerw force-pushed the mistral-backbone branch from 32f166a to c3d71c2 Compare December 19, 2023 23:55

mattdangerw merged commit 4ea8c23 into keras-team:master Dec 20, 2023

tirthasheshpatel mentioned this pull request Feb 8, 2024

Add a Causal LM model for Mistral #1429

Merged

		cache_k = cache_k[:, :update_end_index, ...]
		cache_v = cache_v[:, :update_end_index, ...]

		from keras_nlp.backend import ops


		# TODO: Deprecate this in favor of `keras.layers.LayerNormalization` once


		class MistralTransformerDecoder(keras.layers.Layer):
		"""A Transformer decoder layer for the Mistral backbone."""

Add MistralAI's 7B Transformer as a backbone in KerasNLP Models #1314

Add MistralAI's 7B Transformer as a backbone in KerasNLP Models #1314

Uh oh!

Conversation

tirthasheshpatel commented Nov 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattdangerw commented Nov 13, 2023

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tirthasheshpatel commented Nov 22, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattdangerw commented Nov 28, 2023

Uh oh!

mattdangerw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tirthasheshpatel commented Nov 13, 2023 •

edited

Loading