-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Cohere Model #30027
Refactor Cohere Model #30027
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
Let's keep BC for the layer norm name.
While we are at it, missing copied from! 🤗
@@ -76,10 +76,9 @@ def _get_unpad_data(attention_mask): | |||
|
|||
|
|||
class CohereLayerNorm(nn.Module): | |||
def __init__(self, hidden_size, eps=1e-5, bias=False): | |||
def __init__(self, param_shape=None, eps=1e-5, bias=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def __init__(self, param_shape=None, eps=1e-5, bias=False): | |
def __init__(self, hidden_size =None, eps=1e-5, bias=False): | |
""" The hidden size can be a tuple or an int. If a tuple is used, the layer will be applied on both dim """ |
dtype = q.dtype | ||
q = q.float() | ||
k = k.float() | ||
cos = cos.unsqueeze(unsqueeze_dim) | ||
sin = sin.unsqueeze(unsqueeze_dim) | ||
q_embed = (q * cos) + (rotate_half(q) * sin) | ||
k_embed = (k * cos) + (rotate_half(k) * sin) | ||
return q_embed, k_embed | ||
return q_embed.to(dtype=dtype), k_embed.to(dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is done outside apply_rotary_pos_emb
we can keep the copied from 😉 but it's not an issue
|
||
if (self.head_dim * self.num_heads) != self.hidden_size: | ||
raise ValueError( | ||
f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" | ||
f" and `num_heads`: {self.num_heads})." | ||
) | ||
|
||
if self.use_qk_norm: | ||
# When sharding the model using Tensor Parallelism, need to be careful to use n_local_heads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this comment is relevant as the model is not sharded by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this is to warn others who port this to other frameworks from HF.
As a followup, adding integration tests for the new model will be nice. |
I think there's an issue with the tokenizer.json (or the way transformers is converting). Random multilingual out of nowhere. See: https://huggingface.co/CohereForAI/c4ai-command-r-plus/discussions/15 Found this after converting to mlx-lm, but was found to be reproducible via the following: You can reproduce with:
|
Concretely, loading -> saving -> loading the tokenizer all through HF |
worth noting - the bitsandbytes 4 bit repo linked from the main cohere repo is also a different size and looks very different from the original model repo's tokenizer.json (https://huggingface.co/CohereForAI/c4ai-command-r-plus-4bit/blob/main/tokenizer.json). |
another interesting difference between the 4 bit bnb tokenizer and the original - in the original one, token id token id 255001 <|END_OF_TURN_TOKEN|>, special is set to False. In the 4bit bnb one, it's True. |
I have no idea what you are talking about? A 4bit tokenizer? could you actually open an issue with a repro and the issue? |
I think the difference between two tokenizers.json is unicode encoding after |
What @ahmetustun mentioned is the difference - not a 4 bit tokenizer, the tokenizer.json in https://huggingface.co/CohereForAI/c4ai-command-r-plus vs. https://huggingface.co/CohereForAI/c4ai-command-r-plus-4bit, which @ahmetustun clarified. Not at my workstation now, but if nobody else is seeing this issue, I'll assume I've got something wrong on my end. Thanks for the clarification. |
Please left further comment in the model repo if the problem continues. Thanks @fblissjr. |
* changes * addressing comments * smol fix
* changes * addressing comments * smol fix
What does this PR do?
Refactor Cohere Model
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker and @younesbelkada
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.