-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cohere: update RoPE structure #33408
Conversation
@@ -79,6 +80,43 @@ class CohereConfig(PretrainedConfig): | |||
Whether to tie weight embeddings | |||
rope_theta (`float`, *optional*, defaults to 10000.0): | |||
The base period of the RoPE embeddings. | |||
rope_scaling (`Dict`, *optional*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is copy/paste from llama
# Note: the forward pass of this RoPE is slightly different from Llama's, resulting in different `sin`/`cos` for | ||
# the same parameterization. The differences are highlighted with a comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from the line highlighted with a comment, this is copy/paste from llama
@LysandreJik a PR like this one will be open for a few more modern models. Since part of the changes consists of having a global view of the model to update the |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks clean, nice to reuse the llama code
No need to updat the import structure for now! |
What does this PR do?
This PR propagates the updates to the RoPE structure to
cohere
-- the logic for RoPE was abstracted into a separate module forllama3.1
(#32135). Using the new structure, a model has access to all RoPE scaling strategies.While touching the modeling code, I've taken the liberty to:
copied from
statements, which were disabled in previous PRs;✅ all slow tests passing
Note: #31999 was originally open to migrate all modern RoPE models into the upgraded structure. However, working on
cohere
, I noticed that there may be important implementation differences in RoPE. As such, I'll be opening multiple PRs, batching similar RoPE implementations together.