Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomic Embed Text V2 with Mixture-of-Experts (MoE) architecture #12466

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

manyoso
Copy link
Contributor

@manyoso manyoso commented Mar 19, 2025

  • Adds MoE-based embedding model supporting multilingual embeddings.
  • Selects architecture variant based on hyperparameter detection (MoE layers).
  • Removes unnecessary subclass initialization checks for clarity.

https://www.nomic.ai/blog/posts/nomic-embed-text-v2

Make sure to read the contributing guidelines before submitting a PR

Verified

This commit was signed with the committer’s verified signature. The key has expired.
a1012112796 a1012112796
- Adds MoE-based embedding model supporting multilingual embeddings.
- Selects architecture variant based on hyperparameter detection (MoE layers).
- Removes unnecessary subclass initialization checks for clarity.

https://www.nomic.ai/blog/posts/nomic-embed-text-v2

Signed-off-by: Adam Treat <treat.adam@gmail.com>
@github-actions github-actions bot added the python python script changes label Mar 19, 2025
@manyoso manyoso marked this pull request as draft March 19, 2025 13:46
manyoso added 3 commits March 19, 2025 09:52

Verified

This commit was signed with the committer’s verified signature.
Signed-off-by: Adam Treat <treat.adam@gmail.com>

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Signed-off-by: Adam Treat <treat.adam@gmail.com>

Verified

This commit was signed with the committer’s verified signature.
Signed-off-by: Adam Treat <treat.adam@gmail.com>
@manyoso manyoso marked this pull request as ready for review March 19, 2025 15:54
@@ -702,6 +695,8 @@ def get_vocab_base_pre(self, tokenizer) -> str:
if chkhsh == "ccc2ef013c104be7bae2965776d611e1d7a8a2a9c547dd93a682c9a9fc80352e":
# ref: https://huggingface.co/Xenova/gpt-4o
res = "gpt-4o"
if chkhsh == "a81863d07e75497e2194eb1a1574d5e5cd4d5f85a87a0728b922bf2bed6fb327":
res = "bert"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The newly added tokenizer is nomic-embed-text-v2-moe and not bert, is this expected?

And also this list is auto-generated, please make sure not to modify it manually


if "mlp.experts.mlp.w1" in name:
data_torch = data_torch.view(self.hparams["num_experts"], self.hparams["n_inner"], self.hparams["n_embd"])
return [(self.map_tensor_name(name) + ".weight", data_torch)]
Copy link
Collaborator

@ngxson ngxson Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this work? (no need to return here)

map_tensor_name will append .weight if the given original name also have it

Suggested change
return [(self.map_tensor_name(name) + ".weight", data_torch)]
name += ".weight"

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed something, but llm_build_bert does not seem to support MoE, right? Should we also update the compute graph?

@manyoso manyoso marked this pull request as draft March 19, 2025 17:58
@manyoso
Copy link
Contributor Author

manyoso commented Mar 19, 2025

Working on tests to verify the accuracy of the model

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Signed-off-by: Adam Treat <treat.adam@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants