Fix load balancing loss func for mixtral #28256

liangxuZhang · 2023-12-27T07:48:22Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@ArthurZucker and @younesbelkada
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

kalomaze · 2023-12-30T02:39:12Z

How does this differ from #28115 ?

ArthurZucker

Thanks a lot for deep diving. As discussed here this is welcome, we indeed had a bug in the implementation

Let's try to help with shapes and use something explicit like this:

    _, selected_experts = torch.topk(routing_weights, top_k, dim=-1) # [batch_size X sequence_length, top_k]

    expert_mask = torch.nn.functional.one_hot(selected_experts, num_experts) # [batch_size X sequence_length, top_k, num_experts]

    tokens_per_expert = torch.mean(expert_mask.float(), dim=0) # [top_k, num_experts]

    # Compute the average probability of routing to these experts
    router_prob_per_expert = torch.mean(routing_weights, dim=0) # [num_experts]

    overall_loss = torch.sum(tokens_per_expert * router_prob_per_expert.unsqueeze(0)) / top_k
    return overall_loss * num_experts

ArthurZucker · 2024-01-02T16:22:51Z

src/transformers/models/mixtral/modeling_mixtral.py

@@ -107,15 +107,16 @@ def load_balancing_loss_func(gate_logits: torch.Tensor, num_experts: torch.Tenso
    selected_experts = selected_experts.reshape(-1)

    expert_mask = torch.nn.functional.one_hot(selected_experts, num_experts)
+    expert_mask = expert_mask.reshape(-1,top_k, num_experts)


this is not needed if we just remove selected_experts = selected_experts.reshape(-1)

ArthurZucker · 2024-01-02T16:24:08Z

src/transformers/models/mixtral/modeling_mixtral.py

@@ -107,15 +107,16 @@ def load_balancing_loss_func(gate_logits: torch.Tensor, num_experts: torch.Tenso
    selected_experts = selected_experts.reshape(-1)

    expert_mask = torch.nn.functional.one_hot(selected_experts, num_experts)
+    expert_mask = expert_mask.reshape(-1,top_k, num_experts)
    expert_mask = torch.max(expert_mask, dim=-2).values


Suggested change

expert_mask = torch.max(expert_mask, dim=-2).values

should be removed

codybum · 2024-01-07T13:51:36Z

What is the impact of this issue on Mixtral training? Will this fix conceivability improve the quality of training? Is it likely that previous Mixtral trainings are not as good as they could be?

It seems like an important issue for those working with Mixtral that has been waiting on merge approval for a while.

theblackcat102 · 2024-01-08T00:34:27Z

I personally finds the loss to be much lower with the new implementation. But I wasn't sure if it has to do with the (num_experts**2) instead of just N. I'm pretty sure this is an error on original mixtral side. So far still waiting for the training result on new implemented balance loss to finish. Deepspeed also has an implementation of top-2 which we might be able to reference.

ArthurZucker · 2024-01-08T09:32:59Z

#28255 has information that could help, I am down to merge this for the release planned this week, just the comments that need to be adressed cc @liangxuZhang do you need help to finish this?

liangxuZhang · 2024-01-10T09:47:02Z

#28255 has information that could help, I am down to merge this for the release planned this week, just the comments that need to be adressed cc @liangxuZhang do you need help to finish this?

@ArthurZucker LGTM. The new implementation is correct and concise, and I've made a new commit. In #28255, maybe we can have a deep discuss whether to concatenate gate logits of all layers.

ArthurZucker · 2024-01-10T09:57:30Z

Alright! Pretty sure the math shows it's equivalent to compute on individual layers then sum vs concate and compute, but let's merge this for now !

ArthurZucker

Thanks! Failing test seems unrelated let's just rebase on main

liangxuZhang · 2024-01-10T14:48:41Z

Thanks! Failing test seems unrelated let's just rebase on main

@ArthurZucker I've just rebase on the main branch, but I'm not sure if I'm doing it right. Please tell me what else I need to do

bratao · 2024-01-10T17:11:36Z

@liangxuZhang @ArthurZucker opinions about #28403 ? It looks complementary to this PR

ArthurZucker · 2024-01-11T15:16:00Z

Something like git pull upstream main if the remote is upstream, the exotic CI was fixed on main! I'll merge without it

ArthurZucker · 2024-01-11T15:16:24Z

Thanks a lot @liangxuZhang for this fix! 🤗

dancingpipi · 2024-01-12T08:15:09Z

great！

cryoco · 2024-01-12T08:57:52Z

Impressive work!

* Correct the implementation of auxiliary loss of mixtrtal * correct the implementation of auxiliary loss of mixtrtal * Implement a simpler calculation method --------- Co-authored-by: zhangliangxu3 <zhangliangxu3@jd.com>

ArthurZucker reviewed Jan 2, 2024

View reviewed changes

ArthurZucker mentioned this pull request Jan 8, 2024

loss did not change in expert loss of mixtral model #28205

Closed

4 tasks

ArthurZucker approved these changes Jan 10, 2024

View reviewed changes

zhangliangxu3 added 3 commits January 10, 2024 18:39

Correct the implementation of auxiliary loss of mixtrtal

64ec7cd

correct the implementation of auxiliary loss of mixtrtal

fc6b8b0

Implement a simpler calculation method

0fe0244

liangxuZhang force-pushed the fix_load_balancing_loss_func_for_mixtral branch from cdd6509 to 0fe0244 Compare January 10, 2024 10:41

ArthurZucker merged commit e768616 into huggingface:main Jan 11, 2024
15 of 17 checks passed

RicardoDominguez mentioned this pull request Jan 11, 2024

Mixtral training is not correctly implemented in the Transformers version specified axolotl-ai-cloud/axolotl#1024

Closed

5 tasks

musabgultekin mentioned this pull request Jan 15, 2024

Unable to run Medium model MeetKai/functionary#88

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix load balancing loss func for mixtral #28256

Fix load balancing loss func for mixtral #28256

liangxuZhang commented Dec 27, 2023

kalomaze commented Dec 30, 2023 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

ArthurZucker Jan 2, 2024

ArthurZucker Jan 2, 2024

codybum commented Jan 7, 2024

theblackcat102 commented Jan 8, 2024 •

edited

Loading

ArthurZucker commented Jan 8, 2024

liangxuZhang commented Jan 10, 2024

ArthurZucker commented Jan 10, 2024

ArthurZucker left a comment

liangxuZhang commented Jan 10, 2024

bratao commented Jan 10, 2024 •

edited

Loading

ArthurZucker commented Jan 11, 2024

ArthurZucker commented Jan 11, 2024

dancingpipi commented Jan 12, 2024

cryoco commented Jan 12, 2024

Fix load balancing loss func for mixtral #28256

Fix load balancing loss func for mixtral #28256

Conversation

liangxuZhang commented Dec 27, 2023

What does this PR do?

Before submitting

Who can review?

kalomaze commented Dec 30, 2023 • edited Loading

ArthurZucker left a comment • edited Loading

Choose a reason for hiding this comment

ArthurZucker Jan 2, 2024

Choose a reason for hiding this comment

ArthurZucker Jan 2, 2024

Choose a reason for hiding this comment

codybum commented Jan 7, 2024

theblackcat102 commented Jan 8, 2024 • edited Loading

ArthurZucker commented Jan 8, 2024

liangxuZhang commented Jan 10, 2024

ArthurZucker commented Jan 10, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

liangxuZhang commented Jan 10, 2024

bratao commented Jan 10, 2024 • edited Loading

ArthurZucker commented Jan 11, 2024

ArthurZucker commented Jan 11, 2024

dancingpipi commented Jan 12, 2024

cryoco commented Jan 12, 2024

kalomaze commented Dec 30, 2023 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

theblackcat102 commented Jan 8, 2024 •

edited

Loading

bratao commented Jan 10, 2024 •

edited

Loading