Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deepseek moe #2467

Closed
wants to merge 5 commits into from
Closed

Deepseek moe #2467

wants to merge 5 commits into from

Conversation

esmeetu
Copy link
Collaborator

@esmeetu esmeetu commented Jan 17, 2024

This is a refactor version based #2453, since i have no permission to change onto that PR.

  • Make load_weight logic cleaner, and remove unnecessary merged_replicated_linear_loader function.
  • Remove reduce_results args of DeepseekMLP
  • Make shared_expert and expert use same DeepseekExpertMLP which is semantically appropriate.
  • Align Mixtral MLP with DeepseekExpertMLP by using ReplicatedLinear across gate_proj, up_proj, down_proj.

@zwd003 Thanks for your contribution again! And happy to be co-auther with you.
@zhuohan123 @WoosukKwon This PR is ready for review! Also we can wait for #2293

else:
final_hidden_states.add_(current_hidden_states)

y = tensor_model_parallel_all_reduce(final_hidden_states)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reducing after forwarding sharedexpert could avoiding one extra reduce operation, as sharedexpert also requires a reduce operation; these two operations can be merged.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't split shared_expert weights, and there is no need to reduce for that. Am i right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, it's right

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitting shared_expert may have higher performance, reducing memory required by each gpu

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm..., It seems make sense. But maybe we can wait for #2293. And then consider improving performance.

@simon-mo simon-mo mentioned this pull request Jan 22, 2024
Copy link
Collaborator

@cadedaniel cadedaniel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the contribution @esmeetu and @zwd003 . @simon-mo asked me to review; I took an initial pass, will run on h100 and review later today.

Comment on lines +140 to +141
self.expert_indicies = np.array_split(range(
self.n_routed_experts), self.tp_size)[self.rank].tolist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can replace with pure-pytorch version

torch.arange(self.n_routed_experts
).split(self.n_routed_experts // self.tp_size)[self.rank].tolist()

@@ -0,0 +1,468 @@
# coding=utf-8
# Adapted from
# https://github.com/huggingface/transformers/blob/v4.28.0/src/transformers/models/llama/modeling_llama.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update this comment? seems adapted mostly from mixtral.py

max_position_embeddings=max_position_embeddings,
linear_method=linear_method,
)
self.mlp = DeepseekMoE(config=config,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure this would be easier to read if you did

if <...>:
    self.mlp = DeepseekMoE(...)
else:
    self.mlp = DeepseekMLP(...)

:)

@esmeetu
Copy link
Collaborator Author

esmeetu commented Jan 23, 2024

@cadedaniel @pcmoritz Thanks for your reviews! I suggest to merge the fused moe version in #2453 which is faster. And i will close this PR and let's pushing that PR being merged.

@esmeetu esmeetu closed this Jan 23, 2024
@esmeetu esmeetu deleted the deepseek-moe branch February 14, 2024 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants