Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark results for DeepSeek-v3 in 2x8xH200 Cluster #2738

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

roG0d
Copy link
Contributor

@roG0d roG0d commented Jan 5, 2025

Motivation

  • Having baseline metrics to compare future works using a 2x8xH200 GPU cluster.
  • Explore the tradeoffs of increasing the number of chips with more GPU memory, H200, versus increasing the parallel inference world size when H100.
  • Measure multi-node inference overhead compared to single-node.
  • Explore the benefits of using FP8 quantization.

For output files and logs, please refer to: https://github.com/datacrunch-research/h200-benchmarks

Modifications

  • Added a new folder benchmark_dsv3 resembling a similar structure of other benchmark folders.
  • Added deepseek_v3.shscript containing each benchmark performed.
  • Added a README.md containing the metrics obtained from the benchmarks performed.

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs
Copy link
Member

zhyncs commented Jan 5, 2025

Hi @roG0d Sorry for the late response, may you try the latest version v0.4.1.post4

@roG0d
Copy link
Contributor Author

roG0d commented Jan 7, 2025

Sure, @zhyncs! If you'd like, we were thinking it might be useful to track the progress made in #2591 and benchmark it for single-node FP8 and BF16.

We could create another folder called benchmark_v0_4_1_post to store these benchmark results, similar to what was done for this PR.

@roG0d
Copy link
Contributor Author

roG0d commented Jan 7, 2025

We already have the results for v.0.4.1.post4 up to FusedMoE tuning for H200 here. Would you prefer us to create a new PR for v0.4.1.post4 with these results or should I include it in the current one for main so we can continue updating it with future optimizations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants