-
Notifications
You must be signed in to change notification settings - Fork 544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
来自deepseek r1分析的进一步优化思路,大家看对不对 O(∩_∩)O #26
Comments
The power of open source |
DeepAI |
deep-niubility |
niubi |
不能让ai给出最佳方案吗? |
Impressive! Self-iteration and evolution of AI! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
利用Hopper单周期发射4条异步拷贝指令的特性,提升SMEM填充吞吐量
调整P矩阵布局,确保每线程访问64字节对齐的连续内存,减少bank冲突
通过__builtin_assume提示编译器优化条件分支
对临时张量使用联合存储,减少寄存器占用
利用Hopper的Tensor Memory Accelerator (TMA) 加速大块数据传输
引入软件流水线策略,增加指令级并行
利用Hopper硬件加速的混合精度转换指令
根据头维度动态调整线程数,优化资源利用率
手工调优关键矩阵乘法的PTX指令调度
利用Hopper L2缓存控制指令优化数据局部性
性能预期:
通过上述优化组合,预计可在以下方面提升:
建议使用Nsight Compute进行迭代验证,重点关注:
The text was updated successfully, but these errors were encountered: