-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
slice
performance: Horizontal fusion based on slice
of an input tensor results in segmentation
#58
Comments
Merged
naoyam
added a commit
that referenced
this issue
Mar 23, 2023
Previously, fusions like [this](https://github.com/NVIDIA/Fuser/pull/60/files#diff-a8f5333aa3f2d21440b3cea429bb2a588ed583f4d05486063ef1dc1a30996df9R2411) are segmented due to a limitation of `DomainMap`. It seems there's no impact to the existing tests and benchmarks. No failure with the tests and benchmarks. Dumped all CUDA generated kernels from the benchmarks and compared before and after this PR. Nothing changed. This is part of the fix for #58 --------- Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
naoyam
added a commit
that referenced
this issue
Mar 24, 2023
Resize ops are not replayed, so they don't need to be exactly mapped Previously, `FusionSliceForNanoGPT3_CUDA` was segmented as the `resize` ops are not exactly mapped since they have the different expansion arguments. Since those `resize` ops are part of rfactor transformations, they were detected as conflicting rfactor transformations. However, unlike the `split` and `merge` used by `reshape`, `resize` ops are not replayed, so they don't need to be uniform. This is also part of the fix for #58. Looks like the Python example is not segmented anymore, although I suspect there's still something need to do for `permute`.
I'm going to close this issue as the repro is no longer segmented after #64. Haven't looked at detailed performance profiles, but here's the result of the running the repro with
|
This was referenced Jun 6, 2024
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In a use case from nanoGPT where the activations from the Input Linears of multihead attention are split, they should generate a horizontal fusion with 3 parallel sequences of
slice
+reshape
+permute
. The resulting fusion from nvFuser gets segmented into 6 kernels which is not great.Repro:
Nsys cmd:
Nsys output:
The text was updated successfully, but these errors were encountered: