Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slice performance: Horizontal fusion based on slice of an input tensor results in segmentation #58

Closed
kevinstephano opened this issue Mar 22, 2023 · 1 comment
Assignees

Comments

@kevinstephano
Copy link
Collaborator

In a use case from nanoGPT where the activations from the Input Linears of multihead attention are split, they should generate a horizontal fusion with 3 parallel sequences of slice+reshape+permute. The resulting fusion from nvFuser gets segmented into 6 kernels which is not great.

Repro:

import torch
from nvfuser import FusionDefinition, DataType

inputs = [                     
    torch.randn(16, 128, 3072, device='cuda'),
]                                      
                                                                       
def nvfuser_fusion(fd : FusionDefinition) -> None :                  
    T0 = fd.from_pytorch(inputs[0])                    
    T0_slice1 = fd.ops.slice(T0, [0, 0, 0], [16, 128, 1024], [1, 1, 1])                                                                                
    T0_slice2 = fd.ops.slice(T0, [0, 0, 1024], [16, 128, 2048], [1, 1, 1])             
    T0_slice3 = fd.ops.slice(T0, [0, 0, 2048], [16, 128, 3072], [1, 1, 1])                                                                             
    T1_slice1 = fd.ops.reshape(T0_slice1, [16, 128, 1024], [16, 128, 16, 64])                                                                          
    T1_slice2 = fd.ops.reshape(T0_slice2, [16, 128, 1024], [16, 128, 16, 64])                                                                          
    T1_slice3 = fd.ops.reshape(T0_slice3, [16, 128, 1024], [16, 128, 16, 64])                                                                          
    T2_slice1 = fd.ops.permute(T1_slice1, [0, 2, 1, 3])                                                                                                
    T2_slice2 = fd.ops.permute(T1_slice2, [0, 2, 1, 3])
    T2_slice3 = fd.ops.permute(T1_slice3, [0, 2, 1, 3])                
    fd.add_output(T2_slice1)                     
    fd.add_output(T2_slice2)       
    fd.add_output(T2_slice3) 

with FusionDefinition() as fd:
   nvfuser_fusion(fd)

out = fd.execute(inputs)

Nsys cmd:

nsys nvprof --print-gpu-trace python test.py

Nsys output:

Start (ns)  Duration (ns)  CorrId  GrdX  GrdY  GrdZ  BlkX  BlkY  BlkZ  Reg/Trd  StcSMem (MB)  DymSMem (MB)  Bytes (MB)  Throughput (MBps)  SrcMemKd  DstMemKd         Device         Ctx  Strm                                                  Name                                                
 ----------  -------------  ------  ----  ----  ----  ----  ----  ----  -------  ------------  ------------  ----------  -----------------  --------  --------  --------------------  ---  ----  ----------------------------------------------------------------------------------------------------
 1307711678          23104     146   912     1     1   256     1     1       40         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
 1499928416           6560     270   256    16     1   128     1     1       20         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel1(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)3>)        
 1648504226           6048     311   256    16     1   128     1     1       20         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel2(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)3>)        
 1796967171           7936     356   256    16     1   128     1     1       20         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel3(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)3>)        
 1949040639          11680     397    16   256     1   128     1     1       16         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel4(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)4>)        
 2101463421          11713     442    16   256     1   128     1     1       16         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel5(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)4>)        
 2253434746          11744     483    16   256     1   128     1     1       16         0.000         0.000                                                     NVIDIA H100 PCIe (0)    1     7  CudaCodeGen::kernel6(CudaCodeGen::Tensor<float, (int)3>, CudaCodeGen::Tensor<float, (int)4>)
naoyam added a commit that referenced this issue Mar 23, 2023
Previously, fusions like
[this](https://github.com/NVIDIA/Fuser/pull/60/files#diff-a8f5333aa3f2d21440b3cea429bb2a588ed583f4d05486063ef1dc1a30996df9R2411)
are segmented due to a limitation of `DomainMap`.

It seems there's no impact to the existing tests and benchmarks. No
failure with the tests and benchmarks. Dumped all CUDA generated kernels
from the benchmarks and compared before and after this PR. Nothing
changed.

This is part of the fix for #58

---------

Co-authored-by: Gao, Xiang <qasdfgtyuiop@gmail.com>
naoyam added a commit that referenced this issue Mar 24, 2023
Resize ops are not replayed, so they don't need to be exactly mapped

Previously, `FusionSliceForNanoGPT3_CUDA` was segmented as the `resize`
ops are not exactly mapped since they have the different expansion
arguments. Since those `resize` ops are part of rfactor transformations,
they were detected as conflicting rfactor transformations. However,
unlike the `split` and `merge` used by `reshape`, `resize` ops are not
replayed, so they don't need to be uniform.

This is also part of the fix for #58. Looks like the Python example is
not segmented anymore, although I suspect there's still something need
to do for `permute`.
@naoyam
Copy link
Collaborator

naoyam commented Mar 24, 2023

I'm going to close this issue as the repro is no longer segmented after #64. Haven't looked at detailed performance profiles, but here's the result of the running the repro with PYTORCH_NVFUSER_DUMP=dump_eff_bandwidth on A100 80G:

kernel1 run in 0.041984 ms, achieved: 1198.83 GB/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants