Support aten::split_with_sizes_copy.out
/aten::_chunk_cat
/aten::_chunk_cat.out
to align with CUDA as a fast pass
#1213
Labels
🚀 The feature, motivation and pitch
Motivation
torch.split_with_sizes_copy.out
is called in PyTorch FSDP2 https://github.com/pytorch/pytorch/blob/main/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py#L105CUDA has a fast pass implementation for this aten op. https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L14757 . To achieve same performance as CUDA, please enable a similar fast path for XPU device.
Other cuda specific aten ops in FSDP2:
aten::_chunk_cat
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L5720
aten::_chunk_cat.out
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/native_functions.yaml#L5725
Customer impact
FSDP2 performance on XPU device.
Alternatives
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: