Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Xccl all #7

Open
wants to merge 109 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
c0e1dc1
Happy Init
Chao1Han Aug 29, 2024
93a4bdb
update
Chao1Han Aug 30, 2024
ba6c4b7
update
Chao1Han Aug 30, 2024
68a6aee
update
Chao1Han Aug 30, 2024
31eeee9
register
Chao1Han Sep 2, 2024
b977abc
update
Chao1Han Sep 2, 2024
486b61a
update
Chao1Han Sep 3, 2024
6844932
fix typo and register frontend
Chao1Han Sep 3, 2024
7f6f8b9
update
Chao1Han Sep 3, 2024
be68320
update
Chao1Han Sep 3, 2024
2e21d4f
update
Chao1Han Sep 3, 2024
c9ef78f
update
Chao1Han Sep 4, 2024
076db36
update
Chao1Han Sep 4, 2024
8d739ac
update
Chao1Han Sep 4, 2024
fb9746b
register again
Chao1Han Sep 5, 2024
4f73180
update
Chao1Han Sep 5, 2024
7c2f018
update
Chao1Han Sep 6, 2024
229a80a
refine cmake
Chao1Han Sep 9, 2024
746007b
register all dist op and enable getXcclReduceOp
Chao1Han Sep 9, 2024
5195f52
update
Chao1Han Sep 9, 2024
227e98d
update
Chao1Han Sep 10, 2024
2eb0446
update
Chao1Han Sep 10, 2024
0f61762
update flag
Chao1Han Sep 10, 2024
df81919
update
Chao1Han Sep 11, 2024
b0c0592
update
Chao1Han Sep 11, 2024
366d208
rm redundance code
Chao1Han Sep 11, 2024
3530e43
enable timeout
Chao1Han Sep 11, 2024
485ae8b
add oneccl env
Chao1Han Sep 12, 2024
0cfd224
update
Chao1Han Sep 12, 2024
b99fd8c
Add simple test
Chao1Han Sep 12, 2024
dc41d6a
update
Chao1Han Sep 12, 2024
4c3f49f
enable coalese
Chao1Han Sep 10, 2024
afa2adc
Support broadcast
Chao1Han Sep 11, 2024
8efb5d0
update
Chao1Han Sep 12, 2024
e85c268
update
Chao1Han Sep 12, 2024
7488dbd
update
Chao1Han Sep 12, 2024
e5d6f37
update
Chao1Han Sep 12, 2024
0da5e77
add allgather
Chao1Han Sep 12, 2024
0ad5677
support allgather_into_tensor_coalesced
Chao1Han Sep 13, 2024
009e334
support reduce_scatter
Chao1Han Sep 13, 2024
ecbd989
refine test cases
Chao1Han Sep 13, 2024
a23ffb2
update ut
Chao1Han Sep 13, 2024
1d02dfe
add mpi check
Chao1Han Sep 13, 2024
c485bd8
update datatype map
Chao1Han Sep 13, 2024
2d1ae87
update
Chao1Han Sep 13, 2024
04226de
Merge branch 'xccl' into xccl-group
Chao1Han Sep 13, 2024
6184261
update
Chao1Han Sep 13, 2024
91d26d9
update
Chao1Han Sep 13, 2024
2a83d68
update
Chao1Han Sep 14, 2024
7f62b86
update
Chao1Han Sep 14, 2024
c48f5eb
Support reduce_scatter_base
Chao1Han Sep 14, 2024
9b17dc4
Support reduce_scatter_tensor_coalesced
Chao1Han Sep 14, 2024
6cb3227
support barrier
Chao1Han Sep 14, 2024
e59f051
Merge branch 'xccl' into xccl-group
Chao1Han Sep 14, 2024
d858c81
update
Chao1Han Sep 14, 2024
fea20f5
update
Chao1Han Sep 14, 2024
e0e27f3
update
Chao1Han Sep 14, 2024
029026d
add ut
Chao1Han Sep 18, 2024
682f40f
Support all2all_base
Chao1Han Sep 18, 2024
2694617
update
Chao1Han Sep 18, 2024
612df42
support all2all
Chao1Han Sep 18, 2024
001dac2
use lintrunner format code
Chao1Han Sep 19, 2024
f13b449
rm allgatherv align with nccl
Chao1Han Sep 19, 2024
af29d96
Support reduce
Chao1Han Sep 19, 2024
20b1188
Support gather
Chao1Han Sep 19, 2024
1463eca
Support scatter
Chao1Han Sep 19, 2024
156c2ac
update
Chao1Han Sep 19, 2024
79fd17e
Merge branch 'xccl' into xccl-group
Chao1Han Sep 20, 2024
652da01
Xccl process group for Pytorch
Chao1Han Aug 29, 2024
0cb0016
Merge remote-tracking branch 'upstream/main' into xccl-bak
Chao1Han Sep 20, 2024
a71d69a
Align latest
Chao1Han Sep 20, 2024
a1c2d6b
Merge branch 'xccl-bak' into xccl-group
Chao1Han Sep 20, 2024
ea7a4c9
Merge branch 'xccl-group' into xccl-p2p
Chao1Han Sep 20, 2024
4bf448d
update
Chao1Han Sep 20, 2024
1f83fbf
update
Chao1Han Sep 20, 2024
4f4ecf4
update
Chao1Han Sep 20, 2024
53142c2
Merge branch 'xccl-group' into xccl-p2p
Chao1Han Sep 20, 2024
b6bc4a8
update
Chao1Han Sep 20, 2024
1fbb7ed
support p2p
Chao1Han Sep 23, 2024
88bea25
refine findccl code
Chao1Han Sep 29, 2024
f6ea934
Add comments for build xccl
Chao1Han Sep 30, 2024
31d092d
minor fix
Chao1Han Oct 9, 2024
cbea299
rm duplicate code and refine cmake
Chao1Han Oct 9, 2024
ef261c6
update cmake
Chao1Han Oct 10, 2024
6c648cd
hidden xccl specific
Chao1Han Sep 24, 2024
e621fe6
fix ci fail
Chao1Han Oct 11, 2024
4a45d29
Merge branch 'xccl-bak' into xccl-group
Chao1Han Oct 11, 2024
7bee4d9
Merge branch 'xccl-group' into xccl-p2p
Chao1Han Oct 11, 2024
f85a845
rm ccl attr
Chao1Han Oct 12, 2024
56a5e7f
Refine specific code
Chao1Han Oct 17, 2024
a062f9f
accept comments
Chao1Han Oct 17, 2024
86b66c3
refine code
Chao1Han Oct 21, 2024
1e68c30
Merge branch 'xccl-bak' into xccl-p2p
Chao1Han Oct 21, 2024
d9ce636
align to latest
Chao1Han Oct 21, 2024
385c218
refine code
Chao1Han Oct 21, 2024
ea4fe8c
Merge branch 'xccl-bak' into xccl-p2p
Chao1Han Oct 21, 2024
e36a99c
update
Chao1Han Oct 21, 2024
5096354
update
Chao1Han Oct 21, 2024
9e6448b
add RECORD_PARAM_COMMS_DATA
Chao1Han Oct 22, 2024
8d9c24e
update
Chao1Han Oct 25, 2024
e808b6c
update
Chao1Han Oct 25, 2024
193d946
update
Chao1Han Oct 28, 2024
eb447f2
fix all_gather_v bug
Chao1Han Oct 31, 2024
55150c8
Merge remote-tracking branch 'origin/main' into xccl-p2p
Chao1Han Nov 4, 2024
20b60b1
correct get kvs
Chao1Han Nov 11, 2024
b442419
update kvs key
Chao1Han Nov 12, 2024
0aedc00
Merge remote-tracking branch 'origin/main' into xccl-p2p
Chao1Han Nov 14, 2024
65e0d9d
WA AVG reduction
Chao1Han Nov 14, 2024
3e97e67
update test case
Chao1Han Nov 15, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -276,6 +276,8 @@ option(USE_NATIVE_ARCH "Use -march=native" OFF)
cmake_dependent_option(USE_MPS "Use MPS for macOS build" ON "MPS_FOUND" OFF)
cmake_dependent_option(USE_NCCL "Use NCCL" ON
"USE_CUDA OR USE_ROCM;UNIX;NOT APPLE" OFF)
cmake_dependent_option(USE_XCCL "Use XCCL" ON
"USE_XPU;UNIX;NOT APPLE" OFF)
cmake_dependent_option(USE_RCCL "Use RCCL" ON USE_NCCL OFF)
cmake_dependent_option(USE_STATIC_NCCL "Use static NCCL" OFF "USE_NCCL" OFF)
cmake_dependent_option(USE_SYSTEM_NCCL "Use system-wide NCCL" OFF "USE_NCCL"
Expand Down Expand Up @@ -352,6 +354,8 @@ cmake_dependent_option(
USE_C10D_GLOO "USE C10D GLOO" ON "USE_DISTRIBUTED;USE_GLOO" OFF)
cmake_dependent_option(
USE_C10D_NCCL "USE C10D NCCL" ON "USE_DISTRIBUTED;USE_NCCL" OFF)
cmake_dependent_option(
USE_C10D_XCCL "USE C10D XCCL" ON "USE_DISTRIBUTED;USE_XCCL" OFF)
cmake_dependent_option(
USE_C10D_MPI "USE C10D MPI" ON "USE_DISTRIBUTED;USE_MPI" OFF)
cmake_dependent_option(
Expand Down
4 changes: 4 additions & 0 deletions build_variables.bzl
Original file line number Diff line number Diff line change
Expand Up @@ -704,6 +704,10 @@ libtorch_cuda_sources = libtorch_cuda_core_sources + libtorch_cuda_distributed_s
"torch/csrc/cuda/nccl.cpp",
]

libtorch_xpu_distributed_extra_sources = [
"torch/csrc/distributed/c10d/ProcessGroupXCCL.cpp",
]

torch_cpp_srcs = [
"torch/csrc/api/src/cuda.cpp", # this just forwards stuff, no real CUDA
"torch/csrc/api/src/data/datasets/mnist.cpp",
Expand Down
14 changes: 14 additions & 0 deletions caffe2/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -1049,6 +1049,13 @@ elseif(USE_CUDA)
endif()

if(USE_XPU)
# if SYCL runtime and oneCCL runtime are both system installed
# then building flag USE_XPU=ON , USE_XCCL=ON and USE_C10D_XCCL=ON;
# XCCL backend will be build in libtorch_xpu;
# manually set `USE_XCCL=OFF` disable XCCL backend building.
if(USE_XCCL)
append_filelist("libtorch_xpu_distributed_extra_sources" Caffe2_XPU_SRCS)
endif()
list(APPEND Caffe2_XPU_SRCS ${GENERATED_CXX_TORCH_XPU})
add_library(torch_xpu ${Caffe2_XPU_SRCS})
torch_compile_options(torch_xpu) # see cmake/public/utils.cmake
Expand Down Expand Up @@ -1118,6 +1125,10 @@ if(USE_XPU)
include_directories(SYSTEM ${ATen_XPU_INCLUDE_DIRS})

endif()
if(USE_XCCL)
target_link_libraries(torch_xpu PRIVATE torch::xccl)
target_compile_definitions(torch_xpu PRIVATE USE_XCCL)
endif()
endif()

if(NOT MSVC AND USE_XNNPACK)
Expand Down Expand Up @@ -1404,6 +1415,9 @@ if(USE_DISTRIBUTED)
target_compile_definitions(torch_cuda PUBLIC USE_C10D_NCCL)
endif()
endif()
if(USE_XPU AND USE_C10D_XCCL)
target_compile_definitions(torch_xpu PUBLIC USE_C10D_XCCL)
endif()
if(USE_MPI AND USE_C10D_MPI)
if(CMAKE_CXX_COMPILER_ID MATCHES "Clang" OR CMAKE_CXX_COMPILER_ID STREQUAL "GNU")
set_source_files_properties(
Expand Down
1 change: 1 addition & 0 deletions caffe2/core/macros.h.in
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@
{"USE_CUDNN", "${USE_CUDNN}"}, \
{"CUDNN_VERSION", "${CUDNN_VERSION}"}, \
{"USE_NCCL", "${USE_NCCL}"}, \
{"USE_XCCL", "${USE_XCCL}"}, \
{"USE_MPI", "${USE_MPI}"}, \
{"USE_GFLAGS", "${USE_GFLAGS}"}, \
{"USE_GLOG", "${USE_GLOG}"}, \
Expand Down
18 changes: 18 additions & 0 deletions cmake/Dependencies.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -1123,6 +1123,24 @@ if(USE_CUDA)
include_directories(SYSTEM ${CUB_INCLUDE_DIRS})
endif()

# ---[ XCCL
if(USE_XCCL)
if(NOT USE_XPU)
message(WARNING
"Not using XPU, so disabling USE_XCCL. Suppress this warning with "
"-DUSE_XCCL=OFF.")
caffe2_update_option(USE_XCCL OFF)
elseif(NOT CMAKE_SYSTEM_NAME STREQUAL "Linux")
message(WARNING "USE_XCCL is currently only supported under Linux.")
caffe2_update_option(USE_XCCL OFF)
else()
include(${CMAKE_CURRENT_LIST_DIR}/External/xccl.cmake)
if(NOT XCCL_FOUND)
caffe2_update_option(USE_XCCL OFF)
endif()
endif()
endif()

if(USE_DISTRIBUTED AND USE_TENSORPIPE)
if(MSVC)
message(WARNING "Tensorpipe cannot be used on Windows.")
Expand Down
15 changes: 15 additions & 0 deletions cmake/External/xccl.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
if(NOT __XCCL_INCLUDED)
set(__XCCL_INCLUDED TRUE)

# XCCL_ROOT, XCCL_LIBRARY_DIR, XCCL_INCLUDE_DIR are handled by FindXCCL.cmake.
find_package(XCCL REQUIRED)
if(XCCL_FOUND)
add_library(torch::xccl INTERFACE IMPORTED)
set_property(
TARGET torch::xccl PROPERTY INTERFACE_INCLUDE_DIRECTORIES
${XCCL_INCLUDE_DIR})
set_property(
TARGET torch::xccl PROPERTY INTERFACE_LINK_LIBRARIES
${XCCL_LIBRARY})
endif()
endif()
69 changes: 69 additions & 0 deletions cmake/Modules/FindXCCL.cmake
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# This will define the following variables:
# XCCL_FOUND : True if the system has the XCCL library.
# XCCL_INCLUDE_DIR : Include directories needed to use XCCL.
# XCCL_LIBRARY_DIR :The path to the XCCL library.
# XCCL_LIBRARY : XCCL library fullname.

include(FindPackageHandleStandardArgs)

set(XCCL_ROOT "/opt/intel/oneapi/ccl/latest")
if (NOT EXISTS "${XCCL_ROOT}")
message(STATUS "Default OneCCL not found, using current environment OneAPI")
set(XCCL_ROOT $ENV{ONEAPI_ROOT}/ccl/latest)
endif()

string(COMPARE EQUAL "${XCCL_ROOT}" "" nocclfound)
if(nocclfound)
set(XCCL_FOUND False)
set(XCCL_REASON_FAILURE "OneCCL library not found!!")
set(XCCL_NOT_FOUND_MESSAGE "${XCCL_REASON_FAILURE}")
return()
endif()

# Find include path from binary.
find_file(
XCCL_INCLUDE_DIR
NAMES include
HINTS ${XCCL_ROOT}
NO_DEFAULT_PATH
)

# Find include/oneapi path from include path.
find_file(
XCCL_INCLUDE_ONEAPI_DIR
NAMES oneapi
HINTS ${XCCL_ROOT}/include/
NO_DEFAULT_PATH
)

list(APPEND XCCL_INCLUDE_DIR ${XCCL_INCLUDE_ONEAPI_DIR})

# Find library directory from binary.
find_file(
XCCL_LIBRARY_DIR
NAMES lib
HINTS ${XCCL_ROOT}
NO_DEFAULT_PATH
)

# Find XCCL library fullname.
find_library(
XCCL_LIBRARY
NAMES ccl
HINTS ${XCCL_LIBRARY_DIR}
NO_DEFAULT_PATH
)

if((NOT XCCL_INCLUDE_DIR) OR (NOT XCCL_LIBRARY_DIR) OR (NOT XCCL_LIBRARY))
set(XCCL_FOUND False)
set(XCCL_REASON_FAILURE "OneCCL library not found!!")
set(XCCL_NOT_FOUND_MESSAGE "${XCCL_REASON_FAILURE}")
return()
endif()

find_package_handle_standard_args(
XCCL
FOUND_VAR XCCL_FOUND
REQUIRED_VARS XCCL_INCLUDE_DIR XCCL_LIBRARY_DIR XCCL_LIBRARY
REASON_FAILURE_MESSAGE "${XCCL_REASON_FAILURE}"
)
6 changes: 6 additions & 0 deletions cmake/Summary.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,12 @@ function(caffe2_print_configuration_summary)
message(STATUS " USE_SYSTEM_UCC : ${USE_SYSTEM_UCC}")
endif()
message(STATUS " USE_ITT : ${USE_ITT}")
message(STATUS " USE_XCCL : ${USE_XCCL}")
if(${USE_XCCL})
message(STATUS " USE_C10D_XCCL : ${USE_C10D_XCCL}")
message(STATUS " XCCL include path : ${XCCL_INCLUDE_DIR}")
message(STATUS " XCCL library : ${XCCL_LIBRARY}")
endif()
message(STATUS " USE_NCCL : ${USE_NCCL}")
if(${USE_NCCL})
message(STATUS " USE_SYSTEM_NCCL : ${USE_SYSTEM_NCCL}")
Expand Down
4 changes: 4 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -658,6 +658,10 @@ def run(self):
report("-- Building NCCL library")
else:
report("-- Not using NCCL")
if cmake_cache_vars["USE_XCCL"]:
report("-- Building XCCL library")
else:
report("-- Not using XCCL")
if cmake_cache_vars["USE_DISTRIBUTED"]:
if IS_WINDOWS:
report("-- Building without distributed package")
Expand Down
15 changes: 10 additions & 5 deletions test/distributed/test_c10d_common.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
from torch.testing._internal.common_distributed import (
MultiProcessTestCase,
skip_if_lt_x_gpu,
get_device_count,
)
from torch.testing._internal.common_utils import (
instantiate_parametrized_tests,
Expand Down Expand Up @@ -60,14 +61,15 @@
torch.backends.cuda.matmul.allow_tf32 = False


def gpus_for_rank(world_size):
def gpus_for_rank(world_size, backend):
"""Multigpu tests are designed to simulate the multi nodes with multi
GPUs on each node. Nccl backend requires equal #GPUs in each process.
On a single node, all visible GPUs are evenly
divided to subsets, each process only uses a subset.
"""
visible_devices = list(range(torch.cuda.device_count()))
gpus_per_process = torch.cuda.device_count() // world_size
device_count = get_device_count(backend)
visible_devices = list(range(device_count))
gpus_per_process = device_count // world_size
gpus_for_rank = []
for rank in range(world_size):
gpus_for_rank.append(
Expand Down Expand Up @@ -828,7 +830,7 @@ def update_parameters(model):
def _gpu_model_with_ddp_comm_hook(
self, process_group, hook=None, gradient_as_bucket_view=False, state=None
):
device_id = gpus_for_rank(self.world_size)[self.rank][0]
device_id = gpus_for_rank(self.world_size, process_group.name())[self.rank][0]
gpu_model = DistributedDataParallel(
ModuleForDdpCommHook().to(device_id),
device_ids=[device_id],
Expand All @@ -845,7 +847,7 @@ def _gpu_model_with_ddp_comm_hook(
def _gpu_model_with_builtin_ddp_comm_hook(
self, process_group, hook=None, gradient_as_bucket_view=False
):
device_id = gpus_for_rank(self.world_size)[self.rank][0]
device_id = gpus_for_rank(self.world_size, process_group.name())[self.rank][0]
gpu_model = DistributedDataParallel(
ModuleForDdpCommHook().to(device_id),
device_ids=[device_id],
Expand Down Expand Up @@ -1834,6 +1836,9 @@ def test_init_process_group_for_all_backends(self):
elif backend == dist.Backend.UCC:
if not dist.is_ucc_available():
continue
elif backend == dist.Backend.XCCL:
if not dist.is_xccl_available():
continue
# Multi-threaded PG is defined as a pure python class.
# Its pg.name() does not going through Pybind, so its backend name
# is still "threaded" instead of "custom".
Expand Down
Loading