Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core][experimental] Build an operation-based execution schedule for each actor to avoid deadlocks caused by NCCL operations #46911

Merged
merged 112 commits into from
Aug 14, 2024
Merged
Show file tree
Hide file tree
Changes from 68 commits
Commits
Show all changes
112 commits
Select commit Hold shift + click to select a range
245ff6e
execute schedule
kevin85421 Jul 29, 2024
52453e2
change operation type enum from int to str
kevin85421 Jul 29, 2024
f7ea877
pass str as dict key
kevin85421 Jul 29, 2024
9e411c7
actor_to_execution_schedule
kevin85421 Jul 30, 2024
77e3a26
remove bind_index from DAGNodeOperation
kevin85421 Jul 30, 2024
e0332fb
it somehow works
kevin85421 Jul 31, 2024
4329e21
add a test
kevin85421 Jul 31, 2024
1d6756a
polish
kevin85421 Jul 31, 2024
2174bb6
polish
kevin85421 Jul 31, 2024
ea7685b
polish
kevin85421 Jul 31, 2024
a98b1eb
polish
kevin85421 Jul 31, 2024
ecefb26
polish
kevin85421 Jul 31, 2024
088384e
polish
kevin85421 Jul 31, 2024
b2ac0f9
polish
kevin85421 Jul 31, 2024
bc90966
polish
kevin85421 Jul 31, 2024
ff701e2
polish
kevin85421 Jul 31, 2024
53f5add
polish
kevin85421 Jul 31, 2024
3419151
polish
kevin85421 Jul 31, 2024
3a8acb0
polish
kevin85421 Jul 31, 2024
c0e6cb7
polish
kevin85421 Jul 31, 2024
b8b8ea5
add new tests
kevin85421 Aug 1, 2024
ce7f0e5
add new tests
kevin85421 Aug 1, 2024
900e790
add new tests
kevin85421 Aug 1, 2024
e36277e
add developer api
kevin85421 Aug 1, 2024
83f6209
update comment
kevin85421 Aug 1, 2024
8a8e75a
Add GPU tests
kevin85421 Aug 2, 2024
2fa6118
add complex test
kevin85421 Aug 2, 2024
75539ca
move import heapq to the top-level
Aug 5, 2024
b4e4a38
Apply suggestions from code review
kevin85421 Aug 5, 2024
d220045
update comments
Aug 5, 2024
474e852
use existing buffers
Aug 6, 2024
a5891ab
add prepare / cancel to ExecutableTask
Aug 6, 2024
f8112c0
add cache to ExecutableTask
kevin85421 Aug 6, 2024
ec43c66
use ExecutableTask's exec_operation instead
kevin85421 Aug 6, 2024
97d45f7
restore ser ctx
kevin85421 Aug 6, 2024
69fa038
add comments
kevin85421 Aug 6, 2024
ad51a68
move DAGNodeOperation to a separate file
kevin85421 Aug 6, 2024
ef28fd7
add comments to DAGOperationGraphNode
kevin85421 Aug 6, 2024
acb78f4
add comments for building graph
kevin85421 Aug 6, 2024
e9ecbe5
fix
kevin85421 Aug 6, 2024
bf69570
move select_next_nodes to top-level
kevin85421 Aug 6, 2024
b18df94
move comments to select_next_nodes
kevin85421 Aug 6, 2024
35288f6
add comments for schedule
kevin85421 Aug 6, 2024
ec7f191
add comments for ExecutableTask
kevin85421 Aug 6, 2024
d7ae7a0
update select_next_nodes comment
kevin85421 Aug 6, 2024
aeaedda
add comments for DAGOperationGraphNode
kevin85421 Aug 6, 2024
422a2d9
separate build graph into a independent func
kevin85421 Aug 6, 2024
f989665
add tests for _select_next_nodes
kevin85421 Aug 7, 2024
9189782
separate node generation and adding edge into two functions
kevin85421 Aug 7, 2024
56df8cc
add comments for _build_dag_node_operation_graph
kevin85421 Aug 7, 2024
07125ac
add unit tests for _build_dag_node_operation_graph
kevin85421 Aug 7, 2024
5c3750e
address comments
kevin85421 Aug 7, 2024
8934182
add comments
kevin85421 Aug 7, 2024
09fcc2d
add unit test
kevin85421 Aug 7, 2024
98df069
move tests to GPU instance
kevin85421 Aug 7, 2024
7b06b41
use existing GPU instance
kevin85421 Aug 7, 2024
e664e96
fix BUILD
kevin85421 Aug 7, 2024
0959599
fix BUILD
kevin85421 Aug 7, 2024
2199bda
add comments
kevin85421 Aug 7, 2024
383b877
add type hints
kevin85421 Aug 8, 2024
50ff5c1
use actor_id instead of actor handle
kevin85421 Aug 8, 2024
8667ed4
rename cache to intermediate_buffer
kevin85421 Aug 8, 2024
980f5a1
add asserts
kevin85421 Aug 8, 2024
cafd691
add asserts
kevin85421 Aug 8, 2024
3cb1eb0
add asserts
kevin85421 Aug 8, 2024
ed5836d
update comments for actor_to_candidates
kevin85421 Aug 8, 2024
c1becc9
add assert, and unpack tuple
kevin85421 Aug 8, 2024
255fe00
rename res to input_data
kevin85421 Aug 8, 2024
6b03d0b
use 4 GPUs in CI
kevin85421 Aug 9, 2024
51c2c71
move _select_next_nodes to dag_node_operation.py
kevin85421 Aug 9, 2024
eaa3a4d
update comments for _select_next_nodes
kevin85421 Aug 9, 2024
45d9f43
update comments
kevin85421 Aug 9, 2024
52c6f36
remove DAGNodeOperationType's DeveloperAPI
kevin85421 Aug 9, 2024
ea8def2
remove double negative
kevin85421 Aug 9, 2024
9a81203
move first_nccl_node into a separate loop
kevin85421 Aug 9, 2024
dc46013
add a comment for picking nccl write/read nodes
kevin85421 Aug 9, 2024
18b773a
move return outside try-except
kevin85421 Aug 9, 2024
f51bd1a
add comments for _build_execution_schedule
kevin85421 Aug 9, 2024
fc123cb
add test to ensure _select_next_nodes is deterministic
kevin85421 Aug 9, 2024
a41b9a6
add comments for class_handle
kevin85421 Aug 9, 2024
bb6db70
add comment for deterministism
kevin85421 Aug 9, 2024
a835314
add comment
kevin85421 Aug 9, 2024
0893e99
return actor_to_execution_schedule
kevin85421 Aug 9, 2024
d7bda83
move _build_dag_node_operation_graph to dag_node_operation.py
kevin85421 Aug 9, 2024
164457e
remove print function
kevin85421 Aug 9, 2024
fadf05f
rename DAGOperationGraphNode to _DAGOperationGraphNode
kevin85421 Aug 9, 2024
9895d86
rename DAGNodeOperation to _DAGNodeOperation
kevin85421 Aug 9, 2024
bcc10ad
add comments for in_edges / out_edges
kevin85421 Aug 9, 2024
b125f69
add comments for select_next_nodes behavior
kevin85421 Aug 9, 2024
45fb99f
add an example for actor_to_operation_nodes
kevin85421 Aug 9, 2024
72799f7
split tests into cpu/gpu
kevin85421 Aug 9, 2024
67857a5
add comments for ExecutableTask
kevin85421 Aug 9, 2024
e0df08b
move add_edge into a top-level function
kevin85421 Aug 9, 2024
6e23a3a
rename idx to dag_idx or local_idx
kevin85421 Aug 9, 2024
5691e8b
rename global_idx to dag_idx
kevin85421 Aug 9, 2024
a30f493
update comments for return in _build_dag_node_operation_graph
kevin85421 Aug 9, 2024
44abfbb
move topological sort to dag_node_operation.py
kevin85421 Aug 9, 2024
bca959b
add tests for _generate_actor_to_execution_schedule
kevin85421 Aug 9, 2024
c98362b
add a test for 1f1b
kevin85421 Aug 10, 2024
c45298b
add a test for 1f1b without nccl
kevin85421 Aug 10, 2024
e006cd8
simplify 1f1b test
kevin85421 Aug 10, 2024
5b7c318
troubleshoot ci
kevin85421 Aug 10, 2024
5485c30
troubleshoot ci
kevin85421 Aug 10, 2024
60b1643
troubleshoot ci
kevin85421 Aug 10, 2024
28d8760
revert troubleshoot ci
kevin85421 Aug 12, 2024
a272037
revert troubleshoot ci
kevin85421 Aug 12, 2024
1d9894b
Update python/ray/dag/dag_node_operation.py
kevin85421 Aug 13, 2024
c3819e4
update comments for _select_next_nodes
kevin85421 Aug 13, 2024
3e5feed
change __lt__ for _DAGOperationGraphNode
kevin85421 Aug 14, 2024
d97fda6
remove delete key
kevin85421 Aug 14, 2024
3241cda
update comments
kevin85421 Aug 14, 2024
b36469e
Merge remote-tracking branch 'upstream/master' into better-schedule
kevin85421 Aug 14, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions python/ray/dag/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -146,3 +146,21 @@ py_test(
],
deps = ["//:ray_lib"],
)

py_test(
name = "test_execution_schedule",
size = "medium",
kevin85421 marked this conversation as resolved.
Show resolved Hide resolved
srcs = [
"tests/experimental/test_execution_schedule.py",
],
env = {"RAY_PYTEST_USE_GPU": "1"},
main = "tests/experimental/test_execution_schedule.py",
tags = [
"accelerated_dag",
"exclusive",
"multi_gpu",
"no_windows",
"team:core",
],
deps = ["//:ray_lib"],
)
Loading