-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][experimental] Build an operation-based execution schedule for each actor to avoid deadlocks caused by NCCL operations #46911
Conversation
@@ -788,7 +788,7 @@ def test_compiled_dag_ref_del(ray_start_regular): | |||
compiled_dag.teardown() | |||
|
|||
|
|||
def test_dag_fault_tolerance_chain(ray_start_regular_shared): | |||
def test_dag_fault_tolerance_chain(ray_start_regular): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this usually indicates the workers are not properly cleaned up.
The communication between workers is done using NCCL. The communication | ||
within the worker actor is done using IntraProcessChannel. | ||
""" | ||
monkeypatch.setattr(ray.dag.constants, "RAY_ADAG_ENABLE_DETECT_DEADLOCK", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need time to think about the deadlock detection in this PR. The existing deadlock detection generates some false alarms after this PR.
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
I moved a test from |
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Create an issue to track the progress to add the GPU tests back to CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
python/ray/dag/compiled_dag_node.py
Outdated
return False | ||
next_nodes: List[DAGOperationGraphNode] = [] | ||
first_nccl_node: Optional[DAGOperationGraphNode] = None | ||
for _, candidates in actor_to_candidates.items(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was using bind_index
to mean the local_idx
that you're using here. I didn't realize before that bind_index could be non-contiguous for a single DAG.
What's the motivation here? I think the execution schedule should be the same if there is no NCCL channel in the graph.
Yes, the motivation for this suggestion is the same. Right now the execution schedule will favor scheduling tasks on actors that appear first in the dictionary, which may not be the same depending on what order the actors are inserted into the dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you run the microbenchmark? If we address the rest of @stephanie-wang's PR, I am okay with merging it without splitting to unblock @woshiyyya .
@@ -386,6 +324,79 @@ def __init__( | |||
assert not isinstance(val, ChannelInterface) | |||
assert not isinstance(val, DAGInputAdapter) | |||
|
|||
self.input_reader: ReaderInterface = SynchronousReader(self.input_channels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can handle it later. this just assumes reader/writer is syncrhonous, and I wondered if this should be passed as an input (so that we can support different implementation)
Co-authored-by: Stephanie Wang <swang@cs.berkeley.edu> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
lint failures! |
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
…each actor to avoid deadlocks caused by NCCL operations (ray-project#46911) Generate an execution schedule for each actor. The schedule is a list of DAGNodeOperation. Step 1: Generate a graph based on the following rules: Divide a DAG node into three GraphNodes: READ, COMPUTE, and WRITE. Each GraphNode has a DAGNodeOperation. Add edges between READ and COMPUTE, and between COMPUTE and WRITE, which belong to the same task. Add an edge between COMPUTE with bind_index i and COMPUTE with bind_index i+1 if they belong to the same actor. Add an edge between WRITE of the writer task and READ of the reader task. Step 2: Topological sort: If there are multiple GraphNodes with zero in-degree, select one based on the following rules: (1) If the nodes are not NCCL write nodes, select the one with the smallest bind_index. If there are multiple candidate nodes with the smallest bind_index of the actors that they belong to, any one of them is acceptable. For the implementation details, we maintain a priority queue for each actor, where the peek of the priority queue is the node with the smallest bind_index. (2) If the node is an NCCL write node, select it only if all of its downstream nodes are also the peeks of their priority queues. (3) If (1) and (2) cannot be satisfied, it means that all candidate nodes are NCCL write nodes. In this case, select the one that is the peek of the priority queue and its downstream nodes, regardless of whether the downstream nodes are peeks of their priority queues or not. Then, put the selected nodes into the corresponding actors' schedules.
Why are these changes needed?
DAGNodeOperation
.Step 1: Generate a graph based on the following rules:
GraphNode
has aDAGNodeOperation
.Step 2: Topological sort: If there are multiple GraphNodes with zero in-degree, select one based on the following rules:
bind_index
. If there are multiple candidate nodes with the smallestbind_index
of the actors that they belong to, any one of them is acceptable. For the implementation details, we maintain a priority queue for each actor, where the peek of the priority queue is the node with the smallestbind_index
.Then, put the selected nodes into the corresponding actors' schedules.
Example: 1F1B pipeline parallelism for training
New dependency graph
A 'happen-before' graph for deadlock detection: Without this PR, the DAG will have a deadlock because the graph contains a cycle.
The schedule built by this PR.
Next steps
actor1.t1
andactor1.t2
send data via NCCL channels toactor2.t1
, andactor1.t1
andactor1.t2
have a control dependency. In this case,actor2.t1
needs to read the channel betweenactor1.t1
andactor2.t1
first, and then read the channel betweenactor1.t2
andactor2.t1
to avoid deadlocks.Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.