-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Revamp Ray core fault tolerance guide #27573
Conversation
|
||
You can experiment with this behavior by running the following code. | ||
|
||
.. code-block:: python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets move the code to doc_code folder
|
||
# After the actor has been restarted 5 times, all subsequent methods will | ||
# raise a `RayActorError`. | ||
for _ in range(10): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is actually wrong, see #26875
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can have another page putting these things together and tells people how to write fault tolerant Ray application.
Current task, actor, object pages are more explanatory, people need to read through all of them and figure out themselves how to apply those things to make their application fault tolerant. Probably we should have a page that's more actionable and practical.
For example, if your application is a task DAG, set max_retries
and make sure root objects are owned by the driver, ....
We can brainstorm more on this.
For at-least-once actors, the system will still guarantee execution ordering | ||
according to the initial submission order. For example, any tasks submitted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only true for synchronous, single-threaded actor: https://docs.ray.io/en/master/ray-core/actors/task-orders.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay to not mention this since synchronous and single-threaded is the default.
that have critical state, it is best to take periodic checkpoints and either | ||
manually restart the actor or automatically restart the actor with at-most-once | ||
semantics. If the actor’s exact state at the time of failure is needed, the | ||
application is responsible for resubmitting all tasks since the last | ||
checkpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an example to show people how to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no canonical example for this since it's application-dependent. I simplified this text to make it less specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we can add one checkpointing example in the Examples section.
automatically recover the value by :ref:`re-executing <fault-tolerance-tasks>` | ||
the task that previously created the value. Arguments to the task are | ||
recursively reconstructed through the same mechanism. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an example to demonstrate lineage reconstruction so people can see it in action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need that. Also, there isn't an easy way to show a node dying in a standalone script without using internal APIs to simulate a cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel even an pure text example (with diagram) without runnable code can help people understand: something similar to the summit talk slide example. But we can leave it as it is for now and add examples later if it's needed based on user feedback.
----------------------------- | ||
|
||
The owner of an object can die because of node or worker process failure. | ||
Currently, **Ray does not support recovery from owner failure**. In this case, Ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention the implication of this: this means if possible we should write the problem in a way that driver is the owner for root objects so they fate-share with the driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's save this for a follow-up that adds general recommendations.
reachable. This is a generic error thrown when lineage reconstruction is | ||
disabled and all copies of the object are lost from the cluster. | ||
|
||
.. _`lineage reconstruction`: https://docs.ray.io/en/master/ray-core/actors/fault-tolerance.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This link is removed as part of this PR.
- ``ReferenceCountingAssertionError``: The object has already been deleted, | ||
so it cannot be retrieved. Ray implements automatic memory management through | ||
distributed reference counting, so this error should not happen in general. | ||
However, there is a `known edge case`_ that can produce this error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have the link inline here
text <link>
Let's do this in a follow-up. |
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message. Please feel free to reopen or open a new issue if you'd still like it to be addressed. Again, you can always ask for help on our discussion forum or Ray's public slack channel. Thanks again for opening the issue! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also move Environment Dependencies up one row in the TOC? That way the order will be:
- Tasks (section)
- Actors (section)
- Objects (section)
- Env Deps
- Scheduling (section)
- Fault Tolerance (section)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also move Environment Dependencies up one row in the TOC? That way the order will be:
- Tasks (section)
- Actors (section)
- Objects (section)
- Env Deps
- Scheduling (section)
- Fault Tolerance (section)
@@ -0,0 +1,69 @@ | |||
.. _fault-tolerance-internal-system: | |||
|
|||
Advanced topic: Ray system failure model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stephanie-wang I think this page could use a general editing pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with @stephanie-wang offline, this page is removed from this PR and will be added in a follow-up.
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structure of the content looks good. My main request is (like with the scheduling refactor), that we make this discoverable with links from the main task/actor sections. Could we add 2-3 links each from the main tasks/actors/objects content to the appropriate fault tolerance sections?
Also, how about combining the sections into "Scheduling and Fault Tolerance"? The sections are quite small right now. The fault tolerance pages can be appended to the end of the scheduling guides. |
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
I think fault tolerance is an important and complex topic to be its own section. The fault tolerance page is small now but we plan to expand it in the follow-up PRs: we plan to add "How to write fault tolerant Ray applications" (#27573 (comment)). |
The structure of the content looks good. My main request is (like with the scheduling refactor), that we make this discoverable with links from the main task/actor sections. Could we add 2-3 links each from the main tasks/actors/objects content to the appropriate fault tolerance sections? _Originally posted by @ericl in ray-project#27573 (review) Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Why are these changes needed?
Changes:
@iycheng also added a component failures page, but I left it hidden for now. We should address this in a follow-up PR, but it is an advanced topic.
Related issue number
Closes #27047.
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.