[docs] Revamp Ray core fault tolerance guide #27573

stephanie-wang · 2022-08-05T19:23:34Z

Why are these changes needed?

Changes:

Move all fault tolerance docs under a common heading. The organization is now "User guide" > "Fault tolerance" > "Task | Actor | Object fault tolerance".
Update description of lineage reconstruction semantics.
Add subsections for handling misbehaving tasks and actors.
Move the "Understanding ObjectLostErrors" from troubleshooting page to "Object fault tolerance" page.
Clarify how to destroy actors.

@iycheng also added a component failures page, but I left it hidden for now. We should address this in a follow-up PR, but it is an advanced topic.

Related issue number

Closes #27047.

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

jjyao · 2022-08-05T20:07:15Z

doc/source/ray-core/fault_tolerance/actors.rst

+
+You can experiment with this behavior by running the following code.
+
+.. code-block:: python


Lets move the code to doc_code folder

jjyao · 2022-08-05T20:08:00Z

doc/source/ray-core/fault_tolerance/actors.rst

+
+    # After the actor has been restarted 5 times, all subsequent methods will
+    # raise a `RayActorError`.
+    for _ in range(10):


This code is actually wrong, see #26875

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

jjyao

I'm wondering if we can have another page putting these things together and tells people how to write fault tolerant Ray application.

Current task, actor, object pages are more explanatory, people need to read through all of them and figure out themselves how to apply those things to make their application fault tolerant. Probably we should have a page that's more actionable and practical.

For example, if your application is a task DAG, set max_retries and make sure root objects are owned by the driver, ....

We can brainstorm more on this.

doc/source/ray-core/fault-tolerance.rst

doc/source/ray-core/fault_tolerance/actors.rst

jjyao · 2022-08-08T16:04:53Z

doc/source/ray-core/fault_tolerance/actors.rst

+For at-least-once actors, the system will still guarantee execution ordering
+according to the initial submission order. For example, any tasks submitted


This is only true for synchronous, single-threaded actor: https://docs.ray.io/en/master/ray-core/actors/task-orders.html

I think it's okay to not mention this since synchronous and single-threaded is the default.

jjyao · 2022-08-08T16:06:33Z

doc/source/ray-core/fault_tolerance/actors.rst

+that have critical state, it is best to take periodic checkpoints and either
+manually restart the actor or automatically restart the actor with at-most-once
+semantics. If the actor’s exact state at the time of failure is needed, the
+application is responsible for resubmitting all tasks since the last
+checkpoint.


We need an example to show people how to do it.

There's no canonical example for this since it's application-dependent. I simplified this text to make it less specific.

Probably we can add one checkpointing example in the Examples section.

doc/source/ray-core/fault_tolerance/actors.rst

doc/source/ray-core/fault_tolerance/objects.rst

jjyao · 2022-08-08T16:26:08Z

doc/source/ray-core/fault_tolerance/objects.rst

+automatically recover the value by :ref:`re-executing <fault-tolerance-tasks>`
+the task that previously created the value.  Arguments to the task are
+recursively reconstructed through the same mechanism.
+


We need an example to demonstrate lineage reconstruction so people can see it in action.

I don't think we need that. Also, there isn't an easy way to show a node dying in a standalone script without using internal APIs to simulate a cluster.

I feel even an pure text example (with diagram) without runnable code can help people understand: something similar to the summit talk slide example. But we can leave it as it is for now and add examples later if it's needed based on user feedback.

jjyao · 2022-08-08T16:30:10Z

doc/source/ray-core/fault_tolerance/objects.rst

+-----------------------------
+
+The owner of an object can die because of node or worker process failure.
+Currently, **Ray does not support recovery from owner failure**. In this case, Ray


We should mention the implication of this: this means if possible we should write the problem in a way that driver is the owner for root objects so they fate-share with the driver.

Let's save this for a follow-up that adds general recommendations.

jjyao · 2022-08-08T16:31:32Z

doc/source/ray-core/fault_tolerance/objects.rst

+  reachable.  This is a generic error thrown when lineage reconstruction is
+  disabled and all copies of the object are lost from the cluster.
+
+.. _`lineage reconstruction`: https://docs.ray.io/en/master/ray-core/actors/fault-tolerance.html


This link is removed as part of this PR.

jjyao · 2022-08-08T16:32:57Z

doc/source/ray-core/fault_tolerance/objects.rst

+- ``ReferenceCountingAssertionError``: The object has already been deleted,
+  so it cannot be retrieved. Ray implements automatic memory management through
+  distributed reference counting, so this error should not happen in general.
+  However, there is a `known edge case`_ that can produce this error.


You can have the link inline here

text <link>

stephanie-wang · 2022-08-08T21:08:13Z

I'm wondering if we can have another page putting these things together and tells people how to write fault tolerant Ray application.

Current task, actor, object pages are more explanatory, people need to read through all of them and figure out themselves how to apply those things to make their application fault tolerant. Probably we should have a page that's more actionable and practical.

For example, if your application is a task DAG, set max_retries and make sure root objects are owned by the driver, ....

We can brainstorm more on this.

Let's do this in a follow-up.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stale · 2022-09-26T04:15:41Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

stale · 2022-10-12T09:10:25Z

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

doc/source/ray-core/fault-tolerance.rst

ericl

Could we also move Environment Dependencies up one row in the TOC? That way the order will be:

Tasks (section)
Actors (section)
Objects (section)
Env Deps
Scheduling (section)
Fault Tolerance (section)

ericl

Could we also move Environment Dependencies up one row in the TOC? That way the order will be:

Tasks (section)
Actors (section)
Objects (section)
Env Deps
Scheduling (section)
Fault Tolerance (section)

doc/source/ray-core/fault_tolerance/actors.rst

doc/source/ray-core/doc_code/actor_restart.py

doc/source/ray-core/fault_tolerance/objects.rst

ericl · 2022-12-12T22:56:25Z

doc/source/ray-core/fault_tolerance/ray_components.rst

@@ -0,0 +1,69 @@
+.. _fault-tolerance-internal-system:
+
+Advanced topic: Ray system failure model


@stephanie-wang I think this page could use a general editing pass.

Discussed with @stephanie-wang offline, this page is removed from this PR and will be added in a follow-up.

doc/source/ray-core/fault_tolerance/ray_components.rst

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

stale · 2023-01-15T16:46:04Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

ericl

The structure of the content looks good. My main request is (like with the scheduling refactor), that we make this discoverable with links from the main task/actor sections. Could we add 2-3 links each from the main tasks/actors/objects content to the appropriate fault tolerance sections?

ericl · 2023-01-26T19:46:11Z

Also, how about combining the sections into "Scheduling and Fault Tolerance"? The sections are quite small right now. The fault tolerance pages can be appended to the end of the scheduling guides.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao · 2023-01-27T13:42:28Z

Also, how about combining the sections into "Scheduling and Fault Tolerance"? The sections are quite small right now. The fault tolerance pages can be appended to the end of the scheduling guides.

I think fault tolerance is an important and complex topic to be its own section. The fault tolerance page is small now but we plan to expand it in the follow-up PRs: we plan to add "How to write fault tolerant Ray applications" (#27573 (comment)).

@ericl

The structure of the content looks good. My main request is (like with the scheduling refactor), that we make this discoverable with links from the main task/actor sections. Could we add 2-3 links each from the main tasks/actors/objects content to the appropriate fault tolerance sections? _Originally posted by @ericl in ray-project#27573 (review) Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

fishbone and others added 4 commits July 29, 2022 19:13

submit

f6ba555

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

fix

db9e49b

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>

Update doc/source/ray-core/fault-tolerance.rst

c26d2fb

Fault tolerance revamp

f4e8a03

stephanie-wang requested a review from a team as a code owner August 5, 2022 19:23

jjyao reviewed Aug 5, 2022

View reviewed changes

stephanie-wang added 2 commits August 6, 2022 18:26

merge

d4767eb

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

Doc code

45b4d5f

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

jjyao reviewed Aug 8, 2022

View reviewed changes

doc updates

cd25eb0

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>

stephanie-wang force-pushed the doc-up branch from 6b66c14 to cd25eb0 Compare August 9, 2022 18:53

fishbone self-assigned this Aug 26, 2022

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 26, 2022

stale bot closed this Oct 12, 2022

stephanie-wang reopened this Nov 10, 2022

stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 10, 2022

stephanie-wang assigned jjyao and unassigned fishbone Nov 10, 2022

jjyao added 5 commits December 8, 2022 12:16

Merge branch 'master' of github.com:ray-project/ray into doc-up

3586ac0

rebase

2c214b2

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

fix

9510d5e

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

fix

4b1196d

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

fix

63b325e

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao assigned ericl and stephanie-wang and unassigned jjyao Dec 9, 2022

ericl reviewed Dec 12, 2022

View reviewed changes

doc/source/ray-core/fault-tolerance.rst Outdated Show resolved Hide resolved

ericl reviewed Dec 12, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 12, 2022

address comments

ee30aa7

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 15, 2023

jjyao removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 19, 2023

jjyao added 4 commits January 23, 2023 12:52

up

ec936e1

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

Merge branch 'master' of github.com:ray-project/ray into doc-up

6589427

fix

d3b6eb2

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

nit fix

49d852d

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao requested a review from ericl January 24, 2023 13:10

jjyao removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 24, 2023

ericl reviewed Jan 26, 2023

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2023

address comments

472399d

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

jjyao requested a review from ericl January 27, 2023 14:05

jjyao removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 27, 2023

ericl approved these changes Jan 27, 2023

View reviewed changes

ericl merged commit 2a7dd31 into ray-project:master Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Revamp Ray core fault tolerance guide #27573

[docs] Revamp Ray core fault tolerance guide #27573

stephanie-wang commented Aug 5, 2022

jjyao Aug 5, 2022

jjyao Aug 5, 2022

jjyao left a comment

jjyao Aug 8, 2022

stephanie-wang Aug 8, 2022

jjyao Aug 8, 2022

stephanie-wang Aug 8, 2022

jjyao Aug 9, 2022

jjyao Aug 8, 2022

stephanie-wang Aug 8, 2022

jjyao Aug 9, 2022

jjyao Aug 8, 2022

stephanie-wang Aug 8, 2022

jjyao Aug 8, 2022

jjyao Aug 8, 2022

stephanie-wang commented Aug 8, 2022

stale bot commented Sep 26, 2022

stale bot commented Oct 12, 2022

ericl left a comment

ericl left a comment

ericl Dec 12, 2022

jjyao Jan 24, 2023

stale bot commented Jan 15, 2023

ericl left a comment

ericl commented Jan 26, 2023

jjyao commented Jan 27, 2023


		You can experiment with this behavior by running the following code.

		.. code-block:: python

		For at-least-once actors, the system will still guarantee execution ordering
		according to the initial submission order. For example, any tasks submitted

		@@ -0,0 +1,69 @@
		.. _fault-tolerance-internal-system:

		Advanced topic: Ray system failure model

[docs] Revamp Ray core fault tolerance guide #27573

[docs] Revamp Ray core fault tolerance guide #27573

Conversation

stephanie-wang commented Aug 5, 2022

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jjyao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stephanie-wang commented Aug 8, 2022

stale bot commented Sep 26, 2022

stale bot commented Oct 12, 2022

ericl left a comment

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stale bot commented Jan 15, 2023

ericl left a comment

Choose a reason for hiding this comment

ericl commented Jan 26, 2023

jjyao commented Jan 27, 2023