Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Revamp Ray core fault tolerance guide #27573

Merged
merged 18 commits into from
Jan 27, 2023

Conversation

stephanie-wang
Copy link
Contributor

Why are these changes needed?

Changes:

  1. Move all fault tolerance docs under a common heading. The organization is now "User guide" > "Fault tolerance" > "Task | Actor | Object fault tolerance".
  2. Update description of lineage reconstruction semantics.
  3. Add subsections for handling misbehaving tasks and actors.
  4. Move the "Understanding ObjectLostErrors" from troubleshooting page to "Object fault tolerance" page.
  5. Clarify how to destroy actors.

@iycheng also added a component failures page, but I left it hidden for now. We should address this in a follow-up PR, but it is an advanced topic.

Related issue number

Closes #27047.

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

fishbone and others added 4 commits July 29, 2022 19:13
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@stephanie-wang stephanie-wang requested a review from a team as a code owner August 5, 2022 19:23

You can experiment with this behavior by running the following code.

.. code-block:: python
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets move the code to doc_code folder


# After the actor has been restarted 5 times, all subsequent methods will
# raise a `RayActorError`.
for _ in range(10):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is actually wrong, see #26875

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
Copy link
Collaborator

@jjyao jjyao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can have another page putting these things together and tells people how to write fault tolerant Ray application.

Current task, actor, object pages are more explanatory, people need to read through all of them and figure out themselves how to apply those things to make their application fault tolerant. Probably we should have a page that's more actionable and practical.

For example, if your application is a task DAG, set max_retries and make sure root objects are owned by the driver, ....

We can brainstorm more on this.

doc/source/ray-core/fault-tolerance.rst Outdated Show resolved Hide resolved
doc/source/ray-core/fault_tolerance/actors.rst Outdated Show resolved Hide resolved
Comment on lines +52 to +53
For at-least-once actors, the system will still guarantee execution ordering
according to the initial submission order. For example, any tasks submitted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only true for synchronous, single-threaded actor: https://docs.ray.io/en/master/ray-core/actors/task-orders.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay to not mention this since synchronous and single-threaded is the default.

Comment on lines 62 to 66
that have critical state, it is best to take periodic checkpoints and either
manually restart the actor or automatically restart the actor with at-most-once
semantics. If the actor’s exact state at the time of failure is needed, the
application is responsible for resubmitting all tasks since the last
checkpoint.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an example to show people how to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no canonical example for this since it's application-dependent. I simplified this text to make it less specific.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we can add one checkpointing example in the Examples section.

doc/source/ray-core/fault_tolerance/actors.rst Outdated Show resolved Hide resolved
doc/source/ray-core/fault_tolerance/objects.rst Outdated Show resolved Hide resolved
automatically recover the value by :ref:`re-executing <fault-tolerance-tasks>`
the task that previously created the value. Arguments to the task are
recursively reconstructed through the same mechanism.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need an example to demonstrate lineage reconstruction so people can see it in action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need that. Also, there isn't an easy way to show a node dying in a standalone script without using internal APIs to simulate a cluster.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel even an pure text example (with diagram) without runnable code can help people understand: something similar to the summit talk slide example. But we can leave it as it is for now and add examples later if it's needed based on user feedback.

-----------------------------

The owner of an object can die because of node or worker process failure.
Currently, **Ray does not support recovery from owner failure**. In this case, Ray
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention the implication of this: this means if possible we should write the problem in a way that driver is the owner for root objects so they fate-share with the driver.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's save this for a follow-up that adds general recommendations.

reachable. This is a generic error thrown when lineage reconstruction is
disabled and all copies of the object are lost from the cluster.

.. _`lineage reconstruction`: https://docs.ray.io/en/master/ray-core/actors/fault-tolerance.html
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is removed as part of this PR.

- ``ReferenceCountingAssertionError``: The object has already been deleted,
so it cannot be retrieved. Ray implements automatic memory management through
distributed reference counting, so this error should not happen in general.
However, there is a `known edge case`_ that can produce this error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have the link inline here

text <link>

@stephanie-wang
Copy link
Contributor Author

I'm wondering if we can have another page putting these things together and tells people how to write fault tolerant Ray application.

Current task, actor, object pages are more explanatory, people need to read through all of them and figure out themselves how to apply those things to make their application fault tolerant. Probably we should have a page that's more actionable and practical.

For example, if your application is a task DAG, set max_retries and make sure root objects are owned by the driver, ....

We can brainstorm more on this.

Let's do this in a follow-up.

Signed-off-by: Stephanie Wang <swang@cs.berkeley.edu>
@stale
Copy link

stale bot commented Sep 26, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Sep 26, 2022
@stale
Copy link

stale bot commented Oct 12, 2022

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

@stale stale bot closed this Oct 12, 2022
@stale stale bot removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Nov 10, 2022
@stephanie-wang stephanie-wang assigned jjyao and unassigned fishbone Nov 10, 2022
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao assigned ericl and stephanie-wang and unassigned jjyao Dec 9, 2022
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also move Environment Dependencies up one row in the TOC? That way the order will be:

  • Tasks (section)
  • Actors (section)
  • Objects (section)
  • Env Deps
  • Scheduling (section)
  • Fault Tolerance (section)

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also move Environment Dependencies up one row in the TOC? That way the order will be:

  • Tasks (section)
  • Actors (section)
  • Objects (section)
  • Env Deps
  • Scheduling (section)
  • Fault Tolerance (section)

doc/source/ray-core/fault_tolerance/actors.rst Outdated Show resolved Hide resolved
doc/source/ray-core/doc_code/actor_restart.py Outdated Show resolved Hide resolved
doc/source/ray-core/fault_tolerance/objects.rst Outdated Show resolved Hide resolved
doc/source/ray-core/fault_tolerance/objects.rst Outdated Show resolved Hide resolved
doc/source/ray-core/fault_tolerance/objects.rst Outdated Show resolved Hide resolved
@@ -0,0 +1,69 @@
.. _fault-tolerance-internal-system:

Advanced topic: Ray system failure model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephanie-wang I think this page could use a general editing pass.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @stephanie-wang offline, this page is removed from this PR and will be added in a follow-up.

doc/source/ray-core/fault_tolerance/ray_components.rst Outdated Show resolved Hide resolved
@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 12, 2022
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@stale
Copy link

stale bot commented Jan 15, 2023

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 15, 2023
@jjyao jjyao removed the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Jan 19, 2023
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao requested a review from ericl January 24, 2023 13:10
@jjyao jjyao removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 24, 2023
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structure of the content looks good. My main request is (like with the scheduling refactor), that we make this discoverable with links from the main task/actor sections. Could we add 2-3 links each from the main tasks/actors/objects content to the appropriate fault tolerance sections?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2023
@ericl
Copy link
Contributor

ericl commented Jan 26, 2023

Also, how about combining the sections into "Scheduling and Fault Tolerance"? The sections are quite small right now. The fault tolerance pages can be appended to the end of the scheduling guides.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao
Copy link
Collaborator

jjyao commented Jan 27, 2023

Also, how about combining the sections into "Scheduling and Fault Tolerance"? The sections are quite small right now. The fault tolerance pages can be appended to the end of the scheduling guides.

I think fault tolerance is an important and complex topic to be its own section. The fault tolerance page is small now but we plan to expand it in the follow-up PRs: we plan to add "How to write fault tolerant Ray applications" (#27573 (comment)).

@jjyao jjyao requested a review from ericl January 27, 2023 14:05
@jjyao jjyao removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 27, 2023
@ericl ericl merged commit 2a7dd31 into ray-project:master Jan 27, 2023
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
The structure of the content looks good. My main request is (like with the scheduling refactor), that we make this discoverable with links from the main task/actor sections. Could we add 2-3 links each from the main tasks/actors/objects content to the appropriate fault tolerance sections?

_Originally posted by @ericl in ray-project#27573 (review)

Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] Revamp fault tolerance user guide doc
4 participants