Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix object reconstruction hang on arguments pending creation #47645

Merged
merged 7 commits into from
Sep 27, 2024

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Sep 13, 2024

Why are these changes needed?

Currently inside ObjectRecoveryManager::ReconstructObject we have

auto resubmitted = task_resubmitter_->ResubmitTask(task_id, &task_deps);

  if (resubmitted) {
    reference_counter_->UpdateObjectPendingCreation(object_id, true);

The object being recovered will have pending_creation set to true and it will be set to false after the resubmitted task finishes (succeed or fail) when ReferenceCounter::UpdateFinishedTaskReferences is called. However if the task is an actor task and the actor is dead, ReferenceCounter::UpdateFinishedTaskReferences will be called BEFORE ResubmitTask returns. So basically, we set pending_creation to false first and then set to true which is obviously wrong. This PR fixes this issue by setting pending_creation to true BEFORE calling task_resubmitter_->ResubmitTask(task_id, &task_deps) so that if the actor is dead, pending_creation will be set to false inside ResubmitTask().

Related issue number

Closes #47606

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Sep 13, 2024
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao marked this pull request as ready for review September 25, 2024 18:32
@jjyao
Copy link
Collaborator Author

jjyao commented Sep 25, 2024

cc @Catch-Bull could you review the PR when you get a chance?

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks reasonable. can we make sure some release tests that test lineage reconstruction pass before merging?

src/ray/core_worker/test/direct_actor_transport_test.cc Outdated Show resolved Hide resolved
@jjyao jjyao assigned jjyao and unassigned Catch-Bull and rkooo567 Sep 26, 2024
// Post to the event loop to maintain the async nature of
// SubmitTask and avoid issues like
// https://github.com/ray-project/ray/issues/47606.
io_service_.post(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehh, we do have code that relies on the behavior that this piece of code is executed immediately instead of being posted to the event loop:

def is_shutdown(self) -> bool:
        """Return whether the proxy actor is shutdown.

        If the actor is dead, the health check will return RayActorError.
        """
        try:
            ray.get(self._actor_handle.check_health.remote(), timeout=0)
        except RayActorError:
            # The actor is dead, so it's ready for shutdown.
            return True

        # The actor is still alive, so it's not ready for shutdown.
        return False

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new approach makes sense.

@jjyao jjyao merged commit 1003da0 into ray-project:master Sep 27, 2024
4 of 5 checks passed
@jjyao jjyao deleted the jjyao/hhang branch September 27, 2024 16:41
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…ay-project#47645)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…ay-project#47645)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…ay-project#47645)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…ay-project#47645)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…ay-project#47645)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…ay-project#47645)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…ay-project#47645)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…ay-project#47645)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…ay-project#47645)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core][Object Store] During the object reconstruction, it is hang on arguments pending creation.
3 participants