Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix check failure RAY_CHECK(it != current_tasks_.end()); #47659

Merged
merged 30 commits into from
Oct 9, 2024

Conversation

jjyao
Copy link
Collaborator

@jjyao jjyao commented Sep 13, 2024

Why are these changes needed?

The check failure can happen under the following sequence of events:

  1. An async or threaded actor is launched.
  2. An actor task is submitted.
  3. Actor task is received by the actor and is being executed. The task_id is added to CoreWorker.current_tasks_.
  4. A transient network error happens and caller retries the actor task.
  5. The second attempt of the actor task is received by the actor and is being executed.
  6. The first attempt of the actor task finishes and the task_id is removed from CoreWorker.current_tasks_.
  7. The second attempt of the actor task finishes and the check failure happens when we try to erase the same task_id from CoreWorker.current_tasks_ again:
auto it = current_tasks_.find(task_spec.TaskId());
RAY_CHECK(it != current_tasks_.end());
current_tasks_.erase(it);

This PR fixes the issue by making sure different attempts of the same task are executed sequentially. The reason why we don't support running them in parallel is that it's not safe to assume user's code can handle concurrent execution of the same actor method.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
This reverts commit 2acc718.
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label Sep 19, 2024
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao changed the title [Core] Detect old client [Core] Fix check failure RAY_CHECK(it != current_tasks_.end()); Sep 22, 2024
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the approach lgtm. lmk when I can review the PR!

jjyao added 3 commits October 4, 2024 12:34
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
@jjyao jjyao marked this pull request as ready for review October 4, 2024 22:28
@jjyao jjyao requested a review from rkooo567 October 4, 2024 22:54
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
jjyao added 2 commits October 8, 2024 07:34
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
/// This can happen if transient network error happens after an actor
/// task is submitted and recieved by the actor and the caller retries
/// the same task.
absl::flat_hash_map<TaskID, InboundRequest> queued_actor_tasks_ ABSL_GUARDED_BY(mu_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to change pending_task_id_to_is_canceled value to a state enum.

enum TaskState {
// Waiting for deps, can be cancelled.
// Invariant: queued_actor_tasks_[task_id] does not exist.
WAITING_FOR_DEPS = 0;
// Waiting for deps, but cancelled by user.
CANCELLED_BY_USER;
// Waiting for deps, but cancelled by newer attempts.
CANCELLED_BY_NEW_ATTEMPTS;
// Running, can't be cancelled.
RUNNING;
};

On cancel:

  • case WAITING_FOR_DEPS, CANCELLED_BY_NEW_ATTEMPTS: change to CANCELLED_BY_USER, cancel queued_actor_tasks_[task_id]
  • else: cancel queued_actor_tasks_[task_id]

On new attempt Add:

  • case WAITING_FOR_DEPS: change to CANCELLED_BY_NEW_ATTEMPTS, add new attempt to queued_actor_tasks_
  • case CANCELLED_BY_USER: also cancel new attempt.
  • else: add new attempt to queued_actor_tasks_

On dep resolved:

  • case WAITING_FOR_DEPS: run
  • case CANCELLED_*: Cancel(reason)
  • case RUNNING: check fail

On run finished:

  • case RUNNING: erase the state, wait-dep for queued_actor_tasks_[task_id] if any
  • else: check fail

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending_task_id_to_is_canceled cancel here only means user triggered cancellation.

@jjyao jjyao merged commit b69b929 into ray-project:master Oct 9, 2024
4 of 5 checks passed
@jjyao jjyao deleted the jjyao/cheek branch October 9, 2024 17:28
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…project#47659)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…project#47659)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…project#47659)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…project#47659)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…project#47659)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…project#47659)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…project#47659)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…project#47659)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
ujjawal-khare pushed a commit to ujjawal-khare-27/ray that referenced this pull request Oct 15, 2024
…project#47659)

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants