-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Fix check failure RAY_CHECK(it != current_tasks_.end()); #47659
Conversation
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
This reverts commit 2acc718.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the approach lgtm. lmk when I can review the PR!
src/ray/core_worker/transport/out_of_order_actor_scheduling_queue.cc
Outdated
Show resolved
Hide resolved
/// This can happen if transient network error happens after an actor | ||
/// task is submitted and recieved by the actor and the caller retries | ||
/// the same task. | ||
absl::flat_hash_map<TaskID, InboundRequest> queued_actor_tasks_ ABSL_GUARDED_BY(mu_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to change pending_task_id_to_is_canceled value to a state enum.
enum TaskState {
// Waiting for deps, can be cancelled.
// Invariant: queued_actor_tasks_[task_id] does not exist.
WAITING_FOR_DEPS = 0;
// Waiting for deps, but cancelled by user.
CANCELLED_BY_USER;
// Waiting for deps, but cancelled by newer attempts.
CANCELLED_BY_NEW_ATTEMPTS;
// Running, can't be cancelled.
RUNNING;
};
On cancel:
- case WAITING_FOR_DEPS, CANCELLED_BY_NEW_ATTEMPTS: change to CANCELLED_BY_USER, cancel queued_actor_tasks_[task_id]
- else: cancel queued_actor_tasks_[task_id]
On new attempt Add:
- case WAITING_FOR_DEPS: change to CANCELLED_BY_NEW_ATTEMPTS, add new attempt to queued_actor_tasks_
- case CANCELLED_BY_USER: also cancel new attempt.
- else: add new attempt to queued_actor_tasks_
On dep resolved:
- case WAITING_FOR_DEPS: run
- case CANCELLED_*: Cancel(reason)
- case RUNNING: check fail
On run finished:
- case RUNNING: erase the state, wait-dep for queued_actor_tasks_[task_id] if any
- else: check fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pending_task_id_to_is_canceled
cancel here only means user triggered cancellation.
…project#47659) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…project#47659) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…project#47659) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…project#47659) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…project#47659) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…project#47659) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…project#47659) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…project#47659) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
…project#47659) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> Signed-off-by: ujjawal-khare <ujjawal.khare@dream11.com>
Why are these changes needed?
The check failure can happen under the following sequence of events:
CoreWorker.current_tasks_
.CoreWorker.current_tasks_
.CoreWorker.current_tasks_
again:This PR fixes the issue by making sure different attempts of the same task are executed sequentially. The reason why we don't support running them in parallel is that it's not safe to assume user's code can handle concurrent execution of the same actor method.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.