Skip to content

Conversation

@robertnishihara
Copy link
Collaborator

Creating this PR in order to get feedback.

This PR does the following:

  • To create an actor, we submit an actor creation task, which is scheduled like a normal task.
  • When a local scheduler executes an actor creation task, it basically just starts a worker.
  • This enables actor placement to take into account arbitrary resource requirements.

The main question is how we should handle fault tolerance. E.g., what should trigger the re-execution of an actor creation task. Should it be triggered by the usual fault tolerance mechanism? Or should it be triggered by the monitor when it detects that a local scheduler has died (as is currently done)? The latter is a bit trickier to support since we may need to trigger reconstruction of the actor task before it is known what local scheduler was going to create the actor. The former will require some change as well. Basically the local schedulers needs to be notified that an actor no longer exists or that a local scheduler has died so that they can stop submitting actor tasks directly to that local scheduler.

cc @stephanie-wang

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2860/
Test FAILed.

@stephanie-wang
Copy link
Contributor

How does this sound?

The monitor does nothing for now.

When a local scheduler fails to schedule an actor task on another local scheduler (in give_task_to_local_scheduler_retry), it checks if the destination local scheduler is dead. If yes, it calls reconstruct_object on the task's dependencies (possibly only the dummy one) and then caches the task in cached_submitted_actor_tasks. Else, it follows the existing codepath to retry the task table message. Exactly one of the submitting local schedulers should succeed in the reconstruction back to the initial creation task.

The one thing I am not sure about is if somehow all reconstruction attempts fail. Then, a bunch of tasks will be cached in cached_submitted_actor_tasks, but their dependencies will never be rebuilt, so there will never be a new actor creation notification. I'm still trying to figure out if this could ever happen. If it does, a possible fix would be to call reconstruction on tasks in cached_submitted_actor_tasks on a timeout, similar to normal tasks, but I'm worried this might cause even more problems.

@robertnishihara
Copy link
Collaborator Author

cc @atumanov in case you have thoughts about the design here :)

@robertnishihara
Copy link
Collaborator Author

@stephanie-wang All of the reconstruction attempts could probably fail (it'd be more robust to have some retry mechanism). Also, we really only need to rerun the actor creation task in give_task_to_local_scheduler_retry.

We could do the following:

  1. Add the task ID of the relevant actor creation task to every actor task spec.
  2. Every once in a while, loop over all cached actor tasks and reissue the relevant actor creation tasks.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2867/
Test FAILed.

@stephanie-wang
Copy link
Contributor

Hmm yeah, that sounds good.

One thing I would add in addition to that is a check in give_to_local_scheduler_retry to see if the current local scheduler assignment is dead. If yes, then cache the task and don't try again. Else, retry the request. This will help get rid of all the warning messages that we see right now if a local scheduler dies, and other local schedulers just keep continually.

Also, be careful with reconstruction suppression. We only want to submit the actor creation task once. Ideally, we would reconstruct the result of the initial creation task, instead of directly resubmitting it, so we can reuse the current suppression mechanism.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2972/
Test FAILed.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3015/
Test FAILed.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3017/
Test FAILed.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3018/
Test FAILed.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3031/
Test FAILed.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3032/
Test FAILed.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3033/
Test FAILed.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3035/
Test FAILed.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3038/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3039/
Test FAILed.

@AmplabJenkins
Copy link

Build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3048/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3049/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3063/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3076/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3080/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3070/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3083/
Test FAILed.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3117/
Test FAILed.

@robertnishihara
Copy link
Collaborator Author

Closing for now. I'll do a fresh implementation of this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants