[WIP] Schedule actor creation like a regular task. #1351

robertnishihara · 2017-12-20T07:15:09Z

Creating this PR in order to get feedback.

This PR does the following:

To create an actor, we submit an actor creation task, which is scheduled like a normal task.
When a local scheduler executes an actor creation task, it basically just starts a worker.
This enables actor placement to take into account arbitrary resource requirements.

The main question is how we should handle fault tolerance. E.g., what should trigger the re-execution of an actor creation task. Should it be triggered by the usual fault tolerance mechanism? Or should it be triggered by the monitor when it detects that a local scheduler has died (as is currently done)? The latter is a bit trickier to support since we may need to trigger reconstruction of the actor task before it is known what local scheduler was going to create the actor. The former will require some change as well. Basically the local schedulers needs to be notified that an actor no longer exists or that a local scheduler has died so that they can stop submitting actor tasks directly to that local scheduler.

cc @stephanie-wang

AmplabJenkins · 2017-12-20T09:19:16Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-12-20T09:19:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2860/
Test FAILed.

stephanie-wang · 2017-12-20T20:34:57Z

How does this sound?

The monitor does nothing for now.

When a local scheduler fails to schedule an actor task on another local scheduler (in give_task_to_local_scheduler_retry), it checks if the destination local scheduler is dead. If yes, it calls reconstruct_object on the task's dependencies (possibly only the dummy one) and then caches the task in cached_submitted_actor_tasks. Else, it follows the existing codepath to retry the task table message. Exactly one of the submitting local schedulers should succeed in the reconstruction back to the initial creation task.

The one thing I am not sure about is if somehow all reconstruction attempts fail. Then, a bunch of tasks will be cached in cached_submitted_actor_tasks, but their dependencies will never be rebuilt, so there will never be a new actor creation notification. I'm still trying to figure out if this could ever happen. If it does, a possible fix would be to call reconstruction on tasks in cached_submitted_actor_tasks on a timeout, similar to normal tasks, but I'm worried this might cause even more problems.

robertnishihara · 2017-12-20T21:40:46Z

cc @atumanov in case you have thoughts about the design here :)

robertnishihara · 2017-12-20T23:08:47Z

@stephanie-wang All of the reconstruction attempts could probably fail (it'd be more robust to have some retry mechanism). Also, we really only need to rerun the actor creation task in give_task_to_local_scheduler_retry.

We could do the following:

Add the task ID of the relevant actor creation task to every actor task spec.
Every once in a while, loop over all cached actor tasks and reissue the relevant actor creation tasks.

AmplabJenkins · 2017-12-20T23:39:22Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-12-20T23:39:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2867/
Test FAILed.

stephanie-wang · 2017-12-21T01:36:15Z

Hmm yeah, that sounds good.

One thing I would add in addition to that is a check in give_to_local_scheduler_retry to see if the current local scheduler assignment is dead. If yes, then cache the task and don't try again. Else, retry the request. This will help get rid of all the warning messages that we see right now if a local scheduler dies, and other local schedulers just keep continually.

Also, be careful with reconstruction suppression. We only want to submit the actor creation task once. Ideally, we would reconstruct the result of the initial creation task, instead of directly resubmitting it, so we can reuse the current suppression mechanism.

AmplabJenkins · 2017-12-26T22:59:19Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-12-26T22:59:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2972/
Test FAILed.

AmplabJenkins · 2017-12-30T07:09:15Z

Build finished. Test FAILed.

AmplabJenkins · 2017-12-30T07:09:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3015/
Test FAILed.

AmplabJenkins · 2017-12-30T07:34:16Z

Build finished. Test FAILed.

AmplabJenkins · 2017-12-30T07:34:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3017/
Test FAILed.

AmplabJenkins · 2017-12-30T07:44:19Z

Build finished. Test FAILed.

AmplabJenkins · 2017-12-30T07:44:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3018/
Test FAILed.

AmplabJenkins · 2017-12-31T02:44:19Z

Build finished. Test FAILed.

AmplabJenkins · 2017-12-31T02:44:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3031/
Test FAILed.

AmplabJenkins · 2017-12-31T02:54:19Z

Build finished. Test FAILed.

AmplabJenkins · 2017-12-31T02:54:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3032/
Test FAILed.

AmplabJenkins · 2017-12-31T03:14:19Z

Build finished. Test FAILed.

AmplabJenkins · 2017-12-31T03:14:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3033/
Test FAILed.

AmplabJenkins · 2017-12-31T05:59:20Z

Build finished. Test FAILed.

AmplabJenkins · 2017-12-31T05:59:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3035/
Test FAILed.

AmplabJenkins · 2017-12-31T06:49:21Z

Build finished. Test FAILed.

AmplabJenkins · 2017-12-31T06:49:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3038/
Test FAILed.

AmplabJenkins · 2017-12-31T07:49:16Z

Merged build finished. Test FAILed.

AmplabJenkins · 2017-12-31T07:49:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3039/
Test FAILed.

AmplabJenkins · 2018-01-01T07:59:18Z

Build finished. Test FAILed.

AmplabJenkins · 2018-01-01T07:59:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3048/
Test FAILed.

AmplabJenkins · 2018-01-01T08:09:18Z

Merged build finished. Test FAILed.

AmplabJenkins · 2018-01-01T08:09:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3049/
Test FAILed.

AmplabJenkins · 2018-01-02T02:14:18Z

Merged build finished. Test FAILed.

AmplabJenkins · 2018-01-02T02:14:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3063/
Test FAILed.

AmplabJenkins · 2018-01-02T05:39:13Z

Merged build finished. Test FAILed.

AmplabJenkins · 2018-01-02T05:39:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3076/
Test FAILed.

AmplabJenkins · 2018-01-02T05:54:13Z

Merged build finished. Test FAILed.

AmplabJenkins · 2018-01-02T05:54:14Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3080/
Test FAILed.

AmplabJenkins · 2018-01-02T06:14:17Z

Merged build finished. Test FAILed.

AmplabJenkins · 2018-01-02T06:14:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3070/
Test FAILed.

AmplabJenkins · 2018-01-02T06:30:28Z

Merged build finished. Test FAILed.

AmplabJenkins · 2018-01-02T06:30:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3083/
Test FAILed.

AmplabJenkins · 2018-01-04T22:09:17Z

Merged build finished. Test FAILed.

AmplabJenkins · 2018-01-04T22:09:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3117/
Test FAILed.

robertnishihara · 2018-02-22T04:31:21Z

Closing for now. I'll do a fresh implementation of this soon.

robertnishihara mentioned this pull request Dec 21, 2017

Keep track of which workers have never been used. #1359

Closed

Handle actor creation as a task.

ca4eda3

fixes to get reconstruction working

81d8738

robertnishihara added 2 commits January 1, 2018 20:13

some fixes

7602e26

Fix

9f4d279

Fix

d0c5c38

fix

2a26431

fix

9e537c2

robertnishihara closed this Feb 22, 2018

robertnishihara mentioned this pull request Mar 7, 2018

Treat actor creation like a regular task. #1668

Merged

7 tasks

[WIP] Schedule actor creation like a regular task. #1351

[WIP] Schedule actor creation like a regular task. #1351

Uh oh!

Conversation

robertnishihara commented Dec 20, 2017

Uh oh!

AmplabJenkins commented Dec 20, 2017

Uh oh!

AmplabJenkins commented Dec 20, 2017

Uh oh!

stephanie-wang commented Dec 20, 2017

Uh oh!

robertnishihara commented Dec 20, 2017

Uh oh!

robertnishihara commented Dec 20, 2017

Uh oh!

AmplabJenkins commented Dec 20, 2017

Uh oh!

AmplabJenkins commented Dec 20, 2017

Uh oh!

stephanie-wang commented Dec 21, 2017

Uh oh!

AmplabJenkins commented Dec 26, 2017

Uh oh!

AmplabJenkins commented Dec 26, 2017

Uh oh!

AmplabJenkins commented Dec 30, 2017

Uh oh!

AmplabJenkins commented Dec 30, 2017

Uh oh!

AmplabJenkins commented Dec 30, 2017

Uh oh!

AmplabJenkins commented Dec 30, 2017

Uh oh!

AmplabJenkins commented Dec 30, 2017

Uh oh!

AmplabJenkins commented Dec 30, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Dec 31, 2017

Uh oh!

AmplabJenkins commented Jan 1, 2018

Uh oh!

AmplabJenkins commented Jan 1, 2018

Uh oh!

AmplabJenkins commented Jan 1, 2018

Uh oh!

AmplabJenkins commented Jan 1, 2018

Uh oh!

AmplabJenkins commented Jan 2, 2018

Uh oh!

AmplabJenkins commented Jan 2, 2018

Uh oh!

AmplabJenkins commented Jan 2, 2018

Uh oh!

AmplabJenkins commented Jan 2, 2018

Uh oh!

AmplabJenkins commented Jan 2, 2018

Uh oh!