Treat actor creation like a regular task. #1668

robertnishihara · 2018-03-07T07:57:13Z

This PR treats actor creation like a regular task. That is, when we attempt to create an actor, an actor creation task is scheduled and executed (and when it executes on a worker it turns that worker into an actor).

#1351 was an earlier attempt at this.

Changes to actor resource handling:

If an actor is declared with no resources in the decorator, then the actor creation task requires no resources, and each method requires 1 CPU.
If an actor is declared with some resources in the decorator, then those resources are required by the actor creation task and are acquire for the actor's lifetime, and no resources are associated with the actor methods.

Remaining TODO:

Fix give_task_to_local_scheduler_retry as described in [WIP] Schedule actor creation like a regular task. #1351 (comment).
Figure out how to handle remaining reconstruction issues, that is how to reconstruct actor tasks that are cached in a local scheduler.
~~Make sure only "unused" workers are turned into actors~~ (decided to hold off on this for now).
Make sure load balancing does something approximately reasonable in the case where actor creation tasks require no CPUs (also make sure they will not be run on machines with no CPUs).
- ~~e.g., always send actor creation tasks to the global scheduler~~ (decided not to do this)
- e.g., forbid the global scheduler from assigning anything to a local scheduler with 0 CPUs
Cleanup the PR.

Incidentally, this should fix #1716.

richardliaw · 2018-03-07T08:08:07Z

python/ray/actor.py

This is kind of silly, but when I need to remember to do something I introduce linting errors so that Travis will remind me.

AmplabJenkins · 2018-03-07T09:59:21Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4176/
Test FAILed.

AmplabJenkins · 2018-03-08T00:24:39Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4190/
Test PASSed.

AmplabJenkins · 2018-03-08T02:34:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4194/
Test FAILed.

AmplabJenkins · 2018-03-08T21:49:50Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4211/
Test FAILed.

AmplabJenkins · 2018-03-08T23:54:19Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4216/
Test FAILed.

AmplabJenkins · 2018-03-09T00:09:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4212/
Test FAILed.

AmplabJenkins · 2018-03-09T04:04:22Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4220/
Test FAILed.

AmplabJenkins · 2018-03-11T02:50:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4237/
Test FAILed.

AmplabJenkins · 2018-03-11T03:00:47Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4238/
Test FAILed.

AmplabJenkins · 2018-03-11T06:05:08Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4243/
Test FAILed.

AmplabJenkins · 2018-03-11T18:19:44Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4256/
Test FAILed.

AmplabJenkins · 2018-03-12T00:35:05Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4263/
Test FAILed.

AmplabJenkins · 2018-03-12T00:56:36Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4262/
Test PASSed.

AmplabJenkins · 2018-03-12T02:25:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4266/
Test FAILed.

AmplabJenkins · 2018-03-12T20:33:47Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4271/
Test PASSed.

AmplabJenkins · 2018-03-12T23:42:03Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4277/
Test PASSed.

AmplabJenkins · 2018-03-13T01:13:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4279/
Test PASSed.

AmplabJenkins · 2018-03-14T01:19:47Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4315/
Test PASSed.

AmplabJenkins · 2018-03-14T03:24:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4319/
Test PASSed.

AmplabJenkins · 2018-03-14T03:30:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4320/
Test PASSed.

AmplabJenkins · 2018-03-14T03:54:27Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4321/
Test PASSed.

AmplabJenkins · 2018-03-14T04:14:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4322/
Test PASSed.

AmplabJenkins · 2018-03-14T05:12:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4326/
Test PASSed.

AmplabJenkins · 2018-03-14T22:50:30Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4334/
Test PASSed.

stephanie-wang · 2018-03-15T03:40:59Z

python/ray/actor.py

+                               resources={"CPU": actor_method_cpus},
+                               max_calls=0))
+
+    if actor_creation_resources is not None:


What are the cases when this is not None?

Hm.. some of this needs to be rethought, but basically, submit_task looks up the task resource requirements based on the info in worker.function_properties, so we need to fill that out so that submit_task does the right thing. But maybe the right thing is for resources to be passed into submit_task instead.

Basically, whenever a worker might submit an actor creation task, then register_actor_signatures needs to have been called with actor_creation_resources not equal to None.

Hmm, are there cases where it is equal to None then?

When a worker turns into an actor, it calls register_actor_signatures, and the actor_creation_resources data isn't really available at that point. I could pass it in by storing it in redis or through some other mechanism, but it seemed unnecessary. If this sounds a little strange, part of the issue is that there is a difference in the data that is needed at function invocation time versus at function execution time, but we're using worker.function_properties to store both things.

stephanie-wang · 2018-03-15T03:42:40Z

python/ray/actor.py

+                 actor_method_num_return_vals, method_signatures,
+                 checkpoint_interval, class_name,
+                 actor_creation_dummy_object_id,
+                 actor_creation_resources, actor_method_cpus):


Do we need these extra fields in the serialized handle? I can see how we need the actor_creation_dummy_object_id, but not sure about the others.

(Actually on that note we probably can get rid of some of the other information here, like the checkpoint_interval.)

Hmm.. it's possibly not all needed. I can look through and get rid of the extras if you prefer.

I'm also planning on doing a follow up PR to try to simplify some of this code after this PR and the join consistency PR are merged.

Okay, maybe just leave a TODO about removing some of these fields then? I think we should be careful about adding too many fields to actor handles.

Yep, will add that.

stephanie-wang · 2018-03-15T03:44:09Z

python/ray/actor.py

            if (ray.worker.global_worker.connected and
                    self._ray_actor_handle_id.id() == ray.worker.NIL_ACTOR_ID):
+                # TODO(rkn): Should we be passing in the actor cursor as a
+                # dependency here?


I think it's okay for now, since we want the __ray_terminate__ task to get scheduled immediately.

Ok sounds good. Do you want me to remove the comment? By default I would leave it there because I think we may want to revisit this in the future.

Nah, let's leave the comment.

stephanie-wang · 2018-03-15T03:45:15Z

python/ray/monitor.py

                            log.warn("Failed to remove object location for "
                                     "dead plasma manager.")

+            # TODO(rkn): Analogously to the above loop, we may want to remove


This should actually be okay since dummy object IDs are only known by the local scheduler, so the object manager never adds them to the object table.

Good point. I'll remove this.

stephanie-wang · 2018-03-15T03:45:40Z

python/ray/monitor.py

        log.info(
            "Driver {} has been removed.".format(binary_to_hex(driver_id)))

-        # Get a list of the local schedulers that have not been deleted.


stephanie-wang · 2018-03-15T04:16:29Z

src/local_scheduler/local_scheduler_algorithm.cc

      (algorithm_state->available_workers.size() > 0) &&
-      can_run(algorithm_state, execution_spec)) {
+      can_run(algorithm_state, execution_spec) &&
+      state->static_resources["CPU"] != 0) {


Might make more sense to move this logic into resource_constraints_satisfied. Then we can also add logic to make sure that it only happens for actor creation tasks.

stephanie-wang · 2018-03-15T04:19:59Z

src/local_scheduler/local_scheduler_algorithm.cc

+
+  if (state->static_resources["CPU"] == 0) {
+    // Give the task to the global scheduler to schedule.
+    give_task_to_global_scheduler(state, algorithm_state, execution_spec);


Is this statement reachable if the global scheduler never assigns actor creation tasks to a node with no CPUs? If we can, let's try to avoid the state->static_resources["CPU"] == 0 checks since they're a little cryptic if you don't know about the special case for actor resources.

Yeah I think it's not possible. What if I change this to

// The global scheduler should never assign a task to a machine with 0 CPUs. RAY_CHECK(state->static_resources["CPU"] != 0)

Sure, that sounds good.

stephanie-wang · 2018-03-15T04:23:59Z

src/global_scheduler/global_scheduler_algorithm.cc

 */
 bool constraints_satisfied_hard(const LocalScheduler *scheduler,
                                const TaskSpec *spec) {
+  if (scheduler->info.static_resources.at("CPU") == 0) {


Maybe not in this PR, but It would be great if we could store the error for when there are no feasible nodes as part of the return value for the task. How doable is that?

Hm, what do you mean by "as part of the return value for the task? E.g., store an exception
object in the object store so that ray.get on the return value raises an exception? I think something like this would be useful. Right now we just print an error to STDERR at

ray/src/global_scheduler/global_scheduler_algorithm.cc

Lines 139 to 140 in 0fcceef

RAY_LOG(ERROR) << "Infeasible task. No nodes satisfy hard constraints for "

<< "task = " << Task_task_id(task);

.

It's possible for the lack of feasible nodes to be transient (e.g., if more nodes are being added to the cluster or some were just slow to start up).

Hmm, I didn't think about more nodes getting added to the cluster. Yeah, I was thinking about the possibility of storing the exception, but I agree, I think the printed error is good then.

stephanie-wang · 2018-03-15T04:25:05Z

test/actor_test.py

-            pid = ray.get(a.getpid.remote())
-
-            # Make sure that we can't create another actor.
-            with self.assertRaises(Exception):


Should we propagate an error if we can't find enough resources to create an actor?

I think no because other actors may go out of scope very soon enabling this actor to be created.

Similarly, a task may not be able to be run right away, but soon some resources will free up and then it can run.

stephanie-wang · 2018-03-15T04:28:13Z

test/actor_test.py


        @ray.remote
        class Foo(object):
-            def __init__(self):


Sort of off-topic, but is it okay to define an actor without an __init__ method? I had thought that we threw an error for that.

We print an error. It's valid Python and generally ok I think.

AmplabJenkins · 2018-03-15T06:14:49Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4341/
Test PASSed.

AmplabJenkins · 2018-03-15T08:25:24Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4342/
Test PASSed.

stephanie-wang · 2018-03-15T18:50:27Z

src/global_scheduler/global_scheduler_algorithm.cc

                                const TaskSpec *spec) {
-  if (scheduler->info.static_resources.at("CPU") == 0) {
+  if (scheduler->info.static_resources.count("CPU") == 1 &&
+      scheduler->info.static_resources.at("CPU") == 0) {


Should we add an extra check here that it's only for actor creation tasks?

what would be the reason for a special check? I encourage to not special case anything. We need to have a common abstraction that works for both tasks and actors. Both can be thought of as resource consumers, and the scheduler's job is to meet resource demand with resource supply. Once the correct common abstraction is identified, there should be no need to special case.

Probably no need for a check here. However, the reason would be that currently actor creation tasks are the only tasks where the task resource requirements at runtime differ from the task resource requirements during scheduling (because during scheduling we need to be aware of subsequent method resource requirements).

richardliaw · 2018-03-15T22:29:24Z

src/local_scheduler/local_scheduler_client.h

-    ActorID actor_id,
-    bool is_worker,
-    int64_t num_gpus);
+    bool is_worker);


dumb q - when are local schedulers clients not workers?

when they are drivers :)

richardliaw · 2018-03-15T22:36:27Z

python/ray/monitor.py

        """
        message = DriverTableMessage.GetRootAsDriverTableMessage(data, 0)
        driver_id = message.DriverId()
        log.info(


does it make sense to log after the cleanup?

Yeah, I think it's helpful. This is only in the monitor logs anyway, which should generally be redirected to a file.

oh I just meant to maybe log after L438?

richardliaw · 2018-03-15T22:44:24Z

python/ray/actor.py

                                                       actor_method_name).id()
        worker.function_properties[driver_id][function_id] = (
            # The extra return value is an actor dummy object.
+            # In the cases wher actor_method_cpus is None, that value should


typo: where

richardliaw · 2018-03-15T22:59:05Z

python/ray/actor.py

+        function_id = compute_actor_creation_function_id(class_id)
+        worker.function_properties[driver_id][function_id.id()] = (
+            # The extra return value is an actor dummy object.
+            FunctionProperties(num_return_vals=0 + 1,


why not just 1 and be (a little more) explicit in the comment?

I thought this was clearer.

richardliaw · 2018-03-15T23:03:27Z

python/ray/actor.py

            each of the actor's methods.
+        actor_creation_resources: The resources required by the actor creation
+            task.
+        actor_method_cpus: The number of CPUs required by each actor method.


Is this an int or a list? From the code, it seems like it's just an int, but this doc sounds like it is a number per each method

richardliaw · 2018-03-15T23:07:54Z

python/ray/worker.py

        def remote_decorator(func_or_class):
            if inspect.isfunction(func_or_class) or is_cython(func_or_class):
+                # Set the remote function default resources.
+                resources["CPU"] = (DEFAULT_REMOTE_FUNCTION_CPUS


so if someone does num_gpus=2, custom_resources={"GPU": 1} custom GPUs will be overwritten?

If you include "GPU" in resources= then it will raise an exception. That can only be specified through num_gpus=.

richardliaw · 2018-03-15T23:08:37Z

python/ray/worker.py

+DEFAULT_ACTOR_CREATION_CPUS_SIMPLE_CASE = 0
+# Default resource requirements for actors when some resource requirements are
+# specified.
+DEFAULT_ACTOR_METHOD_CPUS_SPECIFIED_CASE = 0


does this just mean when any actor method is invoked, it will not ask for more cpus?

That's correct (in the case where some resource requirements are specified in the actor decorator).

richardliaw · 2018-03-15T23:14:43Z

python/ray/worker.py

+                if num_cpus is None and num_gpus is None and resources == {}:
+                    # In the default case, actors acquire no resources for
+                    # their lifetime, and actor methods will require 1 CPU.
+                    resources["CPU"] = DEFAULT_ACTOR_CREATION_CPUS_SIMPLE_CASE


also another dumb q, what happens when resources["GPU"] is not specified, as is the case here?

Then it is not included in the resource dict (which is the same as having a value of 0).

AmplabJenkins · 2018-03-16T04:49:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4355/
Test FAILed.

robertnishihara · 2018-03-16T04:52:34Z

retest this please

AmplabJenkins · 2018-03-16T05:43:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4356/
Test PASSed.

AmplabJenkins · 2018-03-16T08:33:47Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4359/
Test PASSed.

pcmoritz

LGTM! Nice job :)

stephanie-wang

Woo!

This reverts commit 96913be.

…t#1668)"" This reverts commit afd8f74.

richardliaw reviewed Mar 7, 2018

View reviewed changes

robertnishihara mentioned this pull request Mar 9, 2018

[tune/core] Better Resource Handling #1651

Closed

7 tasks

robertnishihara added 5 commits March 13, 2018 22:22

Treat actor creation like a regular task.

3bdeb8b

Small cleanups.

3d1c692

Change semantics of actor resource handling.

0a18312

Bug fix.

58d4d8f

Minor linting

ed1ddc8

stephanie-wang requested changes Mar 15, 2018

View reviewed changes

robertnishihara added 2 commits March 14, 2018 21:52

Bug fix.

6720961

Add test for 0 CPU case

abc8d9b

robertnishihara added 2 commits March 14, 2018 23:17

Fix linting

f683edf

Address comments.

83c2c2d

stephanie-wang reviewed Mar 15, 2018

View reviewed changes

richardliaw reviewed Mar 15, 2018

View reviewed changes

Fix typos and add comment.

7b4d52c

Add assertion and fix test.

39dd8e1

pcmoritz approved these changes Mar 16, 2018

View reviewed changes

stephanie-wang approved these changes Mar 16, 2018

View reviewed changes

stephanie-wang merged commit 96913be into ray-project:master Mar 16, 2018

stephanie-wang deleted the actorcreation branch March 16, 2018 18:18

robertnishihara mentioned this pull request Mar 18, 2018

Jenkins test times out in many_drivers_test.py and remove_driver_test.py. #1740

Closed

ericl added a commit to ericl/ray that referenced this pull request Mar 18, 2018

Revert "Treat actor creation like a regular task. (ray-project#1668)"

6a94d80

This reverts commit 96913be.

ericl mentioned this pull request Mar 18, 2018

Revert "Treat actor creation like a regular task. (#1668)" #1746

Closed

ericl added a commit to ericl/ray that referenced this pull request Mar 18, 2018

Revert "Treat actor creation like a regular task. (ray-project#1668)"

afd8f74

This reverts commit 96913be.

This was referenced Mar 19, 2018

Travis test failure in MultiNodeTest.testDriverExitingQuickly. #1750

Closed

Excessive logging in Redis server during actor reconstruction. #807

Closed

richardliaw mentioned this pull request Mar 20, 2018

GPU resources not released after killing actor #1065

Closed

ericl added a commit to ericl/ray that referenced this pull request Mar 23, 2018

Revert "Revert "Treat actor creation like a regular task. (ray-projec…

80b1c7d

…t#1668)"" This reverts commit afd8f74.

robertnishihara mentioned this pull request Mar 26, 2018

Monitor crashes when attempting to recreate actors which require unavailable GPUs. #1514

Closed

	RAY_LOG(ERROR) << "Infeasible task. No nodes satisfy hard constraints for "
	<< "task = " << Task_task_id(task);

Treat actor creation like a regular task. #1668

Treat actor creation like a regular task. #1668

Uh oh!

Conversation

robertnishihara commented Mar 7, 2018 • edited by richardliaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 7, 2018

Uh oh!

AmplabJenkins commented Mar 8, 2018

Uh oh!

AmplabJenkins commented Mar 8, 2018

Uh oh!

AmplabJenkins commented Mar 8, 2018

Uh oh!

AmplabJenkins commented Mar 8, 2018

Uh oh!

AmplabJenkins commented Mar 9, 2018

Uh oh!

AmplabJenkins commented Mar 9, 2018

Uh oh!

AmplabJenkins commented Mar 11, 2018

Uh oh!

AmplabJenkins commented Mar 11, 2018

Uh oh!

AmplabJenkins commented Mar 11, 2018

Uh oh!

AmplabJenkins commented Mar 11, 2018

Uh oh!

AmplabJenkins commented Mar 12, 2018

Uh oh!

AmplabJenkins commented Mar 12, 2018

Uh oh!

AmplabJenkins commented Mar 12, 2018

Uh oh!

AmplabJenkins commented Mar 12, 2018

Uh oh!

AmplabJenkins commented Mar 12, 2018

Uh oh!

AmplabJenkins commented Mar 13, 2018

Uh oh!

AmplabJenkins commented Mar 14, 2018

Uh oh!

AmplabJenkins commented Mar 14, 2018

Uh oh!

AmplabJenkins commented Mar 14, 2018

Uh oh!

AmplabJenkins commented Mar 14, 2018

Uh oh!

AmplabJenkins commented Mar 14, 2018

Uh oh!

AmplabJenkins commented Mar 14, 2018

Uh oh!

AmplabJenkins commented Mar 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

robertnishihara commented Mar 7, 2018 •

edited by richardliaw

Loading