Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Only up to 100 actors can be used #858

Closed
2 tasks done
dtsuzuku-ibm opened this issue Dec 5, 2024 · 3 comments
Closed
2 tasks done

[Bug] Only up to 100 actors can be used #858

dtsuzuku-ibm opened this issue Dec 5, 2024 · 3 comments
Labels
bug Something isn't working fixed Marks an issues as fixed in the dev branch

Comments

@dtsuzuku-ibm
Copy link
Collaborator

Search before asking

  • I searched the issues and found no similar issues.

Component

Library/core

What happened + What you expected to happen

When setting num_workers as greater than 100, ray job finishes with following exception (found by @shivdeep-singh-ibm)

data_processing.utils.unrecoverable.UnrecoverableException: out of 200 created actors only 100 alive

It seems by default listing actors is limited to 100
https://github.com/ray-project/ray/blob/1b13782bc7702fbd7af2c89aff293acc4ff49727/python/ray/util/state/api.py#L784

Reproduction script

import ray
from data_processing_ray.runtime.ray import  RayUtils


class Dummy():
   def __init__(self, message):
      print ("Created Actor: ", message)
      
@ray.remote(scheduling_strategy="SPREAD")
class DummyActor(Dummy):
    def __init__(self, params: dict):
        super().__init__(params["message"])
        
params = {
         "message": "hey ya!!"
            }
            
processors = RayUtils.create_actors(
            clazz=DummyActor,
            params=params,
            actor_options={
                "num_cpus":1
                },
            n_actors=200,
        ) 

Anything else

No response

OS

Red Hat Enterprise Linux (RHEL)

Python

3.11.x

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@dtsuzuku-ibm
Copy link
Collaborator Author

Fixing in: #839

@shivdeep-singh-ibm
Copy link
Collaborator

I have also observed crash like:

    processors = RayUtils.create_actors(
  File "/home/ray/data-processing-lib-ray/src/data_processing_ray/runtime/ray/ray_utils.py", line 121, in create_actors
    raise UnrecoverableException(f"out of {len(actors)} created actors only {len(alive)} alive")


data_processing.utils.unrecoverable.UnrecoverableException: out of 2 created actors only 7 alive

Here we wanted 2 actors, 7 are alive. Our job crashes.

The check

if len(actors) == len(alive):
                return actors

expects them to be equal. Is this a problem?

@daw3rd daw3rd added the fixed Marks an issues as fixed in the dev branch label Dec 12, 2024
@daw3rd
Copy link
Member

daw3rd commented Dec 12, 2024

@shivdeep-singh-ibm @dtsuzuku-ibm do we think this is fixed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fixed Marks an issues as fixed in the dev branch
Projects
None yet
Development

No branches or pull requests

3 participants