Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update to actor list limit to fix issue 803 #814

Open
wants to merge 12 commits into
base: dev
Choose a base branch
from

Conversation

blublinsky
Copy link
Collaborator

Why are these changes needed?

Default limit in the actor list API is 100, so if the number of actors is more than that, then the code was thinking that not all actors were created

Related issue number (if any).

#803

@cmadam
Copy link
Collaborator

cmadam commented Nov 19, 2024

Can you please increase the limit? Make it 100000 instead of 1000.

@@ -113,7 +113,7 @@ def operator() -> ActorHandle:
actors = [operator() for _ in range(n_actors)]
for i in range(120):
time.sleep(1)
alive = list_actors(filters=[("class_name", "=", cls_name), ("state", "=", "ALIVE")])
alive = list_actors(filters=[("class_name", "=", cls_name), ("state", "=", "ALIVE")], limit=n_actors + 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is 10 the right number. what if a transform's ray runtime starts its own actors? Maybe this should be n_actors*2 or the above with the ability to override with a config value (env var?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The idea to use the number of actors to compute the limit is great, but l think it would be better to use

num_actors * 10

instead of

num_actors + 10

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the downside of setting the limit to 1E6

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gents, If we create 5 actors whats the reason asking for 100000? We are asking only for what we need and padding it with 10

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@blublinsky 1) what is the downside of using 1E6 and 2) what if the transform creates its own actors? Please advise.

Copy link
Collaborator

@cmadam cmadam Nov 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but I don't understand why we need to add 10 to the number of actors in the limit. Why not just set the limit to be equal to the number of actors, since this is what is tested in line 117.

For example, if there are 5 actors, why would we ever need a limit greater than 5? Can it happen that list_actors() will return 6 actors, if we only intended to create 5? And, in that case, do we need the run to fail because there are more actors available than what we requested to create?

And again, I am not clear why are we testing for exact equality in line 117. Would we be ok to proceed even if only 80% of the actors are alive? Or is there a reason why we need all the actors to be up when we start? I am thinking of a scenario when we request 10 worker groups, but only 9 are allocated, and one is pending due to lack of resources. Would it be ok to proceed with the execution in this case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, I am a bit concerned to use the same number

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that I found today, max limit supported is 10000

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmadam, @daw3rd any progress on this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the number of created actors is greater than the limit, the function create_actors() should still return the actor handles

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the code is still changing. Please move the PR to draft mode and then promote to ready for review once it is ready. Thanks

if len(alive) >= n_actors/2 + c_len:
# At least half of the actors were created
logger.info(f"created {n_actors}, alive {len(alive)} Running with less actors")
return alive - current
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line has 2 problems:

  1. There is no such operator - between 2 python lists.
  2. This function is supposed to return list[ActorHandle], but alive and current are of type list[ActorStatus]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. rewrote this (and tested)

Signed-off-by: Constantin M Adam <cmadam@us.ibm.com>
@cmadam
Copy link
Collaborator

cmadam commented Nov 22, 2024

@touma-I - the code is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants