-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update to actor list limit to fix issue 803 #814
base: dev
Are you sure you want to change the base?
Conversation
Can you please increase the limit? Make it 100000 instead of 1000. |
data-processing-lib/ray/src/data_processing_ray/runtime/ray/ray_utils.py
Outdated
Show resolved
Hide resolved
data-processing-lib/ray/src/data_processing_ray/runtime/ray/ray_utils.py
Outdated
Show resolved
Hide resolved
@@ -113,7 +113,7 @@ def operator() -> ActorHandle: | |||
actors = [operator() for _ in range(n_actors)] | |||
for i in range(120): | |||
time.sleep(1) | |||
alive = list_actors(filters=[("class_name", "=", cls_name), ("state", "=", "ALIVE")]) | |||
alive = list_actors(filters=[("class_name", "=", cls_name), ("state", "=", "ALIVE")], limit=n_actors + 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is 10 the right number. what if a transform's ray runtime starts its own actors? Maybe this should be n_actors*2 or the above with the ability to override with a config value (env var?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. The idea to use the number of actors to compute the limit is great, but l think it would be better to use
num_actors * 10
instead of
num_actors + 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the downside of setting the limit to 1E6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gents, If we create 5 actors whats the reason asking for 100000? We are asking only for what we need and padding it with 10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@blublinsky 1) what is the downside of using 1E6 and 2) what if the transform creates its own actors? Please advise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but I don't understand why we need to add 10 to the number of actors in the limit. Why not just set the limit to be equal to the number of actors, since this is what is tested in line 117.
For example, if there are 5 actors, why would we ever need a limit greater than 5? Can it happen that list_actors()
will return 6 actors, if we only intended to create 5? And, in that case, do we need the run to fail because there are more actors available than what we requested to create?
And again, I am not clear why are we testing for exact equality in line 117. Would we be ok to proceed even if only 80% of the actors are alive? Or is there a reason why we need all the actors to be up when we start? I am thinking of a scenario when we request 10 worker groups, but only 9 are allocated, and one is pending due to lack of resources. Would it be ok to proceed with the execution in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just in case, I am a bit concerned to use the same number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that I found today, max limit supported is 10000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the number of created actors is greater than the limit, the function create_actors()
should still return the actor handles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the code is still changing. Please move the PR to draft mode and then promote to ready for review once it is ready. Thanks
data-processing-lib/ray/src/data_processing_ray/runtime/ray/ray_utils.py
Outdated
Show resolved
Hide resolved
data-processing-lib/ray/src/data_processing_ray/runtime/ray/ray_utils.py
Outdated
Show resolved
Hide resolved
if len(alive) >= n_actors/2 + c_len: | ||
# At least half of the actors were created | ||
logger.info(f"created {n_actors}, alive {len(alive)} Running with less actors") | ||
return alive - current |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line has 2 problems:
- There is no such operator
-
between 2 python lists. - This function is supposed to return
list[ActorHandle]
, butalive
andcurrent
are of typelist[ActorStatus]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. rewrote this (and tested)
data-processing-lib/ray/src/data_processing_ray/runtime/ray/ray_utils.py
Show resolved
Hide resolved
data-processing-lib/ray/src/data_processing_ray/runtime/ray/ray_utils.py
Show resolved
Hide resolved
data-processing-lib/ray/src/data_processing_ray/runtime/ray/ray_utils.py
Show resolved
Hide resolved
data-processing-lib/ray/src/data_processing_ray/runtime/ray/ray_utils.py
Show resolved
Hide resolved
data-processing-lib/ray/src/data_processing_ray/runtime/ray/ray_utils.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Constantin M Adam <cmadam@us.ibm.com>
@touma-I - the code is ready for review. |
Why are these changes needed?
Default limit in the actor list API is 100, so if the number of actors is more than that, then the code was thinking that not all actors were created
Related issue number (if any).
#803