-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] Preparatory PR for multi-agent, multi-GPU learning agent (alpha-star style) #02. #21649
[RLlib] Preparatory PR for multi-agent, multi-GPU learning agent (alpha-star style) #02. #21649
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks great. a few comments, thanks.
buffer = self.replay_buffers[policy_id] | ||
# If buffer empty or no new samples to mix with replayed ones, | ||
# return None. | ||
if len(buffer) == 0 or len(buffer.last_added_batches) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be safer to check:
if len(buffer) == 0 and self.replay_ratio > 0.0:
return None
if len(buffer.last_added_batches) == 0 and self.replay_ratio < 1.0:
return None
In other words, if we need to have replay samples and the buffer is empty, or we actually need to have newly added samples, but there isn't any, return None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, actually.
If len(buffer) == 0
, then self.last_added_batches
should also be empty.
Everything that's in self.last_added_batches
is guaranteed to be part of the buffer already.
What we should check, though is, whether replay_ratio == 1.0 (in this case, we do NOT care about buffer.last_added_batches
because 100% is replayed (older samples)).
I'll fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that's pretty much what I meant. thanks.
# replay ratio = old / [old + new] | ||
num_new = len(output_batches) | ||
num_old = 0 | ||
while random.random() > num_old / (num_old + num_new): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are not looking at self.replay_ratio here?
can you explain a bit how this while loop is statistically connected to self.replay_ratio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I am still confused. old / [old + new]
is current replay_ratio right?
the logic is we stop adding replayed samples when random.random() is less than current replay_ratio. which by gut feelings should result in an expected ratio of 0.5?
I feel like at the very least, self.replay_ratio should get involved in the math here, something like:
expected_replay_batches = self.replay_ratio * num_new
while random.random() < expected_replay_batches:
output_batches.append(buffer.replay())
expected_replay_batches -= 1.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been fixed. There is also a test case now that verifies, different given ratios are being respected.
The randomness is necessary here, to allow for "odd" ratios. E.g. if we assume that there is always just one new batch and the replay ratio is say 0.33, then only every 2nd returned batch should have 1 additional old batch in it.
rllib/execution/parallel_requests.py
Outdated
`remote_fn()`, which will be applied to the actor(s) instead. | ||
|
||
Args: | ||
trainer: The Trainer object that we run the sampling for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pass in remote_requests_in_flight, instead of the entire trainer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, just double checking, it seems like trainer is still the parameter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this had been fixed. Could you take another look?
|
||
|
||
@ExperimentalAPI | ||
def asynchronous_parallel_requests( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function looks really familiar. are we not replacing some existing logics with this util func call somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry, moved it into a new module for better clarity: This function may not only be used to collect SampleBatches from a RolloutWorker, but works generically on any set (and types!) of ray remote actors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the old code (not used yet anywhere anyways) in replay_ops.py
.
rllib/policy/policy.py
Outdated
""" | ||
# Sample a batch from the given replay actor. | ||
# For better performance, make sure the replay actor is co-located | ||
# with this policy (on the same node). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this comment needs to be updated?
where are we making sure the replay_actor is co-located with the policy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The driver (execution plan/Trainer.setup) needs to make sure of that. But it's not hard-required, just better for performance as we don't have to send the batch across the wire, then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. can. you update the comment a little bit and say:
For better performance, trainer will try to schedule replay actors co-located with this policy
something like that? thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this comment should be here (we don't know for sure what the Trainer will do). The policy has no influence on its Trainer. We should simply clarify that it would be better, if they WERE on the same node, but it's not a hard requirement. And that the Trainer (that creates buffer and policy) needs to take care of this, not the policy itself.
I'll fix.
…ntralized_multi_agent_learning_03
…ntralized_multi_agent_learning_03
…ntralized_multi_agent_learning_03
…ntralized_multi_agent_learning_03
…ed_multi_agent_learning_02 # Conflicts: # rllib/utils/typing.py
…ed_multi_agent_learning_02
…ntralized_multi_agent_learning_02
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of minor comments, one math question. thanks.
# replay ratio = old / [old + new] | ||
num_new = len(output_batches) | ||
num_old = 0 | ||
while random.random() > num_old / (num_old + num_new): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I am still confused. old / [old + new]
is current replay_ratio right?
the logic is we stop adding replayed samples when random.random() is less than current replay_ratio. which by gut feelings should result in an expected ratio of 0.5?
I feel like at the very least, self.replay_ratio should get involved in the math here, something like:
expected_replay_batches = self.replay_ratio * num_new
while random.random() < expected_replay_batches:
output_batches.append(buffer.replay())
expected_replay_batches -= 1.0
rllib/execution/parallel_requests.py
Outdated
`remote_fn()`, which will be applied to the actor(s) instead. | ||
|
||
Args: | ||
trainer: The Trainer object that we run the sampling for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, just double checking, it seems like trainer is still the parameter?
rllib/policy/policy.py
Outdated
""" | ||
# Sample a batch from the given replay actor. | ||
# For better performance, make sure the replay actor is co-located | ||
# with this policy (on the same node). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. can. you update the comment a little bit and say:
For better performance, trainer will try to schedule replay actors co-located with this policy
something like that? thanks.
…ntralized_multi_agent_learning_02
…ntralized_multi_agent_learning_02
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, still have 2 comments.
# Mix buffer's last added batches with older replayed batches. | ||
with self.replay_timer: | ||
output_batches = self.last_added_batches[policy_id].copy() | ||
self.last_added_batches[policy_id].clear() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you clear right after copy, why not:
output_batches = self.last_added_batches[policy_id]
self.last_added_batches[policy_id] = []
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to save a .copy() op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think this is a good idea? this is the only comment I have left.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! Indeed, it saves the copy. Done. :)
On Thu, Jan 27, 2022 at 7:52 AM Sven Mika ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In rllib/execution/buffers/mixin_replay_buffer.py
<#21649 (comment)>:
> + with self.replay_timer:
+ output_batches = self.last_added_batches[policy_id].copy()
+ self.last_added_batches[policy_id].clear()
+
+ # No replay desired -> Return here.
+ if self.replay_ratio == 0.0:
+ return SampleBatch.concat_samples(output_batches)
+ # Only replay desired -> Return a (replayed) sample from the
+ # buffer.
+ elif self.replay_ratio == 1.0:
+ return buffer.replay()
+
+ # Replay ratio = old / [old + new]
+ # Replay proportion: old / new
+ num_new = len(output_batches)
+ replay_proportion = self.replay_proportion
I think it's correct. That's how we also did it in the existing mixin
implementation (in execution/replay_ops.py).
Also the new test cases show that this logic is ok. We test for different
replay ratios and measure the average batch composition (old vs new
samples).
oh, I get it, yeah, we are doing the same replay_ratio to replay_proportion
computation above now.
ok, never mind, I missed that part, my bad.
… —
Reply to this email directly, view it on GitHub
<#21649 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQNWQLAL6262OBCDZK3XHTUYFS3DANCNFSM5ME7QPIA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were assigned.Message ID:
***@***.***>
|
…ntralized_multi_agent_learning_02 # Conflicts: # rllib/execution/rollout_ops.py
@@ -38,7 +38,7 @@ def __init__(self, | |||
|
|||
# By default, use Glorot unform initializer. | |||
if initializer is None: | |||
initializer = flax.nn.initializers.xavier_uniform() | |||
initializer = nn.initializers.xavier_uniform() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sven1977 I'm seeing AttributeError: module 'flax' has no attribute 'nn'
on 1.11.0 release branch. Is this line cherry-pickable, or does it need to be cherry picked into 1.11.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this line is cherry-pickable. Let's also fix the comment:
Fixed code:
# By default, use Glorot uniform initializer.
if initializer is None:
initializer = nn.initializers.xavier_uniform()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mwtian ^^
Preparatory PR for multi-agent, multi-GPU learning agent (alpha-star style) #2.
asynchronous_parallel_requests
utility to rllib/execution/parallel_requests.py. Allows sending (up to n in-flight) parallel remote requests to a set of actors and collecting and returning those results that are available (given some timeout).Policy
class to be usable as remote ray actor, adding e.g.get_host()
and also a (preliminary)learn_on_batch_from_replay_buffer
method.Why are these changes needed?
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.