Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] New ConnectorV2 API #04: Changes to Learner/LearnerGroup API to allow updating from Episodes. #41235

Conversation

sven1977
Copy link
Contributor

@sven1977 sven1977 commented Nov 17, 2023

This PR introduces some major changes to the Learner and LearnerGroup APIs' update() methods.
This is in preparation for the rollout of the new ConnectorV2 API that will be run from within EnvRunners (env-to-module and module-to-env connector pipelines) as well as Learners (learner connector pipelines).

Changes:

  • Disseminate the existing update() methods between update_from_batch vs update_from_episodes().
  • Using update_from_episodes() will allow the final train batch to be computed/compiled by the Learner (vs EnvRunner or Algorithm.training_step()), further offloading things to where they belong and reducing the mental load on the users (e.g. they should not be concerned with gathering data in the EnvRunners that is only later needed for training).
  • Soft-deprecate (warning) the Learner/LearnerGroup.update() methods.
  • An additional self._postprocess_episodes() method is called by Learner during the update_from_...() and can be overridden by specific algorithms. By default, this is simply a noop. This will allow us to pave the way for the Learner ConnectorV2 pipeline to perform possible preprocessing of the train data (batch or episodes).

Why are these changes needed?

Deeper purpose:
This will enable PPO - once ConnectorV2 are used by EnvRunner and Learner - to fully separate value function predictions from the sampling phase. Instead, values predictions AND bootstrap value predictions will be performed solely on the Learner side (in distributed fashion, on GPU where available). PPO RLModule's forward_exploration/inference methods will then no longer be required to "think about" what the Learner might need and solely focus on action computations alone. Other algos will equally benefit from this separation of concerns.

As algos will be able to determine what special data they need from the sampled episodes (e.g. PG-style algos always need vf predictions), users will also be able to write custom (env-runner AND learner) connectors that transform the raw episode data into a data dict that's acceptable by their custom RLModule. For example, a user might provide an RLModule (independent of the algo used) that always requires the previous rewards. The user will then write a env-to-module as well as a learner connector that extracts this data from the episodes and builds it into the resulting action-computing- or train-batch.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: sven1977 <svenmika1977@gmail.com>
*,
minibatch_size: Optional[int] = None,
num_iters: int = 1,
batch: Optional[MultiAgentBatch] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to discuss the alternative to provide two different (mutually exclusive?) methods that the user/algo can decide to call: update_from_batch (for algos that do NOT require episode processing, such as DQN) or update_from_episodes (for algos that require a view on the sampled episodes for e.g. vf-bootstrapping, vtrace, etc..).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it if the two methods are separated. I don't think there would be a case where a specific algorithm's learner would have both methods implemented. ie DQN will only implemented update_from_batch, and PPO would only implement update_from_episodes. This is much much cleaner than mixing both into one function. The user will have to deal with less cognitive load if they are separated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated them in the LearnerGroup and Learner APIs:

  • update_from_batch(async=False|True)
  • update_from_episodes(async=False|True)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think it's nicer to have the async_update bool option as an extra arg (instead of separate method) for better consistency and less code bloat.

raise ValueError(
"Batch contains module ids that are not in the learner: "
f"{missing_module_ids}"
if batch is not None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the alternative design (two update methods), we could avoid these rather ugly if-blocks then.

@@ -1309,12 +1319,12 @@ def update(
metrics_per_module=defaultdict(dict, **metrics_per_module),
)
self._check_result(result)
# TODO (sven): Figure out whether `compile_metrics` should be forced
# TODO (sven): Figure out whether `compile_results` should be forced
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo

# to return all numpy/python data, then we can skip this conversion
# step here.
results.append(convert_to_numpy(result))

batch = self._set_slicing_by_batch_id(batch, value=False)
self._set_slicing_by_batch_id(batch, value=False)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch never used.

@@ -1330,6 +1340,34 @@ def update(
# dict.
return reduce_fn(results)

@OverrideToImplementCustomLogic
def _preprocess_train_data(self, *, batch, episodes) -> Tuple[Any, Any]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this should be private or public?

*,
minibatch_size: Optional[int] = None,
num_iters: int = 1,
batch: Optional[MultiAgentBatch] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same discussion as above.

]
)
)
# TODO (sven): Implement the case in which both batch and episodes might
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have mutually exclusive methods: update_from_batch and update_from_episodes, this case (both batch AND episodes provided by user) would not exist anyways.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
for module_id in rl_module_ckpt_dirs.keys()
):
raise ValueError(
f"module_id {module_id} was specified in both "
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: This error message will NOT have the correct missing module_id.

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 requested a review from a team as a code owner December 21, 2023 14:58
@@ -1201,6 +1223,39 @@ def update(
# dict.
return reduce_fn(results)

@OverrideToImplementCustomLogic
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is any neural network inference, does it happen here or in the connector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! The answer is: sometimes both.

For example: If you have some preprocessing needs for your training data (no matter whether episodes or batches), then you might want to do some preprocessing on this data (e.g. clip rewards, extend episodes by one artificial timestep for v-trace or GAE) and then perform a pre-forward pass through your network (e.g. to get the value estimates). For that pre-forward pass, you'll need to call your connector first to make sure this batch has all custom-required data formats (e.g. LSTM zero-padding). Only after all these preprocessing steps, you will be able to continue with the regular forward_train + loss + ... procedure.

*,
minibatch_size: Optional[int] = None,
num_iters: int = 1,
batch: Optional[MultiAgentBatch] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it if the two methods are separated. I don't think there would be a case where a specific algorithm's learner would have both methods implemented. ie DQN will only implemented update_from_batch, and PPO would only implement update_from_episodes. This is much much cleaner than mixing both into one function. The user will have to deal with less cognitive load if they are separated.



@DeveloperAPI
class ShardEpisodesIterator:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a unittest for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
Signed-off-by: sven1977 <svenmika1977@gmail.com>
@sven1977 sven1977 added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jan 10, 2024
@sven1977 sven1977 merged commit 806701e into ray-project:master Jan 10, 2024
9 checks passed
vickytsang pushed a commit to ROCm/ray that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants