RecurrentPPO #57

corentinlger · 2024-10-20T09:49:46Z

Implement a first running version RecurrentPPO in sbx (but algorithm doesn't learn yet). Still needs to be improved to make it functionnal.

Description

Implement a first running version of RecurrentPPO with an LSTM layer. The algorithm doesn't support Dict observations yet, and doesn't work with any n_steps, n_envs and batch sizes (n_steps has to be a multiple of batch_size).

Introduces :

sbx/recurrentppo directory with:
- policies.py that adds an LSTM layer to the Actor and the Critic
- recurrentppo.py that handles the recurrentppo Model
recurrent.py in sbx/common to create helper functions for the recurrent rollout buffe

I will keep working on the feature but here is a list of TODOs I thought of below. I tried to comment the code to make the changes clear but let me know if I can improve that !

TODOs:

Check the current code architecture is correct
Check the implementation of the LSTM is correct
Implement the predict method in policies.py with the lstm_states
Fix the Handle timeout in code recurrentppo.py L313
Fix the get method of the replay buffer to get batch data with the right sequence order
Update docs and tests

Do you see any other things to do @araffin ?

Motivation and Context

I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (update in the documentation)

Checklist:

I've read the CONTRIBUTION guide (required)
I have updated the changelog accordingly (required).
My change requires a change to the documentation.
I have updated the tests accordingly (required for a bug fix or a new feature).
I have updated the documentation accordingly.
I have reformatted the code using make format (required)
I have checked the codestyle using make check-codestyle and make lint (required)
I have ensured make pytest and make type both pass. (required)
I have checked that the documentation builds using make doc (required)

Note: You can run most of the checks using make commit-checks.

Note: we are using a maximum length of 127 characters per line

sbx/recurrent_ppo/policies.py

araffin · 2024-11-01T08:05:29Z

sbx/recurrent_ppo/policies.py

+    def initialize_carry(batch_size, hidden_size):
+        # Returns a tuple of lstm states (hidden and cell states)
+        return nn.LSTMCell(features=hidden_size).initialize_carry(
+            rng=jax.random.PRNGKey(0), input_shape=(batch_size, hidden_size)


always the same rng, is that intented?

if so, they can be precomputed, no?

always the same rng, is that intented?

I think it is, so the reset states are always the same (I borrowed this from purejaxrl)

if so, they can be precomputed, no?

In fact the function takes 3 differents shapes during a training : at the setup of recurrent_ppo, during the rollouts collection and during the the networks updates. But these values can indeed be precomputed.

I'll ask a friend that knows well about lstm ppo in jax to be sure.

sbx/recurrent_ppo/policies.py

araffin · 2024-11-01T08:14:24Z

sbx/recurrent_ppo/recurrent_ppo.py

+        if normalize_advantage and len(advantages) > 1:
+            advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
+
+        # TODO : something weird here because the params argument isn't used and only actor_state.params instead


this result in an error if params is used?

No the code still works. This comes from the sbx ppo, do you want me to do a quick PR to fix it ?

…rentppo

corentinlger and others added 8 commits August 26, 2024 14:38

Add first rnn elements in ppo_rnn networks

6a50cbf

Rename files and replace gru by lstm

c4d1891

Fix actor output shape and add recurrent rollout buffer

58c9e03

Fix errors in collect rollouts

6ea0afc

First runnable version of LSTM-PPO

8c16259

Fix typos and update comments

0195e8c

Implement predict method

f520f47

Merge branch 'master' into recurrentppo

87d1630