Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecurrentPPO #57

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

corentinlger
Copy link

@corentinlger corentinlger commented Oct 20, 2024

Implement a first running version RecurrentPPO in sbx (but algorithm doesn't learn yet). Still needs to be improved to make it functionnal.

Description

Implement a first running version of RecurrentPPO with an LSTM layer. The algorithm doesn't support Dict observations yet, and doesn't work with any n_steps, n_envs and batch sizes (n_steps has to be a multiple of batch_size).

Introduces :

  • sbx/recurrentppo directory with:
    • policies.py that adds an LSTM layer to the Actor and the Critic
    • recurrentppo.py that handles the recurrentppo Model
  • recurrent.py in sbx/common to create helper functions for the recurrent rollout buffe

I will keep working on the feature but here is a list of TODOs I thought of below. I tried to comment the code to make the changes clear but let me know if I can improve that !

TODOs:

  • Check the current code architecture is correct
  • Check the implementation of the LSTM is correct
  • Implement the predict method in policies.py with the lstm_states
  • Fix the Handle timeout in code recurrentppo.py L313
  • Fix the get method of the replay buffer to get batch data with the right sequence order
  • Update docs and tests

Do you see any other things to do @araffin ?

Motivation and Context

  • I have raised an issue to propose this change (required for new features and bug fixes)

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (update in the documentation)

Checklist:

  • I've read the CONTRIBUTION guide (required)
  • I have updated the changelog accordingly (required).
  • My change requires a change to the documentation.
  • I have updated the tests accordingly (required for a bug fix or a new feature).
  • I have updated the documentation accordingly.
  • I have reformatted the code using make format (required)
  • I have checked the codestyle using make check-codestyle and make lint (required)
  • I have ensured make pytest and make type both pass. (required)
  • I have checked that the documentation builds using make doc (required)

Note: You can run most of the checks using make commit-checks.

Note: we are using a maximum length of 127 characters per line

def initialize_carry(batch_size, hidden_size):
# Returns a tuple of lstm states (hidden and cell states)
return nn.LSTMCell(features=hidden_size).initialize_carry(
rng=jax.random.PRNGKey(0), input_shape=(batch_size, hidden_size)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always the same rng, is that intented?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if so, they can be precomputed, no?

Copy link
Author

@corentinlger corentinlger Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always the same rng, is that intented?

I think it is, so the reset states are always the same (I borrowed this from purejaxrl)

if so, they can be precomputed, no?

In fact the function takes 3 differents shapes during a training : at the setup of recurrent_ppo, during the rollouts collection and during the the networks updates. But these values can indeed be precomputed.

I'll ask a friend that knows well about lstm ppo in jax to be sure.

if normalize_advantage and len(advantages) > 1:
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

# TODO : something weird here because the params argument isn't used and only actor_state.params instead
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this result in an error if params is used?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the code still works. This comes from the sbx ppo, do you want me to do a quick PR to fix it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants