Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The stable baselines implementation of TD3 can not achieve the same performance as the original TD3 [question] #840

Closed
jeppelangaa opened this issue May 3, 2020 · 7 comments
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested RTFM Answer is the documentation

Comments

@jeppelangaa
Copy link

I wanted to use the stable baselines implementation of TD3 in order to be able to compare the algorithm to other reinforcement learning algorithms more easily.

I have compared the original implementation of TD3 from sfujim to the one from stable baselines. The one from stable baselines is faster in terms of computation time, but it cannot achieve as good results as the original implementation. I have created a custom policy with two hidden layers of 256 nodes and set the hyperparameters according to the original implementation as seen below:

import peg_in_hole_env
import gym
import numpy as np
import matplotlib.pyplot as plt

from stable_baselines import TD3
from stable_baselines.td3.policies import FeedForwardPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise
from stable_baselines.ddpg.policies import LnMlpPolicy
from stable_baselines import results_plotter
from stable_baselines.bench import Monitor
from stable_baselines.results_plotter import load_results, ts2xy
from stable_baselines.common.noise import AdaptiveParamNoiseSpec
from stable_baselines.common.callbacks import BaseCallback
import os

# Custom MLP policy with two layers
class CustomTD3Policy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomTD3Policy, self).__init__(*args, **kwargs,
                                           layers=[256, 256],
                                           layer_norm=False,
                                           feature_extraction="mlp")

env = gym.make('PegInHole-v0')

n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))


model = TD3(CustomTD3Policy, env, action_noise=action_noise, verbose=1, buffer_size=int(1e6), batch_size=256, learning_starts=int(25e3), train_freq=1, gradient_steps=1)
# Train the agent
model.learn(total_timesteps=1000000)

Using this code, the original implementation consistently achieves a performance of approximately 150% of the stable baselines implementation. Does anyone have any clues about some things that I am missing? Or maybe some ideas on why the results differ?

System Info
Describe the characteristic of your environment:

  • The library is installed with pip3
  • Python version 3.6.9
  • Tensorflow version 1.14.0
@araffin araffin added question Further information is requested RTFM Answer is the documentation custom gym env Issue related to Custom Gym Env labels May 3, 2020
@araffin
Copy link
Collaborator

araffin commented May 3, 2020

Hello,

You should be using the hyperparameters from the rl zoo (that matches the original paper) as mentioned in the documentation.
There are several differences with your current code:

  • if PegInHole-v0 has a timelimit then you should add a time feature or remove the done signal that are not real done (see Soft Actor-Critic #120 (comment))
  • the hyperparameters from the original paper are:
  policy: 'MlpPolicy'
  gamma: 0.99
  buffer_size: 1000000
  noise_type: 'normal'
  noise_std: 0.1
  learning_starts: 10000
  batch_size: 100
  learning_rate: !!float 1e-3
  train_freq: 1000
  gradient_steps: 1000
  policy_kwargs: "dict(layers=[400, 300])"

@Mindgames
Copy link

Mentioned it before in some issue here & and seen others also do. I believe something is broken in TD3 in recent releases of stable-baselines >2.8.0 .

In fact the same HP setup that I previous used to take 3 or 4 place on OpenAI leaderboard would with 2.9.0 & 2.10.0 render totally useless results.

Have not had time to investigate it more myself. Will try to look in to it when i have time, but i encourage the stable-baselines authors to look in to it asap as TD3 is one of the more potential algos included.

@Miffyli
Copy link
Collaborator

Miffyli commented May 8, 2020

@Mindgames

If you could provide an example code and versions which provide the different results, this would be a good start for the debugging. Indeed if there has been this bad regression (or even a change of default behaviour), it should be fixed asap.

@araffin
Copy link
Collaborator

araffin commented May 8, 2020

I just ran a quick sanity check on HalfCheetahBulletEnv-v0 using the rl-zoo on Google colab:

python train.py --algo td3 --env HalfCheetahBulletEnv-v0 -params gamma:0.98 buffer_size:300000

it reaches a mean reward > 2000 in ~5e5 steps which is the expected behavior (I've got similar results using the PyTorch version and SAC both in SB and SB3).

Some checkpoints:

Eval num_timesteps=100000, episode_reward=662.24 +/- 35.97
Eval num_timesteps=200000, episode_reward=1257.75 +/- 20.91
Eval num_timesteps=300000, episode_reward=1575.77 +/- 6.34
Eval num_timesteps=400000, episode_reward=1974.59 +/- 14.39
Eval num_timesteps=500000, episode_reward=2314.57 +/- 26.04
Eval num_timesteps=600000, episode_reward=2604.17 +/- 26.65
Eval num_timesteps=700000, episode_reward=2613.97 +/- 31.47
Eval num_timesteps=800000, episode_reward=2746.85 +/- 16.53
Eval num_timesteps=900000, episode_reward=2681.37 +/- 10.16
Eval num_timesteps=1000000, episode_reward=2788.40 +/- 27.27

Hyperparameters:

OrderedDict([('batch_size', 100),
             ('buffer_size', 300000),
             ('env_wrapper', 'utils.wrappers.TimeFeatureWrapper'),
             ('gamma', 0.98),
             ('gradient_steps', 1000),
             ('learning_rate', 0.001),
             ('learning_starts', 10000),
             ('n_timesteps', 2000000.0),
             ('noise_std', 0.1),
             ('noise_type', 'normal'),
             ('policy', 'MlpPolicy'),
             ('policy_kwargs', 'dict(layers=[400, 300])'),
             ('train_freq', 1000)])

@jeppelangaa
Copy link
Author

@araffin The done signal is set to False all the time. Is that sufficient or do i need to look more into the "hack" that you have cited? I am not sure that i understand how to use it.

I am sorry for asking something that could be found in the documentation, but I still haven't found the reference to the RL Zoo that you are talking about yet.

I have been using the hyperparameters that you stated, and it seems, even though that I'm not done testing yet, that the stable baselines implementation can reach the same level as the original implementation. Thank you very much.

@araffin
Copy link
Collaborator

araffin commented May 11, 2020

The done signal is set to False all the time. Is that sufficient or do i need to look more into the "hack" that you have cited? I am not sure that i understand how to use it.

I think this is related to araffin/rl-baselines-zoo#79, I recommend you to read the associated paper.
You need to modify the algorithm if you remove the done signal (as it is done in the original implementation). Otherwise, you should use the TimeFeatureWrapper (araffin/rl-baselines-zoo#79).

can reach the same level as the original implementation.

good to hear, please read the rl tips and tricks carefully next time ;)

I am sorry for asking something that could be found in the documentation, but I still haven't found the reference to the RL Zoo that you are talking about yet.

It is mentioned in the README, in the first page of the documentation, in the rl tips and has its own section in the documentation...
Anyway, the repo is here: https://github.com/araffin/rl-baselines-zoo

@araffin
Copy link
Collaborator

araffin commented May 11, 2020

the stable baselines implementation can reach the same level as the original implementation

closing this issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
custom gym env Issue related to Custom Gym Env question Further information is requested RTFM Answer is the documentation
Projects
None yet
Development

No branches or pull requests

4 participants