The stable baselines implementation of TD3 can not achieve the same performance as the original TD3 [question] #840

jeppelangaa · 2020-05-03T19:07:37Z

I wanted to use the stable baselines implementation of TD3 in order to be able to compare the algorithm to other reinforcement learning algorithms more easily.

I have compared the original implementation of TD3 from sfujim to the one from stable baselines. The one from stable baselines is faster in terms of computation time, but it cannot achieve as good results as the original implementation. I have created a custom policy with two hidden layers of 256 nodes and set the hyperparameters according to the original implementation as seen below:

import peg_in_hole_env
import gym
import numpy as np
import matplotlib.pyplot as plt

from stable_baselines import TD3
from stable_baselines.td3.policies import FeedForwardPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise
from stable_baselines.ddpg.policies import LnMlpPolicy
from stable_baselines import results_plotter
from stable_baselines.bench import Monitor
from stable_baselines.results_plotter import load_results, ts2xy
from stable_baselines.common.noise import AdaptiveParamNoiseSpec
from stable_baselines.common.callbacks import BaseCallback
import os

# Custom MLP policy with two layers
class CustomTD3Policy(FeedForwardPolicy):
    def __init__(self, *args, **kwargs):
        super(CustomTD3Policy, self).__init__(*args, **kwargs,
                                           layers=[256, 256],
                                           layer_norm=False,
                                           feature_extraction="mlp")

env = gym.make('PegInHole-v0')

n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))


model = TD3(CustomTD3Policy, env, action_noise=action_noise, verbose=1, buffer_size=int(1e6), batch_size=256, learning_starts=int(25e3), train_freq=1, gradient_steps=1)
# Train the agent
model.learn(total_timesteps=1000000)

Using this code, the original implementation consistently achieves a performance of approximately 150% of the stable baselines implementation. Does anyone have any clues about some things that I am missing? Or maybe some ideas on why the results differ?

System Info
Describe the characteristic of your environment:

The library is installed with pip3
Python version 3.6.9
Tensorflow version 1.14.0

araffin · 2020-05-03T19:57:55Z

Hello,

You should be using the hyperparameters from the rl zoo (that matches the original paper) as mentioned in the documentation.
There are several differences with your current code:

if PegInHole-v0 has a timelimit then you should add a time feature or remove the done signal that are not real done (see Soft Actor-Critic #120 (comment))
the hyperparameters from the original paper are:

  policy: 'MlpPolicy'
  gamma: 0.99
  buffer_size: 1000000
  noise_type: 'normal'
  noise_std: 0.1
  learning_starts: 10000
  batch_size: 100
  learning_rate: !!float 1e-3
  train_freq: 1000
  gradient_steps: 1000
  policy_kwargs: "dict(layers=[400, 300])"

Mindgames · 2020-05-08T19:02:42Z

Mentioned it before in some issue here & and seen others also do. I believe something is broken in TD3 in recent releases of stable-baselines >2.8.0 .

In fact the same HP setup that I previous used to take 3 or 4 place on OpenAI leaderboard would with 2.9.0 & 2.10.0 render totally useless results.

Have not had time to investigate it more myself. Will try to look in to it when i have time, but i encourage the stable-baselines authors to look in to it asap as TD3 is one of the more potential algos included.

Miffyli · 2020-05-08T19:30:29Z

@Mindgames

If you could provide an example code and versions which provide the different results, this would be a good start for the debugging. Indeed if there has been this bad regression (or even a change of default behaviour), it should be fixed asap.

araffin · 2020-05-08T20:12:00Z

I just ran a quick sanity check on HalfCheetahBulletEnv-v0 using the rl-zoo on Google colab:

python train.py --algo td3 --env HalfCheetahBulletEnv-v0 -params gamma:0.98 buffer_size:300000

it reaches a mean reward > 2000 in ~5e5 steps which is the expected behavior (I've got similar results using the PyTorch version and SAC both in SB and SB3).

Some checkpoints:

Eval num_timesteps=100000, episode_reward=662.24 +/- 35.97
Eval num_timesteps=200000, episode_reward=1257.75 +/- 20.91
Eval num_timesteps=300000, episode_reward=1575.77 +/- 6.34
Eval num_timesteps=400000, episode_reward=1974.59 +/- 14.39
Eval num_timesteps=500000, episode_reward=2314.57 +/- 26.04
Eval num_timesteps=600000, episode_reward=2604.17 +/- 26.65
Eval num_timesteps=700000, episode_reward=2613.97 +/- 31.47
Eval num_timesteps=800000, episode_reward=2746.85 +/- 16.53
Eval num_timesteps=900000, episode_reward=2681.37 +/- 10.16
Eval num_timesteps=1000000, episode_reward=2788.40 +/- 27.27

Hyperparameters:

OrderedDict([('batch_size', 100),
             ('buffer_size', 300000),
             ('env_wrapper', 'utils.wrappers.TimeFeatureWrapper'),
             ('gamma', 0.98),
             ('gradient_steps', 1000),
             ('learning_rate', 0.001),
             ('learning_starts', 10000),
             ('n_timesteps', 2000000.0),
             ('noise_std', 0.1),
             ('noise_type', 'normal'),
             ('policy', 'MlpPolicy'),
             ('policy_kwargs', 'dict(layers=[400, 300])'),
             ('train_freq', 1000)])

jeppelangaa · 2020-05-11T15:54:51Z

@araffin The done signal is set to False all the time. Is that sufficient or do i need to look more into the "hack" that you have cited? I am not sure that i understand how to use it.

I am sorry for asking something that could be found in the documentation, but I still haven't found the reference to the RL Zoo that you are talking about yet.

I have been using the hyperparameters that you stated, and it seems, even though that I'm not done testing yet, that the stable baselines implementation can reach the same level as the original implementation. Thank you very much.

araffin · 2020-05-11T16:00:49Z

The done signal is set to False all the time. Is that sufficient or do i need to look more into the "hack" that you have cited? I am not sure that i understand how to use it.

I think this is related to araffin/rl-baselines-zoo#79, I recommend you to read the associated paper.
You need to modify the algorithm if you remove the done signal (as it is done in the original implementation). Otherwise, you should use the TimeFeatureWrapper (araffin/rl-baselines-zoo#79).

can reach the same level as the original implementation.

good to hear, please read the rl tips and tricks carefully next time ;)

I am sorry for asking something that could be found in the documentation, but I still haven't found the reference to the RL Zoo that you are talking about yet.

It is mentioned in the README, in the first page of the documentation, in the rl tips and has its own section in the documentation...
Anyway, the repo is here: https://github.com/araffin/rl-baselines-zoo

araffin · 2020-05-11T16:01:46Z

the stable baselines implementation can reach the same level as the original implementation

closing this issue then.

araffin added question Further information is requested RTFM Answer is the documentation custom gym env Issue related to Custom Gym Env labels May 3, 2020

araffin closed this as completed May 11, 2020

araffin mentioned this issue May 16, 2020

The performance of stable baselines implementation of SAC [question] #861

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The stable baselines implementation of TD3 can not achieve the same performance as the original TD3 [question] #840

The stable baselines implementation of TD3 can not achieve the same performance as the original TD3 [question] #840

jeppelangaa commented May 3, 2020

araffin commented May 3, 2020

Mindgames commented May 8, 2020

Miffyli commented May 8, 2020

araffin commented May 8, 2020 •

edited

Loading

jeppelangaa commented May 11, 2020

araffin commented May 11, 2020

araffin commented May 11, 2020

The stable baselines implementation of TD3 can not achieve the same performance as the original TD3 [question] #840

The stable baselines implementation of TD3 can not achieve the same performance as the original TD3 [question] #840

Comments

jeppelangaa commented May 3, 2020

araffin commented May 3, 2020

Mindgames commented May 8, 2020

Miffyli commented May 8, 2020

araffin commented May 8, 2020 • edited Loading

jeppelangaa commented May 11, 2020

araffin commented May 11, 2020

araffin commented May 11, 2020

araffin commented May 8, 2020 •

edited

Loading