Half-Cheetah and Ant reach high rewards (1000+) but get stuck in a state and don't walk #1718

hadelin2p · 2018-05-30T15:02:47Z

Hello,

I trained the ARS (from the paper here: https://arxiv.org/pdf/1803.07055.pdf) on the Half-Cheetah and Ant environments. It works very well with Mujoco but not with PyBullet. In PyBullet the reward keeps increasing up to 1000+ but the agent cannot walk, it gets stuck in a state at some point without moving. Same with the ant and the humanoid. Would you have any idea of what could be wrong? I'd highly appreciate your help on this. Please find below the code. Kind regards.

`# AI 2018

Importing the libraries

import os
import numpy as np
import gym
from gym import wrappers
import pybullet_envs

Setting the Hyper Parameters

class Hp():

def __init__(self):
    self.nb_steps = 1000
    self.episode_length = 1000
    self.learning_rate = 0.02
    self.nb_directions = 16
    self.nb_best_directions = 16
    assert self.nb_best_directions <= self.nb_directions
    self.noise = 0.03
    self.seed = 37
    self.env_name = 'HalfCheetahBulletEnv-v0'

Normalizing the states

class Normalizer():

def __init__(self, nb_inputs):
    self.n = np.zeros(nb_inputs)
    self.mean = np.zeros(nb_inputs)
    self.mean_diff = np.zeros(nb_inputs)
    self.var = np.zeros(nb_inputs)

def observe(self, x):
    self.n += 1.
    last_mean = self.mean.copy()
    self.mean += (x - self.mean) / self.n
    self.mean_diff += (x - last_mean) * (x - self.mean)
    self.var = (self.mean_diff / self.n).clip(min = 1e-2)

def normalize(self, inputs):
    obs_mean = self.mean
    obs_std = np.sqrt(self.var)
    return (inputs - obs_mean) / obs_std

Building the AI

class Policy():

def __init__(self, input_size, output_size):
    self.theta = np.zeros((output_size, input_size))

def evaluate(self, input, delta = None, direction = None):
    if direction is None:
        return self.theta.dot(input)
    elif direction == "positive":
        return (self.theta + hp.noise*delta).dot(input)
    else:
        return (self.theta - hp.noise*delta).dot(input)

def sample_deltas(self):
    return [np.random.randn(*self.theta.shape) for _ in range(hp.nb_directions)]

def update(self, rollouts, sigma_r):
    step = np.zeros(self.theta.shape)
    for r_pos, r_neg, d in rollouts:
        step += (r_pos - r_neg) * d
    self.theta += hp.learning_rate / (hp.nb_best_directions * sigma_r) * step

Exploring the policy on one specific direction and over one episode

def explore(env, normalizer, policy, direction = None, delta = None):
state = env.reset()
done = False
num_plays = 0.
sum_rewards = 0
while not done and num_plays < hp.episode_length:
normalizer.observe(state)
state = normalizer.normalize(state)
action = policy.evaluate(state, delta, direction)
state, reward, done, _ = env.step(action)
reward = max(min(reward, 1), -1)
sum_rewards += reward
num_plays += 1
return sum_rewards

Training the AI

def train(env, policy, normalizer, hp):

for step in range(hp.nb_steps):
    
    # Initializing the perturbations deltas and the positive/negative rewards
    deltas = policy.sample_deltas()
    positive_rewards = [0] * hp.nb_directions
    negative_rewards = [0] * hp.nb_directions
    
    # Getting the positive rewards in the positive directions
    for k in range(hp.nb_directions):
        positive_rewards[k] = explore(env, normalizer, policy, direction = "positive", delta = deltas[k])
    
    # Getting the negative rewards in the negative/opposite directions
    for k in range(hp.nb_directions):
        negative_rewards[k] = explore(env, normalizer, policy, direction = "negative", delta = deltas[k])
    
    # Gathering all the positive/negative rewards to compute the standard deviation of these rewards
    all_rewards = np.array(positive_rewards + negative_rewards)
    sigma_r = all_rewards.std()
    
    # Sorting the rollouts by the max(r_pos, r_neg) and selecting the best directions
    scores = {k:max(r_pos, r_neg) for k,(r_pos,r_neg) in enumerate(zip(positive_rewards, negative_rewards))}
    order = sorted(scores.keys(), key = lambda x:scores[x])[:hp.nb_best_directions]
    rollouts = [(positive_rewards[k], negative_rewards[k], deltas[k]) for k in order]
    
    # Updating our policy
    policy.update(rollouts, sigma_r)
    
    # Printing the final reward of the policy after the update
    reward_evaluation = explore(env, normalizer, policy)
    print('Step:', step, 'Reward:', reward_evaluation)

Running the main code

def mkdir(base, name):
path = os.path.join(base, name)
if not os.path.exists(path):
os.makedirs(path)
return path
work_dir = mkdir('exp', 'brs')
monitor_dir = mkdir(work_dir, 'monitor')

hp = Hp()
np.random.seed(hp.seed)
env = gym.make(hp.env_name)
env = wrappers.Monitor(env, monitor_dir, force = True)
nb_inputs = env.observation_space.shape[0]
nb_outputs = env.action_space.shape[0]
policy = Policy(nb_inputs, nb_outputs)
normalizer = Normalizer(nb_inputs)
train(env, policy, normalizer, hp)
`

The text was updated successfully, but these errors were encountered:

erwincoumans · 2018-05-30T15:08:11Z

Thanks for the report. Please attach a zip file with the modified ARS files.

Note that the PyBullet versions of the locomotion envs are derived from OpenAI Roboschool, and are much harder than default MuJoco versions.
In addition, you may need to clamp the output of the linear policy to the appropriate action range.
1000 is not a very high reward for Ant, you may want to train longer, at least until > 2000.

hadelin2p · 2018-05-30T16:08:26Z

Thanks a lot for your reply and help. Please find attached the zip file.

I understand, it's noted.
That's where I thought the problem came from. However I am not sure how to clamp the output actions with PyBullet. Could you please indicate me on this?
Sure I haven't trained that long for the ant, but long enough for the cheetah. I'd be already very happy to solve my issue for the cheetah.

ARS.zip

erwincoumans · 2018-05-30T16:11:50Z

See the modifications in this repo, in particular the numpy np.clip operations and the episode length.
jietan/ARS@f781beb

Ideally we make a 100% compatible locomotion environment to the original Gym MuJoCo envs (and drop the Roboschool version)

benelot · 2018-05-30T20:34:07Z

I have similar problems when training cheetah and ant with ppo. I will check your solution and see if there is a more generic way to make the envs work in pybullet. Is it because the simulation does not get some params from the xml file?

…

On Wed, May 30, 2018, 18:11 erwincoumans ***@***.***> wrote: See the modifications in this repo, in particular the numpy np.clip operations and the episode length. ***@***.*** <jietan/ARS@f781beb> — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#1718 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AC97q0JzQ89biZFsSZqm5wUDxgPfAKT9ks5t3sTNgaJpZM4UTdmK> .

benelot · 2018-05-30T20:40:08Z

Oh I see that you are just clipping to [-1;1]. Why does that make sense? Is this just to make the action not overload the motors in the sense that it uses the proper range of force, velocity or position? On Wed, May 30, 2018, 22:33 Benjamin Ellenberger <be.ellenberger@gmail.com> wrote:

…

I have similar problems when training cheetah and ant with ppo. I will check your solution and see if there is a more generic way to make the envs work in pybullet. Is it because the simulation does not get some params from the xml file? On Wed, May 30, 2018, 18:11 erwincoumans ***@***.***> wrote: > See the modifications in this repo, in particular the numpy np.clip > operations and the episode length. > ***@***.*** > <jietan/ARS@f781beb> > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <#1718 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AC97q0JzQ89biZFsSZqm5wUDxgPfAKT9ks5t3sTNgaJpZM4UTdmK> > . >

hadelin2p · 2018-05-30T21:30:14Z

Thanks a lot Erwin! I've just relaunched the training after adding np.clip(..., -1.0, +1.0). Besides I noticed pybullet==2.0.0 was just released so I've just upgraded as well. Let's see if it's going to work now. @benelot let me know if that fixes the issue on your side. Will keep you posted.

erwincoumans · 2018-05-31T02:50:58Z

Note that ESTool trains fine for PyBullet Ant and Half Cheetah.

You need to use the clipping for both training and test/rollout. You need to clip, since you cannot apply randomly large actions, the action space is in the range [-1.1]

hadelin2p · 2018-05-31T11:44:30Z

Hello,

I clipped the actions for both training and test but unfortunately I am getting the same results on the Half-Cheetah: the reward reached 1000 but the agent is still getting stuck not moving. Please find attached a zip folder containing:

      1. The modified code including the clipping (the clipping is applied in the evaluate() method of the Policy() class).
      2. A console screenshot of the final rewards obtained.
      3. One of the last output videos of the cheetah.

Maybe I am still missing something but that must be a tiny detail since it works very well with Mujoco (even with no clipping). Would you mind having a quick look at it? I'm close but another pair of eyes might be helpful :)

ARS.zip

hadelin2p · 2018-06-14T20:33:09Z

Hello,

I forced the exploration on the full episode by setting Done to False here:

https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/gym_locomotion_envs.py#L64

Now it works and the agent walks just fine.

Will try on HumanoidFlagrunHarderBulletEnv-v0 tonight.

HonghuXue · 2020-04-05T17:10:30Z

Hi, Hadelin2p,
I am encountering the exact same problem using pybullet "halfcheetah" using the reinforcement learning algorithm A2C. Despite the increasing reward, but the simulation doesn't show meaningful/ expected movement, the agent gets stuck in a state as you mentioned.
How do you solve that? I don't get your answer "I forced the exploration on the full episode by setting Done to False here: https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/gym_locomotion_envs.py#L64"
You modify the simulator or you change your algorithm setting?

erwincoumans · 2020-04-05T17:59:08Z

The Stable-Baselines has implementations of A2D and PPO that trains the HalfCheetahBulletEnv-v0 fine, check out:
https://github.com/araffin/rl-baselines-zoo

erwincoumans closed this as completed Apr 5, 2020

Clyde21c mentioned this issue Sep 28, 2021

Feature/maml_exp_baseline dongminlee94/meta-learning-for-everyone#57

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Half-Cheetah and Ant reach high rewards (1000+) but get stuck in a state and don't walk #1718

Half-Cheetah and Ant reach high rewards (1000+) but get stuck in a state and don't walk #1718

hadelin2p commented May 30, 2018

erwincoumans commented May 30, 2018 •

edited

Loading

hadelin2p commented May 30, 2018

erwincoumans commented May 30, 2018 •

edited

Loading

benelot commented May 30, 2018 via email

benelot commented May 30, 2018 via email

hadelin2p commented May 30, 2018

erwincoumans commented May 31, 2018 •

edited

Loading

hadelin2p commented May 31, 2018 •

edited

Loading

hadelin2p commented Jun 14, 2018

HonghuXue commented Apr 5, 2020

erwincoumans commented Apr 5, 2020 •

edited

Loading

Half-Cheetah and Ant reach high rewards (1000+) but get stuck in a state and don't walk #1718

Half-Cheetah and Ant reach high rewards (1000+) but get stuck in a state and don't walk #1718

Comments

hadelin2p commented May 30, 2018

Importing the libraries

Setting the Hyper Parameters

Normalizing the states

Building the AI

Exploring the policy on one specific direction and over one episode

Training the AI

Running the main code

erwincoumans commented May 30, 2018 • edited Loading

hadelin2p commented May 30, 2018

erwincoumans commented May 30, 2018 • edited Loading

benelot commented May 30, 2018 via email

benelot commented May 30, 2018 via email

hadelin2p commented May 30, 2018

erwincoumans commented May 31, 2018 • edited Loading

hadelin2p commented May 31, 2018 • edited Loading

hadelin2p commented Jun 14, 2018

HonghuXue commented Apr 5, 2020

erwincoumans commented Apr 5, 2020 • edited Loading

erwincoumans commented May 30, 2018 •

edited

Loading

erwincoumans commented May 30, 2018 •

edited

Loading

erwincoumans commented May 31, 2018 •

edited

Loading

hadelin2p commented May 31, 2018 •

edited

Loading

erwincoumans commented Apr 5, 2020 •

edited

Loading