Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Half-Cheetah and Ant reach high rewards (1000+) but get stuck in a state and don't walk #1718

Closed
hadelin2p opened this issue May 30, 2018 · 11 comments

Comments

@hadelin2p
Copy link

Hello,

I trained the ARS (from the paper here: https://arxiv.org/pdf/1803.07055.pdf) on the Half-Cheetah and Ant environments. It works very well with Mujoco but not with PyBullet. In PyBullet the reward keeps increasing up to 1000+ but the agent cannot walk, it gets stuck in a state at some point without moving. Same with the ant and the humanoid. Would you have any idea of what could be wrong? I'd highly appreciate your help on this. Please find below the code. Kind regards.

`# AI 2018

Importing the libraries

import os
import numpy as np
import gym
from gym import wrappers
import pybullet_envs

Setting the Hyper Parameters

class Hp():

def __init__(self):
    self.nb_steps = 1000
    self.episode_length = 1000
    self.learning_rate = 0.02
    self.nb_directions = 16
    self.nb_best_directions = 16
    assert self.nb_best_directions <= self.nb_directions
    self.noise = 0.03
    self.seed = 37
    self.env_name = 'HalfCheetahBulletEnv-v0'

Normalizing the states

class Normalizer():

def __init__(self, nb_inputs):
    self.n = np.zeros(nb_inputs)
    self.mean = np.zeros(nb_inputs)
    self.mean_diff = np.zeros(nb_inputs)
    self.var = np.zeros(nb_inputs)

def observe(self, x):
    self.n += 1.
    last_mean = self.mean.copy()
    self.mean += (x - self.mean) / self.n
    self.mean_diff += (x - last_mean) * (x - self.mean)
    self.var = (self.mean_diff / self.n).clip(min = 1e-2)

def normalize(self, inputs):
    obs_mean = self.mean
    obs_std = np.sqrt(self.var)
    return (inputs - obs_mean) / obs_std

Building the AI

class Policy():

def __init__(self, input_size, output_size):
    self.theta = np.zeros((output_size, input_size))

def evaluate(self, input, delta = None, direction = None):
    if direction is None:
        return self.theta.dot(input)
    elif direction == "positive":
        return (self.theta + hp.noise*delta).dot(input)
    else:
        return (self.theta - hp.noise*delta).dot(input)

def sample_deltas(self):
    return [np.random.randn(*self.theta.shape) for _ in range(hp.nb_directions)]

def update(self, rollouts, sigma_r):
    step = np.zeros(self.theta.shape)
    for r_pos, r_neg, d in rollouts:
        step += (r_pos - r_neg) * d
    self.theta += hp.learning_rate / (hp.nb_best_directions * sigma_r) * step

Exploring the policy on one specific direction and over one episode

def explore(env, normalizer, policy, direction = None, delta = None):
state = env.reset()
done = False
num_plays = 0.
sum_rewards = 0
while not done and num_plays < hp.episode_length:
normalizer.observe(state)
state = normalizer.normalize(state)
action = policy.evaluate(state, delta, direction)
state, reward, done, _ = env.step(action)
reward = max(min(reward, 1), -1)
sum_rewards += reward
num_plays += 1
return sum_rewards

Training the AI

def train(env, policy, normalizer, hp):

for step in range(hp.nb_steps):
    
    # Initializing the perturbations deltas and the positive/negative rewards
    deltas = policy.sample_deltas()
    positive_rewards = [0] * hp.nb_directions
    negative_rewards = [0] * hp.nb_directions
    
    # Getting the positive rewards in the positive directions
    for k in range(hp.nb_directions):
        positive_rewards[k] = explore(env, normalizer, policy, direction = "positive", delta = deltas[k])
    
    # Getting the negative rewards in the negative/opposite directions
    for k in range(hp.nb_directions):
        negative_rewards[k] = explore(env, normalizer, policy, direction = "negative", delta = deltas[k])
    
    # Gathering all the positive/negative rewards to compute the standard deviation of these rewards
    all_rewards = np.array(positive_rewards + negative_rewards)
    sigma_r = all_rewards.std()
    
    # Sorting the rollouts by the max(r_pos, r_neg) and selecting the best directions
    scores = {k:max(r_pos, r_neg) for k,(r_pos,r_neg) in enumerate(zip(positive_rewards, negative_rewards))}
    order = sorted(scores.keys(), key = lambda x:scores[x])[:hp.nb_best_directions]
    rollouts = [(positive_rewards[k], negative_rewards[k], deltas[k]) for k in order]
    
    # Updating our policy
    policy.update(rollouts, sigma_r)
    
    # Printing the final reward of the policy after the update
    reward_evaluation = explore(env, normalizer, policy)
    print('Step:', step, 'Reward:', reward_evaluation)

Running the main code

def mkdir(base, name):
path = os.path.join(base, name)
if not os.path.exists(path):
os.makedirs(path)
return path
work_dir = mkdir('exp', 'brs')
monitor_dir = mkdir(work_dir, 'monitor')

hp = Hp()
np.random.seed(hp.seed)
env = gym.make(hp.env_name)
env = wrappers.Monitor(env, monitor_dir, force = True)
nb_inputs = env.observation_space.shape[0]
nb_outputs = env.action_space.shape[0]
policy = Policy(nb_inputs, nb_outputs)
normalizer = Normalizer(nb_inputs)
train(env, policy, normalizer, hp)
`

@erwincoumans
Copy link
Member

erwincoumans commented May 30, 2018

Thanks for the report. Please attach a zip file with the modified ARS files.

  1. Note that the PyBullet versions of the locomotion envs are derived from OpenAI Roboschool, and are much harder than default MuJoco versions.
  2. In addition, you may need to clamp the output of the linear policy to the appropriate action range.
  3. 1000 is not a very high reward for Ant, you may want to train longer, at least until > 2000.

@hadelin2p
Copy link
Author

Thanks a lot for your reply and help. Please find attached the zip file.

  1. I understand, it's noted.
  2. That's where I thought the problem came from. However I am not sure how to clamp the output actions with PyBullet. Could you please indicate me on this?
  3. Sure I haven't trained that long for the ant, but long enough for the cheetah. I'd be already very happy to solve my issue for the cheetah.

ARS.zip

@erwincoumans
Copy link
Member

erwincoumans commented May 30, 2018

See the modifications in this repo, in particular the numpy np.clip operations and the episode length.
jietan/ARS@f781beb

Ideally we make a 100% compatible locomotion environment to the original Gym MuJoCo envs (and drop the Roboschool version)

@benelot
Copy link
Contributor

benelot commented May 30, 2018 via email

@benelot
Copy link
Contributor

benelot commented May 30, 2018 via email

@hadelin2p
Copy link
Author

Thanks a lot Erwin! I've just relaunched the training after adding np.clip(..., -1.0, +1.0). Besides I noticed pybullet==2.0.0 was just released so I've just upgraded as well. Let's see if it's going to work now. @benelot let me know if that fixes the issue on your side. Will keep you posted.

@erwincoumans
Copy link
Member

erwincoumans commented May 31, 2018

Note that ESTool trains fine for PyBullet Ant and Half Cheetah.

You need to use the clipping for both training and test/rollout. You need to clip, since you cannot apply randomly large actions, the action space is in the range [-1.1]

@hadelin2p
Copy link
Author

hadelin2p commented May 31, 2018

Hello,

I clipped the actions for both training and test but unfortunately I am getting the same results on the Half-Cheetah: the reward reached 1000 but the agent is still getting stuck not moving. Please find attached a zip folder containing:

      1. The modified code including the clipping (the clipping is applied in the evaluate() method of the Policy() class).
      2. A console screenshot of the final rewards obtained.
      3. One of the last output videos of the cheetah.

Maybe I am still missing something but that must be a tiny detail since it works very well with Mujoco (even with no clipping). Would you mind having a quick look at it? I'm close but another pair of eyes might be helpful :)

ARS.zip

@hadelin2p
Copy link
Author

Hello,

I forced the exploration on the full episode by setting Done to False here:

https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/gym_locomotion_envs.py#L64

Now it works and the agent walks just fine.

Will try on HumanoidFlagrunHarderBulletEnv-v0 tonight.

@HonghuXue
Copy link

Hi, Hadelin2p,
I am encountering the exact same problem using pybullet "halfcheetah" using the reinforcement learning algorithm A2C. Despite the increasing reward, but the simulation doesn't show meaningful/ expected movement, the agent gets stuck in a state as you mentioned.
How do you solve that? I don't get your answer "I forced the exploration on the full episode by setting Done to False here: https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/gym_locomotion_envs.py#L64"
You modify the simulator or you change your algorithm setting?

@erwincoumans
Copy link
Member

erwincoumans commented Apr 5, 2020

The Stable-Baselines has implementations of A2D and PPO that trains the HalfCheetahBulletEnv-v0 fine, check out:
https://github.com/araffin/rl-baselines-zoo

Screenshot from 2020-04-05 10-58-03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants