Average PPO implementation #212

Howuhh · 2022-06-21T17:48:05Z

Description

Implementation of new algorithm - APO. See #210. Work in progress.

Types of changes

Bug fix
New feature
New algorithm
Documentation

Checklist:

I've read the CONTRIBUTION guide (required).
I have ensured pre-commit run --all-files passes (required).
I have updated the documentation and previewed the changes via mkdocs serve.
I have updated the tests accordingly (if applicable).

If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See #137 as an example PR.

vercel · 2022-06-21T17:48:10Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Updated
cleanrl	✅ Ready (Inspect)	Visit Preview	Jun 24, 2022 at 1:51PM (UTC)

Howuhh · 2022-06-21T17:51:57Z

@vwxyzjn I'd appreciate if you could take a quick look at the code (without going into details) to check that I match the style of the rest of the library.

Howuhh · 2022-06-21T20:27:58Z

I also don't quite understand the decision to evaluate an agent by episodic reward and with stochastic actions. This is especially noticeable with --capture-video as it slows down the training a lot (from 3k fps for 64 envs to ~700-600, 5min -> 20min for 1M). This would not be a problem if the the agent was checked periodically, every few updates. This would add some code since you have to write a rollout function, but it is usually very similar for many agents.

vwxyzjn

@Howuhh Thanks for making the PR - it looks really great. I did a pass of editing and removed some of the formatting changes - we want to minimize the lines of code difference between say ppo_continous_action.py and apo_continuous_action.py to make it easy to spot exactly how these two algorithms are different. We do this also to improve maintainability in the future.

Formatting changes like values = torch.zeros(args.num_steps, args.num_envs, device=device) or lr_scheduler = optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0, end_factor=0.0, total_iters=num_updates) should come in a separate PR that fixes this for all affected files, and we should also discuss the design - if we are going with the LinearLR route, it might be worth it to make it an argparse to configure other types of lr scheduler.

vwxyzjn · 2022-06-21T22:02:38Z

I also don't quite understand the decision to evaluate an agent by episodic reward and with stochastic actions.

Are you suggesting we do deterministic evaluation by using the mean of the distribution as actions?

This is especially noticeable with --capture-video as it slows down the training a lot (from 3k fps for 64 envs to ~700-600, 5min -> 20min for 1M). This would not be a problem if the agent was checked periodically, every few updates. This would add some code since you have to write a rollout function, but it is usually very similar for many agents.

I don't follow. What is the problem and how is it related to capture-video?

Howuhh · 2022-06-21T22:13:25Z

What is the problem and how is it related to capture-video

Always capturing video from one of the envs during training has a noticeable overhead (3x slower on my PC on cpu). From my experience it will be faster to not recording video during experience gathering and record it separately on single eval_env every N updates. However, it would require to change a little bit of code for the environment creation, so maybe this is not a priority right now (or at all), I just wanted to share an observation.

Formatting changes like values = torch.zeros(args.num_steps, args.num_envs, device=device) ... should come in a separate PR that fixes this for all affected files

This is fine. I'm used to using this method because .to() sometimes creates an extra data copy, which can be avoided

vwxyzjn · 2022-06-21T22:59:21Z

the video recording wrapper actually records according to a particular schedule and should not slow down training a lot. See https://www.gymlibrary.ml/content/wrappers/

vwxyzjn · 2022-06-22T00:12:28Z

The prototype LGTM. Could you start running some benchmark experiments?

I suggest running

export WANDB_ENTITY=openrlbenchmark
poetry install -E "mujoco pybullet"
python -c "import mujoco_py"
OMP_NUM_THREADS=1 xvfb-run -a python -m cleanrl_utils.benchmark \
    --env-ids HalfCheetah-v2 Walker2d-v2 Hopper-v2 \
    --command "poetry run python cleanrl/apo_continuous_action.py --cuda False --track --capture-video" \
    --num-seeds 3 \
    --workers 9

Howuhh · 2022-06-22T00:30:33Z

Yup, I started to run some stuff. At first I will be testing on Swimmer-v3 as the results on it are very different from the other environments with PPO (for the better) and so it is easier for me to compare the results to the paper.

This leaves only the question of hyperparameters. Do I have to start with the default ones? I think that some values, such as gae-lambda and value-constraint can affect the quality quite a lot.

P.S. with some hparams tweaking it learn Swimmer well ~ 360 return on 1M steps (which is considered as solved).

vwxyzjn · 2022-06-22T00:37:31Z

Maybe leave use the same set of hyperparameters provided in the paper as a starting place. If another set of hyperparameters is used, we should clearly state this difference in the documentation.

Overall, it's desirable to have the same or better performance compared to the original paper.

Howuhh · 2022-06-22T00:38:53Z

Well, the paper indicates only the enumeration sweep that they used, but not the final best ones. 😞

vwxyzjn · 2022-06-22T00:41:18Z

Isn't there a hyperparameter table?

Howuhh · 2022-06-22T00:43:58Z

Yes, it is it! The most important parameters are given here only as a grid for a sweep. It doesn't specify num-envs/num-steps also, but I suppose we could leave them as standard 1 and 2048

vwxyzjn · 2022-06-22T00:50:46Z

Yeah let’s keep the default num_envs=1 and num_steps=2048 for now. I just looked through the original implementation and it looks like it's quite a different architecture.

In the future we can probably try different num_envs, especially with envpool. We have some early envpool expeirments with PPO and it seems to perform really well compared to gym's vecenv.

https://wandb.ai/openrlbenchmark/envpool-cleanrl/reports/CleanRL-experiments--VmlldzoyMTA4ODM2

Howuhh · 2022-06-22T12:22:11Z

@vwxyzjn So, what is the policy of submitting runs to the wandb? Should I first experiment on my local private project and then re-run final evaluation to the openrlbenchmark/cleanrl? Or I can make a project in openrlbenchmark/apo and submit experimantal runs here?

Also, how you group them for the reports? I don't see WANDB_RUN_GROUP set anyware

vwxyzjn · 2022-06-22T13:58:09Z

Feel free to submit to openrlbenchmark/cleanrl. We don’t use WANDB_GROUP really. The grouping happens manually in the Wandb site when we create reports.

Howuhh · 2022-06-22T14:17:43Z

First sanity-check on 3 seeds, seems like it is working as expected on Swimmer-v3. Even better than in paper, but they use more seeds.

vwxyzjn · 2022-06-22T14:18:36Z

Feel free to either use 3 or 5 seeds. It's awesome to see that the performance is even better than the paper!

Howuhh · 2022-06-22T14:24:23Z

It is a deep mystery to me why it works so well on this particular environment. Algorithms based on discounted reward can only solve it if you set gamma=0.9999, but with such gamma they became very unstable from my experience.

P.S. I have a theory (or a guess) that APO will be much more efficient in exploration with RND for example than PPO (based on one plot from paper for Humanoid-v3).

Howuhh · 2022-06-22T19:27:30Z

@vwxyzjn Results for APO Gym Mujoco will be in this report:
https://wandb.ai/openrlbenchmark/cleanrl/reports/-WIP-APO-on-Gym-Mujoco---VmlldzoyMjEwMjY4
Feel free to edit or suggest changes. Runtime ideally should be same as PPO (as there is no additional computations), but my PC is slow, so there will be a big difference.

Results for HalfCheetah also a little bit better than in paper.

vwxyzjn · 2022-06-22T21:23:13Z

Oh, this is very nice. Thanks for running the experiments. Oh here is quick trick for running the experiments faster - try run python apo_continuous_action.py --cuda False (yes in this case GPU makes it slower).

Also, could you rename the file to avg_ppo_continuous_action.py? It's going to be more explicit I think (and don't worry about re-running the experiments).

vwxyzjn · 2022-06-22T21:29:03Z

I modified the report a bit to make it look nicer:

Howuhh · 2022-06-22T21:34:21Z

@vwxyzjn While it will be more explicit, I think we should respect the choice of the author of this method, since he called it APO.

try run python apo_continuous_action.py --cuda False

I don't have cuda, so it is on cpu. Without --capture-video it a lot faster tho, I can easily test one gae-lambda choice in 20m for 3 seeds.
Thanks for the style corrections, looks better now.

vwxyzjn · 2022-06-24T12:58:51Z

@Howuhh are all the experiments done with the same set of hyperparameters?

Howuhh · 2022-06-24T13:35:45Z

No, for now I vary gae-lambda as my primary goal is to replicate results from paper, not to properly compare with PPO (but I will come to that). All other hparams are default. Also, comparison with PPO should be done by average reward metric, not return (for now I plot return, as ppo does not log average reward).

Howuhh · 2022-06-24T20:13:35Z

So, the short conclusion from first experiments on Swimmer, HalfCheetah, Ant, Walker, Hopper:

I can match paper performance on Swimmer, HalfCheetah, Ant
While on Walker and Hopper performance is significantly worse for some reason.

In general pattern also match paper results: APO excels in env without termination states, while fail where they are exists (in terms of return metric, in terms of average reward APO should beat PPO in every env).

P.S. I will try to find a difference in original implementation, there are some configs

vwxyzjn · 2022-06-26T22:14:18Z

That sounds good. Thank you!

Howuhh · 2022-06-27T15:28:49Z

I can restore the original quality on Hopper and Walker, but with quite specific parameters. I don't think it's worth it, given that the results on these environments are only needed for pedagogical purposes (to show that the average reward is not the best choice if it's an episodic task).

I will run benchmark on Swimmer, HalfCheetah, Ant (need to tune hparams a bit to work well on them all with same set). @vwxyzjn Would be this enough for algorithm adoption?

Still, I think this PR also should wait for new gym api as done bootstrapping is even more important here than in PPO. All runs I will transfer to cleanrl-cache project.

UPDATE: if I set terminate_when_unhealthy=False then performance dramatically increases which is quie expected, but I don't know whether it's fair.

vwxyzjn · 2022-06-29T13:58:42Z

Thanks for the update.

I will run benchmark on Swimmer, HalfCheetah, Ant (need to tune hparams a bit to work well on them all with same set). @vwxyzjn Would be this enough for algorithm adoption?

The main thing I'm looking for is one set of hyperparameters that work relatively well with most of the games tested. It's not as clean to me if APO has a tuned parameter per game because then the comparison with PPO would not be insightful - I could do some tuning with PPO, too. You might be also interested in https://wandb.ai/openrlbenchmark/envpool-cleanrl/reports/MuJoCo-CleanRL-EnvPool-vs-openai-baselines--VmlldzoyMjA2NjI2, where we tuned PPO to obtain higher rewards and shorter training time in every task tested except in Swimmer.

Still, I think this PR also should wait for new gym api as done bootstrapping is even more important here than in PPO. All runs I will transfer to cleanrl-cache project.

That's largely up to you. Note that then the contribution might be as long as 6 months (depending on when the new API is introduced). I have a slight preference for this PR to be merged early - you are already very close. If timelimit handling is really important, feel free to just do it with the current API like discussed in #198

            # Handle timeout by bootstraping with value function
            # see GitHub issue #633
            for idx, done_ in enumerate(dones):
                if (
                    done_
                    and infos[idx].get("terminal_observation") is not None
                    and infos[idx].get("TimeLimit.truncated", False)
                ):
                    terminal_obs = self.policy.obs_to_tensor(infos[idx]["terminal_observation"])[0]
                    with th.no_grad():
                        terminal_value = self.policy.predict_values(terminal_obs)[0]
                    rewards[idx] += self.gamma * terminal_value

terminate_when_unhealthy=False

What is this? Is this an option in the gym environment?

Howuhh · 2022-06-29T14:10:35Z

What is this? Is this an option in the gym environment?

Yeah, this option disables done=True on unsafe states like falling, such that done will be only on timelimit. For APO it is important as it is designed for tasks without termination states, with this option it learns on Hopper, Walker a lot quicklier and get higher reward in general. If safety is not an issue, then a don't see why we can't disable dones here.

It's not as clean to me if APO has a tuned parameter per game

I totally agree, I will run some tuning on all of them, it will just take some time. I will post an update with wandb report with results on all envs with same hparams.

vwxyzjn · 2022-06-29T15:44:27Z

Yeah, this option disables done=True on unsafe states like falling,

This changes the environment which makes a fair comparison more difficult; I am open to doing it though as long as it is clearly documented.

I totally agree, I will run some tuning on all of them, it will just take some time. I will post an update with wandb report with results on all envs with same hparams.

Thank you! You might also be interested in a hyperparameter tuning workflow I used.

Download https://wandb.ai/costa-huang/cleanRL/runs/6ai2uv8t/code?workspace=user-costa-huang and named it tune.py
Create a sweep.yaml file filled with

command:
  - ${env}
  - ${interpreter}
  - ${program}
  - ${args}
method: bayes
metric:
  goal: maximize
  name: normalized_scores
parameters:
  anneal-lr:
    values:
      - true
      - false
  clip-vloss:
    values:
      - true
      - false
  ent-coef:
    values:
      - 0
      - 0.01
  learning-rate:
    distribution: uniform
    max: 0.003
    min: 0.0003
  max-grad-norm:
    distribution: uniform
    max: 4
    min: 0.5
  num-minibatches:
    values:
      - 1
      - 2
      - 4
  num-steps:
    values:
      - 5
      - 16
      - 32
      - 64
      - 128
  update-epochs:
    values:
      - 1
      - 2
      - 4
  vf-coef:
    distribution: uniform
    max: 4
    min: 0
program: tuner.py

run wandb sweep sweep.yaml -p cleanrl which will prompt you to execute something like wandb agent costa-huang/cleanRL/ngpwrthg.

It will generate a sweep that looks like this https://wandb.ai/costa-huang/cleanRL/sweeps/ngpwrthg?workspace=user-costa-huang.

vwxyzjn · 2022-07-15T14:57:37Z

Please disregard the hyperparameter optimization guide above and follow the #228 if you want to try it.

The following is the one I used recently to tune ppo in mujoco.

import optuna

from cleanrl_utils.tuner import Tuner

tuner = Tuner(
    script="cleanrl/ppo_continuous_action_envpool_jax.py",
    metric="charts/episodic_return",
    metric_last_n_average_window=50,
    direction="maximize",
    target_scores={
        "HalfCheetah-v4": [-1000, 8000],
        "Walker2d-v4": [-1000, 6000],
        "Ant-v4": [-1000, 6000],
    },
    params_fn=lambda trial: {
        "learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
        "num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
        "update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4]),
        "num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
        "vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
        "max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
        "num-envs": 64
    },
    pruner=optuna.pruners.MedianPruner(n_startup_trials=5),
    wandb_kwargs={"project": "cleanrl"},
)
tuner.tune(
    num_trials=100,
    num_seeds=3,
)

wip APO implementation

c21fee4

vercel bot deployed to Preview June 21, 2022 17:48 View deployment

small bug fix, tmp asserts

6c61b6c

vercel bot deployed to Preview June 21, 2022 19:58 View deployment

Style changes

a8e99c1

vercel bot deployed to Preview June 21, 2022 21:52 View deployment

vwxyzjn reviewed Jun 21, 2022

View reviewed changes

default params change, log mean reward

d381f3d

Howuhh and others added 2 commits June 22, 2022 02:00

merge style changed after review

5716c25

Quick fix

c40982f

vercel bot deployed to Preview June 22, 2022 00:12 View deployment

Merge branch 'apo' of github.com:Howuhh/cleanrl into apo

8896323

start first runs for benchmarking

da62b6d

vercel bot deployed to Preview June 24, 2022 13:51 View deployment

vwxyzjn closed this Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Average PPO implementation #212

Average PPO implementation #212

Howuhh commented Jun 21, 2022

vercel bot commented Jun 21, 2022 •

edited

Loading

Howuhh commented Jun 21, 2022 •

edited

Loading

Howuhh commented Jun 21, 2022 •

edited

Loading

vwxyzjn left a comment •

edited

Loading

vwxyzjn commented Jun 21, 2022

Howuhh commented Jun 21, 2022 •

edited

Loading

vwxyzjn commented Jun 21, 2022

vwxyzjn commented Jun 22, 2022 •

edited

Loading

Howuhh commented Jun 22, 2022 •

edited

Loading

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022 •

edited

Loading

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022 •

edited

Loading

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022 •

edited

Loading

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022 •

edited

Loading

Howuhh commented Jun 22, 2022 •

edited

Loading

vwxyzjn commented Jun 22, 2022 •

edited

Loading

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022 •

edited

Loading

vwxyzjn commented Jun 24, 2022

Howuhh commented Jun 24, 2022 •

edited

Loading

Howuhh commented Jun 24, 2022 •

edited

Loading

vwxyzjn commented Jun 26, 2022

Howuhh commented Jun 27, 2022 •

edited

Loading

vwxyzjn commented Jun 29, 2022

Howuhh commented Jun 29, 2022

vwxyzjn commented Jun 29, 2022

vwxyzjn commented Jul 15, 2022 •

edited

Loading

Average PPO implementation #212

Average PPO implementation #212

Conversation

Howuhh commented Jun 21, 2022

Description

Types of changes

Checklist:

vercel bot commented Jun 21, 2022 • edited Loading

Howuhh commented Jun 21, 2022 • edited Loading

Howuhh commented Jun 21, 2022 • edited Loading

vwxyzjn left a comment • edited Loading

Choose a reason for hiding this comment

vwxyzjn commented Jun 21, 2022

Howuhh commented Jun 21, 2022 • edited Loading

vwxyzjn commented Jun 21, 2022

vwxyzjn commented Jun 22, 2022 • edited Loading

Howuhh commented Jun 22, 2022 • edited Loading

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022 • edited Loading

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022 • edited Loading

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022 • edited Loading

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022 • edited Loading

Howuhh commented Jun 22, 2022 • edited Loading

vwxyzjn commented Jun 22, 2022 • edited Loading

vwxyzjn commented Jun 22, 2022

Howuhh commented Jun 22, 2022 • edited Loading

vwxyzjn commented Jun 24, 2022

Howuhh commented Jun 24, 2022 • edited Loading

Howuhh commented Jun 24, 2022 • edited Loading

vwxyzjn commented Jun 26, 2022

Howuhh commented Jun 27, 2022 • edited Loading

vwxyzjn commented Jun 29, 2022

Howuhh commented Jun 29, 2022

vwxyzjn commented Jun 29, 2022

vwxyzjn commented Jul 15, 2022 • edited Loading

vercel bot commented Jun 21, 2022 •

edited

Loading

Howuhh commented Jun 21, 2022 •

edited

Loading

Howuhh commented Jun 21, 2022 •

edited

Loading

vwxyzjn left a comment •

edited

Loading

Howuhh commented Jun 21, 2022 •

edited

Loading

vwxyzjn commented Jun 22, 2022 •

edited

Loading

Howuhh commented Jun 22, 2022 •

edited

Loading

Howuhh commented Jun 22, 2022 •

edited

Loading

Howuhh commented Jun 22, 2022 •

edited

Loading

Howuhh commented Jun 22, 2022 •

edited

Loading

Howuhh commented Jun 22, 2022 •

edited

Loading

Howuhh commented Jun 22, 2022 •

edited

Loading

vwxyzjn commented Jun 22, 2022 •

edited

Loading

Howuhh commented Jun 22, 2022 •

edited

Loading

Howuhh commented Jun 24, 2022 •

edited

Loading

Howuhh commented Jun 24, 2022 •

edited

Loading

Howuhh commented Jun 27, 2022 •

edited

Loading

vwxyzjn commented Jul 15, 2022 •

edited

Loading