diff --git a/README.md b/README.md index c20277f..aca4033 100644 --- a/README.md +++ b/README.md @@ -11,12 +11,12 @@ MinAtar is a testbed for AI agents which implements miniaturized versions of sev

-## Quick Start +## Standard Quick Start To use MinAtar, you need python3 installed, make sure pip is also up to date. To run the included `DQN` and `AC_lambda` examples, you need `PyTorch`. To install MinAtar, please follow the steps below: 1. Clone the repo: ```bash -git clone https://github.com/kenjyoung/MinAtar.git +git clone https://github.com/Robertboy18/MinAtar-Faster.git ``` If you prefer running MinAtar in a virtualenv, you can do the following before step 2: ```bash @@ -29,6 +29,7 @@ pip install --upgrade pip 2. Install MinAtar: ```bash pip install . +pip install -r requirements.txt ``` If you have any issues with automatic dependency installation, you can instead install the necessary dependencies manually and run ```bash @@ -55,6 +56,193 @@ Use the arrow keys to move and space bar to fire. Also, press q to quit and r to Also included in the examples directory are example implementations of DQN (dqn.py) and online actor-critic with eligibility traces (AC_lambda.py). +## Optimized Code with various Agents Usage + +To run your first experiment: +``` +python3 main.py --agent-json config/agent/SAC.json --env-json config/environment/AcrobotContinuous-v1.json --index 0 +``` + +# Usage +The file main.py trains an agent for a specified number of runs, based on an environment and agent configuration file count in config/environment/ or config/agent/ respectively. The data is saved in the results directory, with a name similar to the environment and agent name. + +For more information on how to use the main.py program, see the `--help` option: +``` +Usage: main.py [OPTIONS] + + Given agent and environment configuration files, run the experiment defined + by the configuration files + +Options: + --env-json TEXT Path to the environment json configuration file + [required] + --agent-json TEXT Path to the agent json configuration file [required] + --index INTEGER The index of the hyperparameter to run + -m, --monitor Whether or not to render the scene as the agent trains. + -a, --after INTEGER How many timesteps (training) should pass before + rendering the scene + --save-dir TEXT Which directory to save the results file in + --help Show this message and exit. +``` + +Example: +``` +./main.py --env-json config/environment/MountainCarContinuous-v0.json --agent-json config/agent/linearAC.json --index 0 --monitor --after 1000 +``` +will run the experiment using linear-Gaussian actor-critic on the mountain +car environment. The experiment is run on one process (serially), and the +scene is rendered after 1000 timesteps of training. We will only run the +hyperparameter setting with index 0. + +# Hyperparameter settings +The hyperparameter settings are laid out in the agent configuration files. +The files are laid out such that each setting is a list of values, and the +total number of hyperparameter settings is the product of the lengths of each +of these lists. For example, if the agent config file looks like: +``` +{ + "agent_name": "linearAC", + "parameters": + { + "decay": [0.5], + "critic_lr": [0.005, 0.1, 0.3], + "actor_lr": [0.005, 0.1, 0.3], + "avg_reward_lr": [0.1, 0.3, 0.5, 0.9], + "scaled": [true], + "clip_stddev": [1000] + } +} +``` +then, there are `1 x 3 x 3 x 4 x 1 x 1 = 36` different hyperparameter +settings. Each hyperparameter setting is given a specific index. For example +hyperparameter setting index `1` would have the following hyperparameters: +``` +{ + "agent_name": "linearAC", + "parameters": + { + "decay": 0.5, + "critic_lr": 0.005, + "actor_lr": 0.005, + "avg_reward_lr": 0.1, + "scaled": true, + "clip_stddev": 1000 + } +} +``` +The hyperparameter settings indices are actually implemented `mod x`, +where `x` is the maximum number of hyperparameter settings (in the example +about, `36`). So, in the example above, the hyperparameter settings with +indices `1, 37, 73, ...` all refer to the same hyperparameter settings since +`1 = 37 = 73 = ... mod 36`. The difference is that the consecutive indices +have a different seed. So, each time we run experiments with hyperparameter +setting `1`, it will have the same seed. If we run with hyperparameter setting +`37`, it will be the same hyperparameter settings as `1`, but with a different +seed, and this seed will be the same every time we run the experiment with +hyperparameter settings `37`. This is what Martha and her students +have done with their Actor-Expert implementation, and I find that it works +nicely for hyperparameter sweeps. + + +# Saved Data +Each experiment saves all the data as a Python dictionary. The dictionary is +designed so that we store all information about the experiment, including all +agent hyperparameters and environment settings so that the experiment is +exactly reproducible. + +If the data dictionary is called `data`, then the main data for the experiment +is stored in `data["experiment_data"]`, which is a dictionary mapping from +hyperparameter settings indices to agent parameters and experiment runs. +`data["experiment_data"][i]["agent_params"]` is a dictionary storing the +agent's hyperparameters (hyperparameter settings index `i`) for the experiment. +`data["experiment_data"][i]["runs]` is a list storing the runs for the +`i-th` hyperparameter setting. Each element of the list is a dictionary, giving +all the information for that run and hyperparameter setting. For example, +`data["experiment_data"][i]["runs"][j]` will give all the information on +the `j-th` run of hyperparameter settings `i`. + +Below is a tree diagram of the data structure: +``` +data +├─── "experiment" +│ ├─── "environment": environment configuration file +│ └─── "agent": agent configuration file +└─── "experiment_data": dictionary of hyperparameter setting *index* to runs +    ├─── "agent_params": the hyperparameters settings + └─── "runs": a list containing all the runs for this hyperparameter setting (each run is a dictionary of elements) + └─── index i: information on the ith run + ├─── "run_number": the run number + ├─── "random_seed": the random seed used for the run + ├─── "total_timesteps": the total number of timesteps in the run + ├─── "eval_interval_timesteps": the interval of timesteps to pass before running offline evaluation + ├─── "episodes_per_eval": the number of episodes run at each offline evaluation + ├─── "eval_episode_rewards": list of the returns (np.array) from each evaluation episode if there are 10 episodes per eval, + │ then this will be a list of np.arrays where each np.array has 10 elements (one per eval episode) + ├─── "eval_episode_steps": the number of timesteps per evaluation episode, with the same form as "eval_episode_rewards" + ├─── "timesteps_at_eval": the number of training steps that passed at each evaluation. For example, if there were 10 + │ offline evaluations, then this will be a list of 10 integers, each stating how many training steps passed before each + │ evaluation. + ├─── "train_episode_rewards": the return seen for each training episode + ├─── "train_episode_steps": the number of timesteps passed for each training episode + ├─── "train_time": the total amount of training time in seconds + ├─── "eval_time": the total amount of evaluation time in seconds + └─── "total_train_episodes": the total number of training episodes for the run +``` + +For example, here is `data["experiment_data"][i]["runs"][j]` for a mock run +of the Linear-Gaussian Actor-Critic agent on MountainCarContinuous-v0: +``` +{'random_seed': 0, + 'total_timesteps': 1000, + 'eval_interval_timesteps': 500, + 'episodes_per_eval': 10, + 'eval_episode_rewards': array([[-200., -200., -200., -200., -200., -200., -200., -200., -200., + -200.], + [-200., -200., -200., -200., -200., -200., -200., -200., -200., + -200.]]), + 'eval_episode_steps': array([[200, 200, 200, 200, 200, 200, 200, 200, 200, 200], + [200, 200, 200, 200, 200, 200, 200, 200, 200, 200]]), + 'timesteps_at_eval': array([ 0, 600]), + 'train_episode_steps': array([200, 200, 200, 200, 200]), + 'train_episode_rewards': array([-200., -200., -200., -200., -200.]), + 'train_time': 0.12098526954650879, + 'eval_time': 0.044415950775146484, + 'total_train_episodes': 5, + ...} +``` + +# Configuration files +Each configuration file is a JSON file and has a few properties. There +are also templates in each configuration directory for the files. + +## Environment Configuration File +``` +{ + "env_name": "environment filename without .json, all files refer to this as env_name", + "total_timesteps": "int - total timesteps for the entire run", + "steps_per_episode": "int - max number of steps per episode", + "eval_interval_timesteps": "int - interval of timesteps at which offline evaluation should be done", + "eval_episodes": "int - the number of offline episodes per evaluation", + "gamma": "float - the discount factor", +} +``` + +## Agent Configuration File +The agent configuration file is more general. The template is below. Since +both agents already have configuration files, there is not much need to add +any new configurations for agents. Instead, it would suffice to alter the +existing configuration files. The issue is that each agent has very different +configurations and hyperparameters, and so the config files are very different. +``` +{ + "agent_name": "filename without .json, all code refers to this as agent_name", + "parameters": + { + "parameter name": "list of values" + } +} +``` + ## OpenAI Gym Wrapper MinAtar now includes an OpenAI Gym plugin using the Gym plugin system. If a sufficiently recent version of OpenAI gym (`pip install gym==0.21.0` works) is installed, this plugin should be automatically available after installing MinAtar as normal. A gym environment can then be constructed as follows: ```bash diff --git a/agent/Random.py b/agent/Random.py new file mode 100644 index 0000000..0c1f751 --- /dev/null +++ b/agent/Random.py @@ -0,0 +1,83 @@ +#!/usr/bin/env python3 + +# Adapted from https://github.com/pranz24/pytorch-soft-actor-critic + +# Import modules +import torch +import numpy as np +from agent.baseAgent import BaseAgent + + +class Random(BaseAgent): + """ + Random implements a random policy. + """ + def __init__(self, action_space, seed): + super().__init__() + self.batch = False + + self.action_dims = len(action_space.high) + self.action_low = action_space.low + self.action_high = action_space.high + + # Set the seed for all random number generators, this includes + # everything used by PyTorch, including setting the initial weights + # of networks. PyTorch prefers seeds with many non-zero binary units + self.torch_rng = torch.manual_seed(seed) + self.rng = np.random.default_rng(seed) + + self.policy = torch.distributions.Uniform( + torch.Tensor(action_space.low), torch.Tensor(action_space.high)) + + def sample_action(self, _): + """ + Samples an action from the agent + + Parameters + ---------- + _ : np.array + The state feature vector + + Returns + ------- + array_like of float + The action to take + """ + action = self.policy.sample() + + return action.detach().cpu().numpy() + + def sample_action_(self, _, size): + """ + sample_action_ is like sample_action, except the rng for + action selection in the environment is not affected by running + this function. + """ + return self.rng.uniform(self.action_low, self.action_high, + size=(size, self.action_dims)) + + def update(self, _, _1, _2, _3, _4): + pass + + def reset(self): + """ + Resets the agent between episodes + """ + pass + + def eval(self): + pass + + def train(self): + pass + + # Save model parameters + def save_model(self, _, _1="", _2=None, _3=None): + pass + + # Load model parameters + def load_model(self, _, _1): + pass + + def get_parameters(self): + pass diff --git a/agent/baseAgent.py b/agent/baseAgent.py new file mode 100644 index 0000000..3e7d78f --- /dev/null +++ b/agent/baseAgent.py @@ -0,0 +1,133 @@ +#!/usr/bin/env python3 + +# Import modules +from abc import ABC, abstractmethod + +# TODO: Given a data dictionary generated by main, create a static +# function to initialize any agent based on this dict. Note that since the +# dict has the agent name, only one function is needed to create ANY agent +# we could also use the experiment util create_agent() function + + +class BaseAgent(ABC): + """ + Class BaseAgent implements the base functionality for all agents + + Attributes + ---------- + self.batch : bool + Whether or not the agent is using batch updates, by default False. + This is needed for the Experiment class to determine what to save + for update transitions. The Experiment class will save all transitions + used in updates, but if an agent performs batch updates and keeps + an experience replay buffer, then the Experiment object must + determine the transitions used in the update from the agent, and + not from the environment. If an agent is not using batch updates, it + is fully online and incremental, and so it must be using the + last environment transition for the update. + self.info : dict + A dictionary which records some chaning agent attributes during + training, if any. For example, this dictionary can be used to + keep track of the entropy in SAC during training. + """ + def __init__(self): + """ + Constructor + """ + self.batch = False + self.info = {} + + """ + BaseAgent is the abstract base class for all agents + """ + @abstractmethod + def sample_action(self, state): + """ + Samples an action from the agent + + Parameters + ---------- + state : np.array + The state feature vector + + Returns + ------- + array_like of float + The action to take + """ + pass + + @abstractmethod + def update(self, state, action, reward, next_state, done_mask): + """ + Takes a single update step, which may be a number of offline + batch updates + + Parameters + ---------- + state : np.array or array_like of np.array + The state feature vector + action : np.array of float or array_like of np.array + The action taken + reward : float or array_like of float + The reward seen by the agent after taking the action + next_state : np.array or array_like of np.array + The feature vector of the next state transitioned to after the + agent took the argument action + done_mask : bool or array_like of bool + False if the agent reached the goal, True if the agent did not + reach the goal yet the episode ended (e.g. max number of steps + reached) + + Return + ------ + 4-tuple of array_like + A tuple containing array_like, each of which contains the states, + actions, rewards, and next states used in the update + """ + pass + + @abstractmethod + def reset(self): + """ + Resets the agent between episodes + """ + pass + + @abstractmethod + def eval(self): + """ + Sets the agent into offline evaluation mode, where the agent will not + explore + """ + pass + + @abstractmethod + def train(self): + """ + Sets the agent to online training mode, where the agent will explore + """ + pass + + @abstractmethod + def get_parameters(self): + """ + Gets all learned agent parameters such that training can be resumed. + + Gets all parameters of the agent such that, if given the + hyperparameters of the agent, training is resumable from this exact + point. This include the learned average reward, the learned entropy, + and other such learned values if applicable. This does not only apply + to the weights of the agent, but *all* values that have been learned + or calculated during training such that, given these values, training + can be resumed from this exact point. + + For example, in the LinearAC class, we must save not only the actor + and critic weights, but also the accumulated eligibility traces. + + Returns + ------- + dict of str to int, float, array_like, and/or torch.Tensor + The agent's weights + """ + pass diff --git a/agent/linear/ESarsa.py b/agent/linear/ESarsa.py new file mode 100644 index 0000000..f18347e --- /dev/null +++ b/agent/linear/ESarsa.py @@ -0,0 +1,242 @@ +# Import modules +import numpy as np +from agent.baseAgent import BaseAgent +from PyFixedReps import TileCoder +import time +import warnings +import inspect + + +class ESarsa(BaseAgent): + """ + Class Esarsa implements the Expected Sarsa(λ) algorithm + """ + def __init__(self, decay, lr, gamma, epsilon, + action_space, bins, num_tilings, env, seed=None, + trace_type="replacing", policy_type="εgreedy", + include_bias=True): + super().__init__() + self.batch = False + + # Set the agent's policy sampler + if seed is None: + seed = int(time()) + self.random = np.random.default_rng(seed=int(seed)) + self.seed = seed + + # Needed so that when evaluating offline, we don't explore + self.is_training = True + + # Tile Coder + self.include_bias = include_bias + input_ranges = list(zip(env.observation_space.low, + env.observation_space.high)) + dims = env.observation_space.shape[0] + params = { + "dims": dims, + "tiles": bins, + "tilings": num_tilings, + "input_ranges": input_ranges, + "scale_output": False, + } + self.tiler = TileCoder(params) + state_features = self.tiler.features() + self.include_bias + + # The weight parameters + self.actions = action_space.n + self.weights = np.zeros((self.actions, state_features)) + + # Set learning rates and other scaling factors + if decay < 0.0: + raise ValueError("cannot have trace decay rate < 0") + self.decay = decay + self.lr = lr / (num_tilings + self.include_bias) + self.gamma = gamma + self.epsilon = epsilon + print(self.lr) + + if policy_type not in ("εgreedy"): + raise ValueError("policy_type must be one of 'εgreedy'") + self.policy_type = policy_type + + # Eligibility traces + if trace_type not in ("accumulating", "replacing"): + raise ValueError("trace_type must be one of 'accumulating', " + + "'replacing'") + self.use_trace = decay > 0.0 + if self.use_trace: + self.trace = np.zeros_like(self.weights) + self.trace_type = trace_type + + source = inspect.getsource(inspect.getmodule(inspect.currentframe())) + self.info = {"source": source} + + def sample_action(self, state): + """ + Samples an action from the actor + + Parameters + ---------- + state : np.array + The state feature vector, not one hot encoded + + Returns + ------- + np.array of float + The action to take + """ + return self._sample_action(state) + + def _sample_action(self, state): + """ + Samples an action + + Parameters + ---------- + state : np.array + The state feature vector, not one hot encoded + + Returns + ------- + np.array of float + The action to take + """ + # Take random action with probability ε and only if in training mode + if self.policy_type == "εgreedy" and self.epsilon != 0 and \ + self.is_training: + if self.random.uniform() < self.epsilon: + action = self.random.choice(self.actions) + return action + + state = self._tiler_indices(state) + action_vals = self.weights[:, state].sum(axis=1) + + if self.policy_type == "εgreedy": + # Choose maximum action + max_actions = np.where(action_vals == np.max(action_vals))[0] + if len(max_actions) > 1: + return max_actions[self.random.choice(len(max_actions))] + else: + return max_actions[0] + + else: + raise ValueError(f"unknown policy type {self.policy_type}") + + def _get_probs(self, state): + """ + Gets the probability of taking each action in state `state` + + Parameters + ---------- + state : np.array + The state observation, not tile-coded + + Returns + ------- + np.array[float] + The probabilities of taking each action in state `state` + """ + state = self._tiler_indices(state) + if self.policy_type == "εgreedy": + probs = np.zeros(self.actions) + probs += self.epsilon / self.actions + + action_vals = self.weights[:, state].sum(axis=1) + max_actions = np.where(action_vals == np.max(action_vals))[0] + probs[max_actions] += (1 - self.epsilon) / len(max_actions) + else: + raise ValueError(f"unknown policy type {self.policy_type}") + + return probs + + def _tiler_indices(self, state): + if self.include_bias: + return np.concatenate( + [ + np.zeros((1,), dtype=np.int32), + self.tiler.get_indices(state) + 1, + ] + ) + + return self.tiler.get_indices(state) + + def update(self, state, action, reward, next_state, done_mask): + """ + Takes a single update step + + Parameters + ---------- + state : np.array or array_like of np.array + The state feature vector + action : np.array of float or array_like of np.array + The action taken + reward : float or array_like of float + The reward seen by the agent after taking the action + next_state : np.array or array_like of np.array + The feature vector of the next state transitioned to after the + agent took the argument action + done_mask : bool or array_like of bool + False if the agent reached the goal, True if the agent did not + reach the goal yet the episode ended (e.g. max number of steps + reached) + Note: this parameter is not used; it is only kept so that the + interface BaseAgent is consistent and can be used for both + Soft Actor-Critic and Linear-Gaussian Actor-Critic + """ + state = self._tiler_indices(state) + + δ = reward + δ -= self.weights[action, state].sum() + + # Update the trace + if self.use_trace: + if self.trace_type == "accumulating": + self.trace[action, state] += 1 + elif self.trace_type == "replacing": + self.trace[action, state] = 1 + else: + raise ValueError(f"unknown trace type {self.trace_type}") + + # Adjust δ if we are in an intra-episode timestep + episode_done = not done_mask + if not episode_done: + probs = self._get_probs(next_state) + next_state = self._tiler_indices(next_state) + + next_q = self.gamma * self.weights[:, next_state].sum(axis=1) + 𝔼_next_q = probs @ next_q + δ += 𝔼_next_q + + # Update the weights + if self.use_trace: + self.weights += (self.lr * δ * self.trace) + + # Decay the trace + self.trace *= (self.decay * self.gamma) + else: + self.weights[action, state] += (self.lr * δ) + + return + + def reset(self): + """ + Resets the agent between episodes + """ + self.trace = np.zeros_like(self.weights) + self.first_call = True + + def eval(self): + """ + Sets the agent into offline evaluation mode, where the agent will not + explore + """ + self.is_training = False + + def train(self): + """ + Sets the agent to online training mode, where the agent will explore + """ + self.is_training = True + + def get_parameters(self): + pass diff --git a/agent/linear/GaussianAC.py b/agent/linear/GaussianAC.py new file mode 100644 index 0000000..7f67652 --- /dev/null +++ b/agent/linear/GaussianAC.py @@ -0,0 +1,422 @@ +#!/usr/bin/env python3 + +# Import modules +import numpy as np +from agent.baseAgent import BaseAgent +from time import time +from PyFixedReps import TileCoder +from env.Bimodal import Bimodal1DEnv + + +class GaussianAC(BaseAgent): + """ + Class GaussianAC implements Linear-Gaussian Actor-Critic with eligibility + trace, as outlined in "Model-Free Reinforcement Learning with Continuous + Action in Practice", which can be found at: + + https://hal.inria.fr/hal-00764281/document + + The major difference is that this algorithm uses the discounted setting + instead of the average reward setting as used in the above paper. This + linear actor critic support multi-dimensional actions as well. + """ + def __init__(self, decay, actor_lr_scale, critic_lr, + gamma, accumulate_trace, action_space, bins, num_tilings, + env, use_critic_trace, use_actor_trace, scaled=False, + clip_stddev=1000, seed=None, trace_type="replacing"): + """ + Constructor + + Parameters + ---------- + decay : float + The eligibility decay rate, lambda + actor_lr : float + The learning rate for the actor + critic_lr : float + The learning rate for the critic + state_features : int + The size of the state feature vectors + gamma : float + The environmental discount factor + accumulate_trace : bool + Whether or not to accumulate the eligibility traces or not, which + may be desirable if the task is continuing. If it is, then the + eligibility trace vectors will be accumulated and not reset between + "episodes" when calling the reset() method. + scaled : bool, optional + Whether the actor learning rate should be scaled by sigma^2 for + learning stability, by default False + clip_stddev : float, optional + The value at which the standard deviation is clipped in order to + prevent numerical overflow, by default 1000. If <= 0, then + no clipping is done. + seed : int + The seed to use for the normal distribution sampler, by default + None. If set to None, uses the integer value of the Unix time. + """ + super().__init__() + self.batch = False + + # Set the agent's policy sampler + if seed is None: + seed = int(time()) + self.random = np.random.default_rng(seed=int(seed)) + self.seed = seed + + # Save whether or not the task is continuing + self.accumulate_trace = accumulate_trace + + # Needed so that when evaluating offline, we don't explore + self.is_training = True + + # Determine standard deviation clipping + self.clip_stddev = clip_stddev > 0 + self.clip_threshold = np.log(clip_stddev) + + # Tile Coder + input_ranges = list(zip(env.observation_space.low, + env.observation_space.high)) + dims = env.observation_space.shape[0] + params = { + "dims": dims, + "tiles": bins, + "tilings": num_tilings, + "input_ranges": input_ranges, + "scale_output": False, + } + self.tiler = TileCoder(params) + state_features = self.tiler.features() + 1 + + # The weight parameters + self.action_dims = action_space.high.shape[0] + self.sigma_weights = np.zeros((self.action_dims, state_features)) + self.mu_weights = np.zeros((self.action_dims, state_features)) + self.actor_weights = np.zeros(state_features * 2) + self.critic_weights = np.zeros(state_features) + + # Set learning rates and other scaling factors + self.scaled = scaled + self.decay = decay + self.critic_lr = critic_lr / (num_tilings + 1) + self.actor_lr = actor_lr_scale * self.critic_lr + self.gamma = gamma + + # Eligibility traces + self.use_actor_trace = use_actor_trace + if trace_type not in ("replacing", "accumulating"): + raise ValueError("trace_type must be one of 'accumulating', " + + "'replacing'") + self.trace_type = trace_type + + if self.use_actor_trace: + self.mu_trace = np.zeros_like(self.mu_weights) + self.sigma_trace = np.zeros_like(self.sigma_weights) + + self.use_critic_trace = use_critic_trace + if self.use_critic_trace: + self.critic_trace = np.zeros(state_features) + + if isinstance(env.env, Bimodal1DEnv): + self.info = { + "actor": {"mean": [], "stddev": []}, + } + self.store_dist = True + else: + self.store_dist = False + + source = inspect.getsource(inspect.getmodule(inspect.currentframe())) + self.info = {"source": source} + + def get_mean(self, state): + """ + Gets the mean of the parameterized normal distribution + + Parameters + ---------- + state : np.array + The indices of the nonzero features in the one-hot encoded state + feature vector + + Returns + ------- + float + The mean of the normal distribution + """ + return self.mu_weights[:, state].sum(axis=1) + + def get_stddev(self, state): + """ + Gets the standard deviation of the parameterized normal distribution + + Parameters + ---------- + state : np.array + The indices of the nonzero features in the one-hot encoded state + feature vector + + Returns + ------- + float + The standard deviation of the normal distribution + """ + # Return un-clipped standard deviation if no clipping + if not self.clip_stddev: + return np.exp(self.sigma_weights[:, state].sum(axis=1)) + + # Clip the standard deviation to prevent numerical overflow + log_std = np.clip(self.sigma_weights[:, state].sum(axis=1), + -self.clip_threshold, self.clip_threshold) + return np.exp(log_std) + + def sample_action(self, state): + """ + Samples an action from the actor + + Parameters + ---------- + state : np.array + The observation, not tile coded + + Returns + ------- + np.array of float + The action to take + """ + # state = np.concatenate( + # [ + # np.ones((1,), dtype=np.int32), + # self.tiler.encode(state), + # ] + # ) + state = np.concatenate( + [ + np.zeros((1,), dtype=np.int32), + self.tiler.get_indices(state) + 1, + ] + ) + mean = self.get_mean(state) + + # If in offline evaluation mode, return the mean action + if not self.is_training: + return np.array(mean) + + stddev = self.get_stddev(state) + + # Sample action from a normal distribution + action = self.random.normal(loc=mean, scale=stddev) + return action + + def get_actor_grad(self, state, action): + """ + Gets the gradient of the actor's parameters + + Parameters + ---------- + state : np.array + The indices of the nonzero features in the one-hot encoded state + feature vector + action : np.array of float + The action taken + + Returns + ------- + np.array + The gradient vector of the actor's weights, in the form + [grad_mu_weights^T, grad_sigma_weights^T]^T + """ + std = self.get_stddev(state) + mean = self.get_mean(state) + + grad_mu = np.zeros_like(self.mu_weights) + grad_sigma = np.zeros_like(self.sigma_weights) + + if action.shape[0] != 1: + # Repeat state along rows to match number of action dims + n = action.shape[0] + state = np.expand_dims(state, 0) + state = state.repeat(n, axis=0) + + scale_mu = (1 / (std ** 2)) * (action - mean) + scale_sigma = ((((action - mean) / std)**2) - 1) + + # Reshape scales so we can use broadcasted multiplication + scale_mu = np.expand_dims(scale_mu, axis=1) + scale_sigma = np.expand_dims(scale_sigma, axis=1) + + # grad_mu = scale_mu * state + # grad_sigma = scale_sigma * state + + else: + scale_mu = (1 / (std ** 2)) * (action - mean) + scale_sigma = ((((action - mean) / std)**2) - 1) + + grad_mu[:, state] = scale_mu + grad_sigma[:, state] = scale_sigma + + return grad_mu, grad_sigma + + def update(self, state, action, reward, next_state, done_mask): + """ + Takes a single update step + + Parameters + ---------- + state : np.array or array_like of np.array + The state feature vector, not tile coded + action : np.array of float or array_like of np.array + The action taken + reward : float or array_like of float + The reward seen by the agent after taking the action + next_state : np.array or array_like of np.array + The feature vector of the next state transitioned to after the + agent took the argument action, not tile coded + done_mask : bool or array_like of bool + False if the agent reached the goal, True if the agent did not + reach the goal yet the episode ended (e.g. max number of steps + reached) + Note: this parameter is not used; it is only kept so that the + interface BaseAgent is consistent and can be used for both + Soft Actor-Critic and Linear-Gaussian Actor-Critic + """ + # state = np.concatenate( + # [ + # np.ones((1,), dtype=np.int32), + # self.tiler.encode(state) + # ] + # ) + # next_state = np.concatenate( + # [ + # np.ones((1,), dtype=np.int32), + # self.tiler.encode(next_state) + # ] + # ) + state = np.concatenate( + [ + np.zeros((1,), dtype=np.int32), + self.tiler.get_indices(state) + 1, + ] + ) + next_state = np.concatenate( + [ + np.zeros((1,), dtype=np.int32), + self.tiler.get_indices(next_state) + 1, + ] + ) + + # Calculate TD error + v = self.critic_weights[state].sum() + next_v = self.critic_weights[next_state].sum() + target = reward + self.gamma * next_v * done_mask + delta = target - v + + # Critic update + if self.use_critic_trace: + # Update critic eligibility trace + self.critic_trace *= (self.gamma * self.decay) + # self.critic_trace = (self.gamma * self.decay * + # self.critic_trace) + state + if self.trace_type == "accumulating": + self.critic_trace[state] += 1 + elif self.trace_type == "replacing": + self.critic_trace[state] = 1 + else: + raise ValueError("unkown trace type {self.trace_type}") + # Update critic + self.critic_weights += (self.critic_lr * delta * self.critic_trace) + else: + grad = np.zeros_like(self.critic_weights) + grad[state] = 1 + self.critic_weights += (self.critic_lr * delta * grad) + + # Actor update + mu_grad, sigma_grad = self.get_actor_grad(state, action) + if self.use_actor_trace: + # Update actor eligibility traces + self.mu_trace *= (self.gamma * self.decay) + self.sigma_trace *= (self.gamma * self.decay) + if self.trace_type == "accumulating": + self.mu_trace[:, state] += mu_grad + self.sigma_trace[:, state] += sigma_grad + else: + self.mu_trace[:, state] = mu_grad[:, state] + self.sigma_trace[:, state] = sigma_grad[:, state] + + # Update actor weights + lr = self.actor_lr + lr *= 1 if not self.scaled else (self.get_stddev(state) ** 2) + self.mu_weights += (lr * delta * self.mu_trace) + self.sigma_weights += (lr * delta * self.sigma_trace) + else: + lr = self.actor_lr + lr *= 1 if not self.scaled else (self.get_stddev(state) ** 2) + self.mu_weights += (lr * delta * mu_grad) + self.sigma_trace = (lr * delta * sigma_grad) + + # In order to be consistent across all children of BaseAgent, we + # return all transitions with the shape B x N, where N is the number + # of state, action, or reward dimensions and B is the batch size = 1 + reward = np.array([reward]) + + return np.expand_dims(state, axis=0), np.expand_dims(action, axis=0), \ + np.expand_dims(reward, axis=0), np.expand_dims(next_state, axis=0) + + def reset(self): + """ + Resets the agent between episodes + """ + if self.accumulate_trace: + return + if self.use_actor_trace: + self.mu_trace = np.zeros_like(self.mu_trace) + self.sigma_trace = np.zeros_like(self.sigma_trace) + if self.use_critic_trace: + self.critic_trace = np.zeros_like(self.critic_trace) + + def eval(self): + """ + Sets the agent into offline evaluation mode, where the agent will not + explore + """ + self.is_training = False + + def train(self): + """ + Sets the agent to online training mode, where the agent will explore + """ + self.is_training = True + + def get_parameters(self): + """ + Gets all learned agent parameters such that training can be resumed. + + Gets all parameters of the agent such that, if given the + hyperparameters of the agent, training is resumable from this exact + point. This include the learned average reward, the learned entropy, + and other such learned values if applicable. This does not only apply + to the weights of the agent, but *all* values that have been learned + or calculated during training such that, given these values, training + can be resumed from this exact point. + + For example, in the GaussianAC class, we must save not only the actor + and critic weights, but also the accumulated eligibility traces. + + Returns + ------- + dict of str to array_like + The agent's weights + """ + pass + + +if __name__ == "__main__": + a = GaussianAC(0.9, 0.1, 0.1, 0.5, 3, False) + print(a.actor_weights, a.critic_weights) + state = np.array([1, 2, 1]) + action = a.sample_action(state) + a.update(state, action, 1, np.array([1, 2, 2]), 0.9) + print(a.actor_weights, a.critic_weights) + state = np.array([1, 2, 2]) + action = a.sample_action(state) + a.update(state, action, 1, np.array([3, 1, 2]), 0.9) + print(a.actor_weights, a.critic_weights) diff --git a/agent/linear/Sarsa.py b/agent/linear/Sarsa.py new file mode 100644 index 0000000..649a6eb --- /dev/null +++ b/agent/linear/Sarsa.py @@ -0,0 +1,280 @@ +#!/usr/bin/env python3 + +# Import modules +import numpy as np +from agent.baseAgent import BaseAgent +from PyFixedReps import TileCoder +import time +import warnings +import inspect + + +class Sarsa(BaseAgent): + def __init__(self, decay, lr, gamma, epsilon, + action_space, bins, num_tilings, env, seed=None, + trace_type="replacing", policy_type="εgreedy", + include_bias=True): + super().__init__() + self.batch = False + + # Set the agent's policy sampler + if seed is None: + seed = int(time()) + self.random = np.random.default_rng(seed=int(seed)) + self.seed = seed + + # Needed so that when evaluating offline, we don't explore + self.is_training = True + + # Tile Coder + self.include_bias = include_bias + input_ranges = list(zip(env.observation_space.low, + env.observation_space.high)) + dims = env.observation_space.shape[0] + params = { + "dims": dims, + "tiles": bins, + "tilings": num_tilings, + "input_ranges": input_ranges, + "scale_output": False, + } + self.tiler = TileCoder(params) + state_features = self.tiler.features() + self.include_bias + + # The weight parameters + self.actions = action_space.n + self.weights = np.zeros((self.actions, state_features)) + + # Set learning rates and other scaling factors + if decay < 0.0: + raise ValueError("cannot have trace decay rate < 0") + self.decay = decay + self.lr = lr / (num_tilings + self.include_bias) + self.gamma = gamma + self.epsilon = epsilon + print(self.lr) + + if policy_type not in ("εgreedy", "softmax"): + raise ValueError("policy_type must be one of 'εgreedy', " + + "'softmax'") + self.policy_type = policy_type + + # Eligibility traces + if trace_type not in ("accumulating", "replacing"): + raise ValueError("trace_type must be one of 'accumulating', " + + "'replacing'") + self.use_trace = self.decay > 0.0 + if self.use_trace: + self.trace = np.zeros_like(self.weights) + self.trace_type = trace_type + + # Keep track of the states and actions used in the SARSA update for + # error checking + self.sarsa_state = None + self.sarsa_action = None + self.first_call = True + + source = inspect.getsource(inspect.getmodule(inspect.currentframe())) + self.info = {"source": source} + + def sample_action(self, state): + """ + Samples an action from the actor + + Parameters + ---------- + state : np.array + The state feature vector, not one hot encoded + + Returns + ------- + np.array of float + The action to take + """ + if self.first_call: + self.first_call = False + return self._sample_action(state) + if np.any(state != self.sarsa_state) and self.is_training: + warnings.warn("Warning: input state was not used as " + + "the next state in SARSA update to select the" + + "next action. Sampling a new action.") + return self._sample_action(state) + else: + return self.sarsa_action + + def _sample_action(self, state): + """ + Samples an action from the actor + + Parameters + ---------- + state : np.array + The state feature vector, not one hot encoded + + Returns + ------- + int + The action to take + """ + state = self._tiler_indices(state) + action_vals = self.weights[:, state].sum(axis=1) + + if self.policy_type == "εgreedy": + return self._sample_epsilon_greedy(action_vals) + elif self.policy_type == "softmax": + return self._sample_softmax(action_vals) + else: + raise ValueError(f"unknown policy type {self.policy_type}") + + def _sample_epsilon_greedy(self, action_vals): + if self.epsilon != 0 and self.random.uniform() < self.epsilon: + return self.random.choice(self.actions) + else: + # Choose maximum action + max_actions = np.where(action_vals == np.max(action_vals))[0] + if len(max_actions) > 1: + return max_actions[self.random.choice(len(max_actions))] + else: + return max_actions[0] + + def _sample_softmax(self, action_vals): + action_vals = action_vals - np.max(action_vals) + if self.epsilon != 0: + # If epsilon is non-zero, use it to determine the stochasticity + # of the policy as the temperature parameter + action_vals /= self.epsilon + probs = np.exp(action_vals) + probs /= np.sum(probs) + return np.random.choice(self.actions, p=probs) + else: + # If epsilon is zero, then we are acting greedily + max_actions = np.where(action_vals == np.max(action_vals))[0] + if len(max_actions) > 1: + return max_actions[self.random.choice(len(max_actions))] + else: + return max_actions[0] + + def _tiler_indices(self, state): + """ + Returns the tile coded representation of state + + Parameters + ---------- + state : np.array + The state observation to tile code + + Returns + ------- + np.array + The tile coded representation of the input state + """ + if self.include_bias: + return np.concatenate( + [ + np.zeros((1,), dtype=np.int32), + self.tiler.get_indices(state) + 1, + ] + ) + + return self.tiler.get_indices(state) + + def update(self, state, action, reward, next_state, done_mask): + """ + Takes a single update step + + Parameters + ---------- + state : np.array or array_like of np.array + The state feature vector + action : np.array of float or array_like of np.array + The action taken + reward : float or array_like of float + The reward seen by the agent after taking the action + next_state : np.array or array_like of np.array + The feature vector of the next state transitioned to after the + agent took the argument action + done_mask : bool or array_like of bool + False if the agent reached the goal, True if the agent did not + reach the goal yet the episode ended (e.g. max number of steps + reached) + Note: this parameter is not used; it is only kept so that the + interface BaseAgent is consistent and can be used for both + Soft Actor-Critic and Linear-Gaussian Actor-Critic + """ + state = self._tiler_indices(state) + + δ = reward + δ -= self.weights[action, state].sum() + + # Update the trace + if self.use_trace: + if self.trace_type == "accumulating": + self.trace[action, state] += 1 + elif self.trace_type == "replacing": + self.trace[action, state] = 1 + else: + raise ValueError(f"unknown trace type {self.trace_type}") + + # Adjust δ if we are in an intra-episode timestep + episode_done = not done_mask + if not episode_done: + self.sarsa_action = next_action = self._sample_action(next_state) + self.sarsa_state = next_state + + next_state = self._tiler_indices(next_state) + + δ += (self.gamma * self.weights[next_action, next_state].sum()) + + # Update the weights + if self.use_trace: + self.weights += (self.lr * δ * self.trace) + + # Decay the trace + self.trace *= (self.decay * self.gamma) + else: + self.weights[action, state] += (self.lr * δ) + + return + + def reset(self): + """ + Resets the agent between episodes + """ + self.trace = np.zeros_like(self.weights) + self.first_call = True + self.sarasa_action = self.sarsa_state = None + + def eval(self): + """ + Sets the agent into offline evaluation mode, where the agent will not + explore + """ + self.is_training = False + + def train(self): + """ + Sets the agent to online training mode, where the agent will explore + """ + self.is_training = True + + def get_parameters(self): + """ + Gets all learned agent parameters such that training can be resumed. + + Gets all parameters of the agent such that, if given the + hyperparameters of the agent, training is resumable from this exact + point. This include the learned average reward, the learned entropy, + and other such learned values if applicable. This does not only apply + to the weights of the agent, but *all* values that have been learned + or calculated during training such that, given these values, training + can be resumed from this exact point. + + For example, in the ESarsa class, we must save not only the actor + and critic weights, but also the accumulated eligibility traces. + + Returns + ------- + dict of str to array_like + The agent's weights + """ + pass diff --git a/agent/linear/SoftmaxAC.py b/agent/linear/SoftmaxAC.py new file mode 100644 index 0000000..4be00a9 --- /dev/null +++ b/agent/linear/SoftmaxAC.py @@ -0,0 +1,343 @@ +# Import modules +import numpy as np +from agent.baseAgent import BaseAgent +from PyFixedReps import TileCoder +import time +from scipy import special +import inspect + + +class SoftmaxAC(BaseAgent): + """ + Class SoftmaxAC implements a Linear-Softmax Actor-Critic with eligibility + traces. The algorithm works in the discounted setting, rather than in the + average reward setting and is similar to the algorithm outlined in the + Policy Gradient chapter in the RL Book. + """ + def __init__(self, decay, actor_lr, critic_lr, gamma, + accumulate_trace, action_space, bins, num_tilings, env, + use_critic_trace, use_actor_trace, temperature, seed=None, + trace_type="replacing"): + """ + Constructor + + Parameters + ---------- + decay : float + The eligibility decay rate, lambda + actor_lr : float + The learning rate for the actor + critic_lr : float + The learning rate for the critic + state_features : int + The size of the state feature vectors + gamma : float + The environmental discount factor + accumulate_trace : bool + Whether or not to accumulate the eligibility traces or not, which + may be desirable if the task is continuing. If it is, then the + eligibility trace vectors will be accumulated and not reset between + "episodes" when calling the reset() method. + scaled : bool, optional + Whether the actor learning rate should be scaled by sigma^2 for + learning stability, by default False + clip_stddev : float, optional + The value at which the standard deviation is clipped in order to + prevent numerical overflow, by default 1000. If <= 0, then + no clipping is done. + seed : int + The seed to use for the normal distribution sampler, by default + None. If set to None, uses the integer value of the Unix time. + """ + super().__init__() + + # Set the agent's policy sampler + if seed is None: + seed = int(time()) + self._random = np.random.default_rng(seed=int(seed)) + self._seed = seed + + # Needed so that when evaluating offline, we don't explore + self._is_training = True + + # Tile Coder + input_ranges = list(zip(env.observation_space.low, + env.observation_space.high)) + dims = env.observation_space.shape[0] + params = { + "dims": dims, + "tiles": bins, + "tilings": num_tilings, + "input_ranges": input_ranges, + "scale_output": False, + } + self._tiler = TileCoder(params) + state_features = self._tiler.features() + 1 + + # The weight parameters + self._action_n = action_space.n + self._avail_actions = np.array(range(self._action_n)) + self._size = state_features + self._actor_weights = np.zeros((self._action_n, state_features)) + self._critic_weights = np.zeros(state_features) # State value critic + + # Set learning rates and other scaling factors + self._critic_α = critic_lr / (num_tilings + 1) + self._actor_α = actor_lr / (num_tilings + 1) + self._γ = gamma + if temperature < 0: + raise ValueError("cannot use temperature < 0") + self._τ = temperature + + # Eligibility traces + if trace_type not in ("accumulating", "replacing"): + raise ValueError("trace_type must be one of accumulating', " + + "'replacing'") + if decay < 0: + raise ValueError("cannot use decay < 0") + elif decay >= 1: + raise ValueError("cannot use decay >= 1") + elif decay == 0: + use_actor_trace = use_critic_trace = False + else: + self._λ = decay + + self._trace_type = trace_type + self.use_actor_trace = use_actor_trace + if self.use_actor_trace: + self._actor_trace = np.zeros((self._action_n, state_features)) + self.use_critic_trace = use_critic_trace + if self.use_critic_trace: + self._critic_trace = np.zeros(state_features) + + source = inspect.getsource(inspect.getmodule(inspect.currentframe())) + self.info = {"source": source} + + def _get_logits(self, state): + """ + Gets the logits of the policy in state + + Parameters + ---------- + state : np.array + The indices of the nonzero features in the tile coded state + representation + + Returns + ------- + np.array of float + The logits of each action + """ + if self._τ == 0: + raise ValueError("cannot compute logits when τ = 0") + + logits = self._actor_weights[:, state].sum(axis=1) + logits -= np.max(logits) # For numerical stability + return logits / self._τ + + def _get_probs(self, state_ind): + if self._τ == 0: + q_values = self._actor_weights[:, state_ind].sum(axis=-1) + + max_value = np.max(q_values) + max_actions = np.where(q_values == max_value)[0] + + probs = np.zeros(self._action_n) + probs[max_actions] = 1 / len(max_actions) + return probs + + logits = self._get_logits(state_ind) + logits -= logits.max() # Subtract max because SciPy breaks things + pi = special.softmax(logits) + return pi + + def sample_action(self, state): + """ + Samples an action from the actor + + Parameters + ---------- + state : np.array + The state feature vector, not one hot encoded + + Returns + ------- + np.array of float + The action to take + """ + state = np.concatenate( + [ + np.zeros((1,), dtype=np.int32), + self._tiler.get_indices(state) + 1, + ] + ) + probs = self._get_probs(state) + + # If in offline evaluation mode, return the action of maximum + # probability + if not self._is_training: + actions = np.where(probs == np.max(probs))[0] + if len(actions) == 1: + return actions[0] + else: + return self._random.choice(actions) + + return self._random.choice(self._action_n, p=probs) + + def _actor_grad(self, state, action): + """ + Returns the gradient of the actor's performance in `state` + evaluated at the action `action` + + Parameters + ---------- + state : np.ndarray + The state observation, not tile coded + action : int + The action to evaluate the gradient on + """ + π = self._get_probs(state) + π = np.reshape(π, (self._actor_weights.shape[0], 1)) + features = np.zeros_like(self._actor_weights) + features[action, state] = 1 + + grad = features + grad[:, state] -= π + return grad + + def update(self, state, action, reward, next_state, done_mask): + """ + Takes a single update step + + Parameters + ---------- + state : np.array or array_like of np.array + The state feature vector + action : np.array of float or array_like of np.array + The action taken + reward : float or array_like of float + The reward seen by the agent after taking the action + next_state : np.array or array_like of np.array + The feature vector of the next state transitioned to after the + agent took the argument action + done_mask : bool or array_like of bool + False if the agent reached the goal, True if the agent did not + reach the goal yet the episode ended (e.g. max number of steps + reached) + Note: this parameter is not used; it is only kept so that the + interface BaseAgent is consistent and can be used for both + Soft Actor-Critic and Linear-Gaussian Actor-Critic + """ + state = np.concatenate( + [ + np.zeros((1,), dtype=np.int32), + self._tiler.get_indices(state) + 1, + ] + ) + next_state = np.concatenate( + [ + np.zeros((1,), dtype=np.int32), + self._tiler.get_indices(next_state) + 1, + ] + ) + + # Calculate TD error + target = reward + done_mask * self._γ * \ + self._critic_weights[next_state].sum() + estimate = self._critic_weights[state].sum() + delta = target - estimate + + # Critic update + if self.use_critic_trace: + # Update critic eligibility trace + self._critic_trace *= (self._γ * self._λ) + if self._trace_type == "accumulating": + self._critic_trace[state] += 1 + elif self._trace_type == "replacing": + self._critic_trace[state] = 1 + else: + raise ValueError(f"unknown trace type {self._trace_type}") + + # Update critic + self._critic_weights += (self._critic_α * delta * + self._critic_trace) + else: + grad = np.zeros_like(self._critic_weights) + grad[state] = 1 + self._critic_weights += (self._critic_α * delta * grad) + + # Actor update + actor_grad = self._actor_grad(state, action) + if self.use_actor_trace: + # Update actor eligibility traces + self._actor_trace *= (self._γ * self._λ) + self._actor_trace += actor_grad + + # Update actor weights + self._actor_weights += (self._actor_α * delta * self._actor_trace) + else: + self._actor_weights += (self._actor_α * delta * actor_grad) + + # In order to be consistent across all children of BaseAgent, we + # return all transitions with the shape B x N, where N is the number + # of state, action, or reward dimensions and B is the batch size = 1 + reward = np.array([reward]) + return np.expand_dims(state, axis=0), np.expand_dims(action, axis=0), \ + np.expand_dims(reward, axis=0), np.expand_dims(next_state, axis=0) + + def reset(self): + """ + Resets the agent between episodes + """ + if self.use_actor_trace: + self._actor_trace = np.zeros_like(self._actor_trace) + if self.use_critic_trace: + self._critic_trace = np.zeros(self._size) + + def eval(self): + """ + Sets the agent into offline evaluation mode, where the agent will not + explore + """ + self._is_training = False + + def train(self): + """ + Sets the agent to online training mode, where the agent will explore + """ + self._is_training = True + + def get_parameters(self): + """ + Gets all learned agent parameters such that training can be resumed. + + Gets all parameters of the agent such that, if given the + hyperparameters of the agent, training is resumable from this exact + point. This include the learned average reward, the learned entropy, + and other such learned values if applicable. This does not only apply + to the weights of the agent, but *all* values that have been learned + or calculated during training such that, given these values, training + can be resumed from this exact point. + + For example, in the SoftmaxAC class, we must save not only the actor + and critic weights, but also the accumulated eligibility traces. + + Returns + ------- + dict of str to array_like + The agent's weights + """ + pass + + +if __name__ == "__main__": + a = SoftmaxAC(0.9, 0.1, 0.1, 0.5, 3, False) + print(a.actor_weights, a.critic_weights) + state = np.array([1, 2, 1]) + action = a.sample_action(state) + a.update(state, action, 1, np.array([1, 2, 2]), 0.9) + print(a.actor_weights, a.critic_weights) + state = np.array([1, 2, 2]) + action = a.sample_action(state) + a.update(state, action, 1, np.array([3, 1, 2]), 0.9) + print(a.actor_weights, a.critic_weights) diff --git a/agent/nonlinear/FKL.py b/agent/nonlinear/FKL.py new file mode 100644 index 0000000..ce3c6e1 --- /dev/null +++ b/agent/nonlinear/FKL.py @@ -0,0 +1,399 @@ +#!/usr/bin/env python3 + +# Import modules +import torch +import time +from gym.spaces import Box, Discrete +import numpy as np +import torch.nn.functional as F +from torch.optim import Adam +from agent.baseAgent import BaseAgent +import agent.nonlinear.nn_utils as nn_utils +from agent.nonlinear.policy_utils import GaussianPolicy, SoftmaxPolicy +from agent.nonlinear.value_function_utils import QNetwork +from utils.experience_replay import TorchBuffer as ExperienceReplay + + +class FKL(BaseAgent): + """ + Class FKL implements a vanilla-style actor-critic algorithm, minimizing + the FKL between the learned policy and the Boltzmann distribution over + action values. This is in contrast to "regular" actor-critics (such as SAC + and VAC in this codebase) which minimize an RKL between these values. + + This algorithm also learns a soft action value function, where the entropy + regularization is determined by `alpha`. + + FKL works only with continuous action spaces and uses MLP function + approximators. + + See https://arxiv.org/abs/2107.08285 for more information on this + algorithm. This implementation is the same as the FKL implementation from + this paper. + """ + def __init__(self, num_inputs, action_space, gamma, tau, alpha, policy, + target_update_interval, critic_lr, actor_lr_scale, + num_samples, actor_hidden_dim, critic_hidden_dim, + replay_capacity, seed, batch_size, betas, env, cuda=False, + clip_stddev=1000, init=None, activation="relu"): + """ + Constructor + + Parameters + ---------- + num_inputs : int + The number of input features + action_space : gym.spaces.Space + The action space from the gym environment + gamma : float + The discount factor + tau : float + The weight of the weighted average, which performs the soft update + to the target critic network's parameters toward the critic + network's parameters, that is: target_parameters = + ((1 - τ) * target_parameters) + (τ * source_parameters) + alpha : float + The entropy regularization temperature. See equation (1) in paper. + policy : str + The type of policy, currently, only support "gaussian" + target_update_interval : int + The number of updates to perform before the target critic network + is updated toward the critic network + critic_lr : float + The critic learning rate + actor_lr : float + The actor learning rate + actor_hidden_dim : int + The number of hidden units in the actor's neural network + critic_hidden_dim : int + The number of hidden units in the critic's neural network + replay_capacity : int + The number of transitions stored in the replay buffer + seed : int + The random seed so that random samples of batches are repeatable + batch_size : int + The number of elements in a batch for the batch update + cuda : bool, optional + Whether or not cuda should be used for training, by default False. + Note that if True, cuda is only utilized if available. + clip_stddev : float, optional + The value at which the standard deviation is clipped in order to + prevent numerical overflow, by default 1000. If <= 0, then + no clipping is done. + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + + Raises + ------ + ValueError + If the batch size is larger than the replay buffer + """ + super().__init__() + self.batch = True + + # Ensure batch size < replay capacity + if batch_size > replay_capacity: + raise ValueError("cannot have a batch larger than replay " + + "buffer capacity") + + # Set the seed for all random number generators, this includes + # everything used by PyTorch, including setting the initial weights + # of networks. PyTorch prefers seeds with many non-zero binary units + self.torch_rng = torch.manual_seed(seed) + self.rng = np.random.default_rng(seed) + + self.is_training = True + self.gamma = gamma + self.tau = tau + self.alpha = alpha + + self.discrete_action = isinstance(action_space, Discrete) + self.state_dims = num_inputs + self.num_samples = num_samples + assert num_samples >= 2 + + self.device = torch.device("cuda:0" if cuda and + torch.cuda.is_available() else "cpu") + + if isinstance(action_space, Box): + self.action_dims = len(action_space.high) + + # Keep a replay buffer + self.replay = ExperienceReplay(replay_capacity, seed, num_inputs, + action_space.shape[0], self.device) + elif isinstance(action_space, Discrete): + self.action_dims = 1 + # Keep a replay buffer + self.replay = ExperienceReplay(replay_capacity, seed, num_inputs, + 1, self.device) + self.batch_size = batch_size + + # Set the interval between timesteps when the target network should be + # updated and keep a running total of update number + self.target_update_interval = target_update_interval + self.update_number = 0 + + # Create the critic Q function + if isinstance(action_space, Box): + action_shape = action_space.shape[0] + elif isinstance(action_space, Discrete): + action_shape = 1 + + self.critic = QNetwork(num_inputs, action_shape, + critic_hidden_dim, init, activation).to( + device=self.device) + self.critic_optim = Adam(self.critic.parameters(), lr=critic_lr, + betas=betas) + + self.critic_target = QNetwork(num_inputs, action_shape, + critic_hidden_dim, init, activation).to( + self.device) + nn_utils.hard_update(self.critic_target, self.critic) + + self.policy_type = policy.lower() + actor_lr = actor_lr_scale * critic_lr + if self.policy_type == "gaussian": + + self.policy = GaussianPolicy(num_inputs, action_space.shape[0], + actor_hidden_dim, activation, + action_space, clip_stddev, init).to( + self.device) + self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr, + betas=betas) + + else: + raise NotImplementedError + + def sample_action(self, state): + """ + Samples an action from the agent + + Parameters + ---------- + state : np.array + The state feature vector + + Returns + ------- + array_like of float + The action to take + """ + state = torch.FloatTensor(state).to(self.device).unsqueeze(0) + if self.is_training: + action, _, _ = self.policy.sample(state) + else: + _, _, action = self.policy.sample(state) + + act = action.detach().cpu().numpy()[0] + return act + + def sample_action_(self, state, size): + """ + sample_action_ is like sample_action, except the rng for + action selection in the environment is not affected by running + this function. + """ + if len(state.shape) > 1 or state.shape[0] > 1: + raise ValueError("sample_action_ takes a single state") + with torch.no_grad(): + state = torch.FloatTensor(state).to(self.device).unsqueeze(0) + if self.is_training: + mean, log_std = self.policy.forward(state) + + if not self.is_training: + return mean.detach().cpu().numpy()[0] + + mean = mean.detach().cpu().numpy()[0] + std = np.exp(log_std.detach().cpu().numpy()[0]) + return self.rng.normal(mean, std, size=size) + + def update(self, state, action, reward, next_state, done_mask): + """ + Takes a single update step, which may be a number of offline + batch updates + + Parameters + ---------- + state : np.array or array_like of np.array + The state feature vector + action : np.array of float or array_like of np.array + The action taken + reward : float or array_like of float + The reward seen by the agent after taking the action + next_state : np.array or array_like of np.array + The feature vector of the next state transitioned to after the + agent took the argument action + done_mask : bool or array_like of bool + False if the agent reached the goal, True if the agent did not + reach the goal yet the episode ended (e.g. max number of steps + reached) + """ + if self.discrete_action: + action = np.array([action]) + # Keep transition in replay buffer + self.replay.push(state, action, reward, next_state, done_mask) + + # Sample a batch from memory + state_batch, action_batch, reward_batch, next_state_batch, \ + mask_batch = self.replay.sample(batch_size=self.batch_size) + + # When updating Q functions, we don't want to backprop through the + # policy and target network parameters + with torch.no_grad(): + next_state_action, next_state_log_pi, _ = \ + self.policy.sample(next_state_batch) + q_next = self.critic_target(next_state_batch, next_state_action) + q_next -= (self.alpha * next_state_log_pi) + + q_target = reward_batch + mask_batch * self.gamma * q_next + + q_prediction = self.critic(state_batch, action_batch) + + # Calculate the losses on each critic + q_loss = F.mse_loss(q_prediction, q_target) + + # Update the critic + self.critic_optim.zero_grad() + q_loss.backward() + self.critic_optim.step() + + sampled_actions, logprob, _ = self.policy.sample(state_batch, + self.num_samples) + if self.num_samples == 1: + raise ValueError("num_samples should be greater than 1") + sampled_actions = torch.permute(sampled_actions, (1, 0, 2)) + + # Calculate the importance sampling ratio + sampled_actions = torch.reshape(sampled_actions, + [-1, self.action_dims]) + stacked_s_batch = torch.repeat_interleave(state_batch, + self.num_samples, + dim=0) + stacked_s_batch = torch.reshape(stacked_s_batch, + [-1, self.state_dims]) + + # Calculate the weighted importance sampling ratio + # Right now, we follow the FKL/RKL paper equation (13) to compute the + # weighted importance sampling ratio, where: + # + # ρ_i = BQ(a_i | s) / π_θ(a_i | s) ∝ exp(Q(s, a_i)τ⁻¹) / π(a_i | s) + # ρ̂_i = ρ_i / ∑(ρ_j) + # + # We could compute a more numerically stable weighted importance + # sampling ratio if needed (but the implementation is very + # complicated): + # + # ρ̂ = π(a_i | s) [∑_{i≠j} ([h(s, a_j)/h(s, a_i)] * π(a_j | s)⁻¹) + 1] + # h(s, a_j, a_i) = exp[(Q(s, a_j) - M)τ⁻¹] / exp[(Q(s, a_i) - M)τ⁻¹] + # M = M(a_j, a_i) = max(Q(s, a_j), Q(s, a_i)) + with torch.no_grad(): + IS_q_values = self.critic(stacked_s_batch, + sampled_actions) + IS_q_values = torch.reshape(IS_q_values, [self.batch_size, + self.num_samples]) + + IS = IS_q_values / self.alpha + IS_max = torch.amax(IS, dim=1).unsqueeze(dim=-1) + IS -= IS_max + IS = IS.exp() + Z = torch.sum(IS, dim=1).unsqueeze(-1) + IS /= Z + prob = logprob.exp().squeeze(dim=-1).T + IS /= prob + + weight = torch.sum(IS, dim=1).unsqueeze(dim=-1) + WIS = IS / weight + + # Calculate the policy loss + logprob = logprob.squeeze() + policy_loss = WIS * logprob.T + policy_loss = -policy_loss.mean() + + # Update the actor + self.policy_optim.zero_grad() + policy_loss.backward() + self.policy_optim.step() + + # Update target network + self.update_number += 1 + if self.update_number % self.target_update_interval == 0: + self.update_number = 0 + nn_utils.soft_update(self.critic_target, self.critic, self.tau) + + def reset(self): + """ + Resets the agent between episodes + """ + pass + + def eval(self): + """ + Sets the agent into offline evaluation mode, where the agent will not + explore + """ + self.is_training = False + + def train(self): + """ + Sets the agent to online training mode, where the agent will explore + """ + self.is_training = True + + # Save model parameters + def save_model(self, env_name, suffix="", actor_path=None, + critic_path=None): + """ + Saves the models so that after training, they can be used. + + Parameters + ---------- + env_name : str + The name of the environment that was used to train the models + suffix : str, optional + The suffix to the filename, by default "" + actor_path : str, optional + The path to the file to save the actor network as, by default None + critic_path : str, optional + The path to the file to save the critic network as, by default None + """ + pass + + # Load model parameters + def load_model(self, actor_path, critic_path): + """ + Loads in a pre-trained actor and a pre-trained critic to resume + training. + + Parameters + ---------- + actor_path : str + The path to the file which contains the actor + critic_path : str + The path to the file which contains the critic + """ + pass + + def get_parameters(self): + """ + Gets all learned agent parameters such that training can be resumed. + + Gets all parameters of the agent such that, if given the + hyperparameters of the agent, training is resumable from this exact + point. This include the learned average reward, the learned entropy, + and other such learned values if applicable. This does not only apply + to the weights of the agent, but *all* values that have been learned + or calculated during training such that, given these values, training + can be resumed from this exact point. + + For example, in the LinearAC class, we must save not only the actor + and critic weights, but also the accumulated eligibility traces. + + Returns + ------- + dict of str to float, torch.Tensor + The agent's weights + """ + pass diff --git a/agent/nonlinear/SAC.py b/agent/nonlinear/SAC.py new file mode 100644 index 0000000..e24c645 --- /dev/null +++ b/agent/nonlinear/SAC.py @@ -0,0 +1,542 @@ +# Import modules +import os +import torch +import numpy as np +import torch.nn.functional as F +from torch.optim import Adam +from agent.baseAgent import BaseAgent +import agent.nonlinear.nn_utils as nn_utils +from agent.nonlinear.policy.MLP import SquashedGaussian +from agent.nonlinear.value_function.MLP import DoubleQ, Q +from utils.experience_replay import TorchBuffer as ExperienceReplay + + +class SAC(BaseAgent): + """ + SAC implements the Soft Actor-Critic agent found in the paper + https://arxiv.org/pdf/1812.05905.pdf. + + SAC works only with continuous action spaces and uses MLP function + approximators. + """ + def __init__(self, gamma, tau, alpha, policy, + target_update_interval, critic_lr, actor_lr_scale, alpha_lr, + actor_hidden_dim, critic_hidden_dim, replay_capacity, seed, + batch_size, betas, env, reparameterized=True, soft_q=True, + double_q=True, automatic_entropy_tuning=False, cuda=False, + clip_stddev=1000, init=None, activation="relu"): + """ + Constructor + + Parameters + ---------- + gamma : float + The discount factor + tau : float + The weight of the weighted average, which performs the soft update + to the target critic network's parameters toward the critic + network's parameters, that is: target_parameters = + ((1 - τ) * target_parameters) + (τ * source_parameters) + alpha : float + The entropy regularization temperature. See equation (1) in paper. + policy : str + The type of policy, currently, only support "gaussian" + target_update_interval : int + The number of updates to perform before the target critic network + is updated toward the critic network + critic_lr : float + The critic learning rate + actor_lr : float + The actor learning rate + alpha_lr : float + The learning rate for the entropy parameter, if using an automatic + entropy tuning algorithm (see automatic_entropy_tuning) parameter + below + actor_hidden_dim : int + The number of hidden units in the actor's neural network + critic_hidden_dim : int + The number of hidden units in the critic's neural network + replay_capacity : int + The number of transitions stored in the replay buffer + seed : int + The random seed so that random samples of batches are repeatable + batch_size : int + The number of elements in a batch for the batch update + automatic_entropy_tuning : bool, optional + Whether the agent should automatically tune its entropy + hyperparmeter alpha, by default False + cuda : bool, optional + Whether or not cuda should be used for training, by default False. + Note that if True, cuda is only utilized if available. + clip_stddev : float, optional + The value at which the standard deviation is clipped in order to + prevent numerical overflow, by default 1000. If <= 0, then + no clipping is done. + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + soft_q : bool + Whether or not to learn soft Q functions, by default True. The + original SAC uses soft Q functions since we learn an + entropy-regularized policy. When learning an entropy regularized + policy, guaranteed policy improvement (in the ideal case) only + exists with respect to soft action values. + reparameterized: bool + Whether to use the reparameterization trick to learn the policy or + to use the log-likelihood trick. The original SAC uses the + reparameterization trick. + double_q : bool + Whether or not to use two Q value functions or not. The original + SAC uses two Q value functions. + + Raises + ------ + ValueError + If the batch size is larger than the replay buffer + """ + super().__init__() + + # Ensure batch size < replay capacity + if batch_size > replay_capacity: + raise ValueError("cannot have a batch larger than replay " + + "buffer capacity") + + action_space = env.action_space + obs_space = env.observation_space + obs_dim = obs_space.shape + # Ensure we are working with vector observations + if len(obs_dim) != 1: + raise ValueError( + f"""SAC works only with vector observations, but got + observation with shape {obs_dim}.""" + ) + + # Set the seed for all random number generators, this includes + # everything used by PyTorch, including setting the initial weights + # of networks. PyTorch prefers seeds with many non-zero binary units + self._torch_rng = torch.manual_seed(seed) + self._rng = np.random.default_rng(seed) + + # Random hypers and fields + self._is_training = True # Whether in training or evaluation mode + self._gamma = gamma # Discount factor + self._tau = tau # Polyak averaging constant for target networks + self._alpha = alpha # Entropy scale + self._reparameterized = reparameterized # Whether to use reparam trick + self._soft_q = soft_q # Whether to use soft Q functions or nor + self._double_q = double_q # Whether or not to use a double Q critic + + self._device = torch.device("cuda:0" if cuda and + torch.cuda.is_available() else "cpu") + + # Experience replay buffer + self._batch_size = batch_size + self._replay = ExperienceReplay(replay_capacity, seed, obs_space.shape, + action_space.shape[0], self._device) + + # Set the interval between timesteps when the target network should be + # updated and keep a running total of update number + self._target_update_interval = target_update_interval + self._update_number = 0 + + # Automatic entropy tuning + self._automatic_entropy_tuning = automatic_entropy_tuning + assert not self._automatic_entropy_tuning + + # Set up the critic and target critic + self._init_critics( + obs_space, + action_space, + critic_hidden_dim, + init, + activation, + critic_lr, + betas, + ) + + # Set up the policy + self._policy_type = policy.lower() + actor_lr = actor_lr_scale * critic_lr + self._init_policy( + obs_space, + action_space, + actor_hidden_dim, + init, + activation, + actor_lr, + betas, + clip_stddev, + ) + + # Set up auto entropy tuning + if self._automatic_entropy_tuning is True: + self._target_entropy = -torch.prod( + torch.Tensor(action_space.shape).to(self._device) + ).item() + self._log_alpha = torch.zeros( + 1, + requires_grad=True, + device=self._device, + ) + self._alpha_optim = Adam([self._log_alpha], lr=alpha_lr) + + def sample_action(self, state): + """ + Samples an action from the agent + + Parameters + ---------- + state : np.array + The state feature vector + + Returns + ------- + array_like of float + The action to take + """ + state = torch.FloatTensor(state).to(self._device).unsqueeze(0) + if self._is_training: + action, _, _, _ = self._policy.rsample(state) + else: + _, _, action, _ = self._policy.rsample(state) + + return action.detach().cpu().numpy()[0] # size (1, action_dims) + + def update(self, state, action, reward, next_state, done_mask): + """ + Takes a single update step, which may be a number of offline + batch updates + + Parameters + ---------- + state : np.array or array_like of np.array + The state feature vector + action : np.array of float or array_like of np.array + The action taken + reward : float or array_like of float + The reward seen by the agent after taking the action + next_state : np.array or array_like of np.array + The feature vector of the next state transitioned to after the + agent took the argument action + done_mask : bool or array_like of bool + False if the agent reached the goal, True if the agent did not + reach the goal yet the episode ended (e.g. max number of steps + reached) + """ + # Keep transition in replay buffer + self._replay.push(state, action, reward, next_state, done_mask) + + # Sample a batch from memory + state_batch, action_batch, reward_batch, next_state_batch, \ + mask_batch = self._replay.sample(batch_size=self._batch_size) + + self._update_critic(state_batch, action_batch, reward_batch, + next_state_batch, mask_batch) + + self._update_actor(state_batch, action_batch, reward_batch, + next_state_batch, mask_batch) + + def _update_actor(self, state_batch, action_batch, reward_batch, + next_state_batch, mask_batch): + """ + Update the actor given a batch of transitions sampled from a replay + buffer. + """ + # Calculate the actor loss + if self._reparameterized: + # Reparameterization trick + pi, log_pi, _, _ = self._policy.rsample(state_batch) + q = self._get_q(state_batch, pi) + + policy_loss = ((self._alpha * log_pi) - q).mean() + + else: + # Log likelihood trick + with torch.no_grad(): + # Context manager ensures that we don't backprop through the q + # function when minimizing the policy loss + pi, log_pi, _, x_t = self._policy.sample(state_batch) + q = self._get_q(state_batch, pi) + + # Compute the policy loss, grad_log_pi will be the only + # differentiated value + grad_log_pi = self._policy.log_prob(state_batch, x_t) + policy_loss = grad_log_pi * (self._alpha * log_pi - q) + policy_loss = policy_loss.mean() + + # Update the actor + self._policy_optim.zero_grad() + policy_loss.backward() + self._policy_optim.step() + + # Tune the entropy if appropriate + if self._automatic_entropy_tuning: + alpha_loss = -(self._log_alpha * + (log_pi + self._target_entropy).detach()).mean() + + self._alpha_optim.zero_grad() + alpha_loss.backward() + self._alpha_optim.step() + + self._alpha = self._log_alpha.exp() + + def _update_critic(self, state_batch, action_batch, reward_batch, + next_state_batch, mask_batch): + """ + Update the critic(s) given a batch of transitions sampled from a replay + buffer. + """ + if self._double_q: + self._update_double_critic( + state_batch, + action_batch, + reward_batch, + next_state_batch, + mask_batch, + ) + else: + self._update_single_critic( + state_batch, + action_batch, + reward_batch, + next_state_batch, + mask_batch, + ) + + def _update_single_critic(self, state_batch, action_batch, reward_batch, + next_state_batch, mask_batch): + """ + Update the critic using a batch of transitions when using a single Q + critic. + """ + if self._double_q: + raise ValueError("cannot call _update_single_critic when using " + + "a double Q critic") + + # When updating Q functions, we don't want to backprop through the + # policy and target network parameters + with torch.no_grad(): + # Sample an action in the next state for the SARSA update + next_state_action, next_state_log_pi, _, _ = \ + self._policy.sample(next_state_batch) + + # Calculate the Q value of the next action in the next state + q_next = self._critic_target(next_state_batch, next_state_action) + if self._soft_q: + q_next -= self._alpha * next_state_log_pi + + # Calculate the target for the SARSA update + q_target = reward_batch + mask_batch * self._gamma * q_next + + # Calculate the Q value of each action in each respective state + q = self._critic(state_batch, action_batch) + + # Calculate the loss between the target and estimate Q values + q_loss = F.mse_loss(q, q_target) + + # Update the critic + self._critic_optim.zero_grad() + q_loss.backward() + self._critic_optim.step() + + # Increment the running total of updates and update the critic target + # if needed + self._update_number += 1 + if self._update_number % self._target_update_interval == 0: + self._update_number = 0 + nn_utils.soft_update(self._critic_target, self._critic, self._tau) + + def _update_double_critic(self, state_batch, action_batch, reward_batch, + next_state_batch, mask_batch): + """ + Update the critic using a batch of transitions when using a double Q + critic. + """ + + if not self._double_q: + raise ValueError("cannot call _update_single_critic when using " + + "a double Q critic") + + # When updating Q functions, we don't want to backprop through the + # policy and target network parameters + with torch.no_grad(): + # Sample an action in the next state for the SARSA update + next_state_action, next_state_log_pi, _, _ = \ + self._policy.sample(next_state_batch) + + # Calculate the action values for the next state + next_q1, next_q2 = self._critic_target(next_state_batch, + next_state_action) + + # Double Q: target uses the minimum of the two computed action + # values + min_next_q = torch.min(next_q1, next_q2) + + # If using soft action value functions, then adjust the target + if self._soft_q: + min_next_q -= self._alpha * next_state_log_pi + + # Calculate the target for the action value function update + q_target = reward_batch + mask_batch * self._gamma * min_next_q + + # Calculate the two Q values of each action in each respective state + q1, q2 = self._critic(state_batch, action_batch) + + # Calculate the losses on each critic + # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] + q1_loss = F.mse_loss(q1, q_target) + + # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] + q2_loss = F.mse_loss(q2, q_target) + q_loss = q1_loss + q2_loss + + # Update the critic + self._critic_optim.zero_grad() + q_loss.backward() + self._critic_optim.step() + + # Increment the running total of updates and update the critic target + # if needed + self._update_number += 1 + if self._update_number % self._target_update_interval == 0: + self._update_number = 0 + nn_utils.soft_update(self._critic_target, self._critic, self._tau) + + def reset(self): + """ + Resets the agent between episodes + """ + pass + + def eval(self): + """ + Sets the agent into offline evaluation mode, where the agent will not + explore + """ + self._is_training = False + + def train(self): + """ + Sets the agent to online training mode, where the agent will explore + """ + self._is_training = True + + # Save model parameters + def save_model(self, env_name, suffix="", actor_path=None, + critic_path=None): + """ + Saves the models so that after training, they can be used. + + Parameters + ---------- + env_name : str + The name of the environment that was used to train the models + suffix : str, optional + The suffix to the filename, by default "" + actor_path : str, optional + The path to the file to save the actor network as, by default None + critic_path : str, optional + The path to the file to save the critic network as, by default None + """ + pass + + # Load model parameters + def load_model(self, actor_path, critic_path): + """ + Loads in a pre-trained actor and a pre-trained critic to resume + training. + + Parameters + ---------- + actor_path : str + The path to the file which contains the actor + critic_path : str + The path to the file which contains the critic + """ + pass + + def get_parameters(self): + """ + Gets all learned agent parameters such that training can be resumed. + + Gets all parameters of the agent such that, if given the + hyperparameters of the agent, training is resumable from this exact + point. This include the learned average reward, the learned entropy, + and other such learned values if applicable. This does not only apply + to the weights of the agent, but *all* values that have been learned + or calculated during training such that, given these values, training + can be resumed from this exact point. + + For example, in the LinearAC class, we must save not only the actor + and critic weights, but also the accumulated eligibility traces. + + Returns + ------- + dict of str to float, torch.Tensor + The agent's weights + """ + pass + + def _init_critics(self, obs_space, action_space, critic_hidden_dim, init, + activation, critic_lr, betas): + """ + Initializes the critic(s) + """ + num_inputs = obs_space.shape[0] + if self._double_q: + critic_type = DoubleQ + else: + critic_type = Q + + self._critic = critic_type(num_inputs, action_space.shape[0], + critic_hidden_dim, init, + activation).to(device=self._device) + self._critic_optim = Adam(self._critic.parameters(), lr=critic_lr, + betas=betas) + + self._critic_target = critic_type(num_inputs, action_space.shape[0], + critic_hidden_dim, init, + activation).to(self._device) + + nn_utils.hard_update(self._critic_target, self._critic) + + def _init_policy(self, obs_space, action_space, actor_hidden_dim, init, + activation, actor_lr, betas, clip_stddev): + """ + Initializes the policy + """ + num_inputs = obs_space.shape[0] + if self._policy_type == "squashedgaussian": + self._policy = SquashedGaussian(num_inputs, action_space.shape[0], + actor_hidden_dim, activation, + action_space, clip_stddev, + init).to(self._device) + self._policy_optim = Adam(self._policy.parameters(), lr=actor_lr, + betas=betas) + + else: + raise NotImplementedError(f"policy {self._policy_type} unknown") + + def _get_q(self, state_batch, action_batch): + """ + Gets the Q values for `action_batch` actions in `state_batch` states + from the critic, rather than the target critic. + + Parameters + ---------- + state_batch : torch.Tensor + The batch of states to calculate the action values in. Of the form + (batch_size, state_dims). + action_batch : torch.Tensor + The batch of actions to calculate the action values of in each + state. Of the form (batch_size, action_dims). + """ + if self._double_q: + q1, q2 = self._critic(state_batch, action_batch) + q = torch.min(q1, q2) + else: + q = self._critic(state_batch, action_batch) + + return q diff --git a/agent/nonlinear/SACDiscrete.py b/agent/nonlinear/SACDiscrete.py new file mode 100644 index 0000000..273016a --- /dev/null +++ b/agent/nonlinear/SACDiscrete.py @@ -0,0 +1,403 @@ +# Import modules +import os +from gym.spaces import Box +import torch +import numpy as np +import torch.nn.functional as F +from torch.optim import Adam +from agent.baseAgent import BaseAgent +import agent.nonlinear.nn_utils as nn_utils +from agent.nonlinear.policy.MLP import Softmax +from agent.nonlinear.value_function.MLP import DoubleQ +from utils.experience_replay import TorchBuffer as ExperienceReplay + + +class SACDiscrete(BaseAgent): + """ + SACDiscrete implements a discrete-action Soft Actor-Critic agent with MLP + function approximation. + + SACDiscrete works only with discrete action spaces. + """ + def __init__(self, env, gamma, tau, alpha, policy, + target_update_interval, critic_lr, actor_lr_scale, alpha_lr, + actor_hidden_dim, critic_hidden_dim, replay_capacity, seed, + batch_size, betas, automatic_entropy_tuning=False, cuda=False, + clip_stddev=1000, init=None, activation="relu"): + """ + Constructor + + Parameters + ---------- + env : gym.Environment + The environment to run on + gamma : float + The discount factor + tau : float + The weight of the weighted average, which performs the soft update + to the target critic network's parameters toward the critic + network's parameters, that is: target_parameters = + ((1 - τ) * target_parameters) + (τ * source_parameters) + alpha : float + The entropy regularization temperature. See equation (1) in paper. + policy : str + The type of policy, currently, only support "gaussian" + target_update_interval : int + The number of updates to perform before the target critic network + is updated toward the critic network + critic_lr : float + The critic learning rate + actor_lr : float + The actor learning rate + alpha_lr : float + The learning rate for the entropy parameter, if using an automatic + entropy tuning algorithm (see automatic_entropy_tuning) parameter + below + actor_hidden_dim : int + The number of hidden units in the actor's neural network + critic_hidden_dim : int + The number of hidden units in the critic's neural network + replay_capacity : int + The number of transitions stored in the replay buffer + seed : int + The random seed so that random samples of batches are repeatable + batch_size : int + The number of elements in a batch for the batch update + automatic_entropy_tuning : bool, optional + Whether the agent should automatically tune its entropy + hyperparmeter alpha, by default False + cuda : bool, optional + Whether or not cuda should be used for training, by default False. + Note that if True, cuda is only utilized if available. + clip_stddev : float, optional + The value at which the standard deviation is clipped in order to + prevent numerical overflow, by default 1000. If <= 0, then + no clipping is done. + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + + Raises + ------ + ValueError + If the batch size is larger than the replay buffer + """ + action_space = env.action_space + obs_space = env.observation_space + if isinstance(action_space, Box): + raise ValueError("SACDiscrete can only be used with " + + "discrete actions") + + super().__init__() + self.batch = True + + # Ensure batch size < replay capacity + if batch_size > replay_capacity: + raise ValueError("cannot have a batch larger than replay " + + "buffer capacity") + + # Set the seed for all random number generators, this includes + # everything used by PyTorch, including setting the initial weights + # of networks. PyTorch prefers seeds with many non-zero binary units + self.torch_rng = torch.manual_seed(seed) + self.rng = np.random.default_rng(seed) + + self.is_training = True + self.gamma = gamma + self.tau = tau + self.alpha = alpha + + self.device = torch.device("cuda:0" if cuda and + torch.cuda.is_available() else "cpu") + + # Keep a replay buffer + action_shape = 1 + obs_dim = obs_space.shape + self.replay = ExperienceReplay(replay_capacity, seed, obs_dim, + action_shape, self.device) + self.batch_size = batch_size + + # Set the interval between timesteps when the target network should be + # updated and keep a running total of update number + self.target_update_interval = target_update_interval + self.update_number = 0 + + self.automatic_entropy_tuning = automatic_entropy_tuning + assert not self.automatic_entropy_tuning + + # Ensure we are working with vector observations + if len(obs_dim) != 1: + raise ValueError( + f"""SACDiscrete works only with vector + observations, but got observation with shape + {obs_dim}.""" + ) + + num_inputs = obs_dim[0] + self.critic = DoubleQ(num_inputs, action_shape, + critic_hidden_dim, init, activation).to( + device=self.device) + self.critic_optim = Adam(self.critic.parameters(), lr=critic_lr, + betas=betas) + + self.critic_target = DoubleQ(num_inputs, action_shape, + critic_hidden_dim, init, activation).to( + self.device) + nn_utils.hard_update(self.critic_target, self.critic) + + self.policy_type = policy.lower() + if self.policy_type == "softmax": + # Target Entropy = −dim(A) + # (e.g. , -6 for HalfCheetah-v2) as given in the paper + if self.automatic_entropy_tuning: + raise ValueError("cannot use auto entropy tuning with" + + " discrete actions") + + self.num_actions = action_space.n + self.policy = Softmax( + num_inputs, self.num_actions, actor_hidden_dim, activation, + init + ).to(self.device) + + actor_lr = actor_lr_scale * critic_lr + self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr, + betas=betas) + + else: + raise NotImplementedError(f"policy type {policy.lower()} not " + + "available") + + def sample_action(self, state): + """ + Samples an action from the agent + + Parameters + ---------- + state : np.array + The state feature vector + + Returns + ------- + array_like of float + The action to take + """ + state = torch.FloatTensor(state).to(self.device).unsqueeze(0) + if self.is_training: + action, _, _ = self.policy.sample(state) + else: + raise ValueError("cannot sample actions in eval mode yet") + + act = action.detach().cpu().numpy()[0] + return int(act[0]) + + def update(self, state, action, reward, next_state, done_mask): + """ + Takes a single update step, which may be a number of offline + batch updates + + Parameters + ---------- + state : np.array or array_like of np.array + The state feature vector + action : np.array of float or array_like of np.array + The action taken + reward : float or array_like of float + The reward seen by the agent after taking the action + next_state : np.array or array_like of np.array + The feature vector of the next state transitioned to after the + agent took the argument action + done_mask : bool or array_like of bool + False if the agent reached the goal, True if the agent did not + reach the goal yet the episode ended (e.g. max number of steps + reached) + """ + # Adjust action to ensure it can be sent to the experience replay + # buffer properly + action = np.array([action]) + + # Keep transition in replay buffer + self.replay.push(state, action, reward, next_state, done_mask) + + # Sample a batch from memory + state_batch, action_batch, reward_batch, next_state_batch, \ + mask_batch = self.replay.sample(batch_size=self.batch_size) + + # When updating Q functions, we don't want to backprop through the + # policy and target network parameters + with torch.no_grad(): + next_state_action, next_state_log_pi, _ = \ + self.policy.sample(next_state_batch) + + qf1_next_target, qf2_next_target = self.critic_target( + next_state_batch, next_state_action) + + min_qf_next_target = torch.min(qf1_next_target, qf2_next_target) \ + - self.alpha * next_state_log_pi + next_q_value = reward_batch + mask_batch * self.gamma * \ + (min_qf_next_target) + + # Two Q-functions to reduce positive bias in policy improvement + qf1, qf2 = self.critic(state_batch, action_batch) + + # Calculate the losses on each critic + # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] + qf1_loss = F.mse_loss(qf1, next_q_value) + + # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] + qf2_loss = F.mse_loss(qf2, next_q_value) + qf_loss = qf1_loss + qf2_loss + + # Update the critic + self.critic_optim.zero_grad() + qf_loss.backward() + self.critic_optim.step() + + # Calculate the actor loss using Eqn(5) in FKL/RKL paper + # Repeat the state for each action + state_batch = state_batch.repeat_interleave(self.num_actions, dim=0) + actions = torch.tensor([n for n in range(self.num_actions)]) + actions = actions.repeat(self.batch_size) + actions = actions.unsqueeze(-1) + + qf1_actions, qf2_actions = self.critic(state_batch, actions) + min_qf_actions = torch.min(qf1_actions, qf2_actions) + + log_prob = self.policy.log_prob(state_batch, actions) + prob = log_prob.exp() + policy_loss = prob * (min_qf_actions - log_prob * self.alpha) + policy_loss = policy_loss.reshape([self.batch_size, self.num_actions]) + policy_loss = -policy_loss.sum(dim=1).mean() + + # Update the actor + self.policy_optim.zero_grad() + policy_loss.backward() + self.policy_optim.step() + + # Tune the entropy if appropriate + if self.automatic_entropy_tuning: + print("warning: should not use auto entropy in these experiments") + alpha_loss = -(self.log_alpha * + (log_pi + self.target_entropy).detach()).mean() + + self.alpha_optim.zero_grad() + alpha_loss.backward() + self.alpha_optim.step() + + self.alpha = self.log_alpha.exp() + + # Increment the running total of updates and update the critic target + # if needed + self.update_number += 1 + if self.update_number % self.target_update_interval == 0: + self.update_number = 0 + nn_utils.soft_update(self.critic_target, self.critic, self.tau) + + def reset(self): + """ + Resets the agent between episodes + """ + pass + + def eval(self): + """ + Sets the agent into offline evaluation mode, where the agent will not + explore + """ + self.is_training = False + + def train(self): + """ + Sets the agent to online training mode, where the agent will explore + """ + self.is_training = True + + # Save model parameters + def save_model(self, env_name, suffix="", actor_path=None, + critic_path=None): + """ + Saves the models so that after training, they can be used. + + Parameters + ---------- + env_name : str + The name of the environment that was used to train the models + suffix : str, optional + The suffix to the filename, by default "" + actor_path : str, optional + The path to the file to save the actor network as, by default None + critic_path : str, optional + The path to the file to save the critic network as, by default None + """ +# if not os.path.exists('models/'): +# os.makedirs('models/') +# +# if actor_path is None: +# actor_path = "models/sac_actor_{}_{}".format(env_name, suffix) +# if critic_path is None: +# critic_path = "models/sac_critic_{}_{}".format(env_name, suffix) +# print('Saving models to {} and {}'.format(actor_path, critic_path)) +# torch.save(self.policy.state_dict(), actor_path) +# torch.save(self.critic.state_dict(), critic_path) + + # Load model parameters + def load_model(self, actor_path, critic_path): + """ + Loads in a pre-trained actor and a pre-trained critic to resume + training. + + Parameters + ---------- + actor_path : str + The path to the file which contains the actor + critic_path : str + The path to the file which contains the critic + """ + pass + + def get_parameters(self): + """ + Gets all learned agent parameters such that training can be resumed. + + Gets all parameters of the agent such that, if given the + hyperparameters of the agent, training is resumable from this exact + point. This include the learned average reward, the learned entropy, + and other such learned values if applicable. This does not only apply + to the weights of the agent, but *all* values that have been learned + or calculated during training such that, given these values, training + can be resumed from this exact point. + + For example, in the LinearAC class, we must save not only the actor + and critic weights, but also the accumulated eligibility traces. + + Returns + ------- + dict of str to float, torch.Tensor + The agent's weights + """ +# parameters = {} +# parameters["actor_weights"] = self.policy.state_dict() +# parameters["actor_optimizer"] = self.policy_optim.state_dict() +# parameters["critic_weights"] = self.critic.state_dict() +# parameters["critic_optimizer"] = self.critic_optim.state_dict() +# parameters["critic_target"] = self.critic_target.state_dict() +# parameters["entropy"] = self.alpha +# +# if self.automatic_entropy_tuning: +# parameters["log_entropy"] = self.log_alpha +# parameters["entropy_optimizer"] = self.alpha_optim.state_dict() +# parameters["target_entropy"] = self.target_entropy +# +# return parameters + + +if __name__ == "__main__": + import gym + a = gym.make("MountainCarContinuous-v0") + actions = a.action_space + s = SAC(num_inputs=5, action_space=actions, gamma=0.9, tau=0.8, + alpha=0.2, policy="Gaussian", target_update_interval=10, + critic_lr=0.01, actor_lr=0.01, alpha_lr=0.01, actor_hidden_dim=200, + critic_hidden_dim=200, replay_capacity=50, seed=0, batch_size=10, + automatic_entropy_tuning=False, cuda=False) diff --git a/agent/nonlinear/SACDiscreteCNN.py b/agent/nonlinear/SACDiscreteCNN.py new file mode 100644 index 0000000..b0cd16e --- /dev/null +++ b/agent/nonlinear/SACDiscreteCNN.py @@ -0,0 +1,402 @@ +# Import modules +import os +from gym.spaces import Box +import torch +import numpy as np +import torch.nn.functional as F +from torch.optim import Adam +from agent.baseAgent import BaseAgent +import agent.nonlinear.nn_utils as nn_utils +from agent.nonlinear.policy.CNN import Softmax +from agent.nonlinear.value_function.CNN import DoubleDiscreteQ as Q +from utils.experience_replay import TorchBuffer as ExperienceReplay + + +class SACDiscrete(BaseAgent): + """ + SACDiscrete implements a discrete-action Soft Actor-Critic agent with CNN + function approximation. + + SACDiscrete works only with discrete action spaces. + """ + def __init__(self, gamma, tau, alpha, policy, env, + target_update_interval, critic_lr, actor_lr_scale, alpha_lr, + hidden_dim, kernel_sizes, channels, replay_capacity, seed, + batch_size, betas, cuda=False, + clip_stddev=1000, init=None, activation="relu"): + """ + Constructor + + Parameters + ---------- + num_inputs : int + The number of input features + action_space : gym.spaces.Space + The action space from the gym environment + gamma : float + The discount factor + tau : float + The weight of the weighted average, which performs the soft update + to the target critic network's parameters toward the critic + network's parameters, that is: target_parameters = + ((1 - τ) * target_parameters) + (τ * source_parameters) + alpha : float + The entropy regularization temperature. See equation (1) in paper. + policy : str + The type of policy, currently, only support "gaussian" + target_update_interval : int + The number of updates to perform before the target critic network + is updated toward the critic network + critic_lr : float + The critic learning rate + actor_lr : float + The actor learning rate + alpha_lr : float + The learning rate for the entropy parameter, if using an automatic + entropy tuning algorithm (see automatic_entropy_tuning) parameter + below + actor_hidden_dim : int + The number of hidden units in the actor's neural network + critic_hidden_dim : int + The number of hidden units in the critic's neural network + replay_capacity : int + The number of transitions stored in the replay buffer + seed : int + The random seed so that random samples of batches are repeatable + batch_size : int + The number of elements in a batch for the batch update + automatic_entropy_tuning : bool, optional + Whether the agent should automatically tune its entropy + hyperparmeter alpha, by default False + cuda : bool, optional + Whether or not cuda should be used for training, by default False. + Note that if True, cuda is only utilized if available. + clip_stddev : float, optional + The value at which the standard deviation is clipped in order to + prevent numerical overflow, by default 1000. If <= 0, then + no clipping is done. + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + + Raises + ------ + ValueError + If the batch size is larger than the replay buffer + """ + self.step = 0 + + # Do some random error checking + action_space = env.action_space + obs_space = env.observation_space + + if isinstance(action_space, Box): + raise ValueError("SACDiscrete can only be used with " + + "discrete actions") + + # Ensure batch size < replay capacity + if batch_size > replay_capacity: + raise ValueError("cannot have a batch larger than replay " + + "buffer capacity") + + super().__init__() + self.batch = True + + # Set the seed for all random number generators, this includes + # everything used by PyTorch, including setting the initial weights + # of networks. PyTorch prefers seeds with many non-zero binary units + self.torch_rng = torch.manual_seed(seed) + self.rng = np.random.default_rng(seed) + + self.is_training = True + self.gamma = gamma + self.tau = tau + self.alpha = alpha + + self.device = torch.device("cuda:0" if cuda and + torch.cuda.is_available() else "cpu") + + # Keep a replay buffer + obs_dim = obs_space.shape + self.replay = ExperienceReplay(replay_capacity, seed, obs_dim, + 1, self.device) + self.batch_size = batch_size + + # Set the interval between timesteps when the target network should be + # updated and keep a running total of update number + self.target_update_interval = target_update_interval + self.update_number = 0 + + self.num_actions = action_space.n + self.critic = Q(obs_dim, self.num_actions, channels, kernel_sizes, + hidden_dim, init, activation).to(device=self.device) + self.critic_optim = Adam(self.critic.parameters(), lr=critic_lr, + betas=betas) + + self.critic_target = Q(obs_dim, self.num_actions, channels, + kernel_sizes, hidden_dim, init, + activation).to(self.device) + nn_utils.hard_update(self.critic_target, self.critic) + + self.policy_type = policy.lower() + if self.policy_type == "softmax": + self.policy = Softmax(obs_dim, channels, kernel_sizes, hidden_dim, + init, activation, + action_space).to(self.device) + + actor_lr = actor_lr_scale * critic_lr + self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr, + betas=betas) + + else: + raise NotImplementedError(f"policy type {policy.lower()} not " + + "available") + + def sample_action(self, state): + """ + Samples an action from the agent + + Parameters + ---------- + state : np.array + The state feature vector + + Returns + ------- + array_like of float + The action to take + """ + state = torch.FloatTensor(state).to(self.device).unsqueeze(0) + if self.is_training: + action, _, _ = self.policy.sample(state) + else: + raise ValueError("cannot sample actions in eval mode yet") + + act = action.detach().cpu().numpy()[0] + return int(act[0]) + + def update(self, state, action, reward, next_state, done_mask): + """ + Takes a single update step, which may be a number of offline + batch updates + + Parameters + ---------- + state : np.array or array_like of np.array + The state feature vector + action : np.array of float or array_like of np.array + The action taken + reward : float or array_like of float + The reward seen by the agent after taking the action + next_state : np.array or array_like of np.array + The feature vector of the next state transitioned to after the + agent took the argument action + done_mask : bool or array_like of bool + False if the agent reached the goal, True if the agent did not + reach the goal yet the episode ended (e.g. max number of steps + reached) + """ + self.step += 1 + + # Adjust action to ensure it can be sent to the experience replay + # buffer properly + action = np.array([action]) + + # Keep transition in replay buffer + self.replay.push(state, action, reward, next_state, done_mask) + + if self.step % 4 != 0: + return + + # Sample a batch from memory + state_batch, action_batch, reward_batch, next_state_batch, \ + mask_batch = self.replay.sample(batch_size=self.batch_size) + + # For rewards, actions, and masks, we know they are scalars, so + # squeeze the final dimension + reward_batch = reward_batch.squeeze() + mask_batch = mask_batch.squeeze() + action_batch = action_batch.squeeze().long() + + # When updating Q functions, we don't want to backprop through the + # policy and target network parameters when computing the target for + # the update + with torch.no_grad(): + next_state_action, next_state_log_pi, _ = \ + self.policy.sample(next_state_batch, log_prob=True) + next_state_log_pi = next_state_log_pi.squeeze(-1) + + next_q1, next_q2 = self.critic_target(next_state_batch) + + next_q1 = next_q1[np.arange(self.batch_size), + next_state_action.squeeze()] + next_q2 = next_q2[np.arange(self.batch_size), + next_state_action.squeeze()] + + min_soft_q = torch.min(next_q1, next_q2) \ + - self.alpha * next_state_log_pi + target = reward_batch + mask_batch * self.gamma * min_soft_q + + # Get the value of the current state + q1, q2 = self.critic(state_batch) + q1 = q1[np.arange(self.batch_size), action_batch] + q2 = q2[np.arange(self.batch_size), action_batch] + + # Calculate the losses on each critic + # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] + q1_loss = F.mse_loss(q1, target) + + # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] + q2_loss = F.mse_loss(q2, target) + q_loss = q1_loss + q2_loss + + # Update the critic + self.critic_optim.zero_grad() + q_loss.backward() + self.critic_optim.step() + + # Calculate the actor loss using Eqn(5) in FKL/RKL paper + # Repeat the state for each action + # state_batch = state_batch.repeat_interleave(self.num_actions, dim=0) + # actions = torch.tensor([n for n in range(self.num_actions)]) + # actions = actions.repeat(self.batch_size) + # actions = actions.long() + + q1, q2 = self.critic(state_batch) + q1 = q1.flatten() + q2 = q2.flatten() + # q1 = q1[np.arange(state_batch.shape[0]), actions] + # q2 = q2[np.arange(state_batch.shape[0]), actions] + min_q = torch.min(q1, q2) + + log_prob = self.policy.all_log_prob(state_batch).squeeze().flatten() + prob = log_prob.exp() + policy_loss = prob * (min_q - log_prob * self.alpha) + policy_loss = policy_loss.reshape([self.batch_size, self.num_actions]) + policy_loss = -policy_loss.sum(dim=1).mean() + + # Update the actor + self.policy_optim.zero_grad() + policy_loss.backward() + self.policy_optim.step() + + # Increment the running total of updates and update the critic target + # if needed + self.update_number += 1 + if self.update_number % self.target_update_interval == 0: + self.update_number = 0 + nn_utils.soft_update(self.critic_target, self.critic, self.tau) + + def reset(self): + """ + Resets the agent between episodes + """ + pass + + def eval(self): + """ + Sets the agent into offline evaluation mode, where the agent will not + explore + """ + self.policy.eval() + self.critic.eval() + self.is_training = False + + def train(self): + """ + Sets the agent to online training mode, where the agent will explore + """ + self.policy.train() + self.critic.train() + self.is_training = True + + # Save model parameters + def save_model(self, env_name, suffix="", actor_path=None, + critic_path=None): + """ + Saves the models so that after training, they can be used. + + Parameters + ---------- + env_name : str + The name of the environment that was used to train the models + suffix : str, optional + The suffix to the filename, by default "" + actor_path : str, optional + The path to the file to save the actor network as, by default None + critic_path : str, optional + The path to the file to save the critic network as, by default None + """ +# if not os.path.exists('models/'): +# os.makedirs('models/') +# +# if actor_path is None: +# actor_path = "models/sac_actor_{}_{}".format(env_name, suffix) +# if critic_path is None: +# critic_path = "models/sac_critic_{}_{}".format(env_name, suffix) +# print('Saving models to {} and {}'.format(actor_path, critic_path)) +# torch.save(self.policy.state_dict(), actor_path) +# torch.save(self.critic.state_dict(), critic_path) + + # Load model parameters + def load_model(self, actor_path, critic_path): + """ + Loads in a pre-trained actor and a pre-trained critic to resume + training. + + Parameters + ---------- + actor_path : str + The path to the file which contains the actor + critic_path : str + The path to the file which contains the critic + """ + pass + + def get_parameters(self): + """ + Gets all learned agent parameters such that training can be resumed. + + Gets all parameters of the agent such that, if given the + hyperparameters of the agent, training is resumable from this exact + point. This include the learned average reward, the learned entropy, + and other such learned values if applicable. This does not only apply + to the weights of the agent, but *all* values that have been learned + or calculated during training such that, given these values, training + can be resumed from this exact point. + + For example, in the LinearAC class, we must save not only the actor + and critic weights, but also the accumulated eligibility traces. + + Returns + ------- + dict of str to float, torch.Tensor + The agent's weights + """ +# parameters = {} +# parameters["actor_weights"] = self.policy.state_dict() +# parameters["actor_optimizer"] = self.policy_optim.state_dict() +# parameters["critic_weights"] = self.critic.state_dict() +# parameters["critic_optimizer"] = self.critic_optim.state_dict() +# parameters["critic_target"] = self.critic_target.state_dict() +# parameters["entropy"] = self.alpha +# +# if self.automatic_entropy_tuning: +# parameters["log_entropy"] = self.log_alpha +# parameters["entropy_optimizer"] = self.alpha_optim.state_dict() +# parameters["target_entropy"] = self.target_entropy +# +# return parameters + + +if __name__ == "__main__": + import gym + a = gym.make("MountainCarContinuous-v0") + actions = a.action_space + s = SAC(num_inputs=5, action_space=actions, gamma=0.9, tau=0.8, + alpha=0.2, policy="Gaussian", target_update_interval=10, + critic_lr=0.01, actor_lr=0.01, alpha_lr=0.01, actor_hidden_dim=200, + critic_hidden_dim=200, replay_capacity=50, seed=0, batch_size=10, + automatic_entropy_tuning=False, cuda=False) diff --git a/agent/nonlinear/VAC.py b/agent/nonlinear/VAC.py new file mode 100644 index 0000000..f27fcbe --- /dev/null +++ b/agent/nonlinear/VAC.py @@ -0,0 +1,443 @@ +#!/usr/bin/env python3 + +# Import modules +import torch +from gym.spaces import Box, Discrete +import numpy as np +import torch.nn.functional as F +from torch.optim import Adam +from agent.baseAgent import BaseAgent +import agent.nonlinear.nn_utils as nn_utils +from agent.nonlinear.policy_utils import GaussianPolicy, SoftmaxPolicy +from agent.nonlinear.value_function_utils import QMLP +from utils.experience_replay import TorchBuffer as ExperienceReplay + + +class VAC(BaseAgent): + """ + VAC implements the Vanilla Actor-Critic agent. + + VAC works only with continuous actions and uses MLP function approximators. + """ + def __init__(self, num_inputs, action_space, gamma, tau, alpha, policy, + target_update_interval, critic_lr, actor_lr_scale, + num_samples, actor_hidden_dim, critic_hidden_dim, + replay_capacity, seed, batch_size, betas, env, cuda=False, + clip_stddev=1000, init=None, activation="relu"): + """ + Constructor + + Parameters + ---------- + num_inputs : int + The number of input features + action_space : gym.spaces.Space + The action space from the gym environment + gamma : float + The discount factor + tau : float + The weight of the weighted average, which performs the soft update + to the target critic network's parameters toward the critic + network's parameters, that is: target_parameters = + ((1 - τ) * target_parameters) + (τ * source_parameters) + alpha : float + The entropy regularization temperature. See equation (1) in paper. + policy : str + The type of policy, currently, only support "gaussian" + target_update_interval : int + The number of updates to perform before the target critic network + is updated toward the critic network + critic_lr : float + The critic learning rate + actor_lr : float + The actor learning rate + actor_hidden_dim : int + The number of hidden units in the actor's neural network + critic_hidden_dim : int + The number of hidden units in the critic's neural network + replay_capacity : int + The number of transitions stored in the replay buffer + seed : int + The random seed so that random samples of batches are repeatable + batch_size : int + The number of elements in a batch for the batch update + cuda : bool, optional + Whether or not cuda should be used for training, by default False. + Note that if True, cuda is only utilized if available. + clip_stddev : float, optional + The value at which the standard deviation is clipped in order to + prevent numerical overflow, by default 1000. If <= 0, then + no clipping is done. + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + + Raises + ------ + ValueError + If the batch size is larger than the replay buffer + """ + super().__init__() + self.batch = True + + # Ensure batch size < replay capacity + if batch_size > replay_capacity: + raise ValueError("cannot have a batch larger than replay " + + "buffer capacity") + + # Set the seed for all random number generators, this includes + # everything used by PyTorch, including setting the initial weights + # of networks. PyTorch prefers seeds with many non-zero binary units + self.torch_rng = torch.manual_seed(seed) + self.rng = np.random.default_rng(seed) + + self.is_training = True + self.gamma = gamma + self.tau = tau + self.alpha = alpha + + self.discrete_action = isinstance(action_space, Discrete) + self.state_dims = num_inputs + self.num_samples = num_samples - 1 + assert num_samples >= 2 + + self.device = torch.device("cuda:0" if cuda and + torch.cuda.is_available() else "cpu") + + if isinstance(action_space, Box): + self.action_dims = action_space.high.shape[0] + + # Keep a replay buffer + self.replay = ExperienceReplay(replay_capacity, seed, num_inputs, + action_space.shape[0], self.device) + elif isinstance(action_space, Discrete): + self.action_dims = 1 + # Keep a replay buffer + self.replay = ExperienceReplay(replay_capacity, seed, num_inputs, + 1, self.device) + self.batch_size = batch_size + + # Set the interval between timesteps when the target network should be + # updated and keep a running total of update number + self.target_update_interval = target_update_interval + self.update_number = 0 + + # Create the critic Q function + if isinstance(action_space, Box): + action_shape = action_space.shape[0] + elif isinstance(action_space, Discrete): + action_shape = 1 + + self.critic = QMLP(num_inputs, action_shape, + critic_hidden_dim, init, activation).to( + device=self.device) + self.critic_optim = Adam(self.critic.parameters(), lr=critic_lr, + betas=betas) + + self.critic_target = QMLP(num_inputs, action_shape, + critic_hidden_dim, init, activation).to( + self.device) + nn_utils.hard_update(self.critic_target, self.critic) + + self.policy_type = policy.lower() + actor_lr = actor_lr_scale * critic_lr + if self.policy_type == "gaussian": + + self.policy = GaussianPolicy(num_inputs, action_space.shape[0], + actor_hidden_dim, activation, + action_space, clip_stddev, init).to( + self.device) + self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr, + betas=betas) + # elif self.policy_type == "softmax": + # num_actions = action_space.n + # self.policy = SoftmaxPolicy(num_inputs, num_actions, + # actor_hidden_dim, activation, + # action_space, init).to(self.device) + # self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr, + # betas=betas) + + + else: + raise NotImplementedError + + def sample_action(self, state): + """ + Samples an action from the agent + + Parameters + ---------- + state : np.array + The state feature vector + + Returns + ------- + array_like of float + The action to take + """ + state = torch.FloatTensor(state).to(self.device).unsqueeze(0) + if self.is_training: + action, _, _ = self.policy.sample(state) + else: + _, _, action = self.policy.sample(state) + + act = action.detach().cpu().numpy()[0] + + if not self.discrete_action: + return act + else: + return int(act[0]) + + def sample_action_(self, state, size): + """ + sample_action_ is like sample_action, except the rng for + action selection in the environment is not affected by running + this function. + """ + if len(state.shape) > 1 or state.shape[0] > 1: + raise ValueError("sample_action_ takes a single state") + with torch.no_grad(): + state = torch.FloatTensor(state).to(self.device).unsqueeze(0) + if self.is_training: + mean, log_std = self.policy.forward(state) + + if not self.is_training: + return mean.detach().cpu().numpy()[0] + + mean = mean.detach().cpu().numpy()[0] + std = np.exp(log_std.detach().cpu().numpy()[0]) + return self.rng.normal(mean, std, size=size) + + def update(self, state, action, reward, next_state, done_mask): + """ + Takes a single update step, which may be a number of offline + batch updates + + Parameters + ---------- + state : np.array or array_like of np.array + The state feature vector + action : np.array of float or array_like of np.array + The action taken + reward : float or array_like of float + The reward seen by the agent after taking the action + next_state : np.array or array_like of np.array + The feature vector of the next state transitioned to after the + agent took the argument action + done_mask : bool or array_like of bool + False if the agent reached the goal, True if the agent did not + reach the goal yet the episode ended (e.g. max number of steps + reached) + """ + if self.discrete_action: + action = np.array([action]) + # Keep transition in replay buffer + self.replay.push(state, action, reward, next_state, done_mask) + + # Sample a batch from memory + state_batch, action_batch, reward_batch, next_state_batch, \ + mask_batch = self.replay.sample(batch_size=self.batch_size) + + # When updating Q functions, we don't want to backprop through the + # policy and target network parameters + with torch.no_grad(): + next_state_action, _, _ = \ + self.policy.sample(next_state_batch) + qf_next_value = self.critic_target(next_state_batch, + next_state_action) + + q_target = reward_batch + mask_batch * self.gamma * qf_next_value + + # Two Q-functions to reduce positive bias in policy improvement + q_prediction = self.critic(state_batch, action_batch) + # print(torch.cat([reward_batch, action_batch, mask_batch], dim=1)) + # print(q_prediction) + + # Calculate the losses on each critic + # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] + q_loss = F.mse_loss(q_prediction, q_target) + + # Update the critic + self.critic_optim.zero_grad() + q_loss.backward() + self.critic_optim.step() + + # Sample action that the agent would take + pi, _, _ = self.policy.sample(state_batch) + + # Calculate the advantage + with torch.no_grad(): + q_pi = self.critic(state_batch, pi) + sampled_actions, _, _ = self.policy.sample(state_batch, + self.num_samples) + if self.num_samples == 1: + sampled_actions = sampled_actions.unsqueeze(1) + sampled_actions = torch.permute(sampled_actions, (1, 0, 2)) + + state_baseline = 0 + if self.num_samples > 2: + # Baseline computed with self.num_samples - 1 action + # value estimates + baseline_actions = sampled_actions[:, :-1] + baseline_actions = torch.reshape(baseline_actions, + [-1, self.action_dims]) + stacked_s_batch = torch.repeat_interleave(state_batch, + self.num_samples-1, + dim=0) + stacked_s_batch = torch.reshape(stacked_s_batch, + [-1, self.state_dims]) + + baseline_q_vals = self.critic(stacked_s_batch, + baseline_actions) + baseline_q_vals = torch.reshape(baseline_q_vals, + [self.batch_size, + self.num_samples-1]) + state_baseline = baseline_q_vals.mean(axis=1).unsqueeze(1) + advantage = q_pi - state_baseline + + # Estimate the entropy from a single sampled action in each state + entropy_actions = sampled_actions[:, -1] + entropy = -self.policy.log_prob(state_batch, entropy_actions) + + # Jπ = 𝔼st∼D,εt∼N[α * logπ(f(εt;st)|st) − Q(st,f(εt;st))] + policy_loss = self.policy.log_prob(state_batch, pi) * advantage + policy_loss = -(policy_loss + (self.alpha * entropy)).mean() + + # Update the actor + self.policy_optim.zero_grad() + policy_loss.backward() + self.policy_optim.step() + + # Update target network + self.update_number += 1 + if self.update_number % self.target_update_interval == 0: + self.update_number = 0 + nn_utils.soft_update(self.critic_target, self.critic, self.tau) + + def update_value_fn(self, state, action, reward, next_state, done_mask, + new_sample): + if new_sample: + # Keep transition in replay buffer + self.replay.push(state, action, reward, next_state, done_mask) + + # Sample a batch from memory + state_batch, action_batch, reward_batch, next_state_batch, \ + mask_batch = self.replay.sample(batch_size=self.batch_size) + + # When updating Q functions, we don't want to backprop through the + # policy and target network parameters + with torch.no_grad(): + next_state_action, _, _ = \ + self.policy.sample(next_state_batch) + + next_q = self.critic_target(next_state_batch, next_state_action) + target_q_value = reward_batch + mask_batch * self.gamma * next_q + + q_value = self.critic(state_batch, action_batch) + + # Calculate the loss on the critic + # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] + q_loss = F.mse_loss(target_q_value, q_value) + + # Update the critic + self.critic_optim.zero_grad() + q_loss.backward() + self.critic_optim.step() + + # Update target networks + self.update_number += 1 + if self.update_number % self.target_update_interval == 0: + self.update_number = 0 + nn_utils.soft_update(self.critic_target, self.critic, self.tau) + + def sample_qs(self, num_q_samples): + """Get a number of samples of Q(s, a) for s in the replay buffer + and a according to current policy""" + # Sample a batch from memory + state_batch, _, _, _, _ = self.replay.sample(batch_size=num_q_samples) + + # When updating Q functions, we don't want to backprop through the + # policy and target network parameters + with torch.no_grad(): + action_batch, _, _ = \ + self.policy.sample(state_batch) + + return self.critic(state_batch, action_batch).detach().\ + squeeze().numpy() + + def reset(self): + """ + Resets the agent between episodes + """ + pass + + def eval(self): + """ + Sets the agent into offline evaluation mode, where the agent will not + explore + """ + self.is_training = False + + def train(self): + """ + Sets the agent to online training mode, where the agent will explore + """ + self.is_training = True + + # Save model parameters + def save_model(self, env_name, suffix="", actor_path=None, + critic_path=None): + """ + Saves the models so that after training, they can be used. + + Parameters + ---------- + env_name : str + The name of the environment that was used to train the models + suffix : str, optional + The suffix to the filename, by default "" + actor_path : str, optional + The path to the file to save the actor network as, by default None + critic_path : str, optional + The path to the file to save the critic network as, by default None + """ + pass + + # Load model parameters + def load_model(self, actor_path, critic_path): + """ + Loads in a pre-trained actor and a pre-trained critic to resume + training. + + Parameters + ---------- + actor_path : str + The path to the file which contains the actor + critic_path : str + The path to the file which contains the critic + """ + pass + + def get_parameters(self): + """ + Gets all learned agent parameters such that training can be resumed. + + Gets all parameters of the agent such that, if given the + hyperparameters of the agent, training is resumable from this exact + point. This include the learned average reward, the learned entropy, + and other such learned values if applicable. This does not only apply + to the weights of the agent, but *all* values that have been learned + or calculated during training such that, given these values, training + can be resumed from this exact point. + + For example, in the LinearAC class, we must save not only the actor + and critic weights, but also the accumulated eligibility traces. + + Returns + ------- + dict of str to float, torch.Tensor + The agent's weights + """ + pass diff --git a/agent/nonlinear/VACDiscrete.py b/agent/nonlinear/VACDiscrete.py new file mode 100644 index 0000000..19efb32 --- /dev/null +++ b/agent/nonlinear/VACDiscrete.py @@ -0,0 +1,353 @@ +#!/usr/bin/env python3 + +# Import modules +import torch +import time +from gym.spaces import Box, Discrete +import numpy as np +import torch.nn.functional as F +from torch.optim import Adam +from agent.baseAgent import BaseAgent +import agent.nonlinear.nn_utils as nn_utils +from agent.nonlinear.policy_utils import GaussianPolicy, SoftmaxPolicy +from agent.nonlinear.value_function_utils import QMLP +from utils.experience_replay import TorchBuffer as ExperienceReplay + + +class VACDiscrete(BaseAgent): + """ + VACDiscrete implements the Vanilla Actor-Critic agent. + + VACDiscrete works only with discrete actions and uses MLP function + approximators. + """ + def __init__(self, num_inputs, action_space, gamma, tau, alpha, policy, + target_update_interval, critic_lr, actor_lr_scale, + actor_hidden_dim, critic_hidden_dim, + replay_capacity, seed, batch_size, betas, cuda=False, + clip_stddev=1000, init=None, activation="relu"): + """ + Constructor + + Parameters + ---------- + num_inputs : int + The number of input features + action_space : gym.spaces.Space + The action space from the gym environment + gamma : float + The discount factor + tau : float + The weight of the weighted average, which performs the soft update + to the target critic network's parameters toward the critic + network's parameters, that is: target_parameters = + ((1 - τ) * target_parameters) + (τ * source_parameters) + alpha : float + The entropy regularization temperature. See equation (1) in paper. + policy : str + The type of policy, currently, only support "softmax" + target_update_interval : int + The number of updates to perform before the target critic network + is updated toward the critic network + critic_lr : float + The critic learning rate + actor_lr : float + The actor learning rate + actor_hidden_dim : int + The number of hidden units in the actor's neural network + critic_hidden_dim : int + The number of hidden units in the critic's neural network + replay_capacity : int + The number of transitions stored in the replay buffer + seed : int + The random seed so that random samples of batches are repeatable + batch_size : int + The number of elements in a batch for the batch update + cuda : bool, optional + Whether or not cuda should be used for training, by default False. + Note that if True, cuda is only utilized if available. + clip_stddev : float, optional + The value at which the standard deviation is clipped in order to + prevent numerical overflow, by default 1000. If <= 0, then + no clipping is done. + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + + Raises + ------ + ValueError + If the batch size is larger than the replay buffer + """ + super().__init__() + self.batch = True + + # Ensure batch size < replay capacity + if batch_size > replay_capacity: + raise ValueError("cannot have a batch larger than replay " + + "buffer capacity") + + # Set the seed for all random number generators, this includes + # everything used by PyTorch, including setting the initial weights + # of networks. PyTorch prefers seeds with many non-zero binary units + self.torch_rng = torch.manual_seed(seed) + self.rng = np.random.default_rng(seed) + + self.is_training = True + self.gamma = gamma + self.tau = tau + self.alpha = alpha + + self.discrete_action = isinstance(action_space, Discrete) + self.state_dims = num_inputs + + self.device = torch.device("cuda:0" if cuda and + torch.cuda.is_available() else "cpu") + + if isinstance(action_space, Box): + raise ValueError("VACDiscrete can only be used with " + + "discrete actions") + elif isinstance(action_space, Discrete): + self.action_dims = 1 + # Keep a replay buffer + self.replay = ExperienceReplay(replay_capacity, seed, num_inputs, + 1, self.device) + self.batch_size = batch_size + + # Set the interval between timesteps when the target network should be + # updated and keep a running total of update number + self.target_update_interval = target_update_interval + self.update_number = 0 + + # Create the critic Q function + if isinstance(action_space, Box): + action_shape = action_space.shape[0] + elif isinstance(action_space, Discrete): + action_shape = 1 + + self.critic = QMLP(num_inputs, action_shape, + critic_hidden_dim, init, activation).to( + device=self.device) + self.critic_optim = Adam(self.critic.parameters(), lr=critic_lr, + betas=betas) + + self.critic_target = QMLP(num_inputs, action_shape, + critic_hidden_dim, init, activation).to( + self.device) + nn_utils.hard_update(self.critic_target, self.critic) + + self.policy_type = policy.lower() + actor_lr = actor_lr_scale * critic_lr + if self.policy_type == "softmax": + self.num_actions = action_space.n + self.policy = SoftmaxPolicy(num_inputs, self.num_actions, + actor_hidden_dim, activation, + action_space, init).to(self.device) + self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr, + betas=betas) + + else: + raise NotImplementedError(f"policy type {policy} not implemented") + + def sample_action(self, state): + """ + Samples an action from the agent + + Parameters + ---------- + state : np.array + The state feature vector + + Returns + ------- + array_like of float + The action to take + """ + state = torch.FloatTensor(state).to(self.device).unsqueeze(0) + if self.is_training: + action, _, _ = self.policy.sample(state) + else: + _, _, action = self.policy.sample(state) + + act = action.detach().cpu().numpy()[0] + if not self.discrete_action: + return act + else: + return int(act[0]) + + def sample_action_(self, state, size): + """ + sample_action_ is like sample_action, except the rng for + action selection in the environment is not affected by running + this function. + """ + if len(state.shape) > 1 or state.shape[0] > 1: + raise ValueError("sample_action_ takes a single state") + with torch.no_grad(): + state = torch.FloatTensor(state).to(self.device).unsqueeze(0) + if self.is_training: + mean, log_std = self.policy.forward(state) + + if not self.is_training: + return mean.detach().cpu().numpy()[0] + + mean = mean.detach().cpu().numpy()[0] + std = np.exp(log_std.detach().cpu().numpy()[0]) + return self.rng.normal(mean, std, size=size) + + def update(self, state, action, reward, next_state, done_mask): + """ + Takes a single update step, which may be a number of offline + batch updates + + Parameters + ---------- + state : np.array or array_like of np.array + The state feature vector + action : np.array of float or array_like of np.array + The action taken + reward : float or array_like of float + The reward seen by the agent after taking the action + next_state : np.array or array_like of np.array + The feature vector of the next state transitioned to after the + agent took the argument action + done_mask : bool or array_like of bool + False if the agent reached the goal, True if the agent did not + reach the goal yet the episode ended (e.g. max number of steps + reached) + """ + if self.discrete_action: + action = np.array([action]) + # Keep transition in replay buffer + self.replay.push(state, action, reward, next_state, done_mask) + + # Sample a batch from memory + state_batch, action_batch, reward_batch, next_state_batch, \ + mask_batch = self.replay.sample(batch_size=self.batch_size) + + # When updating Q functions, we don't want to backprop through the + # policy and target network parameters + with torch.no_grad(): + next_state_action, _, _ = \ + self.policy.sample(next_state_batch) + qf_next_value = self.critic_target(next_state_batch, + next_state_action) + + q_target = reward_batch + mask_batch * self.gamma * qf_next_value + + # Two Q-functions to reduce positive bias in policy improvement + q_prediction = self.critic(state_batch, action_batch) + # print(torch.cat([reward_batch, action_batch, mask_batch], dim=1)) + # print(q_prediction) + + # Calculate the losses on each critic + # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2] + q_loss = F.mse_loss(q_prediction, q_target) + + # Update the critic + self.critic_optim.zero_grad() + q_loss.backward() + self.critic_optim.step() + + # Calculate the actor loss using Eqn(5) in FKL/RKL paper + # No need to use a baseline in this setting + state_batch = state_batch.repeat_interleave(self.num_actions, dim=0) + actions = torch.tensor([n for n in range(self.num_actions)]) + actions = actions.repeat(self.batch_size) + actions = actions.unsqueeze(-1) + + q = self.critic(state_batch, actions) + log_prob = self.policy.log_prob(state_batch, actions) + prob = log_prob.exp() + + policy_loss = prob * (q - log_prob * self.alpha) + policy_loss = policy_loss.reshape([self.batch_size, self.num_actions]) + policy_loss = -policy_loss.sum(dim=1).mean() + + # Update the actor + self.policy_optim.zero_grad() + policy_loss.backward() + self.policy_optim.step() + + # Update target network + self.update_number += 1 + if self.update_number % self.target_update_interval == 0: + self.update_number = 0 + nn_utils.soft_update(self.critic_target, self.critic, self.tau) + + def reset(self): + """ + Resets the agent between episodes + """ + pass + + def eval(self): + """ + Sets the agent into offline evaluation mode, where the agent will not + explore + """ + self.is_training = False + + def train(self): + """ + Sets the agent to online training mode, where the agent will explore + """ + self.is_training = True + + # Save model parameters + def save_model(self, env_name, suffix="", actor_path=None, + critic_path=None): + """ + Saves the models so that after training, they can be used. + + Parameters + ---------- + env_name : str + The name of the environment that was used to train the models + suffix : str, optional + The suffix to the filename, by default "" + actor_path : str, optional + The path to the file to save the actor network as, by default None + critic_path : str, optional + The path to the file to save the critic network as, by default None + """ + pass + + # Load model parameters + def load_model(self, actor_path, critic_path): + """ + Loads in a pre-trained actor and a pre-trained critic to resume + training. + + Parameters + ---------- + actor_path : str + The path to the file which contains the actor + critic_path : str + The path to the file which contains the critic + """ + pass + + def get_parameters(self): + """ + Gets all learned agent parameters such that training can be resumed. + + Gets all parameters of the agent such that, if given the + hyperparameters of the agent, training is resumable from this exact + point. This include the learned average reward, the learned entropy, + and other such learned values if applicable. This does not only apply + to the weights of the agent, but *all* values that have been learned + or calculated during training such that, given these values, training + can be resumed from this exact point. + + For example, in the LinearAC class, we must save not only the actor + and critic weights, but also the accumulated eligibility traces. + + Returns + ------- + dict of str to float, torch.Tensor + The agent's weights + """ + pass diff --git a/agent/nonlinear/nn_utils.py b/agent/nonlinear/nn_utils.py new file mode 100644 index 0000000..63215ba --- /dev/null +++ b/agent/nonlinear/nn_utils.py @@ -0,0 +1,331 @@ +# Import modules +import torch +import torch.nn as nn +import numpy as np + + +# Function definitions +def weights_init_(layer, init="kaiming", activation="relu"): + """ + Initializes the weights for a fully connected layer of a neural network. + + Parameters + ---------- + layer : torch.nn.Module + The layer to initialize + init : str + The type of initialization to use, one of 'xavier_uniform', + 'xavier_normal', 'uniform', 'normal', 'orthogonal', 'kaiming_uniform', + 'default', by default 'kaiming_uniform'. + activation : str + The activation function in use, used to calculate the optimal gain + value. + + """ + if "weight" in dir(layer): + gain = torch.nn.init.calculate_gain(activation) + + if init == "xavier_uniform": + torch.nn.init.xavier_uniform_(layer.weight, gain=gain) + elif init == "xavier_normal": + torch.nn.init.xavier_normal_(layer.weight, gain=gain) + elif init == "uniform": + torch.nn.init.uniform_(layer.weight) / layer.in_features + elif init == "normal": + torch.nn.init.normal_(layer.weight) / layer.in_features + elif init == "orthogonal": + torch.nn.init.orthogonal_(layer.weight) + elif init == "zeros": + torch.nn.init.zeros_(layer.weight) + elif init == "kaiming_uniform" or init == "default" or init is None: + # PyTorch default + return + else: + raise NotImplementedError(f"init {init} not implemented yet") + + if "bias" in dir(layer): + torch.nn.init.constant_(layer.bias, 0) + + +def soft_update(target, source, tau): + """ + Updates the parameters of the target network towards the parameters of + the source network by a weight average depending on tau. The new + parameters for the target network are: + + ((1 - τ) * target_parameters) + (τ * source_parameters) + + Parameters + ---------- + target : torch.nn.Module + The target network + source : torch.nn.Module + The source network + tau : float + The weighting for the weighted average + """ + with torch.no_grad(): + for target_param, param in zip(target.parameters(), + source.parameters()): + # Use in-place operations mul_ and add_ to avoid + # copying tensor data + target_param.data.mul_(1.0 - tau) + target_param.data.add_(tau * param.data) + + +def hard_update(target, source): + """ + Sets the parameters of the target network to the parameters of the + source network. Equivalent to soft_update(target, source, 1) + + Parameters + ---------- + target : torch.nn.Module + The target network + source : torch.nn.Module + The source network + """ + with torch.no_grad(): + for target_param, param in zip(target.parameters(), + source.parameters()): + target_param.data.copy_(param.data) + + +def init_layers(layers, init_scheme): + """ + Initializes the weights for the layers of a neural network. + + Parameters + ---------- + layers : list of nn.Module + The list of layers + init_scheme : str + The type of initialization to use, one of 'xavier_uniform', + 'xavier_normal', 'uniform', 'normal', 'orthogonal', by default None. + If None, leaves the default PyTorch initialization. + """ + def fill_weights(layers, init_fn): + for i in range(len(layers)): + init_fn(layers[i].weight) + + if init_scheme.lower() == "xavier_uniform": + fill_weights(layers, nn.init.xavier_uniform_) + elif init_scheme.lower() == "xavier_normal": + fill_weights(layers, nn.init.xavier_normal_) + elif init_scheme.lower() == "uniform": + fill_weights(layers, nn.init.uniform_) + elif init_scheme.lower() == "normal": + fill_weights(layers, nn.init.normal_) + elif init_scheme.lower() == "orthogonal": + fill_weights(layers, nn.init.orthogonal_) + elif init_scheme is None: + # Use PyTorch default + return + + +def _calc_conv_outputs(in_height, in_width, kernel_size, dilation=1, padding=0, + stride=1): + """ + Calculates the output height and width given in input height and width and + the kernel size. + + Parameters + ---------- + in_height : int + The height of the input image + in_width : int + The width of the input image + kernel_size : tuple[int, int] or int + The kernel size + dilation : tuple[int, int] or int + Spacing between kernel elements, by default 1 + padding : tuple[int, int] or int + Padding added to all four sides of the input, by default 0 + stride : tuple[int, int] or int + Stride of the convolution, by default 1 + + Returns + ------- + tuple[int, int] + The output width and height + """ + # Reshape so that kernel_size, padding, dilation, and stride have one + # element per dimension + if isinstance(kernel_size, int): + kernel_size = [kernel_size] * 2 + if isinstance(padding, int): + padding = [padding] * 2 + if isinstance(dilation, int): + dilation = [dilation] * 2 + if isinstance(stride, int): + stride = [stride] * 2 + + out_height = in_height + 2 * padding[0] - dilation[0] * ( + kernel_size[0] - 1) - 1 + out_height //= stride[0] + + out_width = in_width + 2 * padding[1] - dilation[1] * ( + kernel_size[1] - 1) - 1 + out_width //= stride[1] + + return out_height + 1, out_width + 1 + + +def _get_activation(activation): + """ + Returns an activation operation given a string describing the activation + operation + + Parameters + ---------- + activation : str + The string representation of the activation operation, one of 'relu', + 'tanh' + + Returns + ------- + nn.Module + The activation function + """ + # Set the activation funcitons + if activation.lower() == "relu": + act = nn.ReLU() + elif activation.lower() == "tanh": + act = nn.Tanh() + else: + raise ValueError(f"unknown activation {activation}") + + return act + + +def _construct_conv_linear(input_dim, num_actions, channels, kernel_sizes, + hidden_sizes, init, activation, single_output): + """ + Constructs a number of convolutional layers and a sequence of + densely-connected layers which operate on the output of the convolutional + layers, returning the convolutional sequence and densely-connected sequence + separately. + + This function is particularly suited to produce Q functions or + Softmax policies, but can also be used to construct other approximators + such as Gaussian policies or V functions (where `num_actions == 1` would + actually be the number of state values to output, which is always 1; in + such a case one should set `single_output = True`). + + This function construct a neural net which looks like: + input --> convolutional --> densely-connected --> Output + layers layers + and returns the convolutional and densely-connected layers separately. + + Parameters + ---------- + input_dim : tuple[int, int, int] + Dimensionality of state features, which should be (channels, + height, width) + num_actions : int + If `single_output` is `True`, then this should be the dimensionality of + the action, since then the action will be concatenated with the input + to the linear layers. If `single_output` is `False`, then this should + be the number of discrete available actions in the environment, and the + network will output `num_actions` action values. + channels : array-like[int] + The number of channels in each hidden convolutional layer + kernel_sizes : array-like[int] + The number of channels in each consecutive convolutional layer + hidden_sizes : array-like[int] + The number of units in each consecutive fully connected layer + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + activation : indexable[str] or str + The activation function to use; each element should be one of + 'relu', 'tanh' + single_output : bool + Whether or not the network should have a single output. If `True`, then + the action is concatenated with the input the the linear layers. If + `False`, then `num_actions` are outputted. + """ + # Ensure the number of channels == the number of kernels + if len(channels) != len(kernel_sizes): + kernels = len(kernel_sizes) + channels = len(channels) + raise ValueError("must have the same number of channels and " + + f"kernels but got {channels} channels " + + f"and {kernels} kernels") + + if isinstance(activation, str): + act = [_get_activation(activation)] * (len(channels) + + len(hidden_sizes)) + elif len(channels) != len(activations): + activations = len(activations) + channels = len(channels) + raise ValueError("must have the same number of channels and " + + f"activations but got {channels} channels " + + f"and {activations} activations") + + # Convolutional layers + conv = [] # List of sequential convolutional layers and activations + in_channels = input_dim[0] + out_channels = channels[0] + kernel = kernel_sizes[0] + channel_size = input_dim[1:] + for i in range(1, len(channels)): + # Append the convolutional layer and activation to the list of layers + conv.append(nn.Conv2d(in_channels, out_channels, kernel)) + conv.append(act[i-1]) + + # Calculate the next channel size to be used later for the number of + # inputs to the dense layers + channel_size = _calc_conv_outputs(channel_size[0], + channel_size[1], kernel) + + # Update some running variables for the convolutional layer sizes for + # the next convolutional layer + in_channels = out_channels + out_channels = channels[i] + kernel = kernel_sizes[i] + + # Append the last convolutional layer to the list of layers + conv.append(nn.Conv2d(in_channels, out_channels, kernel)) + conv.append(act[len(channels)-1]) + channel_size = _calc_conv_outputs(channel_size[0], + channel_size[1], kernel) + + # Ensure that the final output size of the convolutional layers + # is non-negative + if np.any(np.array(channel_size) < 0): + raise ValueError("convolutions produce shape with negative size") + + # Construct the chain of convolutions and activations + conv = nn.Sequential(*conv) + conv.apply(lambda module: weights_init_(module, init)) + + # Get the final number of elements of the output of the + # convolutional layers + conv_out = out_channels * np.prod(channel_size) + + # Linear layers + linear = [] # List of dense connections and activations + in_units = conv_out + (num_actions if single_output else 0) + for i in range(len(hidden_sizes)): + # Add a dense layer and activation to the list of operations for the + # fuylly connected layers + linear.append(nn.Linear(in_units, hidden_sizes[i])) + linear.append(act[len(channels) + i]) + + # Update the number of inputs to the next layer + in_units = hidden_sizes[i] + + # Add the final dense layer + if single_output: + linear.append(nn.Linear(in_units, 1)) + else: + linear.append(nn.Linear(in_units, num_actions)) + + # Construct the chain of dense connections and activations + linear = nn.Sequential(*linear) + linear.apply(lambda module: weights_init_(module, init)) + + return conv, linear diff --git a/agent/nonlinear/policy/CNN.py b/agent/nonlinear/policy/CNN.py new file mode 100644 index 0000000..4ce5c47 --- /dev/null +++ b/agent/nonlinear/policy/CNN.py @@ -0,0 +1,146 @@ +# Import modules +import agent.nonlinear.nn_utils as nn_utils +import numpy as np +import time +import torch +from torch.distributions import Normal, Independent +import torch.nn as nn +import torch.nn.functional as F + + +# Global variables +EPSILON = 1e-6 + + +# Class definitions +class Softmax(nn.Module): + """ + Softmax implements a softmax policy in each state, parameterized + using an CNN to predict logits. + """ + def __init__(self, input_dim, channels, kernel_sizes, + hidden_sizes, init, activation, action_space): + """ + Constructor + + Parameters + ---------- + input_dim : tuple[int, int, int] + Dimensionality of state features, which should be (channels, + height, width) + channels : array-like[int] + The number of channels in each hidden convolutional layer + kernel_sizes : array-like[int] + The number of channels in each consecutive convolutional layer + hidden_sizes : array-like[int] + The number of units in each consecutive fully connected layer + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + activation : indexable[str] or str + The activation function to use; each element should be one of + 'relu', 'tanh' + action_space : gym.Spaces.Discrete + The action space + """ + super(Softmax, self).__init__() + + self.num_actions = action_space.n + self.conv, self.linear = nn_utils._construct_conv_linear( + input_dim, + self.num_actions, + channels, + kernel_sizes, + hidden_sizes, + init, + activation, + False, + ) + + def forward(self, state): + """ + Performs the forward pass through the network, predicting a logit for + each action in `state`. + + Parameters + ---------- + state : torch.Tensor[float] or np.array[float] + The state that the action was taken in + + Returns + ------- + torch.Tensor + The logit for each action in `state` with shape `(batch, + num_actions)` + """ + if isinstance(state, np.ndarray): + x = torch.tensor(state) + + x = self.conv(state) + return self.linear(torch.flatten(x, start_dim=1)) + + def sample(self, state, num_samples=1, log_prob=False): + """ + Returns actions sampled from the policy in `state` + + Parameters + ---------- + state : torch.Tensor + The states to sample the actions in + num_samples : int, optional + The number of actions to sampler per state + log_prob : bool, optional + Whether or not to return the log probability of each action in + each state in `state`, by default `False` + + Returns + ------- + torch.Tensor + A sample of `num_samples` actions in each state, with shape + `(num_samples, batch, action_dims = 1)` + """ + logits = self.forward(state) + + probs = F.softmax(logits, dim=1) + + policy = torch.distributions.Categorical(probs) + actions = policy.sample((num_samples,)) + + log_prob_val = None + if log_prob: + log_prob_val = F.log_softmax(logits, dim=1) + log_prob_val = torch.gather(log_prob_val, dim=1, index=actions) + + if num_samples == 1: + actions = actions.squeeze(0) + if log_prob: + log_prob_val = log_prob_val.squeeze(0) + + actions = actions.unsqueeze(-1) + if log_prob: + log_prob_val = log_prob_val.unsqueeze(-1) + + return actions.long(), log_prob_val, None + + def all_log_prob(self, states): + """ + Returns the log probability of taking each action in `states`. + """ + logits = self.forward(states) + log_probs = F.log_softmax(logits, dim=1) + + return log_probs + + def log_prob(self, states, actions): + """ + Returns the log probability of taking `actions` in `states`. + """ + logits = self.forward(states) + log_probs = F.log_softmax(logits, dim=1) + if actions.shape[0] == log_probs.shape[0] and len(actions.shape) == 1: + actions = actions.unsqueeze(-1) + log_probs = torch.gather(log_probs, dim=1, index=actions.long()) + + return log_probs diff --git a/agent/nonlinear/policy/MLP.py b/agent/nonlinear/policy/MLP.py new file mode 100644 index 0000000..7d7cec9 --- /dev/null +++ b/agent/nonlinear/policy/MLP.py @@ -0,0 +1,723 @@ +# Import modules +import torch +import time +import numpy as np +import torch.nn as nn +import torch.nn.functional as F +from torch.distributions import Normal, Independent +from agent.nonlinear.nn_utils import weights_init_ +from utils.TruncatedNormal import TruncatedNormal + + +# Global variables +EPSILON = 1e-6 + + +# Class definitions +class SquashedGaussian(nn.Module): + """ + Class SquashedGaussian implements a policy following a squashed + Gaussian distribution in each state, parameterized by an MLP. + + The MLP architecture is implemented + as two shared hidden layers, followed by two separate output layers: + one to predict the mean, and the other to predict the log standard + deviation. + + For the the version that SAC used for the submission to ICML, see + commit f66e4bf666da8c4142ff5acd33aed91dc25f4110. + Basically there was a bug where the first and last layers + used xavier_uniform while the second layer used kaiming_uniform + """ + def __init__(self, num_inputs, num_actions, hidden_dim, activation, + action_space=None, clip_stddev=1000, init=None): + """ + Constructor + + Parameters + ---------- + num_inputs : int + The number of elements in the state feature vector + num_actions : int + The dimensionality of the action vector + hidden_dim : int + The number of units in each hidden layer of the network + activation : str + The activation function to use, one of 'relu', 'tanh' + action_space : gym.spaces.Space, optional + The action space of the environment, by default None. This argument + is used to ensure that the actions are within the correct scale. + clip_stddev : float, optional + The value at which the standard deviation is clipped in order to + prevent numerical overflow, by default 1000. If <= 0, then + no clipping is done. + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + """ + super(SquashedGaussian, self).__init__() + + self.num_actions = num_actions + + # Determine standard deviation clipping + self.clip_stddev = clip_stddev > 0 + self.clip_std_threshold = np.log(clip_stddev) + + # Set up the layers + self.linear1 = nn.Linear(num_inputs, hidden_dim) + self.linear2 = nn.Linear(hidden_dim, hidden_dim) + self.mean_linear = nn.Linear(hidden_dim, num_actions) + self.log_std_linear = nn.Linear(hidden_dim, num_actions) + + # Initialize weights + self.apply(lambda module: weights_init_(module, init, activation)) + + # action rescaling + if action_space is None: + self.action_scale = torch.tensor(1.) + self.action_bias = torch.tensor(0.) + else: + self.action_scale = torch.FloatTensor( + (action_space.high - action_space.low) / 2.) + self.action_bias = torch.FloatTensor( + (action_space.high + action_space.low) / 2.) + + if activation == "relu": + self.act = F.relu + elif activation == "tanh": + self.act = torch.tanh + else: + raise ValueError(f"unknown activation function {activation}") + + def forward(self, state): + """ + Performs the forward pass through the network, predicting the mean + and the log standard deviation. + + Parameters + ---------- + state : torch.Tensor of float + The input state to predict the policy in + + Returns + ------- + 2-tuple of torch.Tensor of float + The mean and log standard deviation of the Gaussian policy in the + argument state + """ + x = self.act(self.linear1(state)) + x = self.act(self.linear2(x)) + + mean = self.mean_linear(x) + log_std = self.log_std_linear(x) + + if self.clip_stddev: + log_std = torch.clamp(log_std, min=-self.clip_std_threshold, + max=self.clip_std_threshold) + return mean, log_std + + def sample(self, state, num_samples=1): + """ + Samples the policy for an action in the argument state + + Parameters + ---------- + state : torch.Tensor of float + The input state to predict the policy in + + Returns + ------- + torch.Tensor of float + A sampled action + """ + mean, log_std = self.forward(state) + std = log_std.exp() + normal = Normal(mean, std) + + if self.num_actions > 1: + normal = Independent(normal, 1) + + x_t = normal.sample((num_samples,)) + if num_samples == 1: + x_t = x_t.squeeze(0) + y_t = torch.tanh(x_t) + action = y_t * self.action_scale + self.action_bias + log_prob = normal.log_prob(x_t) + + log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + + EPSILON).sum(axis=-1).reshape(log_prob.shape) + if self.num_actions > 1: + log_prob = log_prob.unsqueeze(-1) + + mean = torch.tanh(mean) * self.action_scale + self.action_bias + + return action, log_prob, mean, x_t + + def rsample(self, state, num_samples=1): + """ + Samples the policy for an action in the argument state using + the reparameterization trick + + Parameters + ---------- + state : torch.Tensor of float + The input state to predict the policy in + + Returns + ------- + torch.Tensor of float + A sampled action + """ + mean, log_std = self.forward(state) + std = log_std.exp() + normal = Normal(mean, std) + + if self.num_actions > 1: + normal = Independent(normal, 1) + + # For re-parameterization trick (mean + std * N(0,1)) + # rsample() implements the re-parameterization trick + x_t = normal.rsample((num_samples,)) + if num_samples == 1: + x_t = x_t.squeeze(0) + y_t = torch.tanh(x_t) + action = y_t * self.action_scale + self.action_bias + log_prob = normal.log_prob(x_t) + + log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + + EPSILON).sum(axis=-1).reshape(log_prob.shape) + if self.num_actions > 1: + log_prob = log_prob.unsqueeze(-1) + + mean = torch.tanh(mean) * self.action_scale + self.action_bias + + return action, log_prob, mean, x_t + + def log_prob(self, state_batch, x_t_batch): + """ + Calculates the log probability of taking the action generated + from x_t, where x_t is returned from sample or rsample. The + log probability is returned for each action dimension separately. + """ + mean, log_std = self.forward(state_batch) + std = log_std.exp() + normal = Normal(mean, std) + + if self.num_actions > 1: + normal = Independent(normal, 1) + + y_t = torch.tanh(x_t_batch) + log_prob = normal.log_prob(x_t_batch) + log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) + + EPSILON).sum(axis=-1).reshape(log_prob.shape) + if self.num_actions > 1: + log_prob = log_prob.unsqueeze(-1) + + return log_prob + + def to(self, device): + """ + Moves the network to a device + + Parameters + ---------- + device : torch.device + The device to move the network to + + Returns + ------- + nn.Module + The current network, moved to a new device + """ + self.action_scale = self.action_scale.to(device) + self.action_bias = self.action_bias.to(device) + return super(SquashedGaussian, self).to(device) + + +class Softmax(nn.Module): + """ + Softmax implements a softmax policy in each state, parameterized + using an MLP to predict logits. + """ + def __init__(self, num_inputs, num_actions, hidden_dim, activation, + init=None): + super(Softmax, self).__init__() + + self.num_actions = num_actions + + self.linear1 = nn.Linear(num_inputs, hidden_dim) + self.linear2 = nn.Linear(hidden_dim, hidden_dim) + self.linear3 = nn.Linear(hidden_dim, num_actions) + + # self.apply(weights_init_) + self.apply(lambda module: weights_init_(module, init, activation)) + + if activation == "relu": + self.act = F.relu + elif activation == "tanh": + self.act = torch.tanh + else: + raise ValueError(f"unknown activation {activation}") + + def forward(self, state): + x = self.act(self.linear1(state)) + x = self.act(self.linear2(x)) + return self.linear3(x) + + def sample(self, state, num_samples=1): + logits = self.forward(state) + + if len(logits.shape) != 1 and (len(logits.shape) != 2 and 1 not in + logits.shape): + shape = logits.shape + raise ValueError(f"expected a vector of logits, got shape {shape}") + + probs = F.softmax(logits, dim=1) + + policy = torch.distributions.Categorical(probs) + actions = policy.sample((num_samples,)) + + log_prob = F.log_softmax(logits, dim=1) + + log_prob = torch.gather(log_prob, dim=1, index=actions) + if num_samples == 1: + actions = actions.squeeze(0) + log_prob = log_prob.squeeze(0) + + actions = actions.unsqueeze(-1) + log_prob = log_prob.unsqueeze(-1) + + # return actions.float(), log_prob, None + return actions.int(), log_prob, None + + def all_log_prob(self, states): + logits = self.forward(states) + log_probs = F.log_softmax(logits, dim=1) + + return log_probs + + def log_prob(self, states, actions): + """TODO: Docstring for log_prob. + + Parameters + ---------- + states : TODO + actions : TODO + + Returns + ------- + TODO + + """ + logits = self.forward(states) + log_probs = F.log_softmax(logits, dim=1) + log_probs = torch.gather(log_probs, dim=1, index=actions.long()) + + return log_probs + + +class Gaussian(nn.Module): + """ + Class Gaussian implements a policy following Gaussian distribution + in each state, parameterized as an MLP. The predicted mean is scaled to be + within `(action_min, action_max)`. + + The MLP architecture is implemented as two shared hidden layers, + followed by two separate output layers: one to predict the mean, and the + other to predict the log standard deviation. + """ + def __init__(self, num_inputs, num_actions, hidden_dim, activation, + action_space, clip_stddev=1000, init=None): + """ + Constructor + + Parameters + ---------- + num_inputs : int + The number of elements in the state feature vector + num_actions : int + The dimensionality of the action vector + hidden_dim : int + The number of units in each hidden layer of the network + action_space : gym.spaces.Space + The action space of the environment + clip_stddev : float, optional + The value at which the standard deviation is clipped in order to + prevent numerical overflow, by default 1000. If <= 0, then + no clipping is done. + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + """ + super(Gaussian, self).__init__() + + self.num_actions = num_actions + + # Determine standard deviation clipping + self.clip_stddev = clip_stddev > 0 + self.clip_std_threshold = np.log(clip_stddev) + + self.linear1 = nn.Linear(num_inputs, hidden_dim) + self.linear2 = nn.Linear(hidden_dim, hidden_dim) + + self.mean_linear = nn.Linear(hidden_dim, num_actions) + self.log_std_linear = nn.Linear(hidden_dim, num_actions) + + # Initialize weights + self.apply(lambda module: weights_init_(module, init, activation)) + + # Action rescaling + self.action_max = torch.FloatTensor(action_space.high) + self.action_min = torch.FloatTensor(action_space.low) + + if activation == "relu": + self.act = F.relu + elif activation == "tanh": + self.act = torch.tanh + else: + raise ValueError(f"unknown activation {activation}") + + def forward(self, state): + """ + Performs the forward pass through the network, predicting the mean + and the log standard deviation. + + Parameters + ---------- + state : torch.Tensor of float + The input state to predict the policy in + + Returns + ------- + 2-tuple of torch.Tensor of float + The mean and log standard deviation of the Gaussian policy in the + argument state + """ + x = self.act(self.linear1(state)) + x = self.act(self.linear2(x)) + + mean = torch.tanh(self.mean_linear(x)) + mean = ((mean + 1) / 2) * (self.action_max - self.action_min) + \ + self.action_min # ∈ [action_min, action_max] + log_std = self.log_std_linear(x) + + # Works better with std dev clipping to ±1000 + if self.clip_stddev: + log_std = torch.clamp(log_std, min=-self.clip_std_threshold, + max=self.clip_std_threshold) + return mean, log_std + + def rsample(self, state, num_samples=1): + """ + Samples the policy for an action in the argument state + + Parameters + ---------- + state : torch.Tensor of float + The input state to predict the policy in + + Returns + ------- + torch.Tensor of float + A sampled action + """ + mean, log_std = self.forward(state) + std = log_std.exp() + normal = Normal(mean, std) + if self.num_actions > 1: + normal = Independent(normal, 1) + + # For re-parameterization trick (mean + std * N(0,1)) + # rsample() implements the re-parameterization trick + action = normal.rsample((num_samples,)) + action = torch.clamp(action, self.action_min, self.action_max) + if num_samples == 1: + action = action.squeeze(0) + + log_prob = normal.log_prob(action) + if self.num_actions == 1: + log_prob.unsqueeze(-1) + + return action, log_prob, mean + + def sample(self, state, num_samples=1): + """ + Samples the policy for an action in the argument state + + Parameters + ---------- + state : torch.Tensor of float + The input state to predict the policy in + num_samples : int + The number of actions to sample + + Returns + ------- + torch.Tensor of float + A sampled action + """ + mean, log_std = self.forward(state) + std = log_std.exp() + normal = Normal(mean, std) + if self.num_actions > 1: + normal = Independent(normal, 1) + + # Non-differentiable + action = normal.sample((num_samples,)) + action = torch.clamp(action, self.action_min, self.action_max) + + if num_samples == 1: + action = action.squeeze(0) + + log_prob = normal.log_prob(action) + if self.num_actions == 1: + log_prob.unsqueeze(-1) + + # print(action.shape) + + return action, log_prob, mean + + def log_prob(self, states, actions, show=False): + """ + Returns the log probability of taking actions in states. The + log probability is returned for each action dimension + separately, and should be added together to get the final + log probability + """ + mean, log_std = self.forward(states) + std = log_std.exp() + normal = Normal(mean, std) + if self.num_actions > 1: + normal = Independent(normal, 1) + + log_prob = normal.log_prob(actions) + if self.num_actions == 1: + log_prob.unsqueeze(-1) + + if show: + print(torch.cat([mean, std], axis=1)[0]) + + return log_prob + + def to(self, device): + """ + Moves the network to a device + + Parameters + ---------- + device : torch.device + The device to move the network to + + Returns + ------- + nn.Module + The current network, moved to a new device + """ + self.action_max = self.action_max.to(device) + self.action_min = self.action_min.to(device) + return super(Gaussian, self).to(device) + + +class TruncatedGaussian(nn.Module): + """ + Class TruncatedGaussian implements a policy following + a truncated Gaussian distribution in each state. + + The MLP architecture is implemented + as two shared hidden layers, followed by two separate output layers: + one to predict the mean, and the other to predict the log standard + deviation. The mean is scaled to be within (action_min, `action_max)`. + """ + def __init__(self, num_inputs, num_actions, hidden_dim, activation, + action_space, clip_stddev=1000, init=None): + """ + Constructor + + Parameters + ---------- + num_inputs : int + The number of elements in the state feature vector + num_actions : int + The dimensionality of the action vector + hidden_dim : int + The number of units in each hidden layer of the network + action_space : gym.spaces.Space + The action space of the environment + clip_stddev : float, optional + The value at which the standard deviation is clipped in order to + prevent numerical overflow, by default 1000. If <= 0, then + no clipping is done. + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + """ + super(TruncatedGaussian, self).__init__() + + # Determine standard deviation clipping + self.clip_stddev = clip_stddev > 0 + self.clip_std_threshold = np.log(clip_stddev) + + self.linear1 = nn.Linear(num_inputs, hidden_dim) + self.linear2 = nn.Linear(hidden_dim, hidden_dim) + + self.mean_linear = nn.Linear(hidden_dim, num_actions) + self.log_std_linear = nn.Linear(hidden_dim, num_actions) + + # self.apply(weights_init_) + self.apply(lambda module: weights_init_(module, init, activation)) + + # action rescaling + assert len(action_space.low.shape) == 1 + self.action_max = torch.FloatTensor(action_space.high) + self.action_min = torch.FloatTensor(action_space.low) + + if activation == "relu": + self.act = F.relu + elif activation == "tanh": + self.act = torch.tanh + else: + raise ValueError(f"unknown activation {activation}") + + def forward(self, state): + """ + Performs the forward pass through the network, predicting the mean + and the log standard deviation. + + Parameters + ---------- + state : torch.Tensor of float + The input state to predict the policy in + + Returns + ------- + 2-tuple of torch.Tensor of float + The mean and log standard deviation of the Gaussian policy in the + argument state + """ + x = self.act(self.linear1(state)) + x = self.act(self.linear2(x)) + + mean = torch.tanh(self.mean_linear(x)) + mean = ((mean + 1)/2) * (self.action_max - self.action_min) + \ + self.action_min # ∈ [action_min, action_max] + log_std = self.log_std_linear(x) + + # Works better with std dev clipping to ±1000 + if self.clip_stddev: + log_std = torch.clamp(log_std, min=-self.clip_std_threshold, + max=self.clip_std_threshold) + return mean, log_std + + def rsample(self, state, num_samples=1): + """ + Samples the policy for an action in the argument state + + Parameters + ---------- + state : torch.Tensor of float + The input state to predict the policy in + + Returns + ------- + torch.Tensor of float + A sampled action + """ + mean, log_std = self.forward(state) + std = log_std.exp() + normal = TruncatedNormal(loc=mean, scale=std, a=self.action_min, + b=self.action_max) + + # For re-parameterization trick (mean + std * N(0,1)) + # rsample() implements the re-parameterization trick + x = normal.rsample((num_samples,)) + action = torch.clamp(x, self.action_min, self.action_max) + if num_samples == 1: + action = action.squeeze(0) + + log_prob = normal.log_prob(action) + if num_samples == 1: + log_prob = log_prob.sum(1, keepdim=True) + else: + log_prob = log_prob.sum(2, keepdim=True) + + return action, log_prob, mean + + def sample(self, state, num_samples=1): + """ + Samples the policy for an action in the argument state + + Parameters + ---------- + state : torch.Tensor of float + The input state to predict the policy in + num_samples : int + The number of actions to sample + + Returns + ------- + torch.Tensor of float + A sampled action + """ + mean, log_std = self.forward(state) + std = log_std.exp() + normal = TruncatedNormal(loc=mean, scale=std, a=self.action_min, + b=self.action_max) + + # Non-differentiable + x = normal.sample((num_samples,)) + action = torch.clamp(x, self.action_min, self.action_max) + if num_samples == 1: + action = action.squeeze(0) + + log_prob = normal.log_prob(action) + if num_samples == 1: + log_prob = log_prob.sum(1, keepdim=True) + else: + log_prob = log_prob.sum(2, keepdim=True) + + return action, log_prob, mean + + def log_prob(self, states, actions, show=False): + """ + Returns the log probability of taking actions in states. The + log probability is returned for each action dimension + separately, and should be added together to get the final + log probability + """ + mean, log_std = self.forward(states) + std = log_std.exp() + normal = TruncatedNormal(loc=mean, scale=std, a=self.action_min, + b=self.action_max) + + log_prob = normal.log_prob(actions) + + if show: + print(torch.cat([mean, std], axis=1)[0]) + # print(log_prob.shape) + + return log_prob + + def to(self, device): + """ + Moves the network to a device + + + Parameters + ---------- + device : torch.device + The device to move the network to + + Returns + ------- + nn.Module + The current network, moved to a new device + """ + self.action_max = self.action_max.to(device) + self.action_min = self.action_min.to(device) + return super(TruncatedGaussian, self).to(device) diff --git a/agent/nonlinear/value_function/CNN.py b/agent/nonlinear/value_function/CNN.py new file mode 100644 index 0000000..6f304b8 --- /dev/null +++ b/agent/nonlinear/value_function/CNN.py @@ -0,0 +1,238 @@ +#!/usr/bin/env python3 + +# Import modules +import numpy as np +import torch +import torch.nn as nn +import torch.nn.functional as F +import agent.nonlinear.nn_utils as nn_utils + + +class Q(nn.Module): + """ + Class Q implements an action-value network using a CNN function + approximator. The network has a single output, which is the action value + for the input action in the input state. + + The action value is compute by first convolving the state observation, the + concatenating the flattened state convolution with the action and using + this as input to the fully connected layers. A single action value is + outputted for the input action. + """ + def __init__(self, input_dim, action_dim, channels, kernel_sizes, + hidden_sizes, init, activation): + """ + Constructor + + Parameters + ---------- + input_dim : tuple[int, int, int] + Dimensionality of state features, which should be (channels, + height, width) + action_dim : int + Dimensionality of the action vector + channels : array-like[int] + The number of channels in each hidden convolutional layer + kernel_sizes : array-like[int] + The number of channels in each consecutive convolutional layer + hidden_sizes : array-like[int] + The number of units in each consecutive fully connected layer + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + activation : indexable[str] or str + The activation function to use; each element should be one of + 'relu', 'tanh' + """ + super(Q, self).__init__() + + self.conv, self.linear = nn_utils._construct_conv_linear( + input_dim, + action_dim, + channels, + kernel_sizes, + hidden_sizes, + init, + activation, + True, + ) + + def forward(self, state, action): + """ + Performs the forward pass through the network, predicting the + action-value for `action` in `state`. + + Parameters + ---------- + state : torch.Tensor[float] + The state that the action was taken in + action : torch.Tensor[float] or np.ndarray[float] + The action to get the value of + + Returns + ------- + torch.Tensor + The action value prediction + """ + if isinstance(state, np.ndarray): + x = torch.tensor(state) + + x = self.conv(state) + + x = torch.flatten(x) + x = torch.cat([x, action]) + return self.linear(x) + + +class DiscreteQ(nn.Module): + """ + Class DiscreteQ implements an action-value network using a CNN function + approximator. The network outputs one action value for each available + action. + """ + def __init__(self, input_dim, num_actions, channels, kernel_sizes, + hidden_sizes, init, activation): + """ + Constructor + + Parameters + ---------- + input_dim : tuple[int, int, int] + Dimensionality of state features, which should be (channels, + height, width) + num_actions : int + The number of available actions in the environment + channels : array-like[int] + The number of channels in each hidden convolutional layer + kernel_sizes : array-like[int] + The number of channels in each consecutive convolutional layer + hidden_sizes : array-like[int] + The number of units in each consecutive fully connected layer + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + activation : indexable[str] or str + The activation function to use; each element should be one of + 'relu', 'tanh' + """ + super(DiscreteQ, self).__init__() + + self.conv, self.linear = nn_utils._construct_conv_linear( + input_dim, + num_actions, + channels, + kernel_sizes, + hidden_sizes, + init, + activation, + False, + ) + + def forward(self, state): + """ + Performs the forward pass through the network, predicting an action + value for each action in `state`. + + Parameters + ---------- + state : torch.Tensor[float] or np.array[float] + The state that the action was taken in + + Returns + ------- + torch.Tensor + The action value prediction for each action in `state` + """ + if isinstance(state, np.ndarray): + x = torch.tensor(x) + + x = self.conv(state) + return self.linear(torch.flatten(x, start_dim=1)) + + +class DoubleDiscreteQ(nn.Module): + """ + Class DoubleDiscreteQ implements a double action-value network + using a CNN function approximator. + The network outputs two action values for each available action. + """ + def __init__(self, input_dim, num_actions, channels, kernel_sizes, + hidden_sizes, init, activation): + """ + Constructor + + Parameters + ---------- + input_dim : tuple[int, int, int] + Dimensionality of state features, which should be (channels, + height, width) + num_actions : int + The number of available actions in the environment + channels : array-like[int] + The number of channels in each hidden convolutional layer + kernel_sizes : array-like[int] + The number of channels in each consecutive convolutional layer + hidden_sizes : array-like[int] + The number of units in each consecutive fully connected layer + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + activation : indexable[str] or str + The activation function to use; each element should be one of + 'relu', 'tanh' + """ + super(DoubleDiscreteQ, self).__init__() + + self.conv1, self.linear1 = nn_utils._construct_conv_linear( + input_dim, + num_actions, + channels, + kernel_sizes, + hidden_sizes, + init, + activation, + False, + ) + + self.conv2, self.linear2 = nn_utils._construct_conv_linear( + input_dim, + num_actions, + channels, + kernel_sizes, + hidden_sizes, + init, + activation, + False, + ) + + def forward(self, state): + """ + Performs the forward pass through the network, predicting an action + value for each action in `state`. + + Parameters + ---------- + state : torch.Tensor[float] or np.array[float] + The state that the action was taken in + + Returns + ------- + torch.Tensor + The action value prediction for each action in `state` + """ + if isinstance(state, np.ndarray): + x = torch.tensor(x) + + x1 = self.conv1(state) + q1 = self.linear1(torch.flatten(x1, start_dim=1)) + + x2 = self.conv2(state) + q2 = self.linear2(torch.flatten(x2, start_dim=1)) + + return q1, q2 diff --git a/agent/nonlinear/value_function/MLP.py b/agent/nonlinear/value_function/MLP.py new file mode 100644 index 0000000..bebffd8 --- /dev/null +++ b/agent/nonlinear/value_function/MLP.py @@ -0,0 +1,282 @@ +#!/usr/bin/env python3 + +# Import modules +import numpy as np +import torch +import torch.nn as nn +import torch.nn.functional as F +import agent.nonlinear.nn_utils as nn_utils + + +# Class definitions +class V(nn.Module): + """ + Class V is an MLP for estimating the state value function `v`. + """ + def __init__(self, num_inputs, hidden_dim, init, activation): + """ + Constructor + + Parameters + ---------- + num_inputs : int + Dimensionality of input feature vector + hidden_dim : int + The number of units in each hidden layer + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + activation : str + The activation function to use; one of 'relu', 'tanh' + """ + super(V, self).__init__() + + self.linear1 = nn.Linear(num_inputs, hidden_dim) + self.linear2 = nn.Linear(hidden_dim, hidden_dim) + self.linear3 = nn.Linear(hidden_dim, 1) + + self.apply(lambda module: nn_utils.weights_init_(module, init)) + + if activation == "relu": + self.act = F.relu + elif activation == "tanh": + self.act = torch.tanh + else: + raise ValueError(f"unknown activation {activation}") + + def forward(self, state): + """ + Performs the forward pass through the network, predicting the value of + `state`. + + Parameters + ---------- + state : torch.Tensor of float + The feature vector of the state to compute the value of + + Returns + ------- + torch.Tensor of float + The value of the state + """ + x = self.act(self.linear1(state)) + x = self.act(self.linear2(x)) + x = self.linear3(x) + return x + + +class DiscreteQ(nn.Module): + """ + Class DiscreteQ implements an action value network with number of + predicted action values equal to the number of available actions. + """ + def __init__(self, num_inputs, num_actions, hidden_dim, init, + activation): + """ + Constructor + + Parameters + ---------- + num_inputs : int + Dimensionality of state feature vector + num_actions : int + Dimensionality of the action feature vector + hidden_dim : int + The number of units in each hidden layer + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + activation : str + The activation function to use; one of 'relu', 'tanh' + """ + super(DiscreteQ, self).__init__() + + self.linear1 = nn.Linear(num_inputs, hidden_dim) + self.linear2 = nn.Linear(hidden_dim, hidden_dim) + self.linear3 = nn.Linear(hidden_dim, num_actions) + + self.apply(lambda module: nn_utils.weights_init_(module, init)) + + if activation == "relu": + self.act = F.relu + elif activation == "tanh": + self.act = torch.tanh + else: + raise ValueError(f"unknown activation {activation}") + + def forward(self, state): + """ + Performs the forward pass through each network, predicting the + action-value for `action` in `state`. + + Parameters + ---------- + state : torch.Tensor of float + The state that the action was taken in + + Returns + ------- + torch.Tensor + The action value predictions + """ + + x = self.act(self.linear1(state)) + x = self.act(self.linear2(x)) + return self.linear3(x) + + +class Q(nn.Module): + """ + Class Q implements an action-value network using an MLP function + approximator. The action value is computed by concatenating the action to + the state observation as the input to the neural network. + """ + def __init__(self, num_inputs, num_actions, hidden_dim, init, + activation): + """ + Constructor + + Parameters + ---------- + num_inputs : int + Dimensionality of state feature vector + num_actions : int + Dimensionality of the action feature vector + hidden_dim : int + The number of units in each hidden layer + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + activation : str + The activation function to use; one of 'relu', 'tanh' + """ + super(Q, self).__init__() + + # Q1 architecture + self.linear1 = nn.Linear(num_inputs + num_actions, hidden_dim) + self.linear2 = nn.Linear(hidden_dim, hidden_dim) + self.linear3 = nn.Linear(hidden_dim, 1) + + self.apply(lambda module: nn_utils.weights_init_(module, init)) + + if activation == "relu": + self.act = F.relu + elif activation == "tanh": + self.act = torch.tanh + else: + raise ValueError(f"unknown activation {activation}") + + def forward(self, state, action): + """ + Performs the forward pass through each network, predicting the + action-value for `action` in `state`. + + Parameters + ---------- + state : torch.Tensor of float + The state that the action was taken in + action : torch.Tensor of float + The action taken in the input state to predict the value function + of + + Returns + ------- + torch.Tensor + The action value prediction + """ + xu = torch.cat([state, action], 1) + + x = self.act(self.linear1(xu)) + x = self.act(self.linear2(x)) + x = self.linear3(x) + + return x + + +class DoubleQ(nn.Module): + """ + Class DoubleQ implements two action-value networks, + computing the action-value function using two separate fully + connected neural net. This is useful for implementing double Q-learning. + The action values are computed by concatenating the action to the state + observation and using this as input to each neural network. + """ + def __init__(self, num_inputs, num_actions, hidden_dim, init, + activation): + """ + Constructor + + Parameters + ---------- + num_inputs : int + Dimensionality of state feature vector + num_actions : int + Dimensionality of the action feature vector + hidden_dim : int + The number of units in each hidden layer + init : str + The initialization scheme to use for the weights, one of + 'xavier_uniform', 'xavier_normal', 'uniform', 'normal', + 'orthogonal', by default None. If None, leaves the default + PyTorch initialization. + activation : str + The activation function to use; one of 'relu', 'tanh' + """ + super(DoubleQ, self).__init__() + + # Q1 architecture + self.linear1 = nn.Linear(num_inputs + num_actions, hidden_dim) + self.linear2 = nn.Linear(hidden_dim, hidden_dim) + self.linear3 = nn.Linear(hidden_dim, 1) + + # Q2 architecture + self.linear4 = nn.Linear(num_inputs + num_actions, hidden_dim) + self.linear5 = nn.Linear(hidden_dim, hidden_dim) + self.linear6 = nn.Linear(hidden_dim, 1) + + self.apply(lambda module: nn_utils.weights_init_(module, init)) + + if activation == "relu": + self.act = F.relu + elif activation == "tanh": + self.act = torch.tanh + else: + raise ValueError(f"unknown activation {activation}") + + def forward(self, state, action): + """ + Performs the forward pass through each network, predicting two + action-values (from each action-value approximator) for the input + action in the input state. + + Parameters + ---------- + state : torch.Tensor of float + The state that the action was taken in + action : torch.Tensor of float + The action taken in the input state to predict the value function + of + + Returns + ------- + 2-tuple of torch.Tensor of float + A 2-tuple of action values, one predicted by each function + approximator + """ + xu = torch.cat([state, action], 1) + + x1 = self.act(self.linear1(xu)) + x1 = self.act(self.linear2(x1)) + x1 = self.linear3(x1) + + x2 = self.act(self.linear4(xu)) + x2 = self.act(self.linear5(x2)) + x2 = self.linear6(x2) + + return x1, x2 diff --git a/combine.py b/combine.py new file mode 100644 index 0000000..2ec2e33 --- /dev/null +++ b/combine.py @@ -0,0 +1,78 @@ +#!/usr/bin/env python3 + +import sys +import pickle +import os +import json +import utils.experiment_utils as exp +import click + + +def add_dicts(data, newfiles): + """ + add_dicts adds the data dictionaries in newfiles to the existing + dictionary data. This function assumes that the hyperparameter + indices between data and those found in each file in newfiles are + consistent. + """ + set_experiment_val = False + if data is None: + set_experiment_val = True + data = { + "experiment_data": {}, + "experiment": {}, + } + # Add data from all other dictionaries + for file in newfiles: + with open(file, "rb") as in_file: + # Read in the new dictionary + try: + in_data = pickle.load(in_file) + except EOFError: + print(file) + continue + + if set_experiment_val: + data["experiment"] = in_data["experiment"] + + # Add experiment data to running dictionary + for key in in_data["experiment_data"]: + # Check if key exists + if key in data["experiment_data"]: + if "learned_params" in \ + data["experiment_data"][key]["runs"][0]: + del data["experiment_data"][key]["runs"][0][ + "learned_params"] + # continue + # Append data if existing + data["experiment_data"][key]["runs"].extend( + in_data["experiment_data"][key]["runs"]) + + else: + # Key doesn't exist - add data to dictionary + data["experiment_data"][key] = \ + in_data["experiment_data"][key] + + return data + + +@click.command(help="combine a number of data files in a single " + + "directory into a single data file called data.pkl") +@click.argument("directory", required=True, type=click.Path(exists=True)) +def main(directory): + data = None + if os.path.exists(os.path.join(directory, "data.pkl")): + print("remove data.pkl from directory first") + + files = os.listdir(directory) + if "data.pkl" in files: + files.remove("data.pkl") + filenames = list(map(lambda x: os.path.join(directory, x), files)) + data = add_dicts(data, filenames) + + with open(os.path.join(directory, "data.pkl"), "wb") as outfile: + pickle.dump(data, outfile) + + +if __name__ == "__main__": + main() diff --git a/config/agent/FKL.json b/config/agent/FKL.json new file mode 100644 index 0000000..232e6e5 --- /dev/null +++ b/config/agent/FKL.json @@ -0,0 +1,21 @@ +{ + "agent_name": "fkl", + "parameters": + { + "replay_capacity": [100000], + "batch_size": [32], + "tau": [0.01], + "alpha": [0.001, 0.01, 0.1, 1.0, 10.0], + "betas": [[0.9, 0.999], [0.0, 0.999]], + "num_samples": [30], + "policy_type": ["Gaussian"], + "target_update_interval": [1], + "critic_lr": [1e-1, 1e-2, 1e-3, 1e-4], + "actor_lr_scale": [0.01, 0.1, 1.0, 2.0], + "hidden_dim": [64], + "weight_init": ["xavier_uniform"], + "clip_stddev": [1000], + "cuda": [false] + } +} + diff --git a/config/agent/LinearGaussianAC.json b/config/agent/LinearGaussianAC.json new file mode 100644 index 0000000..6534f01 --- /dev/null +++ b/config/agent/LinearGaussianAC.json @@ -0,0 +1,17 @@ +{ + "agent_name": "LinearGaussianAC", + "parameters": + { + "decay": [0.5, 0.75, 0.9], + "critic_lr": [2.0, 0.5, 0.25, 0.125, 0.0625], + "actor_lr_scale": [0.01, 0.1, 1.0, 2.0, 5.0], + "use_critic_trace": [true, false], + "use_actor_trace": [true, false], + "scaled": [false], + "bins": [4], + "num_tilings": [16], + "clip_stddev": [1000], + "count_interval": [10000], + "trace_type": ["replacing"] + } +} diff --git a/config/agent/LinearSoftmaxAC.json b/config/agent/LinearSoftmaxAC.json new file mode 100644 index 0000000..4081b47 --- /dev/null +++ b/config/agent/LinearSoftmaxAC.json @@ -0,0 +1,18 @@ +{ + "agent_name": "LinearSoftmaxAC", + "parameters": + { + "decay": [0.5, 0.75, 0.9], + "critic_lr": [2.0, 0.5, 0.25, 0.125, 0.0625], + "actor_lr_scale": [0.01, 0.1, 1.0, 2.0, 5.0], + "use_critic_trace": [true, false], + "use_actor_trace": [true, false], + "temperature": [1.0, 0.1, 0.001], + "scaled": [false], + "bins": [4], + "num_tilings": [16], + "clip_stddev": [1000], + "count_interval": [10000], + "trace_type": ["replacing"] + } +} diff --git a/config/agent/SAC.json b/config/agent/SAC.json new file mode 100644 index 0000000..dd86546 --- /dev/null +++ b/config/agent/SAC.json @@ -0,0 +1,25 @@ +{ + "agent_name": "SAC", + "parameters": + { + "replay_capacity": [100000], + "batch_size": [32], + "tau": [0.01], + "num_hidden": [3], + "reparameterized": [true, false], + "soft_q": [true, false], + "double_q": [true, false], + "alpha": [0.001, 0.01, 0.1, 1.0], + "betas": [[0.9, 0.999]], + "policy_type": ["SquashedGaussian"], + "target_update_interval": [1], + "critic_lr": [1e-2, 1e-3, 1e-4, 1e-5], + "actor_lr_scale": [0.01, 0.1, 1.0, 2.0], + "alpha_lr": [0.0], + "hidden_dim": [64], + "automatic_entropy_tuning": [false], + "weight_init": ["xavier_uniform"], + "clip_stddev": [1000], + "cuda": [false] + } +} diff --git a/config/agent/SACDiscrete.json b/config/agent/SACDiscrete.json new file mode 100644 index 0000000..d74c903 --- /dev/null +++ b/config/agent/SACDiscrete.json @@ -0,0 +1,23 @@ +{ + "agent_name": "SACDiscrete", + "parameters": + { + "replay_capacity": [100000], + "batch_size": [32], + "tau": [0.01], + "num_hidden": [3], + "alpha": [0.001, 0.01, 0.1, 1.0, 10.0], + "betas": [[0.9, 0.999], [0.0, 0.999]], + "policy_type": ["Softmax"], + "target_update_interval": [1], + "critic_lr": [1e-1, 1e-2, 1e-3, 1e-4, 1e-5], + "actor_lr_scale": [0.01, 0.1, 1.0, 2.0, 10.0], + "alpha_lr": [0.0], + "hidden_dim": [64], + "automatic_entropy_tuning": [false], + "weight_init": ["xavier_uniform"], + "clip_stddev": [1000], + "cuda": [false] + } +} + diff --git a/config/agent/SACDiscreteCNN.json b/config/agent/SACDiscreteCNN.json new file mode 100644 index 0000000..7a89761 --- /dev/null +++ b/config/agent/SACDiscreteCNN.json @@ -0,0 +1,24 @@ +{ + "agent_name": "SACDiscreteCNN", + "parameters": + { + "replay_capacity": [1000000], + "batch_size": [32], + "tau": [0.01], + "alpha": [1.0], + "betas": [[0.9, 0.999]], + "policy_type": ["Softmax"], + "target_update_interval": [1], + "critic_lr": [0.1], + "actor_lr_scale": [10.0], + "alpha_lr": [0.0], + "hidden_dim": [[128]], + "channels": [[16]], + "kernel_sizes": [[3]], + "weight_init": ["xavier_uniform"], + "clip_stddev": [1000], + "cuda": [false], + "activation": ["relu"] + } +} + diff --git a/config/agent/VAC.json b/config/agent/VAC.json new file mode 100644 index 0000000..ea5e9b7 --- /dev/null +++ b/config/agent/VAC.json @@ -0,0 +1,21 @@ +{ + "agent_name": "VAC", + "parameters": + { + "replay_capacity": [100000], + "batch_size": [32], + "tau": [0.01], + "alpha": [0.001, 0.01, 0.1, 1.0, 10.0], + "betas": [[0.9, 0.999], [0.0, 0.999]], + "num_samples": [30], + "policy_type": ["Gaussian"], + "target_update_interval": [1], + "critic_lr": [1e-1, 1e-2, 1e-3, 1e-4], + "actor_lr_scale": [0.01, 0.1, 1.0, 2.0], + "hidden_dim": [64], + "weight_init": ["xavier_uniform"], + "clip_stddev": [1000], + "cuda": [false] + } +} + diff --git a/config/agent/VACDiscrete.json b/config/agent/VACDiscrete.json new file mode 100644 index 0000000..cc3073d --- /dev/null +++ b/config/agent/VACDiscrete.json @@ -0,0 +1,21 @@ +{ + "agent_name": "VACDiscrete", + "parameters": + { + "replay_capacity": [100000], + "batch_size": [32], + "tau": [0.01], + "alpha": [0.001, 0.01, 0.1, 1.0, 10.0], + "betas": [[0.9, 0.999], [0.0, 0.999]], + "num_samples": [30], + "policy_type": ["Softmax"], + "target_update_interval": [1], + "critic_lr": [1e-1, 1e-2, 1e-3, 1e-4, 1e-5], + "actor_lr_scale": [0.01, 0.1, 1.0, 2.0, 10.0], + "hidden_dim": [64], + "weight_init": ["xavier_uniform"], + "clip_stddev": [1000], + "cuda": [false] + } +} + diff --git a/config/agent/VACDiscreteCNN.json b/config/agent/VACDiscreteCNN.json new file mode 100644 index 0000000..dd52ed1 --- /dev/null +++ b/config/agent/VACDiscreteCNN.json @@ -0,0 +1,24 @@ +{ + "agent_name": "VACDiscreteCNN", + "parameters": + { + "replay_capacity": [1000000], + "batch_size": [32], + "tau": [0.01], + "alpha" : [1.0, 1e-1, 1e-2, 1e-3], + "critic_lr" : [1e-1, 1e-2, 1e-3, 1e-4, 1e-5], + "actor_lr_scale" : [10.0, 1.0, 1e-1, 1e-2, 1e-3], + "betas": [[0.9, 0.999]], + "policy_type": ["Softmax"], + "target_update_interval": [1], + "alpha_lr": [0.0], + "hidden_dim": [[128]], + "channels": [[16]], + "kernel_sizes": [[3]], + "weight_init": ["xavier_uniform"], + "clip_stddev": [1000], + "cuda": [false], + "activation": ["relu"] + } +} + diff --git a/config/environment/AcrobotContinuous-v1.json b/config/environment/AcrobotContinuous-v1.json new file mode 100644 index 0000000..5940ecd --- /dev/null +++ b/config/environment/AcrobotContinuous-v1.json @@ -0,0 +1,9 @@ +{ + "env_name": "Acrobot-v1", + "total_timesteps": 100000, + "steps_per_episode": 1000, + "eval_interval_timesteps": 10000000, + "eval_episodes": 0, + "gamma": 0.99, + "continuous": true +} diff --git a/config/environment/AcrobotDiscrete-v1.json b/config/environment/AcrobotDiscrete-v1.json new file mode 100644 index 0000000..ed1fde8 --- /dev/null +++ b/config/environment/AcrobotDiscrete-v1.json @@ -0,0 +1,9 @@ +{ + "env_name": "Acrobot-v1", + "total_timesteps": 100000, + "steps_per_episode": 1000, + "eval_interval_timesteps": 1000000, + "eval_episodes": 0, + "gamma": 0.99, + "continuous": false +} diff --git a/config/environment/Asterix.json b/config/environment/Asterix.json new file mode 100644 index 0000000..a9f7862 --- /dev/null +++ b/config/environment/Asterix.json @@ -0,0 +1,13 @@ +{ + "env_name": "MinAtarAsterix", + "total_timesteps": 1500000, + "steps_per_episode": 1000, + "eval_interval_timesteps": 100000000, + "eval_episodes": 0, + "gamma": 0.99, + "accumulate_trace": false, + "overwrite_rewards": false, + "continuous": false, + "rewards": {}, + "start_state": [] +} diff --git a/config/environment/BipedalWalker.json b/config/environment/BipedalWalker.json new file mode 100644 index 0000000..aa2b32b --- /dev/null +++ b/config/environment/BipedalWalker.json @@ -0,0 +1,9 @@ +{ + "env_name": "BipedalWalker-v3", + "total_timesteps": 2500000, + "steps_per_episode": 1000, + "eval_interval_timesteps": 10000000, + "eval_episodes": 0, + "gamma": 0.99, + "accumulate_trace": false +} diff --git a/config/environment/Breakout.json b/config/environment/Breakout.json new file mode 100644 index 0000000..1262b09 --- /dev/null +++ b/config/environment/Breakout.json @@ -0,0 +1,13 @@ +{ + "env_name": "MinAtarBreakout", + "total_timesteps": 1500000, + "steps_per_episode": 1000, + "eval_interval_timesteps": 100000000, + "eval_episodes": 0, + "gamma": 0.99, + "accumulate_trace": false, + "overwrite_rewards": false, + "continuous": false, + "rewards": {}, + "start_state": [] +} diff --git a/config/environment/Freeway.json b/config/environment/Freeway.json new file mode 100644 index 0000000..e033b27 --- /dev/null +++ b/config/environment/Freeway.json @@ -0,0 +1,13 @@ +{ + "env_name": "MinAtarFreeway", + "total_timesteps": 5000000, + "steps_per_episode": 2500, + "eval_interval_timesteps": 100000000, + "eval_episodes": 0, + "gamma": 0.99, + "accumulate_trace": false, + "overwrite_rewards": false, + "continuous": false, + "rewards": {}, + "start_state": [] +} diff --git a/config/environment/PendulumContinuous-v0.json b/config/environment/PendulumContinuous-v0.json new file mode 100644 index 0000000..ee10244 --- /dev/null +++ b/config/environment/PendulumContinuous-v0.json @@ -0,0 +1,8 @@ +{ + "env_name": "Pendulum-v0", + "total_timesteps": 100000, + "steps_per_episode": 1000, + "eval_interval_timesteps": 1000000, + "eval_episodes": 0, + "gamma": 0.99 +} diff --git a/config/environment/PendulumDiscrete-v0.json b/config/environment/PendulumDiscrete-v0.json new file mode 100644 index 0000000..ee10244 --- /dev/null +++ b/config/environment/PendulumDiscrete-v0.json @@ -0,0 +1,8 @@ +{ + "env_name": "Pendulum-v0", + "total_timesteps": 100000, + "steps_per_episode": 1000, + "eval_interval_timesteps": 1000000, + "eval_episodes": 0, + "gamma": 0.99 +} diff --git a/config/environment/Seaquest.json b/config/environment/Seaquest.json new file mode 100644 index 0000000..9d016b4 --- /dev/null +++ b/config/environment/Seaquest.json @@ -0,0 +1,13 @@ +{ + "env_name": "MinAtarSeaquest", + "total_timesteps": 2500000, + "steps_per_episode": 1000, + "eval_interval_timesteps": 1000000, + "eval_episodes": 0, + "gamma": 0.99, + "accumulate_trace": false, + "overwrite_rewards": false, + "continuous": false, + "rewards": {}, + "start_state": [] +} diff --git a/config/environment/SpaceInvaders.json b/config/environment/SpaceInvaders.json new file mode 100644 index 0000000..e91ee56 --- /dev/null +++ b/config/environment/SpaceInvaders.json @@ -0,0 +1,13 @@ +{ + "env_name": "MinAtarSpace_Invaders", + "total_timesteps": 1000000, + "steps_per_episode": 1000, + "eval_interval_timesteps": 1000000000, + "eval_episodes": 0, + "gamma": 0.99, + "accumulate_trace": false, + "overwrite_rewards": false, + "continuous": false, + "rewards": {}, + "start_state": [] +} diff --git a/environment.py b/environment.py new file mode 100644 index 0000000..41f4ad6 --- /dev/null +++ b/environment.py @@ -0,0 +1,207 @@ +#!/usr/bin/env python3 + +# Import modules +import gym +from copy import deepcopy +from env.PendulumEnv import PendulumEnv +from env.Acrobot import AcrobotEnv +from env.Gridworld import GridworldEnv +import env.MinAtar as MinAtar +import numpy as np + + +class Environment: + """ + Environment is a wrapper around concrete implementations of environments + which logs data. + """ + def __init__(self, config, seed, monitor=False, monitor_after=0): + """ + Constructor + + Parameters + ---------- + config : dict + The environment configuration file + seed : int + The seed to use for all random number generators + monitor : bool + Whether or not to render the scenes as the agent learns, by + default False + monitor_after : int + If monitor is True, how many timesteps should pass before + the scene is rendered, by default 0. + """ + + self.steps = 0 + self.episodes = 0 + + # Whether to render the environment, and when to. Useful for debugging. + self.monitor = monitor + self.steps_until_monitor = monitor_after + + # Set up the wrapped environment + self.env_name = config["env_name"] + self.env = _env_factory(config) + self.env.seed(seed=seed) + self.steps_per_episode = config["steps_per_episode"] + + # Log environment info + if "info" in dir(self.env): + self.info = self.env.info + else: + self.info = {} + + @property + def action_space(self): + """ + Gets the action space of the Gym environment + + Returns + ------- + gym.spaces.Space + The action space + """ + return self.env.action_space + + @property + def observation_space(self): + """ + Gets the observation space of the Gym environment + + Returns + ------- + gym.spaces.Space + The observation space + """ + return self.env.observation_space + + def seed(self, seed): + """ + Seeds the environment with a random seed + + Parameters + ---------- + seed : int + The random seed to seed the environment with + """ + self.env.seed(seed) + + def reset(self): + """ + Resets the environment by resetting the step counter to 0 and resetting + the wrapped environment. This function also increments the total + episode count. + + Returns + ------- + 2-tuple of array_like, dict + The new starting state and an info dictionary + """ + self.steps = 0 + self.episodes += 1 + + state = self.env.reset() + + return state, {"orig_state": state} + + def render(self): + """ + Renders the current frame + """ + self.env.render() + + def step(self, action): + """ + Takes a single environmental step + + Parameters + ---------- + action : array_like of float + The action array. The number of elements in this array should be + the same as the action dimension. + + Returns + ------- + float, array_like of float, bool, dict + The reward and next state as well as a flag specifying if the + current episode has been completed and an info dictionary + """ + if self.monitor and self.steps_until_monitor < 0: + self.render() + elif self.monitor: + self.steps_until_monitor -= ( + 1 if self.steps_until_monitor >= 0 else 0 + ) + + self.steps += 1 + + # Get the next state, reward, and done flag + state, reward, done, info = self.env.step(action) + info["orig_state"] = state + + # If the episode completes, return the goal reward + if done: + info["steps_exceeded"] = False + return state, reward, done, info + + # If the maximum time-step was reached + if self.steps >= self.steps_per_episode > 0: + done = True + info["steps_exceeded"] = True + + return state, reward, done, info + + +def _env_factory(config): + """ + Instantiates and returns an environment given an environment configuration + file. + + Parameters + ---------- + config : dict + The environment config + + Returns + ------- + gym.Env + The environment to train on + """ + name = config["env_name"] + seed = config["seed"] + env = None + + if name == "Pendulum-v0": + env = PendulumEnv(seed=seed, continuous_action=config["continuous"]) + + elif name == "Gridworld": + env = GridworldEnv(config["rows"], config["cols"]) + env.seed(seed) + + elif name == "Acrobot-v1": + env = AcrobotEnv(seed=seed, continuous_action=config["continuous"]) + + # If using MinAtar environments, we need a wrapper to permute the batch + # dimensions to be consistent with PyTorch. + elif "minatar" in name.lower(): + if "/" in name: + raise ValueError(f"specify environment as MinAtar{name} rather " + + "than MinAtar/{name}") + + minimal_actions = config.get("use_minimal_action_set", True) + stripped_name = name[7:].lower() # Strip off "MinAtar" + + env = MinAtar.BatchFirst( + MinAtar.GymEnv( + stripped_name, + use_minimal_action_set=minimal_actions, + ) + ) + + # Otherwise use a gym environment + else: + env = gym.make(name).env + env.seed(seed) + + return env diff --git a/experiment.py b/experiment.py new file mode 100644 index 0000000..2ff5645 --- /dev/null +++ b/experiment.py @@ -0,0 +1,330 @@ +#!/usr/bin/env python3 + +# Import modules +import time +from datetime import datetime +from copy import deepcopy +import numpy as np + + +class Experiment: + """ + Class Experiment will run a single experiment while logging data. An + experiment consists of a single run of agent-environment interaction. + """ + def __init__(self, agent, env, eval_env, eval_episodes, + total_timesteps, eval_interval_timesteps, max_episodes=-1): + """ + Constructor + + Parameters + ---------- + agent : baseAgent.BaseAgent + The agent to run the experiment on + env : environment.Environment + The environment to use for the experiment + eval_episodes : int + The number of evaluation episodes to run when measuring offline + performance + total_timesteps : int + The maximum number of allowable timesteps per experiment + eval_interval_timesteps: int + The interval of timesteps at which an agent's performance will be + evaluated + state_bins : tuple of int + For the sequence of states used in each update, the number of bins + per dimension with which to bin the states. + min_state_values : array_like + The minimum value of states along each dimension, used to encode + states used in updates to count the number of times states are + used in each update. + max_state_values : array_like + The maximum value of states along each dimension, used to encode + states used in updates to count the number of times states are + used in each update. + action_bins : tuple of int + For the sequence of actions used in each update, the number of bins + per dimension with which to bin the actions. + min_action_values : array_like + The minimum value of actions along each dimension, used to encode + actions used in updates to count the number of times actions are + used in each update. + max_state_values : array_like + The maximum value of actions along each dimension, used to encode + actions used in updates to count the number of times actions are + used in each update. + count_interval : int + The interval of timesteps at which we will store the counts of + state or action bins seen during training or used in updates. At + each timestep, we determine which state/action bins were used in + an update or seen at the current timestep. These values are + accumulated so that the total number of times each bin was + seen/used is stored up to the current timestep. This parameter + controls the timestep interval at which these accumulated values + should be checkpointed. + max_episodes : int + The maximum number of episodes to run. If <= 0, then there is no + episode limit. + """ + self.agent = agent + self.env = env + self.eval_env = eval_env + self.eval_env.monitor = False + + self.eval_episodes = eval_episodes + self.max_episodes = max_episodes + + # Track the number of time steps + self.timesteps_since_last_eval = 0 + self.eval_interval_timesteps = eval_interval_timesteps + self.timesteps_elapsed = 0 + self.total_timesteps = total_timesteps + + # Keep track of number of training episodes + self.train_episodes = 0 + + # Track the returns seen at each training episode + self.train_ep_return = [] + + # Track the steps per each training episode + self.train_ep_steps = [] + + # Track the steps at which evaluation occurs + self.timesteps_at_eval = [] + + # Track the returns seen at each eval episode + self.eval_ep_return = [] + + # Track the number of evaluation steps taken in each evaluation episode + self.eval_ep_steps = [] + + # Anything the experiment tracks + self.info = {} + + # Track the total training and evaluation time + self.train_time = 0.0 + self.eval_time = 0.0 + + def run(self): + """ + Runs the experiment + + Returns + ------- + 14-tuple of list of float, float, int + The online training episodic return, the return per + episode when evaluating offline, the training steps per + episode, the evaluation steps per episode when evaluating + offline, the list of timesteps at which the evaluation episodes + were run, the total amount of training time, the total amount + of evaluation time, and the number of total training episodes, + and the sequence of state, rewards, and actions during training. + Also returns the states, actions, and next states used in each + update to the agent. + """ + # Count total run time + start_run = time.time() + print(f"Starting experiment at: {datetime.now()}") + + # Evaluate once at the beginning + self.eval_time += self.eval() + self.timesteps_at_eval.append(self.timesteps_elapsed) + + # Train + i = 0 + while self.timesteps_elapsed < self.total_timesteps and \ + (self.train_episodes < self.max_episodes if + self.max_episodes > 0 else True): + + # Run the training episode and save the relevant info + ep_reward, ep_steps, train_time = self.run_episode_train() + self.train_ep_return.append(ep_reward) + self.train_ep_steps.append(ep_steps) + self.train_time += train_time + print(f"=== Train ep: {i}, r: {ep_reward}, n_steps: {ep_steps}, " + + f"elapsed: {train_time}") + i += 1 + + # Evaluate once at the end + self.eval_time += self.eval() + self.timesteps_at_eval.append(self.timesteps_elapsed) + + end_run = time.time() + print(f"End run at time {datetime.now()}") + print(f"Total time taken: {end_run - start_run}") + print(f"Training time: {self.train_time}") + print(f"Evaluation time: {self.eval_time}") + + self.info["eval_episode_rewards"] = np.array(self.eval_ep_return) + self.info["eval_episode_steps"] = np.array(self.eval_ep_steps) + self.info["timesteps_at_eval"] = np.array(self.timesteps_at_eval) + self.info["train_episode_steps"] = np.array(self.train_ep_steps) + self.info["train_episode_rewards"] = np.array(self.train_ep_return) + self.info["train_time"] = self.train_time + self.info["eval_time"] = self.eval_time + self.info["total_train_episodes"] = self.train_episodes + + def run_episode_train(self): + """ + Runs a single training episode, saving the evaluation metrics in + the corresponding instance variables. + + Returns + ------- + float, int, float + The return for the episode, the number of steps in the episode, + and the total amount of training time for the episode + """ + # Reset the agent + self.agent.reset() + + self.train_episodes += 1 + + # Track the sequences of states, rewards, and actions during training + # episode_states = [] + episode_rewards = [] + # episode_actions = [] + + start = time.time() + episode_return = 0.0 + episode_steps = 0 + + state, _ = self.env.reset() + + done = False + action = self.agent.sample_action(state) + + while not done: + # Evaluate offline at the appropriate intervals + if self.timesteps_since_last_eval >= \ + self.eval_interval_timesteps: + self.eval_time += self.eval() + self.timesteps_at_eval.append(self.timesteps_elapsed) + + # Sample the next transition + next_state, reward, done, info = self.env.step(action) + episode_steps += 1 + + # episode_states.append(next_state_info["orig_state"]) + episode_rewards.append(reward) + episode_return += reward + + # Compute the done mask, which is 1 if the episode terminated + # without the goal being reached or the episode is incomplete, + # and 0 if the agent reached the goal or terminal state + if self.env.steps_per_episode <= 1: + done_mask = 0 + else: + if episode_steps <= self.env.steps_per_episode and done and \ + not info["steps_exceeded"]: + done_mask = 0 + else: + done_mask = 1 + + # Update agent + self.agent.update(state, action, reward, next_state, done_mask) + + # Continue the episode if not done + if not done: + action = self.agent.sample_action(next_state) + state = next_state + + # Keep track of the timesteps since we last evaluated so we know + # when to evaluate again + self.timesteps_since_last_eval += 1 + + # Keep track of timesteps since we train for a specified number of + # timesteps + self.timesteps_elapsed += 1 + + # Stop if we are at the max allowable timesteps + if self.timesteps_elapsed >= self.total_timesteps: + break + + end = time.time() + + return episode_return, episode_steps, (end-start) + + def eval(self): + """ + Evaluates the agent's performance offline, for the appropriate number + of offline episodes as determined by the self.eval_episodes + instance variable. While evaluating, this function will populate the + appropriate instance variables with the evaluation data. + + Returns + ------- + float + The total amount of evaluation time + """ + self.timesteps_since_last_eval = 0 + + # Set the agent to evaluation mode + self.agent.eval() + + # Save the episodic return and the number of steps per episode + temp_rewards_per_episode = [] + episode_steps = [] + eval_session_time = 0.0 + + # Evaluate offline + for i in range(self.eval_episodes): + eval_start_time = time.time() + episode_reward, num_steps = self.run_episode_eval() + eval_end_time = time.time() + + # Save the evaluation data + temp_rewards_per_episode.append(episode_reward) + episode_steps.append(num_steps) + + # Calculate time + eval_elapsed_time = eval_end_time - eval_start_time + eval_session_time += eval_elapsed_time + + # Display the offline episodic return + print("=== EVAL ep: " + str(i) + ", r: " + + str(episode_reward) + ", n_steps: " + str(num_steps) + + ", elapsed: " + + time.strftime("%H:%M:%S", time.gmtime(eval_elapsed_time))) + + # Save evaluation data + self.eval_ep_return.append(temp_rewards_per_episode) + self.eval_ep_steps.append(episode_steps) + + self.eval_time += eval_session_time + + # Return the agent to training mode + self.agent.train() + + return eval_session_time + + def run_episode_eval(self): + """ + Runs a single evaluation episode. + + Returns + ------- + float, int, list + The episodic return and number of steps and the sequence of states, + rewards, and actions during the episode + """ + state, _ = self.eval_env.reset() + + episode_return = 0.0 + episode_steps = 0 + done = False + + action = self.agent.sample_action(state) + + while not done: + next_state, reward, done, _ = self.eval_env.step(action) + + episode_return += reward + + if not done: + action = self.agent.sample_action(next_state) + + state = next_state + episode_steps += 1 + + return episode_return, episode_steps diff --git a/main.py b/main.py new file mode 100644 index 0000000..ba9f908 --- /dev/null +++ b/main.py @@ -0,0 +1,237 @@ +#!usr/bin/env python3 + +# Import modules +import numpy as np +import environment +import experiment +import pickle +from utils import experiment_utils as exp_utils +import click +import json +from copy import deepcopy +import os +import utils.hypers as hypers + + +@click.command(help="""Given agent and environment configuration files, run + the experiment defined by the configuration files + """) +@click.option("--env-json", help="Path to the environment json " + + "configuration file", + type=str, required=True) +@click.option("--agent-json", help="Path to the agent json configuration file", + type=str, required=True) +@click.option("--index", type=int, required=False, help="The index " + + "of the hyperparameter to run", default=1) +@click.option("--monitor", "-m", is_flag=True, help="Whether or not to " + + "render the scene as the agent trains.", type=bool) +@click.option("--after", "-a", type=int, default=-1, help="How many " + + "timesteps (training) should pass before " + + "rendering the scene") +@click.option("--save-dir", type=str, default="./results", help="Which " + + "directory to save the results file in", required=False) +def run(env_json, agent_json, index, monitor, after, save_dir): + """ + Perform runs over hyperparameter settings. + + Performs the runs on the hyperparameter settings indices specified by + range(start, stop step), with values over the total number of + hyperparameters wrapping around to perform successive runs on the same + hyperparameter settings. For example, if there are 10 hyperparameter + settings and we run with hyperparameter settings 12, then this is the + (12 // 10) = 1 run of hyperparameter settings 12 % 10 = 2, where runs + are 0-based indexed. + + Parameters + ---------- + env_json : str + The path to the JSON environment configuration file + agent_json : str + The path to the JSON agent configuration file + start : int + The hyperparameter index to start the sweep at + stop : int + The hyperparameter index to stop the sweep at + step : int + The stepping value between hyperparameter settings indices + monitor : bool + Whether or not to render the scene as the agent trains + after : int + How many training + evaluation timesteps should pass before rendering + the scene + save_dir : str + The directory to save the data in + """ + # Read the config files + with open(env_json) as in_json: + env_config = json.load(in_json) + with open(agent_json) as in_json: + agent_config = json.load(in_json) + + main(agent_config, env_config, index, monitor, after, save_dir) + + +def main(agent_config, env_config, index, monitor, after, + save_dir="./results"): + """ + Runs experiments on the agent and environment corresponding the the input + JSON files using the hyperparameter settings corresponding to the indices + returned from range(start, stop, step). + + Saves a pickled python dictionary of all training and evaluation data. + + Note: this function will run the experiments sequentially. + + Parameters + ---------- + agent_json : dict + The agent JSON configuration file, as a Python dict + env_json : dict + The environment JSON configuration file, as a Python dict + index : int + The index of the hyperparameter setting to run + monitor : bool + Whether to render the scene as the agent trains or not + after : int + How many training + evaluation timesteps should pass before rendering + the scene + save_dir : str + The directory to save all data in + """ + # Create the data dictionary + data = {} + data["experiment"] = {} + + # Experiment meta-data + data["experiment"]["environment"] = env_config + data["experiment"]["agent"] = agent_config + + # Experiment runs per each hyperparameter + data["experiment_data"] = {} + + # Calculate the number of timesteps before rendering. It is inputted as + # number of training steps, but the environment uses training + eval steps + if after >= 0: + eval_steps = env_config["eval_episodes"] * \ + env_config["steps_per_episode"] + eval_intervals = 1 + (after // env_config["eval_interval_timesteps"]) + after = after + eval_steps * eval_intervals + print(f"Evaluation intervals before monitor: {eval_intervals}") + + # Get the directory to save in + if not save_dir.startswith("./results"): + save_dir = os.path.join("./results", save_dir) + save_dir = os.path.join(save_dir, env_config["env_name"] + "_" + + agent_config["agent_name"] + "results/") + # Run the experiments + # Get agent params from config file for the next experiment + agent_run_params, total_sweeps = hypers.sweeps( + agent_config["parameters"], index) + agent_run_params["gamma"] = env_config["gamma"] + + print(f"Total number of hyperparam combinations: {total_sweeps}") + + # Calculate the run number and the random seed + RUN_NUM = index // total_sweeps + RANDOM_SEED = np.iinfo(np.int16).max - RUN_NUM + + # Create the environment + env_config["seed"] = RANDOM_SEED + if agent_config["agent_name"] == "linearAC" or \ + agent_config["agent_name"] == "linearAC_softmax": + if "use_tile_coding" in env_config: + use_tile_coding = env_config["use_tile_coding"] + env_config["use_full_tile_coding"] = use_tile_coding + del env_config["use_tile_coding"] + + env = environment.Environment(env_config, RANDOM_SEED, monitor, after) + eval_env = environment.Environment(env_config, RANDOM_SEED) + + num_features = env.observation_space.shape[0] + agent_run_params["feature_size"] = num_features + + # Set up the data dictionary to store the data from each run + hp_sweep = index % total_sweeps + if hp_sweep not in data["experiment_data"].keys(): + data["experiment_data"][hp_sweep] = {} + data["experiment_data"][hp_sweep]["agent_hyperparams"] = \ + dict(agent_run_params) + data["experiment_data"][hp_sweep]["runs"] = [] + + SETTING_NUM = index % total_sweeps + TOTAL_TIMESTEPS = env_config["total_timesteps"] + MAX_EPISODES = env_config.get("max_episodes", -1) + EVAL_INTERVAL = env_config["eval_interval_timesteps"] + EVAL_EPISODES = env_config["eval_episodes"] + + # Store the seed in the agent run parameters so that batch algorithms + # can sample randomly + agent_run_params["seed"] = RANDOM_SEED + + # Include the environment observation and action spaces in the agent's + # configuration so that neural networks can have the corrent number of + # output nodes + agent_run_params["observation_space"] = env.observation_space + agent_run_params["action_space"] = env.action_space + + # Saving this data is redundant since we save the env_config file as + # well. Also, each run has the run number as the random seed + run_data = {} + run_data["run_number"] = RUN_NUM + run_data["random_seed"] = RANDOM_SEED + run_data["total_timesteps"] = TOTAL_TIMESTEPS + run_data["eval_interval_timesteps"] = EVAL_INTERVAL + run_data["episodes_per_eval"] = EVAL_EPISODES + + # Print some data about the run + print(f"SETTING_NUM: {SETTING_NUM}") + print(f"RUN_NUM: {RUN_NUM}") + print(f"RANDOM_SEED: {RANDOM_SEED}") + print('Agent setting: ', agent_run_params) + + # Create the agent + print(agent_config["agent_name"]) + agent_run_params["env"] = env + agent = exp_utils.create_agent(agent_config["agent_name"], + agent_run_params) + + # Initialize and run experiment + exp = experiment.Experiment( + agent, + env, + eval_env, + EVAL_EPISODES, + TOTAL_TIMESTEPS, + EVAL_INTERVAL, + MAX_EPISODES, + ) + exp.run() + + # Save the agent's learned parameters, with these parameters and the + # hyperparams, training can be exactly resumed from the end of the run + run_data["learned_params"] = agent.get_parameters() + + # Save any information the agent saved during training + run_data = {**run_data, **agent.info, **exp.info, **env.info} + + # Save data in parent dictionary + data["experiment_data"][hp_sweep]["runs"].append(run_data) + + # After each run, save the data. Since data is accumulated, the + # later runs will overwrite earlier runs with updated data. + if not os.path.exists(save_dir): + os.makedirs(save_dir) + + save_file = save_dir + env_config["env_name"] + "_" + \ + agent_config["agent_name"] + f"_data_{index}.pkl" + + print("=== Saving ===") + print(save_file) + print("==============") + with open(save_file, "wb") as out_file: + pickle.dump(data, out_file) + + +if __name__ == "__main__": + # run_concurrent() + run() diff --git a/minatar/setup.py b/minatar/setup.py new file mode 100644 index 0000000..0c54513 --- /dev/null +++ b/minatar/setup.py @@ -0,0 +1,38 @@ +from setuptools import setup + +packages = ['minatar', 'minatar.environments'] +install_requires = [ + 'cycler>=0.10.0', + 'kiwisolver>=1.0.1', + 'matplotlib>=3.0.3', + 'numpy>=1.16.2', + 'pandas>=0.24.2', + 'pyparsing>=2.3.1', + 'python-dateutil>=2.8.0', + 'pytz>=2018.9', + 'scipy>=1.2.1', + 'seaborn>=0.9.0', + 'six>=1.12.0', +] + +examples_requires = [ + 'torch>=1.0.0', +] + +entry_points = { + 'gym.envs': ['MinAtar=minatar.gym:register_envs'] +} + +setup( + name='MinAtar-Faster', + version='1.1.0', + description='A faster miniaturized version of the arcade learning environment.', + url='https://github.com/kenjyoung/MinAtar', + author='Robert Joseph George', + author_email='rjoseph1@ualberta.com', + license='GPL', + packages=packages, + entry_points=entry_points, + install_requires=install_requires, + extras_require={'examples': examples_requires}, +) diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..355c11f --- /dev/null +++ b/requirements.txt @@ -0,0 +1,85 @@ +appnope==0.1.3 +asttokens==2.0.5 +autopep8==1.6.0 +backcall==0.2.0 +bootstrapped==0.0.2 +cffi==1.15.0 +click==8.1.2 +cloudpickle==2.0.0 +colorama==0.4.4 +commonmark==0.9.1 +cycler==0.11.0 +Cython==0.29.28 +debugpy==1.6.0 +decorator==5.1.1 +entrypoints==0.4 +executing==0.8.3 +fasteners==0.17.3 +filelock==3.6.0 +flake8==4.0.1 +fonttools==4.32.0 +glfw==2.5.3 +gym==0.23.1 +gym-notices==0.0.6 +h5py==3.6.0 +imageio==2.18.0 +importlib-metadata==4.11.3 +industrial-benchmark-python==2.0 +ipykernel==6.13.0 +ipython==8.2.0 +itermplot==0.331 +jedi==0.18.1 +jupyter-client==7.2.2 +jupyter-core==4.10.0 +kernel-driver==0.0.7 +kiwisolver==1.4.2 +llvmlite==0.38.0 +matplotlib==3.5.1 +matplotlib-inline==0.1.3 +mccabe==0.6.1 +mujoco-py==2.1.2.14 +nest-asyncio==1.5.5 +numba==0.55.1 +numpy==1.21.6 +packaging==21.3 +pandas==1.4.2 +parso==0.8.3 +pexpect==4.8.0 +pickleshare==0.7.5 +Pillow==9.1.0 +prompt-toolkit==3.0.29 +psutil==5.9.0 +ptyprocess==0.7.0 +pure-eval==0.2.2 +pycodestyle==2.8.0 +pycparser==2.21 +git+ssh://git@github.com/andnp/PyExpUtils@2.18#egg=PyExpUtils +git+ssh://git@github.com/andnp/PyExpPlotting@0.7#egg=PyExpPlotting +git+ssh://git@github.com/andnp/PyFixedReps@0.5#egg=PyFixedReps +git+ssh://git@github.com/andnp/PyRlEnvs@0.19#egg=PyRlEnvs +pyflakes==2.4.0 +pygame==2.1.2 +pyglet==1.5.23 +Pygments==2.11.2 +pyparsing==3.0.8 +python-dateutil==2.8.2 +pytz==2022.1 +pyzmq==22.3.0 +rich==12.2.0 +rlglue==2.2 +scipy==1.8.0 +seaborn==0.11.2 +setuptools-scm==6.4.2 +six==1.16.0 +stack-data==0.2.0 +toml==0.10.2 +tomli==2.0.1 +torch==1.11.0 +tornado==6.1 +tqdm==4.64.0 +traitlets==5.1.1 +typer==0.4.1 +typing_extensions==4.2.0 +wcwidth==0.2.5 +wrapt==1.14.0 +zipp==3.8.0 diff --git a/utils/TruncatedNormal.py b/utils/TruncatedNormal.py new file mode 100644 index 0000000..702b87e --- /dev/null +++ b/utils/TruncatedNormal.py @@ -0,0 +1,148 @@ +# Taken from: +# # https://github.com/toshas/torch_truncnorm/blob/main/TruncatedNormal.py + +import math +from numbers import Number + +import torch +from torch.distributions import Distribution, constraints +from torch.distributions.utils import broadcast_all + +CONST_SQRT_2 = math.sqrt(2) +CONST_INV_SQRT_2PI = 1 / math.sqrt(2 * math.pi) +CONST_INV_SQRT_2 = 1 / math.sqrt(2) +CONST_LOG_INV_SQRT_2PI = math.log(CONST_INV_SQRT_2PI) +CONST_LOG_SQRT_2PI_E = 0.5 * math.log(2 * math.pi * math.e) + + +class TruncatedStandardNormal(Distribution): + """ + Truncated Standard Normal distribution + https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf + """ + + arg_constraints = { + 'a': constraints.real, + 'b': constraints.real, + } + has_rsample = True + + def __init__(self, a, b, validate_args=None): + self.a, self.b = broadcast_all(a, b) + if isinstance(a, Number) and isinstance(b, Number): + batch_shape = torch.Size() + else: + batch_shape = self.a.size() + super(TruncatedStandardNormal, self).__init__( + batch_shape, validate_args=validate_args) + if self.a.dtype != self.b.dtype: + raise ValueError('Truncation bounds types are different') + if any((self.a >= self.b).view(-1,).tolist()): + raise ValueError('Incorrect truncation range') + eps = torch.finfo(self.a.dtype).eps + self._dtype_min_gt_0 = eps + self._dtype_max_lt_1 = 1 - eps + self._little_phi_a = self._little_phi(self.a) + self._little_phi_b = self._little_phi(self.b) + self._big_phi_a = self._big_phi(self.a) + self._big_phi_b = self._big_phi(self.b) + self._Z = (self._big_phi_b - self._big_phi_a).clamp_min(eps) + self._log_Z = self._Z.log() + little_phi_coeff_a = torch.nan_to_num(self.a, nan=math.nan) + little_phi_coeff_b = torch.nan_to_num(self.b, nan=math.nan) + self._lpbb_m_lpaa_d_Z = (self._little_phi_b * little_phi_coeff_b - + self._little_phi_a * + little_phi_coeff_a) / self._Z + self._mean = -(self._little_phi_b - self._little_phi_a) / self._Z + self._variance = 1 - self._lpbb_m_lpaa_d_Z - ((self._little_phi_b - + self._little_phi_a) / + self._Z) ** 2 + self._entropy = CONST_LOG_SQRT_2PI_E + self._log_Z - 0.5 * \ + self._lpbb_m_lpaa_d_Z + + @constraints.dependent_property + def support(self): + return constraints.interval(self.a, self.b) + + @property + def mean(self): + return self._mean + + @property + def variance(self): + return self._variance + + @property + def entropy(self): + return self._entropy + + @property + def auc(self): + return self._Z + + @staticmethod + def _little_phi(x): + return (-(x ** 2) * 0.5).exp() * CONST_INV_SQRT_2PI + + @staticmethod + def _big_phi(x): + return 0.5 * (1 + (x * CONST_INV_SQRT_2).erf()) + + @staticmethod + def _inv_big_phi(x): + return CONST_SQRT_2 * (2 * x - 1).erfinv() + + def cdf(self, value): + if self._validate_args: + self._validate_sample(value) + return ((self._big_phi(value) - self._big_phi_a) / self._Z).clamp(0, 1) + + def icdf(self, value): + return self._inv_big_phi(self._big_phi_a + value * self._Z) + + def log_prob(self, value): + if self._validate_args: + self._validate_sample(value) + return CONST_LOG_INV_SQRT_2PI - self._log_Z - (value ** 2) * 0.5 + + def rsample(self, sample_shape=torch.Size()): + shape = self._extended_shape(sample_shape) + p = torch.empty(shape, device=self.a.device).uniform_( + self._dtype_min_gt_0, self._dtype_max_lt_1) + return self.icdf(p) + + +class TruncatedNormal(TruncatedStandardNormal): + """ + Truncated Normal distribution + https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf + """ + + has_rsample = True + + def __init__(self, loc, scale, a, b, validate_args=None): + self.loc, self.scale, a, b = broadcast_all(loc, scale, a, b) + a = (a - self.loc) / self.scale + b = (b - self.loc) / self.scale + super(TruncatedNormal, self).__init__(a, b, + validate_args=validate_args) + self._log_scale = self.scale.log() + self._mean = self._mean * self.scale + self.loc + self._variance = self._variance * self.scale ** 2 + self._entropy += self._log_scale + + def _to_std_rv(self, value): + return (value - self.loc) / self.scale + + def _from_std_rv(self, value): + return value * self.scale + self.loc + + def cdf(self, value): + return super(TruncatedNormal, self).cdf(self._to_std_rv(value)) + + def icdf(self, value): + return self._from_std_rv(super(TruncatedNormal, self).icdf(value)) + + def log_prob(self, value): + return super(TruncatedNormal, self).log_prob(self._to_std_rv(value)) \ + - self._log_scale diff --git a/utils/experience_replay.py b/utils/experience_replay.py new file mode 100644 index 0000000..01bdd03 --- /dev/null +++ b/utils/experience_replay.py @@ -0,0 +1,270 @@ +# Import modules +import numpy as np +import torch +from abc import ABC, abstractmethod + + +# Class definitions +class ExperienceReplay(ABC): + """ + Abstract base class ExperienceReplay implements an experience replay + buffer. The specific kind of buffer is determined by classes which + implement this base class. For example, NumpyBuffer stores all + transitions in a numpy array while TorchBuffer implements the buffer + as a torch tensor. + + Attributes + ---------- + self.cast : func + A function which will cast data into an appropriate form to be + stored in the replay buffer. All incoming data is assumed to be + a numpy array. + """ + def __init__(self, capacity, seed, state_size, action_size, + device=None): + """ + Constructor + + Parameters + ---------- + capacity : int + The capacity of the buffer + seed : int + The random seed used for sampling from the buffer + state_size : tuple[int] + The number of dimensions of the state features + action_size : int + The number of dimensions in the action vector + """ + self.device = device + self.is_full = False + self.position = 0 + self.capacity = capacity + + # Set the casting function, which is needed for implementations which + # may keep the ER buffer as a different data structure, for example + # a torch tensor, in this case all data needs to be cast to a torch + # tensor before storing + self.cast = lambda x: x + + # Set the random number generator + self.random = np.random.default_rng(seed=seed) + + # Save the size of states and actions + self.state_size = state_size + self.action_size = action_size + + # Buffer of state, action, reward, next_state, done + self.state_buffer = None + self.action_buffer = None + self.reward_buffer = None + self.next_state_buffer = None + self.done_buffer = None + self.init_buffer() + + @abstractmethod + def init_buffer(self): + """ + Initializes the buffers on which to store transitions. + + Note that different classes which implement this abstract base class + may use different data types as buffers. For example, NumpyBuffer + stores all transitions using a numpy array, while TorchBuffer + stores all transitions on a torch Tensor on a specific device in order + to speed up training by keeping transitions on the same device as + the device which holds the model. + + Post-Condition + -------------- + The replay buffer self.buffer has been initialized + """ + pass + + def push(self, state, action, reward, next_state, done): + """ + Pushes a trajectory onto the replay buffer + + Parameters + ---------- + state : array_like + The state observation + action : array_like + The action taken by the agent in the state + reward : float + The reward seen after taking the argument action in the argument + state + next_state : array_like + The next state transitioned to + done : bool + Whether or not the transition was a transition to a goal state + """ + reward = np.array([reward]) + done = np.array([done]) + + state = self.cast(state) + action = self.cast(action) + reward = self.cast(reward) + next_state = self.cast(next_state) + done = self.cast(done) + + self.state_buffer[self.position] = state + self.action_buffer[self.position] = action + self.reward_buffer[self.position] = reward + self.next_state_buffer[self.position] = next_state + self.done_buffer[self.position] = done + + if self.position >= self.capacity - 1: + self.is_full = True + self.position = (self.position + 1) % self.capacity + + def sample(self, batch_size): + """ + Samples a random batch from the buffer + + Parameters + ---------- + batch_size : int + The size of the batch to sample + + Returns + ------- + 5-tuple of torch.Tensor + The arrays of state, action, reward, next_state, and done from the + batch + """ + # Get the indices for the batch + if self.is_full: + indices = self.random.integers(low=0, high=len(self), + size=batch_size) + else: + indices = self.random.integers(low=0, high=self.position, + size=batch_size) + + # # Sample the batch + # batch = self.buffer[indices] + + # # Keep running indices and get state sample + # start = 0 + # end = self.state_size + # state = batch[:, start:end] + + # # Action sample + # start = end + # end += self.action_size + # action = batch[:, start:end] + + # # Reward sample + # start = end + # end += 1 + # reward = batch[:, start:end] + + # # Next state sample + # start = end + # end += self.state_size + # next_state = batch[:, start:end] + + # # Done mask sample + # start = end + # done = batch[:, start:] + + state = self.state_buffer[indices, :] + action = self.action_buffer[indices, :] + reward = self.reward_buffer[indices] + next_state = self.next_state_buffer[indices, :] + done = self.done_buffer[indices] + + return state, action, reward, next_state, done + + def __len__(self): + """ + Gets the number of elements in the buffer + + Returns + ------- + int + The number of elements currently in the buffer + """ + if not self.is_full: + return self.position + else: + return self.capacity + + +class NumpyBuffer(ExperienceReplay): + """ + Class NumpyBuffer implements an experience replay buffer. This + class stores all states, actions, and rewards as numpy arrays. + For an implementation that uses PyTorch tensors, see + TorchExperienceReplay + """ + def __init__(self, capacity, seed, state_size, action_size): + """ + Constructor + + Parameters + ---------- + capacity : int + The capacity of the buffer + seed : int + The random seed used for sampling from the buffer + state_size : tuple[int] + The dimensions of the state features + action_size : int + The number of dimensions in the action vector + """ + super().__init__(capacity, seed, state_size, action_size, None) + + def init_buffer(self): + self.state_buffer = np.zeros((self.capacity, *self.state_size)) + self.next_state_buffer = np.zeros((self.capacity, *self.state_size)) + self.action_buffer = np.zeros(self.capacity, self.action_size) + self.reward_buffer = np.zeros((self.capacity, 1)) + self.done_buffer = np.zeros((self.capacity, 1)) + + +class TorchBuffer(ExperienceReplay): + """ + Class TorchBuffer implements an experience replay buffer. The + difference between this class and the ExperienceReplay class is that this + class keeps all experiences as a torch Tensor on the appropriate device + so that if using PyTorch, we do not need to cast the batch to a + FloatTensor every time we sample and then place it on the appropriate + device, as this is very time consuming. This class is basically a + PyTorch efficient implementation of ExperienceReplay. + """ + def __init__(self, capacity, seed, state_size, action_size, device): + """ + Constructor + + Parameters + ---------- + capacity : int + The capacity of the buffer + seed : int + The random seed used for sampling from the buffer + device : torch.device + The device on which the buffer instances should be stored + state_size : int + The number of dimensions in the state feature vector + action_size : int + The number of dimensions in the action vector + """ + super().__init__(capacity, seed, state_size, action_size, device) + self.cast = torch.from_numpy + + def init_buffer(self): + self.state_buffer = torch.FloatTensor(self.capacity, *self.state_size) + self.state_buffer = self.state_buffer.to(self.device) + + self.next_state_buffer = torch.FloatTensor(self.capacity, + *self.state_size) + self.next_state_buffer = self.next_state_buffer.to(self.device) + + self.action_buffer = torch.FloatTensor(self.capacity, self.action_size) + self.action_buffer = self.action_buffer.to(self.device) + + self.reward_buffer = torch.FloatTensor(self.capacity, 1) + self.reward_buffer = self.reward_buffer.to(self.device) + + self.done_buffer = torch.FloatTensor(self.capacity, 1) + self.done_buffer = self.done_buffer.to(self.device) diff --git a/utils/experiment_utils.py b/utils/experiment_utils.py new file mode 100644 index 0000000..85561d6 --- /dev/null +++ b/utils/experiment_utils.py @@ -0,0 +1,1415 @@ +# Import modules +import os +import numpy as np +from glob import glob +# from env.tile_coder import TileCoding +import pickle +from tqdm import tqdm +from copy import deepcopy +import bootstrapped.bootstrap as bs +import bootstrapped.stats_functions as bs_stats +from scipy import signal as signal +try: + import runs +except ModuleNotFoundError: + import utils.runs + + +def create_agent(agent, config): + """ + Creates an agent given the agent name and configuration dictionary + + Parameters + ---------- + agent : str + The name of the agent + config : dict + The agent configuration dictionary + + Returns + ------- + baseAgent.BaseAgent + The agent to train + """ + # Random agent + if agent.lower() == "random": + from agent.Random import Random + return Random(config["action_space"], config["seed"]) + + # Sarsa(λ) + if agent.lower() == "sarsa": + from agent.linear.Sarsa import Sarsa + return Sarsa( + decay=config["decay"], + lr=config["lr"], + gamma=config["gamma"], + epsilon=config["epsilon"], + action_space=config["action_space"], + seed=config["seed"], + bins=config["bins"], + num_tilings=config["num_tilings"], + env=config["env"], + trace_type=config["trace_type"], + policy_type=config["policy_type"], + ) + + # 𝔼Sarsa(λ) + if agent.lower() == "esarsa": + from agent.linear.ESarsa import ESarsa + return ESarsa( + decay=config["decay"], + lr=config["lr"], + gamma=config["gamma"], + epsilon=config["epsilon"], + action_space=config["action_space"], + seed=config["seed"], + bins=config["bins"], + num_tilings=config["num_tilings"], + env=config["env"], + trace_type=config["trace_type"], + ) + + # Linear-Gaussian Actor-Critic + if agent.lower() == "LinearGaussianAC".lower(): + from agent.GaussianAC import GaussianAC + return GaussianAC( + decay=config["decay"], + actor_lr_scale=config["actor_lr_scale"], + critic_lr=config["critic_lr"], + gamma=config["gamma"], + accumulate_trace=config["accumulate_trace"], + action_space=config["action_space"], + scaled=config["scaled"], + clip_stddev=config["clip_stddev"], + seed=config["seed"], + bins=config["bins"], + num_tilings=config["num_tilings"], + env=config["env"], + use_critic_trace=config["use_critic_trace"], + use_actor_trace=config["use_actor_trace"], + trace_type=config["trace_type"], + ) + + # Linear-Softmax Actor-Critic + if agent.lower() == "LinearSoftmaxAC".lower(): + from agent.linear.SoftmaxAC import SoftmaxAC + return SoftmaxAC( + decay=config["decay"], + actor_lr=config["actor_lr"], + critic_lr=config["critic_lr"], + gamma=config["gamma"], + accumulate_trace=config["accumulate_trace"], + action_space=config["action_space"], + seed=config["seed"], + bins=config["bins"], + num_tilings=config["num_tilings"], + env=config["env"], + use_critic_trace=config["use_critic_trace"], + use_actor_trace=config["use_actor_trace"], + trace_type=config["trace_type"], + temperature=config["temperature"] + ) + + # FKL + if agent.lower() == "fkl": + if "activation" in config: + activation = config["activation"] + else: + activation = "relu" + + # Vanilla Actor Critic using FKL + from agent.nonlinear.FKL import FKL + return FKL( + num_inputs=config["feature_size"], + action_space=config["action_space"], + gamma=config["gamma"], tau=config["tau"], + alpha=config["alpha"], policy=config["policy_type"], + target_update_interval=config["target_update_interval"], + critic_lr=config["critic_lr"], + actor_lr_scale=config["actor_lr_scale"], + actor_hidden_dim=config["hidden_dim"], + critic_hidden_dim=config["hidden_dim"], + replay_capacity=config["replay_capacity"], + seed=config["seed"], batch_size=config["batch_size"], + cuda=config["cuda"], clip_stddev=config["clip_stddev"], + init=config["weight_init"], betas=config["betas"], + num_samples=config["num_samples"], activation="relu", + env=config["env"], + ) + + # Vanilla Actor-Critic + if agent.lower() == "VAC".lower(): + if "activation" in config: + activation = config["activation"] + else: + activation = "relu" + + from agent.nonlinear.VAC import VAC + return VAC( + num_inputs=config["feature_size"], + action_space=config["action_space"], + gamma=config["gamma"], tau=config["tau"], + alpha=config["alpha"], policy=config["policy_type"], + target_update_interval=config["target_update_interval"], + critic_lr=config["critic_lr"], + actor_lr_scale=config["actor_lr_scale"], + actor_hidden_dim=config["hidden_dim"], + critic_hidden_dim=config["hidden_dim"], + replay_capacity=config["replay_capacity"], + seed=config["seed"], batch_size=config["batch_size"], + cuda=config["cuda"], clip_stddev=config["clip_stddev"], + init=config["weight_init"], betas=config["betas"], + num_samples=config["num_samples"], activation="relu", + env=config["env"], + ) + + # Discrete Vanilla Actor-Critic + if agent.lower() == "VACDiscrete".lower(): + if "activation" in config: + activation = config["activation"] + else: + activation = "relu" + + from agent.nonlinear.VACDiscrete import VACDiscrete + return VACDiscrete( + num_inputs=config["feature_size"], + action_space=config["action_space"], + gamma=config["gamma"], tau=config["tau"], + alpha=config["alpha"], policy=config["policy_type"], + target_update_interval=config[ + "target_update_interval"], + critic_lr=config["critic_lr"], + actor_lr_scale=config["actor_lr_scale"], + actor_hidden_dim=config["hidden_dim"], + critic_hidden_dim=config["hidden_dim"], + replay_capacity=config["replay_capacity"], + seed=config["seed"], batch_size=config["batch_size"], + cuda=config["cuda"], + clip_stddev=config["clip_stddev"], + init=config["weight_init"], betas=config["betas"], + activation="relu", + ) + + # Soft Actor-Critic + if agent.lower() == "SAC".lower(): + if "activation" in config: + activation = config["activation"] + else: + activation = "relu" + + if "num_hidden" in config: + num_hidden = config["num_hidden"] + else: + num_hidden = 3 + from agent.nonlinear.SAC import SAC + return SAC( + gamma=config["gamma"], tau=config["tau"], + alpha=config["alpha"], policy=config["policy_type"], + target_update_interval=config["target_update_interval"], + critic_lr=config["critic_lr"], + actor_lr_scale=config["actor_lr_scale"], + alpha_lr=config["alpha_lr"], + actor_hidden_dim=config["hidden_dim"], + critic_hidden_dim=config["hidden_dim"], + replay_capacity=config["replay_capacity"], + seed=config["seed"], batch_size=config["batch_size"], + automatic_entropy_tuning=config["automatic_entropy_tuning"], + cuda=config["cuda"], clip_stddev=config["clip_stddev"], + init=config["weight_init"], betas=config["betas"], + activation=activation, env=config["env"], + ) + + # Discrete Soft Actor-Critic + if agent.lower() == "SACDiscrete".lower(): + if "activation" in config: + activation = config["activation"] + else: + activation = "relu" + + if "num_hidden" in config: + num_hidden = config["num_hidden"] + else: + num_hidden = 3 + + from agent.nonlinear.SACDiscrete import SACDiscrete + return SACDiscrete( + env=config["env"], + gamma=config["gamma"], tau=config["tau"], + alpha=config["alpha"], policy=config["policy_type"], + target_update_interval=config[ + "target_update_interval"], + critic_lr=config["critic_lr"], + actor_lr_scale=config["actor_lr_scale"], + alpha_lr=config["alpha_lr"], + actor_hidden_dim=config["hidden_dim"], + critic_hidden_dim=config["hidden_dim"], + replay_capacity=config["replay_capacity"], + seed=config["seed"], batch_size=config["batch_size"], + automatic_entropy_tuning=config[ + "automatic_entropy_tuning"], + cuda=config["cuda"], + clip_stddev=config["clip_stddev"], + init=config["weight_init"], betas=config["betas"], + activation=activation, + ) + + # Discrete Soft Actor-Critic + CNN + if agent.lower() == "SACDiscreteCNN".lower(): + if "activation" in config: + activation = config["activation"] + else: + activation = "relu" + + from agent.nonlinear.SACDiscreteCNN import SACDiscrete + return SACDiscrete( + env=config["env"], + gamma=config["gamma"], tau=config["tau"], + alpha=config["alpha"], policy=config["policy_type"], + target_update_interval=config[ + "target_update_interval"], + critic_lr=config["critic_lr"], + actor_lr_scale=config["actor_lr_scale"], + alpha_lr=config["alpha_lr"], + hidden_dim=config["hidden_dim"], + channels=config["channels"], + kernel_sizes=config["kernel_sizes"], + replay_capacity=config["replay_capacity"], + seed=config["seed"], batch_size=config["batch_size"], + cuda=config["cuda"], + clip_stddev=config["clip_stddev"], + init=config["weight_init"], betas=config["betas"], + activation=activation, + ) + + raise NotImplementedError("No agent " + agent) + + +def _calculate_mean_return_episodic(hp_returns, type_, after=0): + """ + Calculates the mean return for an experiment run on an episodic environment + over all runs and episodes + + Parameters + ---------- + hp_returns : Iterable of Iterable + A list of lists, where the outer list has a single inner list for each + run. The inner lists store the return per episode for that run. Note + that these returns should be for a single hyperparameter setting, as + everything in these lists are averaged and returned as the average + return. + type_ : str + Whether calculating the training or evaluation mean returns, one of + 'train', 'eval' + after : int, optional + Only consider episodes after this episode, by default 0 + + Returns + ------- + 2-tuple of float + The mean and standard error of the returns over all episodes and all + runs + """ + if type_ == "eval": + hp_returns = [np.mean(hp_returns[i][after:], axis=-1) for i in + range(len(hp_returns))] + + # Calculate the average return for all episodes in the run + run_returns = [np.mean(hp_returns[i][after:]) for i in + range(len(hp_returns))] + + mean = np.mean(run_returns) + stderr = np.std(run_returns) / np.sqrt(len(hp_returns)) + + return mean, stderr + + +def _calculate_mean_return_episodic_conf(hp_returns, type_, significance, + after=0): + """ + Calculates the mean return for an experiment run on an episodic environment + over all runs and episodes + + Parameters + ---------- + hp_returns : Iterable of Iterable + A list of lists, where the outer list has a single inner list for each + run. The inner lists store the return per episode for that run. Note + that these returns should be for a single hyperparameter setting, as + everything in these lists are averaged and returned as the average + return. + type_ : str + Whether calculating the training or evaluation mean returns, one of + 'train', 'eval' + significance: float + The level of significance for the confidence interval + after : int, optional + Only consider episodes after this episode, by default 0 + + Returns + ------- + 2-tuple of float + The mean and standard error of the returns over all episodes and all + runs + """ + if type_ == "eval": + hp_returns = [np.mean(hp_returns[i][after:], axis=-1) for i in + range(len(hp_returns))] + + # Calculate the average return for all episodes in the run + run_returns = [np.mean(hp_returns[i][after:]) for i in + range(len(hp_returns))] + + mean = np.mean(run_returns) + run_returns = np.array(run_returns) + + conf = bs.bootstrap(run_returns, stat_func=bs_stats.mean, + alpha=significance) + + return mean, conf + + +def _calculate_mean_return_continuing(hp_returns, type_, after=0): + """ + Calculates the mean return for an experiment run on a continuing + environment over all runs and episodes + + Parameters + ---------- + hp_returns : Iterable of Iterable + A list of lists, where the outer list has a single inner list for each + run. The inner lists store the return per episode for that run. Note + that these returns should be for a single hyperparameter setting, as + everything in these lists are averaged and returned as the average + return. + type_ : str + Whether calculating the training or evaluation mean returns, one of + 'train', 'eval' + after : int, optional + Only consider episodes after this episode, by default 0 + + Returns + ------- + 2-tuple of float + The mean and standard error of the returns over all episodes and all + runs + """ + hp_returns = np.stack(hp_returns) + + # If evaluating, use the mean return over all episodes for each + # evaluation interval. That is, if 10 eval episodes for each + # evaluation the take the average return over all these eval + # episodes + if type_ == "eval": + hp_returns = hp_returns.mean(axis=-1) + + # Calculate the average return over all runs + hp_returns = hp_returns[after:, :].mean(axis=-1) + + # Calculate the average return over all "episodes" + stderr = np.std(hp_returns) / np.sqrt(len(hp_returns)) + mean = hp_returns.mean(axis=0) + + return mean, stderr + + +def _calculate_mean_return_continuing_conf(hp_returns, type_, significance, + after=0): + """ + Calculates the mean return for an experiment run on a continuing + environment over all runs and episodes + + Parameters + ---------- + hp_returns : Iterable of Iterable + A list of lists, where the outer list has a single inner list for each + run. The inner lists store the return per episode for that run. Note + that these returns should be for a single hyperparameter setting, as + everything in these lists are averaged and returned as the average + return. + type_ : str + Whether calculating the training or evaluation mean returns, one of + 'train', 'eval' + after : int, optional + Only consider episodes after this episode, by default 0 + + Returns + ------- + 2-tuple of float + The mean and standard error of the returns over all episodes and all + runs + """ + hp_returns = np.stack(hp_returns) + + # If evaluating, use the mean return over all episodes for each + # evaluation interval. That is, if 10 eval episodes for each + # evaluation the take the average return over all these eval + # episodes + if type_ == "eval": + hp_returns = hp_returns.mean(axis=-1) + + # Calculate the average return over all episodes + hp_returns = hp_returns[after:, :].mean(axis=-1) + + # Calculate the average return over all runs + mean = hp_returns.mean(axis=0) + conf = bs.bootstrap(hp_returns, stat_func=bs_stats.mean, + alpha=significance) + + return mean, conf + + +def get_best_hp_by_file(dir, type_, after=0, env_type="continuing"): + """ + Find the best hyperparameters from a list of files. + + Gets and returns a list of the hyperparameter settings, sorted by average + return. This function assumes a single directory containing all data + dictionaries, where each data dictionary contains all data of all runs for + a *single* hyperparameter setting. There must be a single file for each + hyperparameter setting in the argument directory. + + Note: If any retrun is NaN within the range specified by after, then the + entire return is considered NaN. + + Parameters + ---------- + dir : str + The directory which contains the data dictionaries, with one data + dictionary per hyperparameter setting + type_ : str + The type of return by which to compare hyperparameter settings, one of + "train" or "eval" + after : int, optional + Hyperparameters will only be compared by their performance after + training for this many episodes (in continuing tasks, this is the + number of times the task is restarted). For example, if after = -10, + then only the last 10 returns from training/evaluation are taken + into account when comparing the hyperparameters. As usual, positive + values index from the front, and negative values index from the back. + env_type : str, optional + The type of environment, one of 'continuing', 'episodic'. By default + 'continuing' + + Returns + ------- + n-tuple of 2-tuple(int, float) + A tuple with the number of elements equal to the total number of + hyperparameter combinations. Each sub-tuple is a tuple of (hyperparameter + setting number, mean return over all runs and episodes) + """ + files = glob(os.path.join(dir, "*.pkl")) + + if type_ not in ("train", "eval"): + raise ValueError("type_ should be one of 'train', 'eval'") + + return_type = "train_episode_rewards" if type_ == "train" \ + else "eval_episode_rewards" + + mean_returns = [] + # hp_settings = [] + # hp_settings = sorted(list(data["experiment_data"].keys())) + for file in tqdm(files): + hp_returns = [] + + # Get the data + file = open(file, "rb") + data = pickle.load(file) + + hp_setting = next(iter(data["experiment_data"])) + # hp_settings.append(hp_setting) + for run in data["experiment_data"][hp_setting]["runs"]: + hp_returns.append(run[return_type]) + + # Episodic and continuing must be dealt with differently since + # we may have many episodes for a given number of timesteps for + # episodic tasks + if env_type == "episodic": + hp_returns, _ = _calculate_mean_return_episodic(hp_returns, type_, + after) + + elif env_type == "continuing": + hp_returns, _ = _calculate_mean_return_continuing(hp_returns, + type_, after) + + # Save mean return + mean_returns.append((hp_setting, hp_returns)) + + # Close the file + file.close() + del data + + # Create a structured array for sorting by return + dtype = [("setting index", int), ("return", float)] + mean_returns = np.array(mean_returns, dtype=dtype) + + # Return the best hyperparam settings in order with the + # mean returns sorted by hyperparmater setting performance + # best_hp_settings = np.argsort(mean_returns) + # mean_returns = np.array(mean_returns)[best_hp_settings] + mean_returns = np.sort(mean_returns, order="return") + + # return tuple(zip(best_hp_settings, mean_returns)) + return mean_returns + + +def combine_runs(data1, data2): + """ + Adds the runs for each hyperparameter setting in data2 to the runs for the + corresponding hyperparameter setting in data1. + + Given two data dictionaries, this function will get each hyperparameter + setting and extend the runs done on this hyperparameter setting and saved + in data1 by the runs of this hyperparameter setting and saved in data2. + In short, this function extends the lists + data1["experiment_data"][i]["runs"] by the lists + data2["experiment_data"][i]["runs"] for all i. This is useful if + multiple runs are done at different times, and the two data files need + to be combined. + + Parameters + ---------- + data1 : dict + A data dictionary as generated by main.py + data2 : dict + A data dictionary as generated by main.py + + Raises + ------ + KeyError + If a hyperparameter setting exists in data2 but not in data1. This + signals that the hyperparameter settings indices are most likely + different, so the hyperparameter index i in data1 does not correspond + to the same hyperparameter index in data2. In addition, all other + functions expect the number of runs to be consistent for each + hyperparameter setting, which would be violated in this case. + """ + for hp_setting in data1["experiment_data"]: + if hp_setting not in list(data2.keys()): + # Ensure consistent hyperparam settings indices + raise KeyError("hyperparameter settings are different " + + "between the two experiments") + + extra_runs = data2["experiment_data"][hp_setting]["runs"] + data1["experiment_data"][hp_setting]["runs"].extend(extra_runs) + + +def get_returns(data, type_, ind, env_type="continuing"): + """ + Gets the returns seen by an agent + + Gets the online or offline returns seen by an agent trained with + hyperparameter settings index ind. + + Parameters + ---------- + data : dict + The Python data dictionary generated from running main.py + type_ : str + Whether to get the training or evaluation returns, one of 'train', + 'eval' + ind : int + Gets the returns of the agent trained with this hyperparameter + settings index + env_type : str, optional + The type of environment, one of 'continuing', 'episodic'. By default + 'continuing' + + Returns + ------- + array_like + The array of returns of the form (N, R, C) where N is the number of + runs, R is the number of times a performance was measured, and C is the + number of returns generated each time performance was measured + (offline >= 1; online = 1). For the online setting, N is the number of + runs, and R is the number of episodes and C = 1. For the offline + setting, N is the number of runs, R is the number of times offline + evaluation was performed, and C is the number of episodes run each + time performance was evaluated offline. + """ + if env_type == "episodic": + # data = reduce_episodes(data, ind, type_) + data = runs.expand_episodes(data, ind, type_) + + returns = [] + if type_ == "eval": + # Get the offline evaluation episode returns per run + for run in data["experiment_data"][ind]["runs"]: + returns.append(run["eval_episode_rewards"]) + returns = np.stack(returns) + + elif type_ == "train": + # Get the returns per episode per run + for run in data["experiment_data"][ind]["runs"]: + returns.append(run["train_episode_rewards"]) + returns = np.expand_dims(np.stack(returns), axis=-1) + + return returns + + +def get_avg_returns(data, type_, ind, after=0, before=None): + """ + Gets the average returns over all episodes seen by an agent for each run + + Gets the online or offline returns seen by an agent trained with + hyperparameter settings index ind. + + Parameters + ---------- + data : dict + The Python data dictionary generated from running main.py + type_ : str + Whether to get the training or evaluation returns, one of 'train', + 'eval' + ind : int + Gets the returns of the agent trained with this hyperparameter + settings index + + Returns + ------- + array_like + The array of returns of the form (N, R, C) where N is the number of + runs, R is the number of times a performance was measured, and C is the + number of returns generated each time performance was measured + (offline >= 1; online = 1). For the online setting, N is the number of + runs, and R is the number of episodes and C = 1. For the offline + setting, N is the number of runs, R is the number of times offline + evaluation was performed, and C is the number of episodes run each + time performance was evaluated offline. + """ + returns = [] + if type_ == "eval": + # Get the offline evaluation episode returns per run + for run in data["experiment_data"][ind]["runs"]: + if before is not None: + run_returns = run["eval_episode_rewards"][after:before] + else: + run_returns = run["eval_episode_rewards"][after:before] + returns.append(run_returns) + + returns = np.stack(returns).mean(axis=(-2, -1)) + + elif type_ == "train": + # Get the returns per episode per run + for run in data["experiment_data"][ind]["runs"]: + if before is not None: + run_returns = run["train_episode_rewards"][after:before] + else: + run_returns = run["train_episode_rewards"][after:] + returns.append(np.mean(run_returns)) + + returns = np.array(returns) + + return returns + + +def get_mean_returns_with_stderr_hp_varying(dir_, type_, hp_name, combo, + env_config, agent_config, after=0, + env_type="continuing"): + """ + Calculate mean and standard error of return for each hyperparameter value. + + Gets the mean returns for each variation of a single hyperparameter, + with all other hyperparameters remaining constant. Since there are + many different ways this can happen (the hyperparameter can vary + with all other remaining constant, but there are many combinations + of these constant hyperparameters), the combo argument cycles through + the combinations of constant hyperparameters. + + Given hyperparameters a, b, and c, let's say we want to get all + hyperparameter settings indices where a varies, and b and c are constant. + if a, b, and c can each be 1 or 2, then there are four ways that a can + vary with b and c remaining constant: + + [ + ((a=1, b=1, c=1), (a=2, b=1, c=1)), combo = 0 + ((a=1, b=2, c=1), (a=2, b=2, c=1)), combo = 1 + ((a=1, b=1, c=2), (a=2, b=1, c=2)), combo = 2 + ((a=1, b=2, c=2), (a=2, b=2, c=2)) combo = 3 + ] + + The combo argument indexes into this list of hyperparameter settings + + Parameters + ---------- + dir_ : str + The directory of data dictionaries generated from running main.py, + separated into one data dictionary per HP setting + type_ : str + Which type of data to plot, one of "eval" or "train" + hp_name : str + The name of the hyperparameter to plot the sensitivity curves of + combo : int + Determines the values of the constant hyperparameters. Given that + only one hyperparameter may vary, there are many different sets + having this hyperparameter varying with all others remaining constant + since each constant hyperparameter may take on many values. This + argument cycles through all sets of hyperparameter settings indices + that have only one hyperparameter varying and all others constant. + env_config : dict + The environment configuration file as a Python dictionary + agent_config : dict + The agent configuration file as a Python dictionary + after : int + Only consider returns after this episode + """ + hp_combo = get_varying_single_hyperparam(env_config, agent_config, + hp_name)[combo] + + env_name = env_config["env_name"] + agent_name = agent_config["agent_name"] + filename = f"{env_name}_{agent_name}_hp-" + "{hp}.pkl" + + mean_returns = [] + stderr_returns = [] + hp_values = [] + for hp in hp_combo: + if hp is None: + continue + + with open(os.path.join(dir_, filename.format(hp=hp)), "rb") as in_file: + data = pickle.load(in_file) + + hp_returns = [] + return_type = f"{type_}_episode_rewards" + for run in data["experiment_data"][hp]["runs"]: + hp_returns.append(run[return_type]) + + if env_type == "episodic": + mean_return, stderr_return = \ + _calculate_mean_return_episodic(hp_returns, type_, after) + elif env_type == "continuing": + mean_return, stderr_return = \ + _calculate_mean_return_continuing(hp_returns, type_, after) + + mean_returns.append(mean_return) + stderr_returns.append(stderr_return) + hp_value = data["experiment_data"][hp]["agent_hyperparams"][hp_name] + hp_values.append(hp_value) + + del data + + # Get each hp value and sort all results by hp value + # hp_values = np.array(agent_config["parameters"][hp_name]) + hp_values = np.array(hp_values) + indices = np.argsort(hp_values) + + mean_returns = np.array(mean_returns)[indices] + stderr_returns = np.array(stderr_returns)[indices] + hp_values = hp_values[indices] + + return hp_values, mean_returns, stderr_returns + + +def get_mean_returns_with_conf_hp_varying(dir_, type_, hp_name, combo, + env_config, agent_config, after=0, + env_type="continuing", + significance=0.1): + """ + Calculate mean and standard error of return for each hyperparameter value. + + Gets the mean returns for each variation of a single hyperparameter, + with all other hyperparameters remaining constant. Since there are + many different ways this can happen (the hyperparameter can vary + with all other remaining constant, but there are many combinations + of these constant hyperparameters), the combo argument cycles through + the combinations of constant hyperparameters. + + Given hyperparameters a, b, and c, let's say we want to get all + hyperparameter settings indices where a varies, and b and c are constant. + if a, b, and c can each be 1 or 2, then there are four ways that a can + vary with b and c remaining constant: + + [ + ((a=1, b=1, c=1), (a=2, b=1, c=1)), combo = 0 + ((a=1, b=2, c=1), (a=2, b=2, c=1)), combo = 1 + ((a=1, b=1, c=2), (a=2, b=1, c=2)), combo = 2 + ((a=1, b=2, c=2), (a=2, b=2, c=2)) combo = 3 + ] + + The combo argument indexes into this list of hyperparameter settings + + Parameters + ---------- + dir_ : str + The directory of data dictionaries generated from running main.py, + separated into one data dictionary per HP setting + type_ : str + Which type of data to plot, one of "eval" or "train" + hp_name : str + The name of the hyperparameter to plot the sensitivity curves of + combo : int + Determines the values of the constant hyperparameters. Given that + only one hyperparameter may vary, there are many different sets + having this hyperparameter varying with all others remaining constant + since each constant hyperparameter may take on many values. This + argument cycles through all sets of hyperparameter settings indices + that have only one hyperparameter varying and all others constant. + env_config : dict + The environment configuration file as a Python dictionary + agent_config : dict + The agent configuration file as a Python dictionary + after : int + Only consider returns after this episode + """ + hp_combo = get_varying_single_hyperparam(env_config, agent_config, + hp_name)[combo] + + env_name = env_config["env_name"] + agent_name = agent_config["agent_name"] + filename = f"{env_name}_{agent_name}_hp-" + "{hp}.pkl" + + mean_returns = [] + conf_returns = [] + hp_values = [] + for hp in hp_combo: + if hp is None: + continue + + with open(os.path.join(dir_, filename.format(hp=hp)), "rb") as in_file: + data = pickle.load(in_file) + + hp_returns = [] + return_type = f"{type_}_episode_rewards" + for run in data["experiment_data"][hp]["runs"]: + hp_returns.append(run[return_type]) + + if env_type == "episodic": + mean_return, conf_return = \ + _calculate_mean_return_episodic_conf(hp_returns, type_, + significance, after) + elif env_type == "continuing": + mean_return, conf_return = \ + _calculate_mean_return_continuing_conf(hp_returns, type_, + significance, after) + + mean_returns.append(mean_return) + conf_returns.append([conf_return.lower_bound, conf_return.upper_bound]) + hp_value = data["experiment_data"][hp]["agent_hyperparams"][hp_name] + hp_values.append(hp_value) + + del data + + # Get each hp value and sort all results by hp value + # hp_values = np.array(agent_config["parameters"][hp_name]) + hp_values = np.array(hp_values) + indices = np.argsort(hp_values) + + mean_returns = np.array(mean_returns)[indices] + conf_returns = np.array(conf_returns)[indices, :].transpose() + hp_values = hp_values[indices] + + return hp_values, mean_returns, conf_returns + + +def get_mean_err(data, type_, ind, smooth_over, error, + env_type="continuing", keep_shape=False, + err_args={}): + """ + Gets the timesteps, mean, and standard error to be plotted for + a given hyperparameter settings index + + Note: This function assumes that each run has an equal number of episodes. + This is true for continuing tasks. For episodic tasks, you will need to + cutoff the episodes so all runs have the same number of episodes. + + Parameters + ---------- + data : dict + The Python data dictionary generated from running main.py + type_ : str + Which type of data to plot, one of "eval" or "train" + ind : int + The hyperparameter settings index to plot + smooth_over : int + The number of previous data points to smooth over. Note that this + is *not* the number of timesteps to smooth over, but rather the number + of data points to smooth over. For example, if you save the return + every 1,000 timesteps, then setting this value to 15 will smooth + over the last 15 readings, or 15,000 timesteps. + error: function + The error function to compute the error with + env_type : str, optional + The type of environment the data was generated on + keep_shape : bool, optional + Whether or not the smoothed data should discard or keep the first + few data points before smooth_over. + err_args : dict + A dictionary of keyword arguments to pass to the error function + + Returns + ------- + 3-tuple of list(int), list(float), list(float) + The timesteps, mean episodic returns, and standard errors of the + episodic returns + """ + timesteps = None # So the linter doesn't have a temper tantrum + + # Determine the timesteps to plot at + if type_ == "eval": + timesteps = \ + data["experiment_data"][ind]["runs"][0]["timesteps_at_eval"] + + elif type_ == "train": + timesteps_per_ep = \ + data["experiment_data"][ind]["runs"][0]["train_episode_steps"] + timesteps = get_cumulative_timesteps(timesteps_per_ep) + + # Get the mean over all episodes per evaluation step (for online + # returns, this axis will have length 1 so we squeeze it) + returns = get_returns(data, type_, ind, env_type=env_type) + returns = returns.mean(axis=-1) + + returns = smooth(returns, smooth_over, keep_shape=keep_shape) + + # Get the standard error of mean episodes per evaluation + # step over all runs + if error is not None: + err = error(returns, **err_args) + else: + err = None + + # Get the mean over all runs + print("RUNS:", returns.shape) + mean = returns.mean(axis=0) + + # Return only the valid portion of timesteps. If smoothing and not + # keeping the first data points, then the first smooth_over columns + # will not have any data + if not keep_shape: + end = len(timesteps) - smooth_over + 1 + timesteps = timesteps[:end] + + return timesteps, mean, err + + +def bootstrap_conf(runs, significance=0.01): + """ + THIS NEEDS TO BE UPDATED + + + Gets the bootstrap confidence interval of the distribution of mean return + per episode for a single hyperparameter setting. + + Note that this function assumes that there are an equal number of episodes + for each run. This is true for continuing environments. If using an + episodic environment, ensure that the episodes have been made consistent + across runs before running this function. + + Parameters + ---------- + data : dict + The Python data dictionary generated from running main.py + significance : float, optional + The significance level for the confidence interval, by default 0.01 + + Returns + ------- + array_like + An array with two rows and n columns. The first row denotes the lower + bound of the confidence interval and the second row denotes the upper + bound of the confidence interval. The number of columns, n, is the + number of episodes. + """ + # return_type = type_ + "_episode_rewards" + # runs = [] + # for run in data["experiment_data"][hp]["runs"]: + # if type_ == "eval": + # runs.append(run[return_type].mean()) + # else: + # runs.append(run[return_type]) + + # Rows are the returns for the episode number == row number for each run + ep_conf = [] + run_returns = [] + for ep in range(runs.shape[1]): + ep_returns = [] + for run in range(runs.shape[0]): + ep_returns.append(np.mean(runs[run][ep])) + run_returns.append(ep_returns) + + run_returns = np.array(run_returns) + + conf_interval = [] + for ep in range(run_returns.shape[0]): + ep_conf = bs.bootstrap(run_returns[ep, :], stat_func=bs_stats.mean, + alpha=significance) + conf_interval.append([ep_conf.lower_bound, ep_conf.upper_bound]) + + return np.array(conf_interval).transpose() + + +def stderr(matrix, axis=0): + """ + Calculates the standard error along a specified axis + + Parameters + ---------- + matrix : array_like + The matrix to calculate standard error along the rows of + axis : int, optional + The axis to calculate the standard error along, by default 0 + + Returns + ------- + array_like + The standard error of each row along the specified axis + + Raises + ------ + np.AxisError + If an invalid axis is passed in + """ + if axis > len(matrix.shape) - 1: + raise np.AxisError(f"""axis {axis} is out of bounds for array with + {len(matrix.shape) - 1} dimensions""") + + samples = matrix.shape[axis] + return np.std(matrix, axis=axis) / np.sqrt(samples) + + +def smooth(matrix, smooth_over, keep_shape=False): + """ + Smooth the rows of returns + + Smooths the rows of returns by replacing the value at index i in a + row of returns with the average of the next smooth_over elements, + starting at element i. + + Parameters + ---------- + matrix : array_like + The array to smooth over + smooth_over : int + The number of elements to smooth over + keep_shape : bool, optional + Whether the smoothed array should have the same shape as + as the input array, by default True. If True, then for the first + few i < smooth_over columns of the input array, the element at + position i is replaced with the average of all elements at + positions j <= i. + + Returns + ------- + array_like + The smoothed over array + """ + if smooth_over > 1: + # Smooth each run separately + kernel = np.ones(smooth_over) / smooth_over + smoothed_matrix = _smooth(matrix, kernel, "valid", axis=1) + + # Smooth the first few episodes + if keep_shape: + beginning_cols = [] + for i in range(1, smooth_over): + # Calculate smoothing over the first i columns + beginning_cols.append(matrix[:, :i].mean(axis=1)) + + # Numpy will use each smoothed col as a row, so transpose + beginning_cols = np.array(beginning_cols).transpose() + else: + return matrix + + if keep_shape: + # Return the smoothed array + return np.concatenate([beginning_cols, smoothed_matrix], + axis=1) + else: + return smoothed_matrix + + +def _smooth(matrix, kernel, mode="valid", axis=0): + """ + Performs an axis-wise convolution of matrix with kernel + + Parameters + ---------- + matrix : array_like + The matrix to convolve + kernel : array_like + The kernel to convolve on each row of matrix + mode : str, optional + The mode of convolution, by default "valid". One of 'valid', + 'full', 'same' + axis : int, optional + The axis to perform the convolution along, by default 0 + + Returns + ------- + array_like + The convolved array + + Raises + ------ + ValueError + If kernel is multi-dimensional + """ + if len(kernel.shape) != 1: + raise ValueError("kernel must be 1D") + + def convolve(mat): + return np.convolve(mat, kernel, mode=mode) + + return np.apply_along_axis(convolve, axis=axis, arr=matrix) + + +def get_cumulative_timesteps(timesteps_per_episode): + """ + Creates an array of cumulative timesteps. + + Creates an array of timesteps, where each timestep is the cumulative + number of timesteps up until that point. This is needed for plotting the + training data, where the training timesteps are stored for each episode, + and we need to plot on the x-axis the cumulative timesteps, not the + timesteps per episode. + + Parameters + ---------- + timesteps_per_episode : list + A list where each element in the list denotes the amount of timesteps + for the corresponding episode. + + Returns + ------- + array_like + An array where each element is the cumulative number of timesteps up + until that point. + """ + timesteps_per_episode = np.array(timesteps_per_episode) + cumulative_timesteps = [timesteps_per_episode[:i].sum() + for i in range(timesteps_per_episode.shape[0])] + + return np.array(cumulative_timesteps) + + +def combine_data_dictionaries_by_hp(dir_, env, agent, num_hp_settings, + num_runs, save_dir=".", save_returns=True, + env_type="continuing", offset=0): + """ + Combines all data dictionaries by hyperparameter setting. + + Given a directory, combines all data dictionaries relating to the argument + agent and environment, grouped by hyperparameter settings index. This way, + each resulting data dictionary will contain all data of all runs for + a single hyperparameter setting. This function will save one data + dictionary, consisting of all runs, for each hyperparameter setting. + + This function looks for files named like + "env_agent_data_start_stop_step.pkl" in the argument directory and + combines all those whose start index refers to the same hyperparameter + settings index. + + Parameters + ---------- + dir_ : str + The directory containing the data files + env : str + The name of the environment the experiments were run on + agent : str + The name of the agent in the experiments + num_hp_settings : int + The total number of hyperparameter settings used in the experiment + num_runs : int + The number of runs in the experiment + save_dir : str, optional + The directory to save the combined data in, by default "." + save_returns : bool, optinal + Whether or not to save the mean training and evaluation returns over + all episodes and runs in a text file, by default True + env_type : str, optional + Whether the environment is continuing or episodic, one of 'continuing', + 'episodic'; by default 'continuing'. This determines how the average + return is calculated. For continuing environments, each episode's + performance is first averaged over runs and then over episodes. For + episodic environments, the average return is calculated by first + averaging over all episodes in each run, and then averaging over all + runs; this is required since each run may have a different number of + episodes. + """ + hp_returns = [] + + for hp_ind in range(num_hp_settings): + _, train_mean, eval_mean = \ + combine_data_dictionaries_single_hp(dir_, env, agent, hp_ind, + num_hp_settings, num_runs, + save_dir, save_returns, + env_type, offset=offset) + if save_returns: + hp_returns.append((hp_ind, train_mean, eval_mean)) + + # Write the mean training and evaluation returns to a file + if save_returns: + filename = f"{env}_{agent}_avg_returns.pkl" + with open(os.path.join(save_dir, filename), "wb") as out_file: + # out_file.write(f"{train_mean}, {eval_mean}") + pickle.dump(hp_returns, out_file) + + +def combine_data_dictionaries_single_hp(dir_, env, agent, hp_ind, + num_hp_settings, num_runs, + save_dir=".", calculate_returns=True, + env_type="continuing", offset=0): + filenames = f"{env}_{agent}_data_" + "{start}.pkl" + + hp_run_files = [] + hp_offset = offset * num_hp_settings + start = hp_ind + hp_offset + for j in range(start, start + num_hp_settings * num_runs, num_hp_settings): + filename = os.path.join(dir_, filenames.format(start=j)) + if os.path.exists(filename): + hp_run_files.append(filename) + data = combine_data_dictionaries(hp_run_files, True, save_dir=save_dir, + filename=f"{env}_{agent}_hp-{hp_ind}") + + if not calculate_returns: + return hp_ind, -1., -1. + + # Get the returns for each episode in each run + train_returns = [] + eval_returns = [] + for run in data["experiment_data"][hp_ind]["runs"]: + train_returns.append(run["train_episode_rewards"]) + eval_returns.append(run["eval_episode_rewards"]) + + # Get the mean performance + if env_type == "continuing": + train_mean, _ = _calculate_mean_return_continuing(train_returns, + "train") + eval_mean, _ = _calculate_mean_return_continuing(eval_returns, + "eval") + + elif env_type == "episodic": + train_mean, _ = _calculate_mean_return_episodic(train_returns, + "train") + eval_mean, _ = _calculate_mean_return_episodic(eval_returns, + "eval") + + return hp_ind, train_mean, eval_mean + + +def combine_data_dictionaries(files, save=True, save_dir=".", filename="data"): + """ + Combine data dictionaries given a list of filenames + + Given a list of paths to data dictionaries, combines each data dictionary + into a single one. + + Parameters + ---------- + files : list of str + A list of the paths to data dictionary files to combine + save : bool + Whether or not to save the data + save_dir : str, optional + The directory to save the resulting data dictionaries in + filename : str, optional + The name of the file to save which stores the combined data, by default + 'data' + + Returns + ------- + dict + The combined dictionary + """ + # Use first dictionary as base dictionary + with open(files[0], "rb") as in_file: + data = pickle.load(in_file) + + # Add data from all other dictionaries + for file in files[1:]: + with open(file, "rb") as in_file: + # Read in the new dictionary + in_data = pickle.load(in_file) + + # Add experiment data to running dictionary + for key in in_data["experiment_data"]: + # Check if key exists + if key in data["experiment_data"]: + # Append data if existing + data["experiment_data"][key]["runs"] \ + .extend(in_data["experiment_data"][key]["runs"]) + + else: + # Key doesn't exist - add data to dictionary + data["experiment_data"][key] = \ + in_data["experiment_data"][key] + + if save: + with open(os.path.join(save_dir, f"{filename}.pkl"), "wb") as out_file: + pickle.dump(data, out_file) + + return data + + +def combine_data_dictionaries_by_dir(dir): + """ + Combines the many data dictionaries created during the concurrent + training procedure into a single data dictionary. The combined data is + saved as "data.pkl" in the argument dir. + + Parameters + ---------- + dir : str + The path to the directory containing all data dictionaries to combine + + Returns + ------- + dict + The combined dictionary + """ + files = glob(os.path.join(dir, "*.pkl")) + + combine_data_dictionaries(files) + + +if __name__ == "__main__": + f = open("results/MountainCarContinuous-v0_linearACresults" + + "/MountainCarContinuous-v0_linearAC_hp-12.pkl", "rb") + data = pickle.load(f) + f.close() + + # get_mean_stderr(data, "train", 12, 5) + r = get_returns(data, "train", 12, "episodic") + print(r.shape) + + +def detrend_linear(arr, axis=-1, type_="linear"): + """ + Detrends a matrix along an axis using linear model fitting + + Parameters + ---------- + arr : array_like + The array to detrend + axis : int, optional + The axis along which to detrend, by default -1 + type_ : str, optional + Whether to use the prediction of the linear model or the mean + generated by the linear model, by default "linear". One of "linear", + "mean" + + Returns + ------- + array_like + The array of detrended data + """ + return signal.detrend(arr, axis=axis, type=type_) + + +def detrend_difference(arr, axis=-1): + """ + Detrends a matrix along an axis using the method of differences + + Parameters + ---------- + arr : array_like + The array to detrend + axis : int, optional + The axis along which to detrend, by default -1 + + Returns + ------- + array_like + The array of detrended data + """ + return np.diff(arr, axis=axis) diff --git a/utils/hypers.py b/utils/hypers.py new file mode 100644 index 0000000..0c1061a --- /dev/null +++ b/utils/hypers.py @@ -0,0 +1,520 @@ +import numpy as np +from collections.abc import Iterable +from copy import deepcopy +from pprint import pprint +try: + from utils.runs import expand_episodes +except ModuleNotFoundError: + from runs import expand_episodes + + +CONTINIUING = "continuing" +EPISODIC = "episodic" +TRAIN = "train" +EVAL = "eval" + + +def sweeps(parameters, index): + """ + Gets the parameters for the hyperparameter sweep defined by the index. + + Each hyperparameter setting has a specific index number, and this function + will get the appropriate parameters for the argument index. In addition, + this the indices will wrap around, so if there are a total of 10 different + hyperparameter settings, then the indices 0 and 10 will return the same + hyperparameter settings. This is useful for performing loops. + + For example, if you had 10 hyperparameter settings and you wanted to do + 10 runs, the you could just call this for indices in range(0, 10*10). If + you only wanted to do runs for hyperparameter setting i, then you would + use indices in range(i, 10, 10*10) + + Parameters + ---------- + parameters : dict + The dictionary of parameters, as found in the agent's json + configuration file + index : int + The index of the hyperparameters configuration to return + + Returns + ------- + dict, int + The dictionary of hyperparameters to use for the agent and the total + number of combinations of hyperparameters (highest possible unique + index) + """ + # If the algorithm is a batch algorithm, ensure the batch size if less + # than the replay buffer size + if "batch_size" in parameters and "replay_capacity" in parameters: + batches = np.array(parameters["batch_size"]) + replays = np.array(parameters["replay_capacity"]) + legal_settings = [] + + # Calculate the legal combinations of batch sizes and replay capacities + for batch in batches: + legal = np.where(replays >= batch)[0] + legal_settings.extend(list(zip([batch] * + len(legal), replays[legal]))) + + # Replace the configs batch/replay combos with the legal ones + parameters["batch/replay"] = legal_settings + replaced_hps = ["batch_size", "replay_capacity"] + else: + replaced_hps = [] + + # Get the hyperparameters corresponding to the argument index + out_params = {} + accum = 1 + for key in parameters: + if key in replaced_hps: + # Ignore the HPs that have been sanitized and replaced by a new + # set of HPs + continue + + num = len(parameters[key]) + if key == "batch/replay": + # Batch/replay must be treated differently + batch_replay_combo = parameters[key][(index // accum) % num] + out_params["batch_size"] = batch_replay_combo[0] + out_params["replay_capacity"] = batch_replay_combo[1] + accum *= num + continue + + out_params[key] = parameters[key][(index // accum) % num] + accum *= num + + return (out_params, accum) + + +def total(parameters): + """ + Similar to sweeps but only returns the total number of + hyperparameter combinations. This number is the total number of distinct + hyperparameter settings. If this function returns k, then there are k + distinct hyperparameter settings, and indices 0 and k refer to the same + distinct hyperparameter setting. + + Parameters + ---------- + parameters : dict + The dictionary of parameters, as found in the agent's json + configuration file + + Returns + ------- + int + The number of distinct hyperparameter settings + """ + return sweeps(parameters, 0)[1] + + +def satisfies(data, f): + """ + Similar to hold_constant. Returns all hyperparameter settings + that result in f evaluating to True. + + For each run, the hyperparameter dictionary for that run is inputted to f. + If f returns True, then those hypers are kept. + + Parameters + ---------- + data : dict + The data dictionary generate from running an experiment + f : f(dict) -> bool + A function mapping hyperparameter settings (in a dictionary) to a + boolean value + + Returns + ------- + tuple of list[int], dict + The list of hyperparameter settings satisfying the constraints + defined by constant_hypers and a dictionary of new hyperparameters + which satisfy these constraints + """ + indices = [] + + # Generate a new hyperparameter configuration based on the old + # configuration + new_hypers = deepcopy(data["experiment"]["agent"]["parameters"]) + # Clear the hyper configuration + for key in new_hypers: + if isinstance(new_hypers[key], list): + new_hypers[key] = set() + + for index in data["experiment_data"]: + hypers = data["experiment_data"][index]["agent_hyperparams"] + if not f(hypers): + continue + + # Track the hyper indices and the full hyper settings + indices.append(index) + for key in new_hypers: + if key not in data["experiment_data"][index]["agent_hyperparams"]: + # print(f"{key} not in agent hyperparameters, ignoring...") + continue + + if isinstance(new_hypers[key], set): + agent_val = data["experiment_data"][index][ + "agent_hyperparams"][key] + + # Convert lists to a hashable type + if isinstance(agent_val, list): + agent_val = tuple(agent_val) + + new_hypers[key].add(agent_val) + else: + if key in new_hypers: + value = new_hypers[key] + raise IndexError("clobbering existing hyper " + + f"{key} with value {value} with " + + f"new value {agent_val}") + new_hypers[key] = agent_val + + # Convert each set in new_hypers to a list + for key in new_hypers: + if isinstance(new_hypers[key], set): + new_hypers[key] = sorted(list(new_hypers[key])) + + return indices, new_hypers + + +def hold_constant(data, constant_hypers): + """ + Returns the hyperparameter settings indices and hyperparameter values + of the hyperparameter settings satisfying the constraints constant_hypers. + + Returns the hyperparameter settings indices in the data that + satisfy the constraints as well as a new dictionary of hypers which satisfy + the constraints. The indices returned are the hyper indices of the original + data and not the indices into the new hyperparameter configuration + returned. + + Parameters + ---------- + data: dict + The data dictionary generated from an experiment + + constant_hypers: dict[string]any + A dictionary mapping hyperparameters to a value that they should be + equal to. + + Returns + ------- + tuple of list[int], dict + The list of hyperparameter settings satisfying the constraints + defined by constant_hypers and a dictionary of new hyperparameters + which satisfy these constraints + + Example + ------- + >>> data = ... + >>> contraints = {"stepsize": 0.8} + >>> hold_constant(data, constraints) + ( + [0, 1, 6, 7], + { + "stepsize": [0.8], + "decay": [0.0, 0.5], + "epsilon": [0.0, 0.1], + } + ) + """ + indices = [] + + # Generate a new hyperparameter configuration based on the old + # configuration + new_hypers = deepcopy(data["experiment"]["agent"]["parameters"]) + # Clear the hyper configuration + for key in new_hypers: + if isinstance(new_hypers[key], list): + new_hypers[key] = set() + + # Go through each hyperparameter index, checking if it satisfies the + # constraints + for index in data["experiment_data"]: + # Assume we hyperparameter satisfies the constraints + constraint_satisfied = True + + # Check to see if the agent hyperparameter satisfies the constraints + for hyper in constant_hypers: + constant_val = constant_hypers[hyper] + + # Ensure the constrained hyper exists in the data + if hyper not in data["experiment_data"][index][ + "agent_hyperparams"]: + raise IndexError(f"no such hyper {hyper} in agent hypers") + + agent_val = data["experiment_data"][index]["agent_hyperparams"][ + hyper] + + if agent_val != constant_val: + # Hyperparameter does not satisfy the constraints + constraint_satisfied = False + break + + # If the constraint is satisfied, then we will store the hypers + if constraint_satisfied: + indices.append(index) + + # Add the hypers to the configuration + for key in new_hypers: + if isinstance(new_hypers[key], set): + agent_val = data["experiment_data"][index][ + "agent_hyperparams"][key] + + if isinstance(agent_val, list): + agent_val = tuple(agent_val) + + new_hypers[key].add(agent_val) + else: + if key in new_hypers: + value = new_hypers[key] + raise IndexError("clobbering existing hyper " + + f"{key} with value {value} with " + + f"new value {agent_val}") + new_hypers[key] = agent_val + + # Convert each set in new_hypers to a list + for key in new_hypers: + if isinstance(new_hypers[key], set): + new_hypers[key] = sorted(list(new_hypers[key])) + + return indices, new_hypers + + +def renumber(data, hypers): + """ + Renumbers the hyperparameters in data to reflect the hyperparameter map + hypers. If any hyperparameter settings exist in data that do not exist in + hypers, then those data are discarded. + + Note that each hyperparameter listed in hypers must also be listed in data + and vice versa, but the specific hyperparameter values need not be the + same. For example if "decay" ∈ data[hypers], then it also must be in hypers + and vice versa. If 0.9 ∈ data[hypers][decay], then it need *not* be in + hypers[decay]. + + This function does not mutate the input data, but rather returns a copy of + the input data, appropriately mutated. + + Parameters + ---------- + data : dict + The data dictionary generated from running the experiment + hypers : dict + The new dictionary of hyperparameter values + + Returns + ------- + dict + The modified data dictionary + + Examples + -------- + >>> data = ... + >>> contraints = {"stepsize": 0.8} + >>> new_hypers = hold_constant(data, constraints)[1] + >>> new_data = renumber(data, new_hypers) + """ + data = deepcopy(data) + # Ensure each hyperparameter is in both hypers and data; hypers need not + # list every hyperparameter *value* that is listed in data, but it needs to + # have the same hyperparameters. E.g. if "decay" exists in data then it + # should also exist in hypers, but if 0.9 ∈ data[hypers][decay], this value + # need not exist in hypers. + for key in data["experiment"]["agent"]["parameters"]: + if key not in hypers: + raise ValueError("data and hypers should have all the same " + + f"hyperparameters but {key} ∈ data but ∉ hypers") + + # Ensure each hyperparameter listed in hypers is also listed in data. If it + # isn't then it isn't clear which value of this hyperparamter the data in + # data should map to. E.g. if "decay" = [0.1, 0.2] ∈ hypers but ∉ data, + # which value should we set for the data in data when renumbering? 0.1 or + # 0.2? + for key in hypers: + if key not in data["experiment"]["agent"]["parameters"]: + raise ValueError("data and hypers should have all the same " + + f"hyperparameters but {key} ∈ hypers but ∉ data") + + new_data = {} + new_data["experiment"] = data["experiment"] + new_data["experiment"]["agent"]["parameters"] = hypers + new_data["experiment_data"] = {} + + total_hypers = total(hypers) + + for i in range(total_hypers): + setting = sweeps(hypers, i)[0] + + for j in data["experiment_data"]: + agent_hypers = data["experiment_data"][j]["agent_hyperparams"] + setting_in_data = True + + # For each hyperparameter value in setting, ensure that the + # corresponding agent hyperparameter is equal. If not, ignore that + # hyperparameter setting. + for key in setting: + # If the hyper setting is iterable, then check each value in + # the iterable to ensure it is equal to the corresponding + # value in the agent hyperparameters + if isinstance(setting[key], Iterable): + if len(setting[key]) != len(agent_hypers[key]): + setting_in_data = False + break + for k in range(len(setting[key])): + if setting[key][k] != agent_hypers[key][k]: + setting_in_data = False + break + + # Non-iterable data + elif setting[key] != agent_hypers[key]: + setting_in_data = False + break + + if setting_in_data: + new_data["experiment_data"][i] = data["experiment_data"][j] + + return new_data + + +def get_performance(data, hyper, type_=TRAIN, repeat=True): + """ + Returns the data for each run of key, optionally adjusting the runs' + data so that each run has the same number of data points. This is + accomplished by repeating each episode's performance by the number of + timesteps the episode took to complete + + Parameters + ---------- + data : dict + The data dictionary + hyper : int + The hyperparameter index to get the run data of + repeat : bool + Whether or not to repeat the runs data + + Returns + ------- + np.array + The array of performance data + """ + if type_ not in (TRAIN, EVAL): + raise ValueError(f"unknown type {type_}") + + key = type_ + "_episode_rewards" + + if repeat: + data = expand_episodes(data, hyper, type_) + + run_data = [] + for run in data["experiment_data"][hyper]["runs"]: + run_data.append(run[key]) + + return np.array(run_data) + + +def best(data, perf=TRAIN): + """ + Returns the hyperparameter index of the hyper setting which resulted in the + highest AUC of the learning curve. AUC is calculated by computing the AUC + for each run, then taking the average over all runs. + + Parameters + ---------- + data : dict + The data dictionary + perf : str + The type of performance to evaluate, train or eval + + Returns + ------- + np.array[int], np.float32 + The hyper settings that resulted in the maximum return as well as the + maximum return + """ + max_hyper = int(np.max(list(data["experiment_data"].keys()))) + hypers = [np.finfo(np.float64).min] * (max_hyper + 1) + for hyper in data["experiment_data"]: + hyper_data = [] + for run in data["experiment_data"][hyper]["runs"]: + hyper_data.append(run[f"{perf}_episode_rewards"].mean()) + + hyper_data = np.array(hyper_data) + hypers[hyper] = hyper_data.mean() + + return np.argmax(hypers), np.max(hypers) + + +def get(data, ind): + """ + Gets the hyperparameters for hyperparameter settings index ind + + data : dict + The Python data dictionary generated from running main.py + ind : int + Gets the returns of the agent trained with this hyperparameter + settings index + + Returns + ------- + dict + The dictionary of hyperparameters + """ + return data["experiment_data"][ind]["agent_hyperparams"] + + +def which(data, hypers, equal_keys=False): + """ + Get the hyperparameter index at which all agent hyperparameters are + equal to those specified by hypers. + + Parameters + ---------- + data : dict + The data dictionary that resulted from running an experiment + hypers : dict[string]any + A dictionary of hyperparameters to the values that those + hyperparameters should take on + equal_keys : bool, optional + Whether or not all keys must be shared between the sets of agent + hyperparameters and the argument hypers. By default False. + + Returns + ------- + int, None + The hyperparameter index at which the agent had hyperparameters equal + to those specified in hypers. + + Examples + -------- + >>> data = ... # Some data from an experiment + >>> hypers = {"critic_lr": 0.01, "actor_lr": 1.0} + >>> ind = which(data, hypers) + >>> print(ind in data["experiment_data"]) + True + """ + for ind in data["experiment_data"]: + is_equal = True + agent_hypers = data["experiment_data"][ind]["agent_hyperparams"] + + # Ensure that all keys in each dictionary are equal + if equal_keys and set(agent_hypers.keys()) != set(hypers.keys()): + continue + + # For the current set of agent hyperparameters (index ind), check to + # see if all hyperparameters used by the agent are equal to those + # specified by hypers. If not, then break and check the next set of + # agent hyperparameters. + for h in hypers: + if h in agent_hypers and hypers[h] != agent_hypers[h]: + is_equal = False + break + + if is_equal: + return ind + + # No agent hyperparameters were found that coincided with the argument + # hypers + return None diff --git a/utils/max_time.py b/utils/max_time.py new file mode 100644 index 0000000..85d6f39 --- /dev/null +++ b/utils/max_time.py @@ -0,0 +1,27 @@ +#!/usr/bin/env python3 + +# This script looks through all runs of an experiment, over all hyper settings. +# It will return the runtime from the longest running experiment. + +import numpy as np +import pickle +import sys +import os + +if len(sys.argv) != 2: + print(f"{sys.argv[0]}: checks the maximum runtime over all runs for an " + "experiment") + print("usage:") + print(f"\t{sys.argv[0]} path/to/dir/containing/data.pkl") + +f = sys.argv[1] +with open(f, "rb") as infile: + data = pickle.load(infile) + +time = [] +for hyper in data["experiment_data"]: + for run in data["experiment_data"][hyper]["runs"]: + total = run["train_time"] + run["eval_time"] + time.append(total) + +print("Maximum run time:", np.max(time) / 3600) diff --git a/utils/plot_hypers.py b/utils/plot_hypers.py new file mode 100644 index 0000000..3423907 --- /dev/null +++ b/utils/plot_hypers.py @@ -0,0 +1,103 @@ +import pickle +import functools +from tqdm import tqdm +import os +import matplotlib.pyplot as plt +import numpy as np +import scipy +import json +import sys +import seaborn as sns +import plot_utils as plot +import matplotlib as mpl +import experiment_utils as exp +import hypers + + +# Place environment name with type of environment in type_map so that we know +# how to plot/evaluate. This terrible code-style is due to legacy code which +# needs to be fixed badly. +CONTINUING = "continuing" +EPISODIC = "episodic" +type_map = { + "MinAtarBreakout": EPISODIC, + "MinAtarFreeway": EPISODIC, + "PendulumFixed-v0": EPISODIC, + "Acrobot-v1": EPISODIC, + "BipedalWalker-v3": EPISODIC, + "LunarLanderContinuous-v2": EPISODIC, + "Bimodal1DEnv": CONTINUING, + "Hopper-v2": EPISODIC, + "PuddleWorld-v1": EPISODIC, + "MountainCar-v0": EPISODIC, + "MountainCarContinuous-v0": EPISODIC, + "Pendulum-v0": CONTINUING, + "Pendulum-v1": CONTINUING, + "Walker2d": EPISODIC, + "Swimmer-v2": EPISODIC +} + +if len(sys.argv) < 5: + print("invalid number of inputs:") + print(f"\t{sys.argv[0]} env_json hyper agent_json") + +env_json = sys.argv[1] +DIR = sys.argv[2] +HYPER = sys.argv[3] +agent_json = sys.argv[4:] + +# Load configuration files +with open(env_json, "r") as infile: + env_config = json.load(infile) +if "gamma" not in env_config: + env_config["gamma"] = -1 + +agent_configs = [] +for j in agent_json: + with open(j, "r") as infile: + agent_configs.append(json.load(infile)) + +ENV = env_config["env_name"] +ENV_TYPE = type_map[ENV] +PERFORMANCE_METRIC_TYPE = "train" +DATA_FILE = "data.pkl" + + +# Script +DATA_FILES = [] +for config in agent_configs: + agent = config["agent_name"] + if DIR: + DATA_FILES.append(f"./results/{DIR}/{ENV}_{agent}results") + else: + DATA_FILES.append(f"./results/{ENV}_{agent}results") + +DATA = [] +for f in DATA_FILES: + with open(os.path.join(f, DATA_FILE), "rb") as infile: + DATA.append(pickle.load(infile)) + +# Generate labels for plots +labels = [] +for ag in DATA: + labels.append([ag["experiment"]["agent"]["agent_name"]]) +colours = [["#003f5c"], ["#bc5090"], ["#ffa600"], ["#ff6361"], ["#58cfa1"]] + +# Plot the hyperparameter sensitivities +all_fig, all_ax = plot.hyper_sensitivity(DATA, HYPER) + +# Adjust axis spines +all_ax.spines['top'].set_visible(False) +all_ax.spines['right'].set_visible(False) +all_ax.spines['bottom'].set_linewidth(2) +all_ax.spines['left'].set_linewidth(2) + +# Set title and legend +all_ax.set_title(HYPER + " " + os.path.basename(env_json).rstrip(".json")) +all_ax.legend() + +all_fig.savefig( + f"{os.path.expanduser('~')}/{ENV}_{HYPER}.png", + bbox_inches="tight", +) +exit(0) diff --git a/utils/plot_mse.py b/utils/plot_mse.py new file mode 100644 index 0000000..f705bf4 --- /dev/null +++ b/utils/plot_mse.py @@ -0,0 +1,117 @@ +import pickle +import seaborn as sns +import os +import matplotlib.pyplot as plt +import numpy as np +import hypers +import json +import sys +import plot_utils as plot +import matplotlib as mpl +mpl.rcParams["font.size"] = 24 +mpl.rcParams["svg.fonttype"] = "none" + + +# Place environment name with type of environment in type_map so that we know +# how to plot/evaluate. This terrible code-style is due to legacy code which +# needs to be fixed badly. +CONTINUING = "continuing" +EPISODIC = "episodic" +type_map = { + "MinAtarBreakout": EPISODIC, + "MinAtarFreeway": EPISODIC, + "LunarLanderContinuous-v2": EPISODIC, + "Bimodal3Env": CONTINUING, + "Bimodal2DEnv": CONTINUING, + "Bimodal1DEnv": CONTINUING, + "BipedalWalker-v3": EPISODIC, + "Hopper-v2": EPISODIC, + "PuddleWorld-v1": EPISODIC, + "MountainCar-v0": EPISODIC, + "MountainCarContinuous-v0": EPISODIC, + "PendulumFixed-v0": CONTINUING, + "Pendulum-v0": CONTINUING, + "Acrobot-v1": EPISODIC, + "Pendulum-v1": CONTINUING, + "Walker2d": EPISODIC, + "Swimmer-v2": EPISODIC + } + +if len(sys.argv) < 4: + raise ArgumentError("""invalid arguments, call ./plot_mse + path/to/env_config dir/with/data.pkl + path/to/agent_config(s) + """) +env_json = sys.argv[1] +DIR = sys.argv[2] +agent_json = sys.argv[3:] + +# Load configuration files +with open(env_json, "r") as infile: + env_config = json.load(infile) +agent_configs = [] +for j in agent_json: + with open(j, "r") as infile: + agent_configs.append(json.load(infile)) + +ENV = env_config["env_name"] +ENV_TYPE = type_map[ENV] +PERFORMANCE_METRIC_TYPE = "train" +DATA_FILE = "data.pkl" + + +# Script +DATA_FILES = [] +for config in agent_configs: + agent = config["agent_name"] + if DIR: + DATA_FILES.append(f"./results/{DIR}/{ENV}_{agent}results") + else: + DATA_FILES.append(f"./results/{ENV}_{agent}results") + +DATA = [] +for f in DATA_FILES: + print(f"Opening file: {f}") + with open(os.path.join(f, DATA_FILE), "rb") as infile: + DATA.append(pickle.load(infile)) + +# Find best hypers +BEST_IND = [] +for agent in DATA: + best_hp = hypers.best(agent)[0] + BEST_IND.append(best_hp) + +# Generate labels for plots +labels = [] +for ag in DATA: + labels.append([ag["experiment"]["agent"]["agent_name"]]) + +CMAP = "tab10" +colours = list(sns.color_palette(CMAP, 8).as_hex()) +colours = list(map(lambda x: [x], colours)) +plt.rcParams["axes.prop_cycle"] = mpl.cycler(color=sns.color_palette(CMAP)) + +# Plot the mean + standard error +print("=== Plotting mean with standard error") +PLOT_TYPE = "train" +SOLVED = 0 +TYPE = "online" if PLOT_TYPE == "train" else "offline" +best_ind = list(map(lambda x: [x], BEST_IND)) + +plot_labels = list(map(lambda x: x[0], labels)) # Adjust labels for plot +fig, ax = plot.mean_with_stderr( + DATA, + PLOT_TYPE, + best_ind, + [5000]*len(best_ind), + plot_labels, + env_type="episodic", + figsize=(16, 16), + colours=colours, +) +ax.set_title(ENV) + +fig.savefig( + f"{os.path.expanduser('~')}/{ENV}.png", + bbox_inches="tight", +) diff --git a/utils/plot_runs_separate.py b/utils/plot_runs_separate.py new file mode 100644 index 0000000..9259ad4 --- /dev/null +++ b/utils/plot_runs_separate.py @@ -0,0 +1,232 @@ +# Plot each separate run on a different sub-axis, ordered by AUC + +import pickle +from math import ceil +import seaborn as sns +import functools +from tqdm import tqdm +import os +import matplotlib.pyplot as plt +import numpy as np +import scipy +import json +import sys +import plot_utils as plot +import matplotlib as mpl +mpl.rcParams["font.size"] = 24 + +try: + import hypers + import runs +except ModuleNotFoundError: + import utils.hypers + import utils.runs + +# Set up plots +params = { + 'axes.labelsize': 8, + 'axes.titlesize': 32, + 'legend.fontsize': 16, + 'xtick.labelsize': 24, + 'ytick.labelsize': 24 +} +plt.rcParams.update(params) + +plt.rc('text', usetex=False) # You might want usetex=True to get DejaVu Sans +plt.rc('font', **{'family': 'sans-serif', 'serif': ['DejaVu Sans']}) +plt.rcParams["font.family"] = "DejaVu Sans" +plt.rcParams.update({'font.size': 32}) +plt.tick_params(top=False, right=False, labelsize=24) + +mpl.rcParams["svg.fonttype"] = "none" + +if len(sys.argv) != 4: + raise ArgumentError("""should run ./plot_runs_separate.py + path/to/env_config save/dir path/to/agent_config + """) + +env_json = sys.argv[1] +DIR = sys.argv[2] +agent_json = sys.argv[3] + + +def get_y_bounds(env, per_env_tuning): + """ + Get the bounds for the y-axis plots on `env` given that `per_env_tuning` + determines whether we are tuning per environment or across environments. + """ + if per_env_tuning: + if "mountaincar" in env.lower(): + return (-1000, -50) + elif "acrobot" in env.lower(): + return (-1000, -50) + elif "pendulum" in env.lower(): + return (-1000, 1000) + else: + if "mountaincar" in env.lower(): + return (-1000, -50) + elif "acrobot" in env.lower(): + return (-1000, -50) + elif "pendulum" in env.lower(): + return (-1000, 950) + + if "breakout" in env.lower(): + return (0, 25) + + +# Load configuration files +with open(env_json, "r") as infile: + env_config = json.load(infile) + +with open(agent_json, "r") as infile: + agent_config = json.load(infile) +agent = agent_config["agent_name"] + +ENV = env_config["env_name"] + +# Uncomment the next lines if using ICML data +if agent == "GreedyAC": + agent = "cem" +elif agent == "GreedyACSoftmax": + agent = "cem_softmax" + +if ENV == "Pendulum-v0": + env = "PendulumFixed-v0" +else: + env = ENV + +if ENV == "MountainCarContinuous-v0": + env_config["env_name"] = "MountainCar-v0" + env_config["continuous"] = True + ENV = env_config["env_name"] + +# Script +if DIR: + data_file = f"./results/{DIR}/{env}_{agent}results/data.pkl" +else: + data_file = f"./results/{env}_{agent}results/data.pkl" +with open(data_file, "rb") as infile: + data = pickle.load(infile) + +# Find best hypers +# ################################# +# For new runs +# ################################# +best_hp = hypers.best(data)[0] +per_env_tuning = True + +# Expand data to ensure episodic environments have the same number of data +# points per run +if "pendulum" not in ENV.lower(): + data = runs.expand_episodes(data, best_hp) +low_x = 0 +if "pendulum" not in ENV.lower(): + high_x = np.cumsum( + data["experiment_data"][best_hp]["runs"][0]["train_episode_steps"] + )[-1] +else: + high_x = len( + data["experiment_data"][best_hp]["runs"][0]["train_episode_steps"] + ) + +# Go through and get the list of hyperparameter indices ordered by AUC +num_runs = list(range(len(data["experiment_data"][best_hp]["runs"]))) +auc = hypers.get_performance(data, best_hp, repeat=False).mean(axis=-1) +order = np.argsort(auc) + +# Figure out the number of rows and columns for the subplots +num_plots = len(num_runs) +COLS = 4 +ROWS = max(1, ceil(num_plots / COLS)) +fig = plt.figure(figsize=(7 * COLS, 4.8 * ROWS), constrained_layout=True) +spec = fig.add_gridspec(ROWS, COLS) + +# Plot +low_y, high_y = get_y_bounds(ENV, per_env_tuning) +returns = [] +for i, run_num in enumerate(order): + run = data["experiment_data"][best_hp]["runs"][run_num] + + # Figure out which row and column of the subplots we are on + y = i // COLS + x = i - y * COLS + ax = fig.add_subplot(spec[y, x]) + + if "pendulum" not in ENV.lower(): + # If an episodic environment, ignore the last episode, since it will be + # cut off. We actually are cutting off too much here, but the + # alternative is to iterate over the entire data set twice, since we + # need to find the maximum steps for the last episode. + cutoff = env_config["steps_per_episode"] + performance = run["train_episode_rewards"][:-cutoff] + else: + performance = run["train_episode_rewards"] + + ax.plot( + performance, + label=f"Run {i}", + linewidth=2.5, + color="#007bff", + ) + + # Only set x ticks for bottom row + if y == ROWS-1: + ax.set_xticks([low_x, high_x]) + else: + ax.set_xticks([]) + + # Only set y ticks for leftmost column + if x == 0: + ax.set_yticks(get_y_bounds(ENV, per_env_tuning)) + else: + ax.set_yticks([]) + + # Set axis title and bounds + ax.set_title(f"Run {i}") + ax.set_xlim(low_x, high_x) + ax.set_ylim(low_y-10, high_y+10) + + # Adjust axis spines + ax.spines['top'].set_visible(False) + ax.spines['right'].set_visible(False) + ax.spines['bottom'].set_linewidth(2) + ax.spines['left'].set_linewidth(2) + + returns.append(performance) + +# Calculate returns and stderr of returns +returns = np.array(returns) +mean = returns.mean(axis=0) +stderr = np.std(returns, axis=0, ddof=1) +stderr /= np.sqrt(returns.shape[0]) + +ax = fig.add_subplot(spec[:, COLS-1]) +ax.fill_between( + np.arange(mean.shape[-1]), + mean-stderr, + mean+stderr, + alpha=0.1, + color="#161c1e", +) +ax.plot(mean, label="Mean", linewidth=3.0, color="#161c1e") + +# Set title and axes limits +ax.set_title("Mean") +ax.set_xlim(low_x, high_x) +ax.set_ylim(low_y-10, high_y+10) +ax.set_yticks(get_y_bounds(ENV, per_env_tuning)) +ax.set_xticks([low_x, high_x]) + +# Adjust axis spines +ax.spines['top'].set_visible(False) +ax.spines['right'].set_visible(False) +ax.spines['bottom'].set_linewidth(2) +ax.spines['left'].set_linewidth(2) + +# Add the figure title +fig.suptitle(ENV) + +fig.savefig( + f"{os.path.expanduser('~')}/{ENV}_{agent}_runs.png", + bbox_inches="tight", +) diff --git a/utils/plot_utils.py b/utils/plot_utils.py new file mode 100644 index 0000000..c572561 --- /dev/null +++ b/utils/plot_utils.py @@ -0,0 +1,1456 @@ +# Import modules +import matplotlib.pyplot as plt +from matplotlib.lines import Line2D +from matplotlib import ticker, gridspec +import experiment_utils as exp +import numpy as np +from scipy import ndimage +import seaborn as sns +from collections.abc import Iterable +import pickle +import matplotlib as mpl +import hypers +import warnings +import runs + +TRAIN = "train" +EVAL = "eval" + + +# Set up plots +params = { + 'axes.labelsize': 48, + 'axes.titlesize': 36, + 'legend.fontsize': 16, + 'xtick.labelsize': 48, + 'ytick.labelsize': 48 +} +plt.rcParams.update(params) + +plt.rc('text', usetex=False) # You might want usetex=True to get DejaVu Sans +plt.rc('font', **{'family': 'sans-serif', 'serif': ['DejaVu Sans']}) +plt.rcParams["font.family"] = "DejaVu Sans" +plt.rcParams.update({'font.size': 15}) +plt.tick_params(top=False, right=False, labelsize=20) + +mpl.rcParams["svg.fonttype"] = "none" + + +# Constants +EPISODIC = "episodic" +CONTINUING = "continuing" + + +# Colours +CMAP = "tab10" +DEFAULT_COLOURS = list(sns.color_palette(CMAP, 6).as_hex()) +plt.rcParams["axes.prop_cycle"] = mpl.cycler(color=sns.color_palette(CMAP)) +OFFSET = 0 # The offset to start in DEFAULT_COLOURS + + +def episode_steps(data, type_, ind, labels, xlim=None, + ylim=None, colours=None, xlabel="episodes", + ylabel="steps to goal", figsize=(16, 9), + title="Steps to Goal", α=0.2): + """ + Plot steps per episode + + Parameters + ---------- + data : TODO + type_ : TODO + ind : TODO + smooth_over : TODO + labels : TODO + xlim : TODO, optional + ylim : TODO, optional + colours : TODO, optional + xlabel : TODO, optional + ylabel : TODO, optional + + Returns + ------- + TODO + + """ + # Set the colours to be default if not set + if colours is None: + colours = _get_default_colours(ind) + + # Set up figure + fig, ax = _setup_fig(None, None, figsize, xlim=xlim, ylim=ylim, + xlabel=xlabel, ylabel=ylabel, title=title) + + # For a single dict, then many + for i in range(len(data)): + for j in range(len(ind[i])): + _episode_steps(data[i], type_, ind[i][j], colours[i][j], + labels[i], ax, α) + + ax.legend() + return fig, ax + + +def _episode_steps(data, type_, ind, colour, label, ax, α=0.2): + """ + Plot steps per episode + + Parameters + ---------- + data : TODO + type_ : TODO + ind : TODO + smooth_over : TODO + label : TODO + xlim : TODO, optional + ylim : TODO, optional + colours : TODO, optional + xlabel : TODO, optional + ylabel : TODO, optional + + Returns + ------- + TODO + + """ + key = type_ + "_episode_steps" + + # For a single dict, then many + steps_per_run = [] + lengths = [] + for run in data["experiment_data"][ind]["runs"]: + steps_per_run.append(run[key]) + lengths.append(len(steps_per_run[-1])) + + # Adjust the lengths of each run so that there are a consistent number of + # episodes in each row + min_length = min(lengths) + for i in range(len(steps_per_run)): + steps_per_run[i] = steps_per_run[i][0:min_length] + steps_per_run = np.array(steps_per_run) + + mean = steps_per_run.mean(axis=0) + std_err = np.std(steps_per_run, axis=0, ddof=1) / \ + np.sqrt(steps_per_run.shape[0]) + + print(f"Final steps to goal for {label}:", mean[-1]) + + _plot_shaded(ax, np.arange(mean.shape[0]), mean, std_err, colour, + label, α) + + +def hyper_sensitivity(data_dicts, hyper, type_=TRAIN, figsize=(16, 9), + labels=None, metric="return"): + """ + Plots the hyperparameter sensitivity curves + + Parameters + ---------- + data_dicts : list[dict] + A list of data dictionaries resulting from some experiments + hyper : str + The hyper to plot the sensitivity of + type_ : str + The type of data to plot, one of train or eval + figsize : tuple[int] + The figure size + labels : list[str] + A list of labels, of the same length as data_dicts. If None, then the + agent name is used + metric : str + The metric to gauge sensitivity by, one of 'return', 'steps' + + Returns + ------- + plt.figure, plt.Axes + The figure and axes plotted on + """ + fig = plt.figure(figsize=figsize) + ax = fig.add_subplot() + + if type_ not in (TRAIN, EVAL): + raise ValueError(f"type_ must be one of '{TRAIN}', '{EVAL}'") + + metric = metric.lower() + if metric not in ("return", "steps"): + raise ValueError(f"metric must be one of 'return', 'steps'") + + key = type_ + "_episode_" + ("rewards" if metric == "return" else "steps") + + for j, ag in enumerate(data_dicts): + config = ag["experiment"]["agent"] + num_settings = hypers.total(config["parameters"]) + hps = config["parameters"][hyper] + + max_returns = [None] * len(hps) + max_inds = [-1] * len(hps) + runs = -1 + + for i in range(num_settings): + setting = hypers.sweeps(config["parameters"], i)[0] + ind = hps.index(setting[hyper]) + + # Get the average return for the run. If no such data exists, we + # assume that the agent diverged and we give it minimum performance + if i not in ag["experiment_data"].keys(): + avg_return = np.finfo(np.float64).min + continue + + # Store the total number of runs for each setting, which will be + # needed in the final loop + if len(ag["experiment_data"][i]["runs"]) > runs: + runs = len(ag["experiment_data"][i]["runs"]) + + avg_return = [] + for run in ag["experiment_data"][i]["runs"]: + avg_return.append(run[key]) + + avg_run_return = [np.mean(run) for run in avg_return] + avg_return = np.mean(avg_run_return) + + if max_returns[ind] is None or ( + metric == "return" and avg_return > max_returns[ind]) or ( + metric == "steps" and avg_return < max_returns[ind]): + max_inds[ind] = i + max_returns[ind] = avg_return + + # Go through each best hyper and get the mean performance + std err + # per run. If no data exists due to divergence, then just append nans + returns = [] + for index in max_inds: + if index not in ag["experiment_data"]: + returns.append([np.nan] * runs) + continue + + index_returns = [] + for run in ag["experiment_data"][index]["runs"]: + index_returns.append(run[key].mean()) + returns.append(index_returns) + + # Warn the user if some hyper setting does not have the expected + # number of runs + n = len(index_returns) + if n != runs: + warnings.warn(f"hyper setting {index} has only {n} " + + f"runs when {runs} runs expected") + + # To deal with hyper settings which don't have the full number of runs, + # we take each mean and standard error separately before adding to an + # array. + mean = np.array([np.mean(r) for r in returns]) + std_err = np.array([np.std(r, ddof=1) / len(r) for r in returns]) + + ag_name = ag["experiment"]["agent"]["agent_name"] + + # Any runs that failed due to invalid hypers and resulted in nans + # should have low performance. We make it 10 * lower than the lowest + # performance + std_err[np.where(np.isnan(std_err))] = 0 + min_ = np.min(mean[np.where(~np.isnan(mean))]) + mean[np.where(np.isnan(mean))] = min_ * (10 if min_ < 0 else 0.1) + + if not labels: + label = ag_name + else: + label = labels[j] + ax.plot(hps, mean, label=label) + ax.fill_between(hps, mean-std_err, mean+std_err, alpha=0.1) + + ylabel = "Steps to Goal" if metric == "steps" else "Average Return" + ax.set_ylabel(ylabel) + ax.set_xlabel(hyper) + + return fig, ax + + +def mean_with_bootstrap_conf(data, type_, ind, smooth_over, names, + fig=None, ax=None, figsize=(12, 6), + xlim=None, ylim=None, alpha=0.1, + colours=None, env_type="continuing", + significance=0.05, keep_shape=False, + xlabel=None, ylabel=None): + """ + Plots the average training or evaluation return over all runs with + confidence intervals. + + Given a list of data dictionaries of the form returned by main.py, this + function will plot each episodic return for the list of hyperparameter + settings ind each data dictionary. The ind argument is a list, where each + element is a list of hyperparameter settings to plot for the data + dictionary at the same index as this list. For example, if ind[i] = [1, 2], + then plots will be generated for the data dictionary at location i + in the data argument for hyperparameter settings ind[i] = [1, 2]. + The smooth_over argument tells how many previous data points to smooth + over + + Parameters + ---------- + data : list of dict + The Python data dictionaries generated from running main.py for the + agents + type_ : str + Which type of data to plot, one of "eval" or "train" + ind : iter of iter of int + The list of lists of hyperparameter settings indices to plot for + each agent. For example [[1, 2], [3, 4]] means that the first agent + plots will use hyperparameter settings indices 1 and 2, while the + second will use 3 and 4. + smooth_over : list of int + The number of previous data points to smooth over for the agent's + plot for each data dictionary. Note that this is *not* the number of + timesteps to smooth over, but rather the number of data points to + smooth over. For example, if you save the return every 1,000 + timesteps, then setting this value to 15 will smooth over the last + 15 readings, or 15,000 timesteps. For example, [1, 2] will mean that + the plots using the first data dictionary will smooth over the past 1 + data points, while the second will smooth over the passed 2 data + points for each hyperparameter setting. + fig : plt.figure + The figure to plot on, by default None. If None, creates a new figure + ax : plt.Axes + The axis to plot on, by default None, If None, creates a new axis + figsize : tuple(int, int) + The size of the figure to plot + names : list of str + The name of the agents, used for the legend + xlim : float, optional + The x limit for the plot, by default None + ylim : float, optional + The y limit for the plot, by default None + alpha : float, optional + The alpha channel for the plot, by default 0.1 + colours : list of list of str + The colours to use for each hyperparameter settings plot for each data + dictionary + env_type : str, optional + The type of environment, one of 'continuing', 'episodic'. By default + 'continuing' + significance : float, optional + The significance level for the confidence interval, by default 0.01 + + Returns + ------- + plt.figure, plt.Axes + The figure and axes of the plot + """ + fig, ax = _setup_fig(fig, ax, figsize, None, xlim, ylim, xlabel, ylabel) + + # Set the colours to be default if not set + if colours is None: + colours = _get_default_colours(ind) + + # Track the total timesteps per hyperparam setting over all episodes and + # the cumulative timesteps per episode per data dictionary (timesteps + # should be consistent between all hp settings in a single data dict) + total_timesteps = [] + cumulative_timesteps = [] + + for i in range(len(data)): + if type_ == "train": + cumulative_timesteps.append(exp.get_cumulative_timesteps(data[i] + ["experiment_data"][ind[i][0]]["runs"] + [0]["train_episode_steps"])) + elif type_ == "eval": + cumulative_timesteps.append(data[i]["experiment_data"][ind[i][0]] + ["runs"][0]["timesteps_at_eval"]) + else: + raise ValueError("type_ must be one of 'train', 'eval'") + total_timesteps.append(cumulative_timesteps[-1][-1]) + + # Find the minimum of total trained-for timesteps. Each plot will only + # be plotted on the x-axis until this value + min_timesteps = min(total_timesteps) + + # For each data dictionary, find the minimum index where the timestep at + # that index is >= minimum timestep + ind_ge_min_timesteps = [] + for cumulative_timesteps_per_data in cumulative_timesteps: + final_ind = np.where(cumulative_timesteps_per_data >= + min_timesteps)[0][0] + # Since indexing will stop right before the minimum, increment it + ind_ge_min_timesteps.append(final_ind + 1) + + # Plot all data for all HP settings, only up until the minimum index + # fig, ax = None, None + if env_type == "continuing": + plot_fn = _plot_mean_with_conf_continuing + else: + plot_fn = _plot_mean_with_conf_episodic + + for i in range(len(data)): + fig, ax = \ + plot_fn(data=data[i], type_=type_, + ind=ind[i], smooth_over=smooth_over[i], name=names[i], + fig=fig, ax=ax, figsize=figsize, xlim=xlim, ylim=ylim, + last_ind=ind_ge_min_timesteps[i], alpha=alpha, + colours=colours[i], significance=significance, + keep_shape=keep_shape) + + return fig, ax + + +def _plot_mean_with_conf_continuing(data, type_, ind, smooth_over, fig=None, + ax=None, figsize=(12, 6), name="", + last_ind=-1, xlabel="Timesteps", + ylabel="Average Return", xlim=None, + ylim=None, alpha=0.1, colours=None, + significance=0.05, keep_shape=False): + """ + Plots the average training or evaluation return over all runs for a single + data dictionary on a continuing environment. Bootstrap confidence intervals + are plotted as shaded regions. + + Parameters + ---------- + data : dict + The Python data dictionary generated from running main.py + type_ : str + Which type of data to plot, one of "eval" or "train" + ind : iter of int + The list of hyperparameter settings indices to plot + smooth_over : int + The number of previous data points to smooth over. Note that this + is *not* the number of timesteps to smooth over, but rather the number + of data points to smooth over. For example, if you save the return + every 1,000 timesteps, then setting this value to 15 will smooth + over the last 15 readings, or 15,000 timesteps. + fig : plt.figure + The figure to plot on, by default None. If None, creates a new figure + ax : plt.Axes + The axis to plot on, by default None, If None, creates a new axis + figsize : tuple(int, int) + The size of the figure to plot + name : str, optional + The name of the agent, used for the legend + last_ind : int, optional + The index of the last element to plot in the returns list, + by default -1. This is useful if you want to plot many things on the + same axis, but all of which have a different number of elements. This + way, we can plot the first last_ind elements of each returns for each + agent. + timestep_multiply : int, optional + A value to multiply each timstep by, by default 1. This is useful if + your agent does multiple updates per timestep and you want to plot + performance vs. number of updates. + xlim : float, optional + The x limit for the plot, by default None + ylim : float, optional + The y limit for the plot, by default None + alpha : float, optional + The alpha channel for the plot, by default 0.1 + colours : list of str + The colours to use for each plot of each hyperparameter setting + env_type : str, optional + The type of environment, one of 'continuing', 'episodic'. By default + 'continuing' + significance : float, optional + The significance level for the confidence interval, by default 0.01 + + Returns + ------- + plt.figure, plt.Axes + The figure and axes of the plot + + Raises + ------ + ValueError + When an axis is passed but no figure is passed + When an appropriate number of colours is not specified to cover all + hyperparameter settings + """ + # This should be the exact same as the episodic version except without + # reducing the episodes. Follow the same structure as the episodic function + # and the continuing function with standard error. + raise NotImplementedError + + +def _plot_mean_with_conf_episodic(data, type_, ind, smooth_over, fig=None, + ax=None, figsize=(12, 6), name="", + last_ind=-1, xlabel="Timesteps", + ylabel="Average Return", xlim=None, + ylim=None, alpha=0.1, colours=None, + significance=0.05, keep_shape=False): + """ + Plots the average training or evaluation return over all runs for a single + data dictionary on an episodic environment. Bootstrap confidence intervals + are plotted as shaded regions. + + Parameters + ---------- + data : dict + The Python data dictionary generated from running main.py + type_ : str + Which type of data to plot, one of "eval" or "train" + ind : iter of int + The list of hyperparameter settings indices to plot + smooth_over : int + The number of previous data points to smooth over. Note that this + is *not* the number of timesteps to smooth over, but rather the number + of data points to smooth over. For example, if you save the return + every 1,000 timesteps, then setting this value to 15 will smooth + over the last 15 readings, or 15,000 timesteps. + fig : plt.figure + The figure to plot on, by default None. If None, creates a new figure + ax : plt.Axes + The axis to plot on, by default None, If None, creates a new axis + figsize : tuple(int, int) + The size of the figure to plot + name : str, optional + The name of the agent, used for the legend + last_ind : int, optional + The index of the last element to plot in the returns list, + by default -1. This is useful if you want to plot many things on the + same axis, but all of which have a different number of elements. This + way, we can plot the first last_ind elements of each returns for each + agent. + xlim : float, optional + The x limit for the plot, by default None + ylim : float, optional + The y limit for the plot, by default None + alpha : float, optional + The alpha channel for the plot, by default 0.1 + colours : list of str + The colours to use for each plot of each hyperparameter setting + env_type : str, optional + The type of environment, one of 'continuing', 'episodic'. By default + 'continuing' + significance : float, optional + The significance level for the confidence interval, by default 0.01 + + Returns + ------- + plt.figure, plt.Axes + The figure and axes of the plot + + Raises + ------ + ValueError + When an axis is passed but no figure is passed + When an appropriate number of colours is not specified to cover all + hyperparameter settings + """ + if colours is not None and len(colours) != len(ind): + raise ValueError("must have one colour for each hyperparameter " + + "setting") + + if colours is None: + colours = _get_default_colours(ind) + + # Set up figure + # if ax is None and fig is None: + # fig = plt.figure(figsize=figsize) + # ax = fig.add_subplot() + + # if xlim is not None: + # ax.set_xlim(xlim) + # if ylim is not None: + # ax.set_ylim(ylim) + + conf_level = "{:.2f}".format(1-significance) + title = f"""Average {type_.title()} Return per Run with {conf_level} + Confidence Intervals""" + fig, ax = _setup_fig(fig, ax, figsize, title, xlim, ylim, xlabel, ylabel) + + # Plot with bootstrap confidence interval + for i in range(len(ind)): + data = runs.expand_episodes(data, ind[i], type_=type_) + + _, mean, conf = exp.get_mean_err(data, type_, ind[i], smooth_over, + exp.bootstrap_conf, + err_args={ + "significance": significance, + }, + keep_shape=keep_shape) + + mean = mean[:last_ind] + conf = conf[:, :last_ind] + + episodes = np.arange(mean.shape[0]) + + # Plot based on colours + label = f"{name}" + print(mean.shape, conf.shape, episodes.shape) + _plot_shaded(ax, episodes, mean, conf, colours[i], label, alpha) + + ax.legend() + conf_level = "{:.2f}".format(1-significance) + ax.set_title(f"""Average {type_.title()} Return per Run with {conf_level} + Confidence Intervals""") + # ax.set_ylabel(ylabel) + # ax.set_xlabel(xlabel) + + fig.show() + return fig, ax + + +def plot_mean_with_runs(data, type_, ind, smooth_over, names, colours=None, + figsize=(12, 6), xlim=None, ylim=None, alpha=0.1, + plot_avg=True, env_type="continuing", + keep_shape=False, fig=None, ax=None): + """ + Plots the mean return over all runs and the return for each run for a list + of data dictionaries and hyperparameter indices + + Plots both the mean return per episode (over runs) as well as the return + for each individual run (including "mini-runs", if a set of concurrent + episodes were run for all runs, e.g. multiple evaluation episodes per + run at set intervals) + + Note that this function takes in a list of data dictionaries and will + plot the runs for each ind (which is a list of lists, where each super-list + refers to a data dictionary and each sub-list refers to the indices for + that data dictionary to plot). + + Example + ------- + plot_mean_with_runs([sac_data, linear_data], "train", [[3439], [38, 3]], + smooth_over=[5, 2], names=["SAC", "LAC"], figsize=(12, 6), alpha=0.2, + plot_avg=True, env_type="episodic") + + will plot hyperparameter index 3439 for the sac_data, smoothing over the + last 5 episodes, and the label will have the term "SAC" in it; also plots + the mean and each individual run on the linear_data for hyperparameter + settings 38 and 3, smoothing over the last 2 episodes for each and with + the term "LAC" in the labels. + + Parameters + ---------- + data : list of dict + The Python data dictionaries generated from running main.py for the + agents + type_ : str + Which type of data to plot, one of "eval" or "train" + ind : iter of iter of int + The list of lists of hyperparameter settings indices to plot for + each agent. For example [[1, 2], [3, 4]] means that the first agent + plots will use hyperparameter settings indices 1 and 2, while the + second will use 3 and 4. + smooth_over : list of int + The number of previous data points to smooth over for the agent's + plot for each data dictionary. Note that this is *not* the number of + timesteps to smooth over, but rather the number of data points to + smooth over. For example, if you save the return every 1,000 + timesteps, then setting this value to 15 will smooth over the last + 15 readings, or 15,000 timesteps. For example, [1, 2] will mean that + the plots using the first data dictionary will smooth over the past 1 + data points, while the second will smooth over the passed 2 data + points for each hyperparameter setting. + figsize : tuple(int, int) + The size of the figure to plot + names : list of str + The name of the agents, used for the legend + colours : list of list of str, optional + The colours to use for each hyperparameter settings plot for each data + dictionary, by default None + xlim : float, optional + The x limit for the plot, by default None + ylim : float, optional + The y limit for the plot, by default None + alpha : float, optional + The alpha to use for plots of the runs, by default 0.1 + plot_avg : bool, optional + If concurrent episodes are executed in each run (e.g. multiple + evaluation episodes are run at set intervals), then whether to plot the + performance of each separately or the average performance over all + env_type : str, optional + The type of environment, one of 'continuing', 'episodic'. By default + 'continuing' + fig : plt.Figure + The figure to plot on + ax : plt.Axes + The axis to plot on + + Returns + ------- + tuple of plt.figure, plt.Axes + The figure and axis plotted on + """ + # Set the colours to be default if not set + if colours is None: + colours = _get_default_colours(ind) + + # Set up figure + # fig = plt.figure(figsize=figsize) + # ax = fig.add_subplot() + if env_type == "continuing": + xlabel = "Timesteps" + ylabel = "Average Reward" + else: + xlabel = "Timesteps" + ylabel = "Return" + title = "Mean Return with Runs" + + fig, ax = _setup_fig(fig, ax, figsize, title, xlim, ylim, xlabel, ylabel) + + # Plot for each data dictionary given + legend_lines = [] + legend_labels = [] + for i in range(len(data)): + for _ in range(len(ind[i])): + fig, ax, labels, lines = \ + _plot_mean_with_runs(data[i], type_, ind[i], smooth_over[i], + names[i], colours[i], figsize, xlim, ylim, + alpha, plot_avg, env_type, fig, ax, + keep_shape) + + legend_lines.extend(lines) + legend_labels.extend(labels) + + ax.legend(legend_lines, legend_labels) + fig.show() + + return fig, ax + + +def _plot_mean_with_runs(data, type_, ind, smooth_over, name, colours=None, + figsize=(12, 6), xlim=None, ylim=None, alpha=0.1, + plot_avg=True, env_type="continuing", fig=None, + ax=None, keep_shape=False): + """ + Plots the mean return over all runs and the return for each run for a + single data dictionary and for each in a list of hyperparameter settings. + + Similar to plot_mean_with_runs, except that this function takes in only + a single data dictionary. + + Parameters + ---------- + data : dict + The Python data dictionary generated from running main.py + type_ : str + Which type of data to plot, one of "eval" or "train" + ind : iter of int + The list of hyperparameter settings indices to plot for + each agent. For example [1, 2] means that the agent plots will use + hyperparameter settings indices 1 and 2. + smooth_over : int + The number of previous data points to smooth over for the agent's + plot. Note that this is *not* the number of timesteps to smooth over, + but rather the number of data points to smooth over. For example, + if you save the return every 1,000 timesteps, then setting this value + to 15 will smooth over the last 15 readings, or 15,000 timesteps. + figsize : tuple(int, int) + The size of the figure to plot + name : str + The name of the agents, used for the legend + colours : list of list of str, optional + The colours to use for each hyperparameter settings plot for each data + dictionary, by default None + xlim : float, optional + The x limit for the plot, by default None + ylim : float, optional + The y limit for the plot, by default None + alpha : float, optional + The alpha to use for plots of the runs, by default 0.1 + plot_avg : bool, optional + If concurrent episodes are executed in each run (e.g. multiple + evaluation episodes are run at set intervals), then whether to plot the + performance of each separately or the average performance over all + env_type : str, optional + The type of environment, one of 'continuing', 'episodic'. By default + 'continuing' + + Returns + ------- + tuple of plt.figure, plt.Axes, list of str, list of mpl.Lines2D + The figure and axis plotted on as well as the list of strings to use + as labels and the list of lines to include in the legend + """ + # Set up figure and axis + fig, ax = _setup_fig(fig, ax, figsize, None, xlim, ylim) + + if colours is None: + colours = _get_default_colours(ind) + + # Store the info to keep in the legend + legend_labels = [] + legend_lines = [] + + # Plot each selected hyperparameter setting in the data dictionary + for j in range(len(ind)): + fig, ax, labels, lines = \ + _plot_mean_with_runs_single_hp(data, type_, ind[j], smooth_over, + name, colours[j], figsize, xlim, + ylim, alpha, plot_avg, env_type, + fig, ax, keep_shape) + legend_labels.extend(labels) + legend_lines.extend(lines) + + return fig, ax, legend_labels, legend_lines + + +def _plot_mean_with_runs_single_hp(data, type_, ind, smooth_over, names, + colour=None, figsize=(12, 6), xlim=None, + ylim=None, alpha=0.1, plot_avg=True, + env_type="continuing", fig=None, ax=None, + keep_shape=False): + """ + Plots the mean return over all runs and the return for each run for a + single data dictionary and a single hyperparameter setting. + + Similar to _plot_mean_with_runs, except that this function takes in only + a single data dictionary and a single hyperparameter setting. + + Parameters + ---------- + data : dict + The Python data dictionary generated from running main.py + type_ : str + Which type of data to plot, one of "eval" or "train" + ind : int + The hyperparameter settings indices to plot for the agent. For example + 5 means that the agent plots will use hyperparameter settings index 5. + smooth_over : int + The number of previous data points to smooth over for the agent's + plot. Note that this is *not* the number of timesteps to smooth over, + but rather the number of data points to smooth over. For example, + if you save the return every 1,000 timesteps, then setting this value + to 15 will smooth over the last 15 readings, or 15,000 timesteps. + figsize : tuple(int, int) + The size of the figure to plot + name : str + The name of the agents, used for the legend + colours : list of list of str, optional + The colours to use for each hyperparameter settings plot for each data + dictionary, by default None + xlim : float, optional + The x limit for the plot, by default None + ylim : float, optional + The y limit for the plot, by default None + alpha : float, optional + The alpha to use for plots of the runs, by default 0.1 + plot_avg : bool, optional + If concurrent episodes are executed in each run (e.g. multiple + evaluation episodes are run at set intervals), then whether to plot the + performance of each separately or the average performance over all + env_type : str, optional + The type of environment, one of 'continuing', 'episodic'. By default + 'continuing' + + Returns + ------- + tuple of plt.figure, plt.Axes, list of str, list of mpl.Lines2D + The figure and axis plotted on as well as the list of strings to use + as labels and the list of lines to include in the legend + """ + # if env_type == "episodic": + # data = runs.expand_episodes(data, ind, type_=type_) + + # Set up figure and axis + fig, ax = _setup_fig(fig, ax, figsize, None, xlim, ylim) + + if colour is None: + colour = _get_default_colours([ind])[0] + + # Determine the timesteps to plot at + if type_ == "eval": + timesteps = \ + data["experiment_data"][ind]["runs"][0]["timesteps_at_eval"] + + elif type_ == "train": + timesteps_per_ep = \ + data["experiment_data"][ind]["runs"][0]["train_episode_steps"] + timesteps = exp.get_cumulative_timesteps(timesteps_per_ep) + + # Plot the average reward + if env_type == "continuing": + episode_steps = data["experiment"]["environment"]["steps_per_episode"] + + # Get returns + all_returns = exp.get_returns(data, type_, ind, env_type) + + # If concurrent episodes are run in each run then average them if + # appropriate + if type_ == "eval" and plot_avg: + all_returns = all_returns.mean(axis=-1) + elif type_ == "eval" and not plot_avg: + all_returns = np.concatenate(all_returns, axis=1) + all_returns = all_returns.transpose() + elif type_ == "train": + all_returns = np.squeeze(all_returns) + + # Smooth returns + all_returns = exp.smooth(all_returns, smooth_over, keep_shape) + + # Plot the average reward + if env_type == "continuing": + episode_steps = data["experiment"]["environment"] + episode_steps = episode_steps["steps_per_episode"] + # all_returns /= episode_steps + + # Determine whether to plot episodes or timesteps on the x-axis, which is + # dependent on the environment type + if env_type == "episodic": + xvalues = np.arange(all_returns.shape[1]) # episodes + else: + xvalues = timesteps[:all_returns.shape[1]] + + # Plot each run + for run in range(all_returns.shape[0]): + print(all_returns[run].shape) + ax.plot(xvalues, all_returns[run], color=colour, linestyle="-", + alpha=alpha) + + # Plot the mean + mean_colour = "black" + # mean = all_returns.mean(axis=0) + # ax.plot(xvalues, mean, color=mean_colour) + + # Store legend identifiers for the run + legend_labels = [] + legend_lines = [] + legend_labels.append("Individual Runs") + legend_lines.append(Line2D([0], [0], color=colour, linestyle="--", + alpha=alpha)) + + # Set up the legend variables for the mean over all runs + label = f"{names}" + legend_labels.append(label) + legend_lines.append(Line2D([0], [0], color=mean_colour, linestyle="-")) + + return fig, ax, legend_labels, legend_lines + + +def mean_with_stderr(data, type_, ind, smooth_over, names, + fig=None, ax=None, figsize=(12, 6), + xlim=None, ylim=None, alpha=0.1, + colours=None, env_type="continuing", + keep_shape=False, xlabel="", ylabel=""): + """ + Plots the average training or evaluation return over all runs with standard + error. + + Given a list of data dictionaries of the form returned by main.py, this + function will plot each episodic return for the list of hyperparameter + settings ind each data dictionary. The ind argument is a list, where each + element is a list of hyperparameter settings to plot for the data + dictionary at the same index as this list. For example, if ind[i] = [1, 2], + then plots will be generated for the data dictionary at location i + in the data argument for hyperparameter settings ind[i] = [1, 2]. + The smooth_over argument tells how many previous data points to smooth + over + + Parameters + ---------- + data : list of dict + The Python data dictionaries generated from running main.py for the + agents + type_ : str + Which type of data to plot, one of "eval" or "train" + ind : iter of iter of int + The list of lists of hyperparameter settings indices to plot for + each agent. For example [[1, 2], [3, 4]] means that the first agent + plots will use hyperparameter settings indices 1 and 2, while the + second will use 3 and 4. + smooth_over : list of int + The number of previous data points to smooth over for the agent's + plot for each data dictionary. Note that this is *not* the number of + timesteps to smooth over, but rather the number of data points to + smooth over. For example, if you save the return every 1,000 + timesteps, then setting this value to 15 will smooth over the last + 15 readings, or 15,000 timesteps. For example, [1, 2] will mean that + the plots using the first data dictionary will smooth over the past 1 + data points, while the second will smooth over the passed 2 data + points for each hyperparameter setting. + fig : plt.figure + The figure to plot on, by default None. If None, creates a new figure + ax : plt.Axes + The axis to plot on, by default None, If None, creates a new axis + figsize : tuple(int, int) + The size of the figure to plot + names : list of str + The name of the agents, used for the legend + xlim : float, optional + The x limit for the plot, by default None + ylim : float, optional + The y limit for the plot, by default None + alpha : float, optional + The alpha channel for the plot, by default 0.1 + colours : list of list of str + The colours to use for each hyperparameter settings plot for each data + dictionary + env_type : str, optional + The type of environment, one of 'continuing', 'episodic'. By default + 'continuing' + + Returns + ------- + plt.figure, plt.Axes + The figure and axes of the plot + """ + # Set the colours to be default if not set + if colours is None: + colours = _get_default_colours(ind) + + # Set up figure + title = f"Average {type_.title()} Return per Run with Standard Error" + fig, ax = _setup_fig(fig, ax, figsize, xlim=xlim, ylim=ylim, xlabel=xlabel, + ylabel=ylabel, title=title) + + # Track the total timesteps per hyperparam setting over all episodes and + # the cumulative timesteps per episode per data dictionary (timesteps + # should be consistent between all hp settings in a single data dict) + total_timesteps = [] + cumulative_timesteps = [] + + for i in range(len(data)): + if type_ == "train": + cumulative_timesteps.append(exp.get_cumulative_timesteps(data[i] + ["experiment_data"][ind[i][0]]["runs"] + [0]["train_episode_steps"])) + elif type_ == "eval": + cumulative_timesteps.append(data[i]["experiment_data"][ind[i][0]] + ["runs"][0]["timesteps_at_eval"]) + else: + raise ValueError("type_ must be one of 'train', 'eval'") + total_timesteps.append(cumulative_timesteps[-1][-1]) + + # Find the minimum of total trained-for timesteps. Each plot will only + # be plotted on the x-axis until this value + min_timesteps = min(total_timesteps) + + # For each data dictionary, find the minimum index where the timestep at + # that index is >= minimum timestep + ind_ge_min_timesteps = [] + for cumulative_timesteps_per_data in cumulative_timesteps: + final_ind = np.where(cumulative_timesteps_per_data >= + min_timesteps)[0][0] + # Since indexing will stop right before the minimum, increment it + ind_ge_min_timesteps.append(final_ind + 1) + + # Plot all data for all HP settings, only up until the minimum index + # fig, ax = None, None + plot_fn = _plot_mean_with_stderr_continuing if env_type == "continuing" \ + else _plot_mean_with_stderr_episodic + for i in range(len(data)): + fig, ax = \ + plot_fn(data=data[i], type_=type_, + ind=ind[i], smooth_over=smooth_over[i], name=names[i], + fig=fig, ax=ax, figsize=figsize, xlim=xlim, ylim=ylim, + last_ind=ind_ge_min_timesteps[i], alpha=alpha, + colours=colours[i], keep_shape=keep_shape) + + return fig, ax + + +def _plot_mean_with_stderr_continuing(data, type_, ind, smooth_over, fig=None, + ax=None, figsize=(12, 6), xlim=None, + ylim=None, xlabel=None, ylabel=None, + name="", last_ind=-1, + timestep_multiply=None, alpha=0.1, + colours=None, + keep_shape=False): + """ + Plots the average training or evaluation return over all runs for a single + data dictionary on a continuing environment. Standard error + is plotted as shaded regions. + + Parameters + ---------- + data : dict + The Python data dictionary generated from running main.py + type_ : str + Which type of data to plot, one of "eval" or "train" + ind : iter of int + The list of hyperparameter settings indices to plot + smooth_over : int + The number of previous data points to smooth over. Note that this + is *not* the number of timesteps to smooth over, but rather the number + of data points to smooth over. For example, if you save the return + every 1,000 timesteps, then setting this value to 15 will smooth + over the last 15 readings, or 15,000 timesteps. + fig : plt.figure + The figure to plot on, by default None. If None, creates a new figure + ax : plt.Axes + The axis to plot on, by default None, If None, creates a new axis + figsize : tuple(int, int) + The size of the figure to plot + name : str, optional + The name of the agent, used for the legend + last_ind : int, optional + The index of the last element to plot in the returns list, + by default -1. This is useful if you want to plot many things on the + same axis, but all of which have a different number of elements. This + way, we can plot the first last_ind elements of each returns for each + agent. + timestep_multiply : array_like of float, optional + A value to multiply each timstep by, by default None. This is useful if + your agent does multiple updates per timestep and you want to plot + performance vs. number of updates. + xlim : float, optional + The x limit for the plot, by default None + ylim : float, optional + The y limit for the plot, by default None + alpha : float, optional + The alpha channel for the plot, by default 0.1 + colours : list of str + The colours to use for each plot of each hyperparameter setting + env_type : str, optional + The type of environment, one of 'continuing', 'episodic'. By default + 'continuing' + + Returns + ------- + plt.figure, plt.Axes + The figure and axes of the plot + + Raises + ------ + ValueError + When an axis is passed but no figure is passed + When an appropriate number of colours is not specified to cover all + hyperparameter settings + """ + if colours is not None and len(colours) != len(ind): + raise ValueError("must have one colour for each hyperparameter " + + "setting") + + if timestep_multiply is None: + timestep_multiply = [1] * len(ind) + + if ax is not None and fig is None: + raise ValueError("must pass figure when passing axis") + + if colours is None: + colours = _get_default_colours(ind) + + # Set up figure + if ax is None and fig is None: + title = f"Average {type_.title()} Return per Run with Standard Error" + fig, ax = _setup_fig(fig, ax, figsize, xlim=xlim, ylim=ylim, + xlabel=xlabel, ylabel=ylabel, title=title) + + episode_length = data["experiment"]["environment"]["steps_per_episode"] + + # Plot with the standard error + for i in range(len(ind)): + timesteps, mean, std = exp.get_mean_err(data, type_, ind[i], + smooth_over, exp.stderr, + keep_shape=keep_shape) + timesteps = np.array(timesteps[:last_ind]) * timestep_multiply[i] + # mean = mean[:last_ind] / episode_length + # std = std[:last_ind] / episode_length + + # Plot based on colours + label = f"{name}" + if colours is not None: + _plot_shaded(ax, timesteps, mean, std, colours[i], label, alpha) + else: + _plot_shaded(ax, timesteps, mean, std, None, label, alpha) + + ax.legend() + + fig.show() + return fig, ax + + +def _plot_mean_with_stderr_episodic(data, type_, ind, smooth_over, fig=None, + ax=None, figsize=(12, 6), name="", + last_ind=-1, xlabel="Timesteps", + ylabel="Average Return", xlim=None, + ylim=None, alpha=0.1, colours=None, + keep_shape=False): + """ + Plots the average training or evaluation return over all runs for a + single data dictionary on an episodic environment. Plots shaded retions + as standard error. + + Parameters + ---------- + data : dict + The Python data dictionary generated from running main.py + type_ : str + Which type of data to plot, one of "eval" or "train" + ind : iter of int + The list of hyperparameter settings indices to plot + smooth_over : int + The number of previous data points to smooth over. Note that this + is *not* the number of timesteps to smooth over, but rather the number + of data points to smooth over. For example, if you save the return + every 1,000 timesteps, then setting this value to 15 will smooth + over the last 15 readings, or 15,000 timesteps. + fig : plt.figure + The figure to plot on, by default None. If None, creates a new figure + ax : plt.Axes + The axis to plot on, by default None, If None, creates a new axis + figsize : tuple(int, int) + The size of the figure to plot + name : str, optional + The name of the agent, used for the legend + xlim : float, optional + The x limit for the plot, by default None + ylim : float, optional + The y limit for the plot, by default None + alpha : float, optional + The alpha channel for the plot, by default 0.1 + colours : list of str + The colours to use for each plot of each hyperparameter setting + env_type : str, optional + The type of environment, one of 'continuing', 'episodic'. By default + 'continuing' + + Returns + ------- + plt.figure, plt.Axes + The figure and axes of the plot + + Raises + ------ + ValueError + When an axis is passed but no figure is passed + When an appropriate number of colours is not specified to cover all + hyperparameter settings + """ + if colours is not None and len(colours) != len(ind): + raise ValueError("must have one colour for each hyperparameter " + + "setting") + + if ax is not None and fig is None: + raise ValueError("must pass figure when passing axis") + + if colours is None: + colours = _get_default_colours(ind) + + # Set up figure + if ax is None and fig is None: + fig = plt.figure(figsize=figsize) + ax = fig.add_subplot() + + if xlim is not None: + ax.set_xlim(xlim) + if ylim is not None: + ax.set_ylim(ylim) + + # Plot with the standard error + for i in range(len(ind)): + # data = exp.reduce_episodes(data, ind[i], type_=type_) + data = runs.expand_episodes(data, ind[i], type_=type_) + + # data has consistent # of episodes, so treat as env_type="continuing" + _, mean, std = exp.get_mean_err(data, type_, ind[i], smooth_over, + exp.stderr, keep_shape=keep_shape) + print(mean.shape, std.shape, "HERE") + episodes = np.arange(mean.shape[0]) + print(mean.shape, episodes[0], episodes[-1]) + + # Plot based on colours + label = f"{name}" + if colours is not None: + _plot_shaded(ax, episodes, mean, std, colours[i], label, alpha) + else: + _plot_shaded(ax, episodes, mean, std, None, label, alpha) + + ax.legend() + ax.set_title(f"Average {type_.title()} Return per Run with Standard Error") + ax.set_ylabel(ylabel) + ax.set_xlabel(xlabel) + + fig.show() + return fig, ax + + +def return_distribution(data, type_, hp_ind, bins, figsize=(12, 6), xlim=None, + ylim=None, after=0, before=-1): + """ + Plots the distribution of returns on either an episodic or continuing + environment + + Parameters + ---------- + data : dict + The data dictionary containing the runs of a single hyperparameter + setting + type_ : str, optional + The type of surface to plot, by default "surface". One of 'surface', + 'wireframe', or 'bar' + hp_ind : int, optional + The hyperparameter settings index in the data dictionary to use for + the plot, by default -1. If less than 0, then the first hyperparameter + setting in the dictionary is used. + bins : Iterable, int + The bins to use for the plot. If an Iterable, then each value in the + Iterable is considered as a cutoff for bins. If an integer, separates + the returns into that many bins + figsize : tuple, optional + The size of the figure to plot, by default (15, 10) + xlim : 2-tuple of float, optional + The cutoff points for the x-axis to plot between, by default None + ylim : 2-tuple of float, optional + The cutoff points for the y-axis to plot between, by default None + + Returns + ------- + plt.figure, plt.Axes3D + The figure and axis plotten on + """ + # Get the episode returns for each run + run_returns = [] + return_type = type_ + "_episode_rewards" + for run in data["experiment_data"][hp_ind]["runs"]: + run_returns.append(np.mean(run[return_type][after:before])) + + title = f"Learning Curve Distribution - HP Settings {hp_ind}" + return _return_histogram(run_returns, bins, figsize, title, xlim, ylim) + + +def _return_histogram(run_returns, bins, figsize, title, xlim, ylim, kde=True): + fig = plt.figure(figsize=figsize) + ax = fig.add_subplot() + + ax.set_title(title) + ax.set_xlabel("Average Return Per Run") + ax.set_ylabel("Relative Frequency") + _ = sns.histplot(run_returns, bins=bins, kde=kde) + + if xlim is not None: + ax.set_xlim(xlim) + if ylim is not None: + ax.set_ylim(ylim) + + # Plot relative frequency on the y-axis + ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: + "{:.2f}".format(x / len(run_returns)))) + + fig.show() + return fig, ax + + +def _get_default_colours(iter_): + """ + Recursively turns elements of an Iterable into strings representing + colours. + + This function will turn each element of an Iterable into strings that + represent colours, recursively. If the elements of an Iterable are + also Iterable, then this function will recursively descend all the way + through every Iterable until it finds an Iterable with non-Iterable + elements. These elements will be replaced by strings that represent + colours. In effect, this function keeps the data structure, but replaces + non-Iterable elements by strings representing colours. Note that this + funcion assumes that all elements of an Iterable are of the same type, + and so it only checks if the first element of an Iterable object is + Iterable or not to stop the recursion. + + Parameters + ---------- + iter_ : collections.Iterable + The top-level Iterable object to turn into an Iterable of strings of + colours, recursively. + + Returns + ------- + list of list of ... of strings + A data structure that has the same architecture as the input Iterable + but with all non-Iterable elements replaced by strings. + """ + colours = [] + + # Calculate the number of lists at the current level to go through + paths = range(len(iter_)) + + # Return a list of colours if the elements of the list are not lists + if not isinstance(iter_[0], Iterable): + global OFFSET + col = [DEFAULT_COLOURS[(OFFSET + i) % len(DEFAULT_COLOURS)] + for i in paths] + OFFSET += len(paths) + return col + + # For each list at the current level, get the colours corresponding to + # this level + for i in paths: + colours.append(_get_default_colours(iter_[i])) + + return colours + + +def _plot_shaded(ax, x, y, region, colour, label, alpha): + """ + Plots a curve with a shaded region. + + Parameters + ---------- + ax : plt.Axes + The axis to plot on + x : Iterable + The points on the x-axis + y : Iterable + The points on the y-axis + region : list or array_like + The region to shade about the y points. The shaded region will be + y +/- region. If region is a list or 1D np.ndarray, then the region + is used both for the + and - portions. If region is a 2D np.ndarray, + then the first row will be used as the lower bound (-) and the + second row will be used for the upper bound (+). That is, the region + between (lower_bound, upper_bound) will be shaded, and there will be + no subtraction/adding of the y-values. + colour : str + The colour to plot with + label : str + The label to use for the plot + alpha : float + The alpha value for the shaded region + """ + if colour is None: + colour = DEFAULT_COLOURS[0] + + ax.plot(x, y, color=colour, label=label) + if type(region) == list: + ax.fill_between(x, y-region, y+region, alpha=alpha, color=colour) + elif type(region) == np.ndarray and len(region.shape) == 1: + ax.fill_between(x, y-region, y+region, alpha=alpha, color=colour) + elif type(region) == np.ndarray and len(region.shape) == 2: + ax.fill_between(x, region[0, :], region[1, :], alpha=alpha, + color=colour) + + +def _setup_fig(fig, ax, figsize=None, title=None, xlim=None, ylim=None, + xlabel=None, ylabel=None, xscale=None, yscale=None, xbase=None, + ybase=None): + if fig is None: + if ax is not None: + raise ValueError("Must specify figure when axis given") + if figsize is not None: + fig = plt.figure(figsize=figsize) + else: + fig = plt.figure() + + if ax is None: + ax = fig.add_subplot() + + if title is not None: + ax.set_title(title) + + if xlabel is not None: + ax.set_xlabel(xlabel) + + if ylabel is not None: + ax.set_ylabel(ylabel) + + if xlim is not None: + ax.set_xlim(xlim) + + if ylim is not None: + ax.set_ylim(ylim) + + if xscale is not None: + if xbase is not None: + ax.set_xscale(xscale, base=xbase) + else: + ax.set_xscale(xscale) + + if yscale is not None: + if ybase is not None: + ax.set_yscale(yscale, base=ybase) + else: + ax.set_yscale(yscale) + + return fig, ax + + +def reset(): + """ + Resets the colours offset + """ + global OFFSET + OFFSET = 0 diff --git a/utils/runs.py b/utils/runs.py new file mode 100644 index 0000000..f9e4d96 --- /dev/null +++ b/utils/runs.py @@ -0,0 +1,184 @@ +import numpy as np +from copy import deepcopy + +TRAIN = "train" +EVAL = "eval" + +def episodes_to(in_data, i, type_=TRAIN): + """ + Restricts the number of `type_` episodes to be from episode 0 to the + episode right before episode i. + The input data dictionary is not changed. If `type_` is 'train', then the + training returns are restricted to be only from episodes 0 to i and the + 'eval' episodes are restricted to reflect this. If `type_` is 'eval", then + the evaluation returns are restricted to be only from episode 0 to i and + the 'train' returns are restricted to reflect this. + + By 'restricted to reflect this', we mean that the returns are + restricted so that the final return is at the same timestep (or + nearest timestep, rounding up to episode completion) as the + final timestep of episode i for the data of type `type_`. + + + Parameters + ---------- + in_data : dict + The data dictionary + i : int + The episode to restrict values to + type_ : str + The type of data to restrict to be from episode 0 to i. One + of 'train', 'eval'. + + Returns + ------- + dict + The modified data dictionary + """ + data = deepcopy(in_data) + + if type_ not in (TRAIN, EVAL): + raise ValueError("type_ must be one of 'train', 'eval'") + + key = type_ + other = "eval" if key == "train" else "train" + + for hyper in data["experiment_data"]: + for j in range(len(data["experiment_data"][hyper]["runs"])): + run_data = data["experiment_data"][hyper]["runs"][j] + + if i > len(run_data[f"{key}_episode_rewards"]): + last = len(run_data[f"{key}_episode_rewards"]) + raise IndexError(f"no such episode i={i}, largest episode " + + f"index is {last}") + + # Adjust training data + run_data[f"{key}_episode_rewards"] = run_data[ + f"{key}_episode_rewards"][:i] + + run_data[f"{key}_episode_steps"] = run_data[ + f"{key}_episode_steps"][:i] + + # Figure out which timestep episode i happened on + last_step = np.cumsum(run_data[f"{key}_episode_steps"])[-1] + + # Figure out which episodes to keep of the "other" type (if type_ + # is 'train' then other is 'eval' and vice versa) + to_discard = np.cumsum(run_data[f"{other}_episode_steps"]) \ + > last_step + + if len(to_discard): + last_other_step = np.argmax(to_discard) + + # Adjust "other" data + run_data[f"{other}_episode_reward"] = run_data[ + f"{other}_episode_reward"][:last_other_step] + + run_data[f"{other}_episode_steps"] = run_data[ + f"{other}_episode_steps"][:last_other_step] + else: + # Adjust "other" data + run_data[f"{other}_episode_reward"] = [] + + run_data[f"{other}_episode_steps"] = [] + + return data + + +def expand_episodes(data, ind, type_='train'): + """ + For each run, repeat each episode's performance measure by how many + timesteps that episode took to finish. This results in episodic experiments + having the same number of data readings per run, so that performances can + be averaged over runs and an be easily plotted. + + This function will modify a single run's data such that if you plotted only + that run's data, then it would appear as a step plot. For example, if we + had the following episode performances: + + [100, 110] + + with the following number of timesteps for each episode: + + [2, 3] + + Then this function will modify the data so that it looks like: + + [100, 100, 110, 110, 110] + + Parameters + ---------- + data : dict + The data dictionary generated by the experiment + ind : int + The hyperparameter index to adjust + type_ : str + Which data type to adjust, one of 'train', 'eval' + """ + data = deepcopy(data) + runs = data["experiment_data"][ind]["runs"] + episodes = [] + if type_ == "train": + for i in range(len(runs)): + run_return = [] + for j in range(len(runs[i]["train_episode_rewards"])): + run_return.extend([runs[i]["train_episode_rewards"][j] for _ in + range(runs[i]["train_episode_steps"][j])]) + data["experiment_data"][ind]["runs"][i][ + "train_episode_rewards"] = run_return + + elif type_ == "eval": + for i in range(len(runs)): + run_return = [] + for j in range(len(runs[i]["eval_episode_rewards"])): + run_return.extend([runs[i]["eval_episode_rewards"][j] for _ in + range(runs[i]["eval_episode_steps"][j])]) + data["experiment_data"][ind]["runs"][i][ + "eval_episode_rewards"] = run_return + + else: + raise ValueError(f"unknown type {type_}") + return data + + +def reduce_episodes(data, ind, type_): + """ + Reduce the number of episodes in an episodic setting + + Given a data dictionary, this function will reduce the number of episodes + seen on each run to the minimum among all runs for that hyperparameter + settings index. This is needed to plot curves by episodic return. + + Parameters + ---------- + data : dict + The Python data dictionary generated from running main.py + ind : int + The hyperparameter settings index to reduce the episodes of + type_ : str + Whether to reduce the training or evaluation returns, one of 'train', + 'eval' + """ + data = deepcopy(data) + runs = data["experiment_data"][ind]["runs"] + episodes = [] + if type_ == "train": + for run in data["experiment_data"][ind]["runs"]: + episodes.append(len(run["train_episode_rewards"])) + + min_ = np.min(episodes) + for i in range(len(runs)): + runs[i]["train_episode_rewards"] = \ + runs[i]["train_episode_rewards"][:min_] + + elif type_ == "eval": + for run in data["experiment_data"][ind]["runs"]: + episodes.append(run["eval_episode_rewards"].shape[0]) + + min_ = np.min(episodes) + + for i in range(len(runs)): + runs[i]["eval_episode_rewards"] = \ + runs[i]["eval_episode_rewards"][:min_, :] + + return data