diff --git a/README.md b/README.md
index c20277f..aca4033 100644
--- a/README.md
+++ b/README.md
@@ -11,12 +11,12 @@ MinAtar is a testbed for AI agents which implements miniaturized versions of sev
 <img src="img/space_invaders.gif" width="200" />
 </p>
 
-## Quick Start
+## Standard Quick Start
 To use MinAtar, you need python3 installed, make sure pip is also up to date.  To run the included `DQN` and `AC_lambda` examples, you need `PyTorch`.  To install MinAtar, please follow the steps below:
 
 1. Clone the repo: 
 ```bash
-git clone https://github.com/kenjyoung/MinAtar.git
+git clone https://github.com/Robertboy18/MinAtar-Faster.git
 ```
 If you prefer running MinAtar in a virtualenv, you can do the following before step 2:
 ```bash
@@ -29,6 +29,7 @@ pip install --upgrade pip
 2.  Install MinAtar:
 ```bash
 pip install .
+pip install -r requirements.txt
 ```
 If you have any issues with automatic dependency installation, you can instead install the necessary dependencies manually and run
 ```bash
@@ -55,6 +56,193 @@ Use the arrow keys to move and space bar to fire. Also, press q to quit and r to
 
 Also included in the examples directory are example implementations of DQN (dqn.py) and online actor-critic with eligibility traces (AC_lambda.py).
 
+## Optimized Code with various Agents Usage
+
+To run your first experiment:
+```
+python3 main.py --agent-json config/agent/SAC.json --env-json config/environment/AcrobotContinuous-v1.json --index 0
+```
+
+# Usage
+The file main.py trains an agent for a specified number of runs, based on an environment and agent configuration file count in config/environment/ or config/agent/ respectively. The data is saved in the results directory, with a name similar to the environment and agent name.
+
+For more information on how to use the main.py program, see the `--help` option:
+```
+Usage: main.py [OPTIONS]
+
+  Given agent and environment configuration files, run the experiment defined
+  by the configuration files
+
+Options:
+  --env-json TEXT      Path to the environment json configuration file
+                       [required]
+  --agent-json TEXT    Path to the agent json configuration file  [required]
+  --index INTEGER      The index of the hyperparameter to run
+  -m, --monitor        Whether or not to render the scene as the agent trains.
+  -a, --after INTEGER  How many timesteps (training) should pass before
+                       rendering the scene
+  --save-dir TEXT      Which directory to save the results file in
+  --help               Show this message and exit.
+```
+
+Example:
+```
+./main.py --env-json config/environment/MountainCarContinuous-v0.json --agent-json config/agent/linearAC.json --index 0 --monitor --after 1000
+```
+will run the experiment using linear-Gaussian actor-critic on the mountain
+car environment. The experiment is run on one process (serially), and the
+scene is rendered after 1000 timesteps of training. We will only run the
+hyperparameter setting with index 0.
+
+# Hyperparameter settings
+The hyperparameter settings are laid out in the agent configuration files.
+The files are laid out such that each setting is a list of values, and the
+total number of hyperparameter settings is the product of the lengths of each
+of these lists. For example, if the agent config file looks like:
+```
+{
+    "agent_name": "linearAC",
+    "parameters":
+    {
+        "decay": [0.5],
+        "critic_lr": [0.005, 0.1, 0.3],
+        "actor_lr": [0.005, 0.1, 0.3],
+        "avg_reward_lr": [0.1, 0.3, 0.5, 0.9],
+        "scaled": [true],
+        "clip_stddev": [1000]
+    }
+}
+```
+then, there are `1 x 3 x 3 x 4 x 1 x 1 = 36` different hyperparameter
+settings. Each hyperparameter setting is given a specific index. For example
+hyperparameter setting index `1` would have the following hyperparameters:
+```
+{
+    "agent_name": "linearAC",
+    "parameters":
+    {
+        "decay": 0.5,
+        "critic_lr": 0.005,
+        "actor_lr": 0.005,
+        "avg_reward_lr": 0.1,
+        "scaled": true,
+        "clip_stddev": 1000
+    }
+}
+```
+The hyperparameter settings indices are actually implemented `mod x`,
+where `x` is the maximum number of hyperparameter settings (in the example
+about, `36`). So, in the example above, the hyperparameter settings with
+indices `1, 37, 73, ...` all refer to the same hyperparameter settings since
+`1 = 37 = 73 = ... mod 36`. The difference is that the consecutive indices
+have a different seed. So, each time we run experiments with hyperparameter
+setting `1`, it will have the same seed. If we run with hyperparameter setting
+`37`, it will be the same hyperparameter settings as `1`, but with a different
+seed, and this seed will be the same every time we run the experiment with
+hyperparameter settings `37`. This is what Martha and her students
+have done with their Actor-Expert implementation, and I find that it works
+nicely for hyperparameter sweeps.
+
+
+# Saved Data
+Each experiment saves all the data as a Python dictionary. The dictionary is
+designed so that we store all information about the experiment, including all
+agent hyperparameters and environment settings so that the experiment is
+exactly reproducible.
+
+If the data dictionary is called `data`, then the main data for the experiment
+is stored in `data["experiment_data"]`, which is a dictionary mapping from
+hyperparameter settings indices to agent parameters and experiment runs.
+`data["experiment_data"][i]["agent_params"]` is a dictionary storing the
+agent's hyperparameters (hyperparameter settings index `i`) for the experiment.
+`data["experiment_data"][i]["runs]` is a list storing the runs for the
+`i-th` hyperparameter setting. Each element of the list is a dictionary, giving
+all the information for that run and hyperparameter setting. For example,
+`data["experiment_data"][i]["runs"][j]` will give all the information on
+the `j-th` run of hyperparameter settings `i`.
+
+Below is a tree diagram of the data structure:
+```
+data
+├─── "experiment"
+│       ├─── "environment": environment configuration file
+│       └─── "agent": agent configuration file
+└─── "experiment_data": dictionary of hyperparameter setting *index* to runs
+        ├─── "agent_params": the hyperparameters settings
+        └─── "runs": a list containing all the runs for this hyperparameter setting (each run is a dictionary of elements)
+                └─── index i: information on the ith run
+                		├─── "run_number": the run number
+                        ├─── "random_seed": the random seed used for the run
+                        ├─── "total_timesteps": the total number of timesteps in the run
+                        ├─── "eval_interval_timesteps": the interval of timesteps to pass before running offline evaluation
+                        ├─── "episodes_per_eval": the number of episodes run at each offline evaluation
+                        ├─── "eval_episode_rewards": list of the returns (np.array) from each evaluation episode if there are 10 episodes per eval,
+                        │     then this will be a list of np.arrays where each np.array has 10 elements (one per eval episode)
+                        ├─── "eval_episode_steps": the number of timesteps per evaluation episode, with the same form as "eval_episode_rewards"
+                        ├─── "timesteps_at_eval": the number of training steps that passed at each evaluation. For example, if there were 10
+                        │    offline evaluations, then this will be a list of 10 integers, each stating how many training steps passed before each
+                        │    evaluation.
+                        ├─── "train_episode_rewards": the return seen for each training episode
+                        ├─── "train_episode_steps": the number of timesteps passed for each training episode
+                        ├─── "train_time": the total amount of training time in seconds
+                        ├─── "eval_time": the total amount of evaluation time in seconds
+                        └─── "total_train_episodes": the total number of training episodes for the run
+```
+
+For example, here is `data["experiment_data"][i]["runs"][j]` for a mock run
+of the Linear-Gaussian Actor-Critic agent on MountainCarContinuous-v0:
+```
+{'random_seed': 0,
+ 'total_timesteps': 1000,
+ 'eval_interval_timesteps': 500,
+ 'episodes_per_eval': 10,
+ 'eval_episode_rewards': array([[-200., -200., -200., -200., -200., -200., -200., -200., -200.,
+         -200.],
+        [-200., -200., -200., -200., -200., -200., -200., -200., -200.,
+         -200.]]),
+ 'eval_episode_steps': array([[200, 200, 200, 200, 200, 200, 200, 200, 200, 200],
+        [200, 200, 200, 200, 200, 200, 200, 200, 200, 200]]),
+ 'timesteps_at_eval': array([  0, 600]),
+ 'train_episode_steps': array([200, 200, 200, 200, 200]),
+ 'train_episode_rewards': array([-200., -200., -200., -200., -200.]),
+ 'train_time': 0.12098526954650879,
+ 'eval_time': 0.044415950775146484,
+ 'total_train_episodes': 5,
+ ...}
+```
+
+# Configuration files
+Each configuration file is a JSON file and has a few properties. There
+are also templates in each configuration directory for the files.
+
+## Environment Configuration File
+```
+{
+    "env_name": "environment filename without .json, all files refer to this as env_name",
+    "total_timesteps": "int - total timesteps for the entire run",
+    "steps_per_episode": "int - max number of steps per episode",
+    "eval_interval_timesteps": "int - interval of timesteps at which offline evaluation should be done",
+    "eval_episodes": "int - the number of offline episodes per evaluation",
+    "gamma": "float - the discount factor",
+}
+```
+
+## Agent Configuration File
+The agent configuration file is more general. The template is below. Since
+both agents already have configuration files, there is not much need to add
+any new configurations for agents. Instead, it would suffice to alter the
+existing configuration files. The issue is that each agent has very different
+configurations and hyperparameters, and so the config files are very different.
+```
+{
+    "agent_name": "filename without .json, all code refers to this as agent_name",
+    "parameters":
+    {
+        "parameter name": "list of values"
+    }
+}
+```
+
 ## OpenAI Gym Wrapper
 MinAtar now includes an OpenAI Gym plugin using the Gym plugin system. If a sufficiently recent version of OpenAI gym (`pip install gym==0.21.0` works) is installed, this plugin should be automatically available after installing MinAtar as normal. A gym environment can then be constructed as follows:
 ```bash
diff --git a/agent/Random.py b/agent/Random.py
new file mode 100644
index 0000000..0c1f751
--- /dev/null
+++ b/agent/Random.py
@@ -0,0 +1,83 @@
+#!/usr/bin/env python3
+
+# Adapted from https://github.com/pranz24/pytorch-soft-actor-critic
+
+# Import modules
+import torch
+import numpy as np
+from agent.baseAgent import BaseAgent
+
+
+class Random(BaseAgent):
+    """
+    Random implements a random policy.
+    """
+    def __init__(self, action_space, seed):
+        super().__init__()
+        self.batch = False
+
+        self.action_dims = len(action_space.high)
+        self.action_low = action_space.low
+        self.action_high = action_space.high
+
+        # Set the seed for all random number generators, this includes
+        # everything used by PyTorch, including setting the initial weights
+        # of networks. PyTorch prefers seeds with many non-zero binary units
+        self.torch_rng = torch.manual_seed(seed)
+        self.rng = np.random.default_rng(seed)
+
+        self.policy = torch.distributions.Uniform(
+            torch.Tensor(action_space.low), torch.Tensor(action_space.high))
+
+    def sample_action(self, _):
+        """
+        Samples an action from the agent
+
+        Parameters
+        ----------
+        _ : np.array
+            The state feature vector
+
+        Returns
+        -------
+        array_like of float
+            The action to take
+        """
+        action = self.policy.sample()
+
+        return action.detach().cpu().numpy()
+
+    def sample_action_(self, _, size):
+        """
+        sample_action_ is like sample_action, except the rng for
+        action selection in the environment is not affected by running
+        this function.
+        """
+        return self.rng.uniform(self.action_low, self.action_high,
+                                size=(size, self.action_dims))
+
+    def update(self, _, _1, _2, _3, _4):
+        pass
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        pass
+
+    def eval(self):
+        pass
+
+    def train(self):
+        pass
+
+    # Save model parameters
+    def save_model(self, _, _1="", _2=None, _3=None):
+        pass
+
+    # Load model parameters
+    def load_model(self, _, _1):
+        pass
+
+    def get_parameters(self):
+        pass
diff --git a/agent/baseAgent.py b/agent/baseAgent.py
new file mode 100644
index 0000000..3e7d78f
--- /dev/null
+++ b/agent/baseAgent.py
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+
+# Import modules
+from abc import ABC, abstractmethod
+
+# TODO: Given a data dictionary generated by main, create a static
+# function to initialize any agent based on this dict. Note that since the
+# dict has the agent name, only one function is needed to create ANY agent
+# we could also use the experiment util create_agent() function
+
+
+class BaseAgent(ABC):
+    """
+    Class BaseAgent implements the base functionality for all agents
+
+    Attributes
+    ----------
+    self.batch : bool
+        Whether or not the agent is using batch updates, by default False.
+        This is needed for the Experiment class to determine what to save
+        for update transitions. The Experiment class will save all transitions
+        used in updates, but if an agent performs batch updates and keeps
+        an experience replay buffer, then the Experiment object must
+        determine the transitions used in the update from the agent, and
+        not from the environment. If an agent is not using batch updates, it
+        is fully online and incremental, and so it must be using the
+        last environment transition for the update.
+    self.info : dict
+        A dictionary which records some chaning agent attributes during
+        training, if any. For example, this dictionary can be used to
+        keep track of the entropy in SAC during training.
+    """
+    def __init__(self):
+        """
+        Constructor
+        """
+        self.batch = False
+        self.info = {}
+
+    """
+    BaseAgent is the abstract base class for all agents
+    """
+    @abstractmethod
+    def sample_action(self, state):
+        """
+        Samples an action from the agent
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector
+
+        Returns
+        -------
+        array_like of float
+            The action to take
+        """
+        pass
+
+    @abstractmethod
+    def update(self, state, action, reward, next_state, done_mask):
+        """
+        Takes a single update step, which may be a number of offline
+        batch updates
+
+        Parameters
+        ----------
+        state : np.array or array_like of np.array
+            The state feature vector
+        action : np.array of float or array_like of np.array
+            The action taken
+        reward : float or array_like of float
+            The reward seen by the agent after taking the action
+        next_state : np.array or array_like of np.array
+            The feature vector of the next state transitioned to after the
+            agent took the argument action
+        done_mask : bool or array_like of bool
+            False if the agent reached the goal, True if the agent did not
+            reach the goal yet the episode ended (e.g. max number of steps
+            reached)
+
+        Return
+        ------
+        4-tuple of array_like
+            A tuple containing array_like, each of which contains the states,
+            actions, rewards, and next states used in the update
+        """
+        pass
+
+    @abstractmethod
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        pass
+
+    @abstractmethod
+    def eval(self):
+        """
+        Sets the agent into offline evaluation mode, where the agent will not
+        explore
+        """
+        pass
+
+    @abstractmethod
+    def train(self):
+        """
+        Sets the agent to online training mode, where the agent will explore
+        """
+        pass
+
+    @abstractmethod
+    def get_parameters(self):
+        """
+        Gets all learned agent parameters such that training can be resumed.
+
+        Gets all parameters of the agent such that, if given the
+        hyperparameters of the agent, training is resumable from this exact
+        point. This include the learned average reward, the learned entropy,
+        and other such learned values if applicable. This does not only apply
+        to the weights of the agent, but *all* values that have been learned
+        or calculated during training such that, given these values, training
+        can be resumed from this exact point.
+
+        For example, in the LinearAC class, we must save not only the actor
+        and critic weights, but also the accumulated eligibility traces.
+
+        Returns
+        -------
+        dict of str to int, float, array_like, and/or torch.Tensor
+            The agent's weights
+        """
+        pass
diff --git a/agent/linear/ESarsa.py b/agent/linear/ESarsa.py
new file mode 100644
index 0000000..f18347e
--- /dev/null
+++ b/agent/linear/ESarsa.py
@@ -0,0 +1,242 @@
+# Import modules
+import numpy as np
+from agent.baseAgent import BaseAgent
+from PyFixedReps import TileCoder
+import time
+import warnings
+import inspect
+
+
+class ESarsa(BaseAgent):
+    """
+    Class Esarsa implements the Expected Sarsa(λ) algorithm
+    """
+    def __init__(self, decay, lr, gamma, epsilon,
+                 action_space, bins, num_tilings, env, seed=None,
+                 trace_type="replacing", policy_type="εgreedy",
+                 include_bias=True):
+        super().__init__()
+        self.batch = False
+
+        # Set the agent's policy sampler
+        if seed is None:
+            seed = int(time())
+        self.random = np.random.default_rng(seed=int(seed))
+        self.seed = seed
+
+        # Needed so that when evaluating offline, we don't explore
+        self.is_training = True
+
+        # Tile Coder
+        self.include_bias = include_bias
+        input_ranges = list(zip(env.observation_space.low,
+                                env.observation_space.high))
+        dims = env.observation_space.shape[0]
+        params = {
+                    "dims": dims,
+                    "tiles": bins,
+                    "tilings": num_tilings,
+                    "input_ranges": input_ranges,
+                    "scale_output": False,
+                }
+        self.tiler = TileCoder(params)
+        state_features = self.tiler.features() + self.include_bias
+
+        # The weight parameters
+        self.actions = action_space.n
+        self.weights = np.zeros((self.actions, state_features))
+
+        # Set learning rates and other scaling factors
+        if decay < 0.0:
+            raise ValueError("cannot have trace decay rate < 0")
+        self.decay = decay
+        self.lr = lr / (num_tilings + self.include_bias)
+        self.gamma = gamma
+        self.epsilon = epsilon
+        print(self.lr)
+
+        if policy_type not in ("εgreedy"):
+            raise ValueError("policy_type must be one of 'εgreedy'")
+        self.policy_type = policy_type
+
+        # Eligibility traces
+        if trace_type not in ("accumulating", "replacing"):
+            raise ValueError("trace_type must be one of 'accumulating', " +
+                             "'replacing'")
+        self.use_trace = decay > 0.0
+        if self.use_trace:
+            self.trace = np.zeros_like(self.weights)
+            self.trace_type = trace_type
+
+        source = inspect.getsource(inspect.getmodule(inspect.currentframe()))
+        self.info = {"source": source}
+
+    def sample_action(self, state):
+        """
+        Samples an action from the actor
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector, not one hot encoded
+
+        Returns
+        -------
+        np.array of float
+            The action to take
+        """
+        return self._sample_action(state)
+
+    def _sample_action(self, state):
+        """
+        Samples an action
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector, not one hot encoded
+
+        Returns
+        -------
+        np.array of float
+            The action to take
+        """
+        # Take random action with probability ε and only if in training mode
+        if self.policy_type == "εgreedy" and self.epsilon != 0 and \
+           self.is_training:
+            if self.random.uniform() < self.epsilon:
+                action = self.random.choice(self.actions)
+                return action
+
+        state = self._tiler_indices(state)
+        action_vals = self.weights[:, state].sum(axis=1)
+
+        if self.policy_type == "εgreedy":
+            # Choose maximum action
+            max_actions = np.where(action_vals == np.max(action_vals))[0]
+            if len(max_actions) > 1:
+                return max_actions[self.random.choice(len(max_actions))]
+            else:
+                return max_actions[0]
+
+        else:
+            raise ValueError(f"unknown policy type {self.policy_type}")
+
+    def _get_probs(self, state):
+        """
+        Gets the probability of taking each action in state `state`
+
+        Parameters
+        ----------
+        state : np.array
+            The state observation, not tile-coded
+
+        Returns
+        -------
+        np.array[float]
+            The probabilities of taking each action in state `state`
+        """
+        state = self._tiler_indices(state)
+        if self.policy_type == "εgreedy":
+            probs = np.zeros(self.actions)
+            probs += self.epsilon / self.actions
+
+            action_vals = self.weights[:, state].sum(axis=1)
+            max_actions = np.where(action_vals == np.max(action_vals))[0]
+            probs[max_actions] += (1 - self.epsilon) / len(max_actions)
+        else:
+            raise ValueError(f"unknown policy type {self.policy_type}")
+
+        return probs
+
+    def _tiler_indices(self, state):
+        if self.include_bias:
+            return np.concatenate(
+                [
+                    np.zeros((1,), dtype=np.int32),
+                    self.tiler.get_indices(state) + 1,
+                ]
+            )
+
+        return self.tiler.get_indices(state)
+
+    def update(self, state, action, reward, next_state, done_mask):
+        """
+        Takes a single update step
+
+        Parameters
+        ----------
+        state : np.array or array_like of np.array
+            The state feature vector
+        action : np.array of float or array_like of np.array
+            The action taken
+        reward : float or array_like of float
+            The reward seen by the agent after taking the action
+        next_state : np.array or array_like of np.array
+            The feature vector of the next state transitioned to after the
+            agent took the argument action
+        done_mask : bool or array_like of bool
+            False if the agent reached the goal, True if the agent did not
+            reach the goal yet the episode ended (e.g. max number of steps
+            reached)
+            Note: this parameter is not used; it is only kept so that the
+            interface BaseAgent is consistent and can be used for both
+            Soft Actor-Critic and Linear-Gaussian Actor-Critic
+        """
+        state = self._tiler_indices(state)
+
+        δ = reward
+        δ -= self.weights[action, state].sum()
+
+        # Update the trace
+        if self.use_trace:
+            if self.trace_type == "accumulating":
+                self.trace[action, state] += 1
+            elif self.trace_type == "replacing":
+                self.trace[action, state] = 1
+            else:
+                raise ValueError(f"unknown trace type {self.trace_type}")
+
+        # Adjust δ if we are in an intra-episode timestep
+        episode_done = not done_mask
+        if not episode_done:
+            probs = self._get_probs(next_state)
+            next_state = self._tiler_indices(next_state)
+
+            next_q = self.gamma * self.weights[:, next_state].sum(axis=1)
+            𝔼_next_q = probs @ next_q
+            δ += 𝔼_next_q
+
+        # Update the weights
+        if self.use_trace:
+            self.weights += (self.lr * δ * self.trace)
+
+            # Decay the trace
+            self.trace *= (self.decay * self.gamma)
+        else:
+            self.weights[action, state] += (self.lr * δ)
+
+        return
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        self.trace = np.zeros_like(self.weights)
+        self.first_call = True
+
+    def eval(self):
+        """
+        Sets the agent into offline evaluation mode, where the agent will not
+        explore
+        """
+        self.is_training = False
+
+    def train(self):
+        """
+        Sets the agent to online training mode, where the agent will explore
+        """
+        self.is_training = True
+
+    def get_parameters(self):
+        pass
diff --git a/agent/linear/GaussianAC.py b/agent/linear/GaussianAC.py
new file mode 100644
index 0000000..7f67652
--- /dev/null
+++ b/agent/linear/GaussianAC.py
@@ -0,0 +1,422 @@
+#!/usr/bin/env python3
+
+# Import modules
+import numpy as np
+from agent.baseAgent import BaseAgent
+from time import time
+from PyFixedReps import TileCoder
+from env.Bimodal import Bimodal1DEnv
+
+
+class GaussianAC(BaseAgent):
+    """
+    Class GaussianAC implements Linear-Gaussian Actor-Critic with eligibility
+    trace, as outlined in "Model-Free Reinforcement Learning with Continuous
+    Action in Practice", which can be found at:
+
+    https://hal.inria.fr/hal-00764281/document
+
+    The major difference is that this algorithm uses the discounted setting
+    instead of the average reward setting as used in the above paper. This
+    linear actor critic support multi-dimensional actions as well.
+    """
+    def __init__(self, decay, actor_lr_scale, critic_lr,
+                 gamma, accumulate_trace, action_space, bins, num_tilings,
+                 env, use_critic_trace, use_actor_trace, scaled=False,
+                 clip_stddev=1000, seed=None, trace_type="replacing"):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        decay : float
+            The eligibility decay rate, lambda
+        actor_lr : float
+            The learning rate for the actor
+        critic_lr : float
+            The learning rate for the critic
+        state_features : int
+            The size of the state feature vectors
+        gamma : float
+            The environmental discount factor
+        accumulate_trace : bool
+            Whether or not to accumulate the eligibility traces or not, which
+            may be desirable if the task is continuing. If it is, then the
+            eligibility trace vectors will be accumulated and not reset between
+            "episodes" when calling the reset() method.
+        scaled : bool, optional
+            Whether the actor learning rate should be scaled by sigma^2 for
+            learning stability, by default False
+        clip_stddev : float, optional
+            The value at which the standard deviation is clipped in order to
+            prevent numerical overflow, by default 1000. If <= 0, then
+            no clipping is done.
+        seed : int
+            The seed to use for the normal distribution sampler, by default
+            None. If set to None, uses the integer value of the Unix time.
+        """
+        super().__init__()
+        self.batch = False
+
+        # Set the agent's policy sampler
+        if seed is None:
+            seed = int(time())
+        self.random = np.random.default_rng(seed=int(seed))
+        self.seed = seed
+
+        # Save whether or not the task is continuing
+        self.accumulate_trace = accumulate_trace
+
+        # Needed so that when evaluating offline, we don't explore
+        self.is_training = True
+
+        # Determine standard deviation clipping
+        self.clip_stddev = clip_stddev > 0
+        self.clip_threshold = np.log(clip_stddev)
+
+        # Tile Coder
+        input_ranges = list(zip(env.observation_space.low,
+                                env.observation_space.high))
+        dims = env.observation_space.shape[0]
+        params = {
+                    "dims": dims,
+                    "tiles": bins,
+                    "tilings": num_tilings,
+                    "input_ranges": input_ranges,
+                    "scale_output": False,
+                }
+        self.tiler = TileCoder(params)
+        state_features = self.tiler.features() + 1
+
+        # The weight parameters
+        self.action_dims = action_space.high.shape[0]
+        self.sigma_weights = np.zeros((self.action_dims, state_features))
+        self.mu_weights = np.zeros((self.action_dims, state_features))
+        self.actor_weights = np.zeros(state_features * 2)
+        self.critic_weights = np.zeros(state_features)
+
+        # Set learning rates and other scaling factors
+        self.scaled = scaled
+        self.decay = decay
+        self.critic_lr = critic_lr / (num_tilings + 1)
+        self.actor_lr = actor_lr_scale * self.critic_lr
+        self.gamma = gamma
+
+        # Eligibility traces
+        self.use_actor_trace = use_actor_trace
+        if trace_type not in ("replacing", "accumulating"):
+            raise ValueError("trace_type must be one of 'accumulating', " +
+                             "'replacing'")
+        self.trace_type = trace_type
+
+        if self.use_actor_trace:
+            self.mu_trace = np.zeros_like(self.mu_weights)
+            self.sigma_trace = np.zeros_like(self.sigma_weights)
+
+        self.use_critic_trace = use_critic_trace
+        if self.use_critic_trace:
+            self.critic_trace = np.zeros(state_features)
+
+        if isinstance(env.env, Bimodal1DEnv):
+            self.info = {
+                "actor": {"mean": [], "stddev": []},
+            }
+            self.store_dist = True
+        else:
+            self.store_dist = False
+
+        source = inspect.getsource(inspect.getmodule(inspect.currentframe()))
+        self.info = {"source": source}
+
+    def get_mean(self, state):
+        """
+        Gets the mean of the parameterized normal distribution
+
+        Parameters
+        ----------
+        state : np.array
+            The indices of the nonzero features in the one-hot encoded state
+            feature vector
+
+        Returns
+        -------
+        float
+            The mean of the normal distribution
+        """
+        return self.mu_weights[:, state].sum(axis=1)
+
+    def get_stddev(self, state):
+        """
+        Gets the standard deviation of the parameterized normal distribution
+
+        Parameters
+        ----------
+        state : np.array
+            The indices of the nonzero features in the one-hot encoded state
+            feature vector
+
+        Returns
+        -------
+        float
+            The standard deviation of the normal distribution
+        """
+        # Return un-clipped standard deviation if no clipping
+        if not self.clip_stddev:
+            return np.exp(self.sigma_weights[:, state].sum(axis=1))
+
+        # Clip the standard deviation to prevent numerical overflow
+        log_std = np.clip(self.sigma_weights[:, state].sum(axis=1),
+                          -self.clip_threshold, self.clip_threshold)
+        return np.exp(log_std)
+
+    def sample_action(self, state):
+        """
+        Samples an action from the actor
+
+        Parameters
+        ----------
+        state : np.array
+            The observation, not tile coded
+
+        Returns
+        -------
+        np.array of float
+            The action to take
+        """
+        # state = np.concatenate(
+        #     [
+        #         np.ones((1,), dtype=np.int32),
+        #         self.tiler.encode(state),
+        #     ]
+        # )
+        state = np.concatenate(
+            [
+                np.zeros((1,), dtype=np.int32),
+                self.tiler.get_indices(state) + 1,
+            ]
+        )
+        mean = self.get_mean(state)
+
+        # If in offline evaluation mode, return the mean action
+        if not self.is_training:
+            return np.array(mean)
+
+        stddev = self.get_stddev(state)
+
+        # Sample action from a normal distribution
+        action = self.random.normal(loc=mean, scale=stddev)
+        return action
+
+    def get_actor_grad(self, state, action):
+        """
+        Gets the gradient of the actor's parameters
+
+        Parameters
+        ----------
+        state : np.array
+            The indices of the nonzero features in the one-hot encoded state
+            feature vector
+        action : np.array of float
+            The action taken
+
+        Returns
+        -------
+        np.array
+            The gradient vector of the actor's weights, in the form
+            [grad_mu_weights^T, grad_sigma_weights^T]^T
+        """
+        std = self.get_stddev(state)
+        mean = self.get_mean(state)
+
+        grad_mu = np.zeros_like(self.mu_weights)
+        grad_sigma = np.zeros_like(self.sigma_weights)
+
+        if action.shape[0] != 1:
+            # Repeat state along rows to match number of action dims
+            n = action.shape[0]
+            state = np.expand_dims(state, 0)
+            state = state.repeat(n, axis=0)
+
+            scale_mu = (1 / (std ** 2)) * (action - mean)
+            scale_sigma = ((((action - mean) / std)**2) - 1)
+
+            # Reshape scales so we can use broadcasted multiplication
+            scale_mu = np.expand_dims(scale_mu, axis=1)
+            scale_sigma = np.expand_dims(scale_sigma, axis=1)
+
+            # grad_mu = scale_mu * state
+            # grad_sigma = scale_sigma * state
+
+        else:
+            scale_mu = (1 / (std ** 2)) * (action - mean)
+            scale_sigma = ((((action - mean) / std)**2) - 1)
+
+        grad_mu[:, state] = scale_mu
+        grad_sigma[:, state] = scale_sigma
+
+        return grad_mu, grad_sigma
+
+    def update(self, state, action, reward, next_state, done_mask):
+        """
+        Takes a single update step
+
+        Parameters
+        ----------
+        state : np.array or array_like of np.array
+            The state feature vector, not tile coded
+        action : np.array of float or array_like of np.array
+            The action taken
+        reward : float or array_like of float
+            The reward seen by the agent after taking the action
+        next_state : np.array or array_like of np.array
+            The feature vector of the next state transitioned to after the
+            agent took the argument action, not tile coded
+        done_mask : bool or array_like of bool
+            False if the agent reached the goal, True if the agent did not
+            reach the goal yet the episode ended (e.g. max number of steps
+            reached)
+            Note: this parameter is not used; it is only kept so that the
+            interface BaseAgent is consistent and can be used for both
+            Soft Actor-Critic and Linear-Gaussian Actor-Critic
+        """
+        # state = np.concatenate(
+        #     [
+        #         np.ones((1,), dtype=np.int32),
+        #         self.tiler.encode(state)
+        #     ]
+        # )
+        # next_state = np.concatenate(
+        #     [
+        #         np.ones((1,), dtype=np.int32),
+        #         self.tiler.encode(next_state)
+        #     ]
+        # )
+        state = np.concatenate(
+            [
+                np.zeros((1,), dtype=np.int32),
+                self.tiler.get_indices(state) + 1,
+            ]
+        )
+        next_state = np.concatenate(
+            [
+                np.zeros((1,), dtype=np.int32),
+                self.tiler.get_indices(next_state) + 1,
+            ]
+        )
+
+        # Calculate TD error
+        v = self.critic_weights[state].sum()
+        next_v = self.critic_weights[next_state].sum()
+        target = reward + self.gamma * next_v * done_mask
+        delta = target - v
+
+        # Critic update
+        if self.use_critic_trace:
+            # Update critic eligibility trace
+            self.critic_trace *= (self.gamma * self.decay)
+            # self.critic_trace = (self.gamma * self.decay *
+            #                      self.critic_trace) + state
+            if self.trace_type == "accumulating":
+                self.critic_trace[state] += 1
+            elif self.trace_type == "replacing":
+                self.critic_trace[state] = 1
+            else:
+                raise ValueError("unkown trace type {self.trace_type}")
+            # Update critic
+            self.critic_weights += (self.critic_lr * delta * self.critic_trace)
+        else:
+            grad = np.zeros_like(self.critic_weights)
+            grad[state] = 1
+            self.critic_weights += (self.critic_lr * delta * grad)
+
+        # Actor update
+        mu_grad, sigma_grad = self.get_actor_grad(state, action)
+        if self.use_actor_trace:
+            # Update actor eligibility traces
+            self.mu_trace *= (self.gamma * self.decay)
+            self.sigma_trace *= (self.gamma * self.decay)
+            if self.trace_type == "accumulating":
+                self.mu_trace[:, state] += mu_grad
+                self.sigma_trace[:, state] += sigma_grad
+            else:
+                self.mu_trace[:, state] = mu_grad[:, state]
+                self.sigma_trace[:, state] = sigma_grad[:, state]
+
+            # Update actor weights
+            lr = self.actor_lr
+            lr *= 1 if not self.scaled else (self.get_stddev(state) ** 2)
+            self.mu_weights += (lr * delta * self.mu_trace)
+            self.sigma_weights += (lr * delta * self.sigma_trace)
+        else:
+            lr = self.actor_lr
+            lr *= 1 if not self.scaled else (self.get_stddev(state) ** 2)
+            self.mu_weights += (lr * delta * mu_grad)
+            self.sigma_trace = (lr * delta * sigma_grad)
+
+        # In order to be consistent across all children of BaseAgent, we
+        # return all transitions with the shape B x N, where N is the number
+        # of state, action, or reward dimensions and B is the batch size = 1
+        reward = np.array([reward])
+
+        return np.expand_dims(state, axis=0), np.expand_dims(action, axis=0), \
+            np.expand_dims(reward, axis=0), np.expand_dims(next_state, axis=0)
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        if self.accumulate_trace:
+            return
+        if self.use_actor_trace:
+            self.mu_trace = np.zeros_like(self.mu_trace)
+            self.sigma_trace = np.zeros_like(self.sigma_trace)
+        if self.use_critic_trace:
+            self.critic_trace = np.zeros_like(self.critic_trace)
+
+    def eval(self):
+        """
+        Sets the agent into offline evaluation mode, where the agent will not
+        explore
+        """
+        self.is_training = False
+
+    def train(self):
+        """
+        Sets the agent to online training mode, where the agent will explore
+        """
+        self.is_training = True
+
+    def get_parameters(self):
+        """
+        Gets all learned agent parameters such that training can be resumed.
+
+        Gets all parameters of the agent such that, if given the
+        hyperparameters of the agent, training is resumable from this exact
+        point. This include the learned average reward, the learned entropy,
+        and other such learned values if applicable. This does not only apply
+        to the weights of the agent, but *all* values that have been learned
+        or calculated during training such that, given these values, training
+        can be resumed from this exact point.
+
+        For example, in the GaussianAC class, we must save not only the actor
+        and critic weights, but also the accumulated eligibility traces.
+
+        Returns
+        -------
+        dict of str to array_like
+            The agent's weights
+        """
+        pass
+
+
+if __name__ == "__main__":
+    a = GaussianAC(0.9, 0.1, 0.1, 0.5, 3, False)
+    print(a.actor_weights, a.critic_weights)
+    state = np.array([1, 2, 1])
+    action = a.sample_action(state)
+    a.update(state, action, 1, np.array([1, 2, 2]), 0.9)
+    print(a.actor_weights, a.critic_weights)
+    state = np.array([1, 2, 2])
+    action = a.sample_action(state)
+    a.update(state, action, 1, np.array([3, 1, 2]), 0.9)
+    print(a.actor_weights, a.critic_weights)
diff --git a/agent/linear/Sarsa.py b/agent/linear/Sarsa.py
new file mode 100644
index 0000000..649a6eb
--- /dev/null
+++ b/agent/linear/Sarsa.py
@@ -0,0 +1,280 @@
+#!/usr/bin/env python3
+
+# Import modules
+import numpy as np
+from agent.baseAgent import BaseAgent
+from PyFixedReps import TileCoder
+import time
+import warnings
+import inspect
+
+
+class Sarsa(BaseAgent):
+    def __init__(self, decay, lr, gamma, epsilon,
+                 action_space, bins, num_tilings, env, seed=None,
+                 trace_type="replacing", policy_type="εgreedy",
+                 include_bias=True):
+        super().__init__()
+        self.batch = False
+
+        # Set the agent's policy sampler
+        if seed is None:
+            seed = int(time())
+        self.random = np.random.default_rng(seed=int(seed))
+        self.seed = seed
+
+        # Needed so that when evaluating offline, we don't explore
+        self.is_training = True
+
+        # Tile Coder
+        self.include_bias = include_bias
+        input_ranges = list(zip(env.observation_space.low,
+                                env.observation_space.high))
+        dims = env.observation_space.shape[0]
+        params = {
+                    "dims": dims,
+                    "tiles": bins,
+                    "tilings": num_tilings,
+                    "input_ranges": input_ranges,
+                    "scale_output": False,
+                }
+        self.tiler = TileCoder(params)
+        state_features = self.tiler.features() + self.include_bias
+
+        # The weight parameters
+        self.actions = action_space.n
+        self.weights = np.zeros((self.actions, state_features))
+
+        # Set learning rates and other scaling factors
+        if decay < 0.0:
+            raise ValueError("cannot have trace decay rate < 0")
+        self.decay = decay
+        self.lr = lr / (num_tilings + self.include_bias)
+        self.gamma = gamma
+        self.epsilon = epsilon
+        print(self.lr)
+
+        if policy_type not in ("εgreedy", "softmax"):
+            raise ValueError("policy_type must be one of 'εgreedy', " +
+                             "'softmax'")
+        self.policy_type = policy_type
+
+        # Eligibility traces
+        if trace_type not in ("accumulating", "replacing"):
+            raise ValueError("trace_type must be one of 'accumulating', " +
+                             "'replacing'")
+        self.use_trace = self.decay > 0.0
+        if self.use_trace:
+            self.trace = np.zeros_like(self.weights)
+            self.trace_type = trace_type
+
+        # Keep track of the states and actions used in the SARSA update for
+        # error checking
+        self.sarsa_state = None
+        self.sarsa_action = None
+        self.first_call = True
+
+        source = inspect.getsource(inspect.getmodule(inspect.currentframe()))
+        self.info = {"source": source}
+
+    def sample_action(self, state):
+        """
+        Samples an action from the actor
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector, not one hot encoded
+
+        Returns
+        -------
+        np.array of float
+            The action to take
+        """
+        if self.first_call:
+            self.first_call = False
+            return self._sample_action(state)
+        if np.any(state != self.sarsa_state) and self.is_training:
+            warnings.warn("Warning: input state was not used as " +
+                          "the next state in SARSA update to select the" +
+                          "next action. Sampling a new action.")
+            return self._sample_action(state)
+        else:
+            return self.sarsa_action
+
+    def _sample_action(self, state):
+        """
+        Samples an action from the actor
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector, not one hot encoded
+
+        Returns
+        -------
+        int
+            The action to take
+        """
+        state = self._tiler_indices(state)
+        action_vals = self.weights[:, state].sum(axis=1)
+
+        if self.policy_type == "εgreedy":
+            return self._sample_epsilon_greedy(action_vals)
+        elif self.policy_type == "softmax":
+            return self._sample_softmax(action_vals)
+        else:
+            raise ValueError(f"unknown policy type {self.policy_type}")
+
+    def _sample_epsilon_greedy(self, action_vals):
+        if self.epsilon != 0 and self.random.uniform() < self.epsilon:
+            return self.random.choice(self.actions)
+        else:
+            # Choose maximum action
+            max_actions = np.where(action_vals == np.max(action_vals))[0]
+            if len(max_actions) > 1:
+                return max_actions[self.random.choice(len(max_actions))]
+            else:
+                return max_actions[0]
+
+    def _sample_softmax(self, action_vals):
+        action_vals = action_vals - np.max(action_vals)
+        if self.epsilon != 0:
+            # If epsilon is non-zero, use it to determine the stochasticity
+            # of the policy as the temperature parameter
+            action_vals /= self.epsilon
+            probs = np.exp(action_vals)
+            probs /= np.sum(probs)
+            return np.random.choice(self.actions, p=probs)
+        else:
+            # If epsilon is zero, then we are acting greedily
+            max_actions = np.where(action_vals == np.max(action_vals))[0]
+            if len(max_actions) > 1:
+                return max_actions[self.random.choice(len(max_actions))]
+            else:
+                return max_actions[0]
+
+    def _tiler_indices(self, state):
+        """
+        Returns the tile coded representation of state
+
+        Parameters
+        ----------
+        state : np.array
+            The state observation to tile code
+
+        Returns
+        -------
+        np.array
+            The tile coded representation of the input state
+        """
+        if self.include_bias:
+            return np.concatenate(
+                [
+                    np.zeros((1,), dtype=np.int32),
+                    self.tiler.get_indices(state) + 1,
+                ]
+            )
+
+        return self.tiler.get_indices(state)
+
+    def update(self, state, action, reward, next_state, done_mask):
+        """
+        Takes a single update step
+
+        Parameters
+        ----------
+        state : np.array or array_like of np.array
+            The state feature vector
+        action : np.array of float or array_like of np.array
+            The action taken
+        reward : float or array_like of float
+            The reward seen by the agent after taking the action
+        next_state : np.array or array_like of np.array
+            The feature vector of the next state transitioned to after the
+            agent took the argument action
+        done_mask : bool or array_like of bool
+            False if the agent reached the goal, True if the agent did not
+            reach the goal yet the episode ended (e.g. max number of steps
+            reached)
+            Note: this parameter is not used; it is only kept so that the
+            interface BaseAgent is consistent and can be used for both
+            Soft Actor-Critic and Linear-Gaussian Actor-Critic
+        """
+        state = self._tiler_indices(state)
+
+        δ = reward
+        δ -= self.weights[action, state].sum()
+
+        # Update the trace
+        if self.use_trace:
+            if self.trace_type == "accumulating":
+                self.trace[action, state] += 1
+            elif self.trace_type == "replacing":
+                self.trace[action, state] = 1
+            else:
+                raise ValueError(f"unknown trace type {self.trace_type}")
+
+        # Adjust δ if we are in an intra-episode timestep
+        episode_done = not done_mask
+        if not episode_done:
+            self.sarsa_action = next_action = self._sample_action(next_state)
+            self.sarsa_state = next_state
+
+            next_state = self._tiler_indices(next_state)
+
+            δ += (self.gamma * self.weights[next_action, next_state].sum())
+
+        # Update the weights
+        if self.use_trace:
+            self.weights += (self.lr * δ * self.trace)
+
+            # Decay the trace
+            self.trace *= (self.decay * self.gamma)
+        else:
+            self.weights[action, state] += (self.lr * δ)
+
+        return
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        self.trace = np.zeros_like(self.weights)
+        self.first_call = True
+        self.sarasa_action = self.sarsa_state = None
+
+    def eval(self):
+        """
+        Sets the agent into offline evaluation mode, where the agent will not
+        explore
+        """
+        self.is_training = False
+
+    def train(self):
+        """
+        Sets the agent to online training mode, where the agent will explore
+        """
+        self.is_training = True
+
+    def get_parameters(self):
+        """
+        Gets all learned agent parameters such that training can be resumed.
+
+        Gets all parameters of the agent such that, if given the
+        hyperparameters of the agent, training is resumable from this exact
+        point. This include the learned average reward, the learned entropy,
+        and other such learned values if applicable. This does not only apply
+        to the weights of the agent, but *all* values that have been learned
+        or calculated during training such that, given these values, training
+        can be resumed from this exact point.
+
+        For example, in the ESarsa class, we must save not only the actor
+        and critic weights, but also the accumulated eligibility traces.
+
+        Returns
+        -------
+        dict of str to array_like
+            The agent's weights
+        """
+        pass
diff --git a/agent/linear/SoftmaxAC.py b/agent/linear/SoftmaxAC.py
new file mode 100644
index 0000000..4be00a9
--- /dev/null
+++ b/agent/linear/SoftmaxAC.py
@@ -0,0 +1,343 @@
+# Import modules
+import numpy as np
+from agent.baseAgent import BaseAgent
+from PyFixedReps import TileCoder
+import time
+from scipy import special
+import inspect
+
+
+class SoftmaxAC(BaseAgent):
+    """
+    Class SoftmaxAC implements a Linear-Softmax Actor-Critic with eligibility
+    traces. The algorithm works in the discounted setting, rather than in the
+    average reward setting and is similar to the algorithm outlined in the
+    Policy Gradient chapter in the RL Book.
+    """
+    def __init__(self, decay, actor_lr, critic_lr, gamma,
+                 accumulate_trace, action_space, bins, num_tilings, env,
+                 use_critic_trace, use_actor_trace, temperature, seed=None,
+                 trace_type="replacing"):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        decay : float
+            The eligibility decay rate, lambda
+        actor_lr : float
+            The learning rate for the actor
+        critic_lr : float
+            The learning rate for the critic
+        state_features : int
+            The size of the state feature vectors
+        gamma : float
+            The environmental discount factor
+        accumulate_trace : bool
+            Whether or not to accumulate the eligibility traces or not, which
+            may be desirable if the task is continuing. If it is, then the
+            eligibility trace vectors will be accumulated and not reset between
+            "episodes" when calling the reset() method.
+        scaled : bool, optional
+            Whether the actor learning rate should be scaled by sigma^2 for
+            learning stability, by default False
+        clip_stddev : float, optional
+            The value at which the standard deviation is clipped in order to
+            prevent numerical overflow, by default 1000. If <= 0, then
+            no clipping is done.
+        seed : int
+            The seed to use for the normal distribution sampler, by default
+            None. If set to None, uses the integer value of the Unix time.
+        """
+        super().__init__()
+
+        # Set the agent's policy sampler
+        if seed is None:
+            seed = int(time())
+        self._random = np.random.default_rng(seed=int(seed))
+        self._seed = seed
+
+        # Needed so that when evaluating offline, we don't explore
+        self._is_training = True
+
+        # Tile Coder
+        input_ranges = list(zip(env.observation_space.low,
+                                env.observation_space.high))
+        dims = env.observation_space.shape[0]
+        params = {
+                    "dims": dims,
+                    "tiles": bins,
+                    "tilings": num_tilings,
+                    "input_ranges": input_ranges,
+                    "scale_output": False,
+                }
+        self._tiler = TileCoder(params)
+        state_features = self._tiler.features() + 1
+
+        # The weight parameters
+        self._action_n = action_space.n
+        self._avail_actions = np.array(range(self._action_n))
+        self._size = state_features
+        self._actor_weights = np.zeros((self._action_n, state_features))
+        self._critic_weights = np.zeros(state_features)  # State value critic
+
+        # Set learning rates and other scaling factors
+        self._critic_α = critic_lr / (num_tilings + 1)
+        self._actor_α = actor_lr / (num_tilings + 1)
+        self._γ = gamma
+        if temperature < 0:
+            raise ValueError("cannot use temperature < 0")
+        self._τ = temperature
+
+        # Eligibility traces
+        if trace_type not in ("accumulating", "replacing"):
+            raise ValueError("trace_type must be one of accumulating', " +
+                             "'replacing'")
+        if decay < 0:
+            raise ValueError("cannot use decay < 0")
+        elif decay >= 1:
+            raise ValueError("cannot use decay >= 1")
+        elif decay == 0:
+            use_actor_trace = use_critic_trace = False
+        else:
+            self._λ = decay
+
+        self._trace_type = trace_type
+        self.use_actor_trace = use_actor_trace
+        if self.use_actor_trace:
+            self._actor_trace = np.zeros((self._action_n, state_features))
+        self.use_critic_trace = use_critic_trace
+        if self.use_critic_trace:
+            self._critic_trace = np.zeros(state_features)
+
+        source = inspect.getsource(inspect.getmodule(inspect.currentframe()))
+        self.info = {"source": source}
+
+    def _get_logits(self, state):
+        """
+        Gets the logits of the policy in state
+
+        Parameters
+        ----------
+        state : np.array
+            The indices of the nonzero features in the tile coded state
+            representation
+
+        Returns
+        -------
+        np.array of float
+            The logits of each action
+        """
+        if self._τ == 0:
+            raise ValueError("cannot compute logits when τ = 0")
+
+        logits = self._actor_weights[:, state].sum(axis=1)
+        logits -= np.max(logits)  # For numerical stability
+        return logits / self._τ
+
+    def _get_probs(self, state_ind):
+        if self._τ == 0:
+            q_values = self._actor_weights[:, state_ind].sum(axis=-1)
+
+            max_value = np.max(q_values)
+            max_actions = np.where(q_values == max_value)[0]
+
+            probs = np.zeros(self._action_n)
+            probs[max_actions] = 1 / len(max_actions)
+            return probs
+
+        logits = self._get_logits(state_ind)
+        logits -= logits.max()  # Subtract max because SciPy breaks things
+        pi = special.softmax(logits)
+        return pi
+
+    def sample_action(self, state):
+        """
+        Samples an action from the actor
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector, not one hot encoded
+
+        Returns
+        -------
+        np.array of float
+            The action to take
+        """
+        state = np.concatenate(
+            [
+                np.zeros((1,), dtype=np.int32),
+                self._tiler.get_indices(state) + 1,
+            ]
+        )
+        probs = self._get_probs(state)
+
+        # If in offline evaluation mode, return the action of maximum
+        # probability
+        if not self._is_training:
+            actions = np.where(probs == np.max(probs))[0]
+            if len(actions) == 1:
+                return actions[0]
+            else:
+                return self._random.choice(actions)
+
+        return self._random.choice(self._action_n, p=probs)
+
+    def _actor_grad(self, state, action):
+        """
+        Returns the gradient of the actor's performance in `state`
+        evaluated at the action `action`
+
+        Parameters
+        ----------
+        state : np.ndarray
+            The state observation, not tile coded
+        action : int
+            The action to evaluate the gradient on
+        """
+        π = self._get_probs(state)
+        π = np.reshape(π, (self._actor_weights.shape[0], 1))
+        features = np.zeros_like(self._actor_weights)
+        features[action, state] = 1
+
+        grad = features
+        grad[:, state] -= π
+        return grad
+
+    def update(self, state, action, reward, next_state, done_mask):
+        """
+        Takes a single update step
+
+        Parameters
+        ----------
+        state : np.array or array_like of np.array
+            The state feature vector
+        action : np.array of float or array_like of np.array
+            The action taken
+        reward : float or array_like of float
+            The reward seen by the agent after taking the action
+        next_state : np.array or array_like of np.array
+            The feature vector of the next state transitioned to after the
+            agent took the argument action
+        done_mask : bool or array_like of bool
+            False if the agent reached the goal, True if the agent did not
+            reach the goal yet the episode ended (e.g. max number of steps
+            reached)
+            Note: this parameter is not used; it is only kept so that the
+            interface BaseAgent is consistent and can be used for both
+            Soft Actor-Critic and Linear-Gaussian Actor-Critic
+        """
+        state = np.concatenate(
+            [
+                np.zeros((1,), dtype=np.int32),
+                self._tiler.get_indices(state) + 1,
+            ]
+        )
+        next_state = np.concatenate(
+            [
+                np.zeros((1,), dtype=np.int32),
+                self._tiler.get_indices(next_state) + 1,
+            ]
+        )
+
+        # Calculate TD error
+        target = reward + done_mask * self._γ * \
+            self._critic_weights[next_state].sum()
+        estimate = self._critic_weights[state].sum()
+        delta = target - estimate
+
+        # Critic update
+        if self.use_critic_trace:
+            # Update critic eligibility trace
+            self._critic_trace *= (self._γ * self._λ)
+            if self._trace_type == "accumulating":
+                self._critic_trace[state] += 1
+            elif self._trace_type == "replacing":
+                self._critic_trace[state] = 1
+            else:
+                raise ValueError(f"unknown trace type {self._trace_type}")
+
+            # Update critic
+            self._critic_weights += (self._critic_α * delta *
+                                     self._critic_trace)
+        else:
+            grad = np.zeros_like(self._critic_weights)
+            grad[state] = 1
+            self._critic_weights += (self._critic_α * delta * grad)
+
+        # Actor update
+        actor_grad = self._actor_grad(state, action)
+        if self.use_actor_trace:
+            # Update actor eligibility traces
+            self._actor_trace *= (self._γ * self._λ)
+            self._actor_trace += actor_grad
+
+            # Update actor weights
+            self._actor_weights += (self._actor_α * delta * self._actor_trace)
+        else:
+            self._actor_weights += (self._actor_α * delta * actor_grad)
+
+        # In order to be consistent across all children of BaseAgent, we
+        # return all transitions with the shape B x N, where N is the number
+        # of state, action, or reward dimensions and B is the batch size = 1
+        reward = np.array([reward])
+        return np.expand_dims(state, axis=0), np.expand_dims(action, axis=0), \
+            np.expand_dims(reward, axis=0), np.expand_dims(next_state, axis=0)
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        if self.use_actor_trace:
+            self._actor_trace = np.zeros_like(self._actor_trace)
+        if self.use_critic_trace:
+            self._critic_trace = np.zeros(self._size)
+
+    def eval(self):
+        """
+        Sets the agent into offline evaluation mode, where the agent will not
+        explore
+        """
+        self._is_training = False
+
+    def train(self):
+        """
+        Sets the agent to online training mode, where the agent will explore
+        """
+        self._is_training = True
+
+    def get_parameters(self):
+        """
+        Gets all learned agent parameters such that training can be resumed.
+
+        Gets all parameters of the agent such that, if given the
+        hyperparameters of the agent, training is resumable from this exact
+        point. This include the learned average reward, the learned entropy,
+        and other such learned values if applicable. This does not only apply
+        to the weights of the agent, but *all* values that have been learned
+        or calculated during training such that, given these values, training
+        can be resumed from this exact point.
+
+        For example, in the SoftmaxAC class, we must save not only the actor
+        and critic weights, but also the accumulated eligibility traces.
+
+        Returns
+        -------
+        dict of str to array_like
+            The agent's weights
+        """
+        pass
+
+
+if __name__ == "__main__":
+    a = SoftmaxAC(0.9, 0.1, 0.1, 0.5, 3, False)
+    print(a.actor_weights, a.critic_weights)
+    state = np.array([1, 2, 1])
+    action = a.sample_action(state)
+    a.update(state, action, 1, np.array([1, 2, 2]), 0.9)
+    print(a.actor_weights, a.critic_weights)
+    state = np.array([1, 2, 2])
+    action = a.sample_action(state)
+    a.update(state, action, 1, np.array([3, 1, 2]), 0.9)
+    print(a.actor_weights, a.critic_weights)
diff --git a/agent/nonlinear/FKL.py b/agent/nonlinear/FKL.py
new file mode 100644
index 0000000..ce3c6e1
--- /dev/null
+++ b/agent/nonlinear/FKL.py
@@ -0,0 +1,399 @@
+#!/usr/bin/env python3
+
+# Import modules
+import torch
+import time
+from gym.spaces import Box, Discrete
+import numpy as np
+import torch.nn.functional as F
+from torch.optim import Adam
+from agent.baseAgent import BaseAgent
+import agent.nonlinear.nn_utils as nn_utils
+from agent.nonlinear.policy_utils import GaussianPolicy, SoftmaxPolicy
+from agent.nonlinear.value_function_utils import QNetwork
+from utils.experience_replay import TorchBuffer as ExperienceReplay
+
+
+class FKL(BaseAgent):
+    """
+    Class FKL implements a vanilla-style actor-critic algorithm, minimizing
+    the FKL between the learned policy and the Boltzmann distribution over
+    action values. This is in contrast to "regular" actor-critics (such as SAC
+    and VAC in this codebase) which minimize an RKL between these values.
+
+    This algorithm also learns a soft action value function, where the entropy
+    regularization is determined by `alpha`.
+
+    FKL works only with continuous action spaces and uses MLP function
+    approximators.
+
+    See https://arxiv.org/abs/2107.08285 for more information on this
+    algorithm. This implementation is the same as the FKL implementation from
+    this paper.
+    """
+    def __init__(self, num_inputs, action_space, gamma, tau, alpha, policy,
+                 target_update_interval, critic_lr, actor_lr_scale,
+                 num_samples, actor_hidden_dim, critic_hidden_dim,
+                 replay_capacity, seed, batch_size, betas, env, cuda=False,
+                 clip_stddev=1000, init=None, activation="relu"):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        num_inputs : int
+            The number of input features
+        action_space : gym.spaces.Space
+            The action space from the gym environment
+        gamma : float
+            The discount factor
+        tau : float
+            The weight of the weighted average, which performs the soft update
+            to the target critic network's parameters toward the critic
+            network's parameters, that is: target_parameters =
+            ((1 - τ) * target_parameters) + (τ * source_parameters)
+        alpha : float
+            The entropy regularization temperature. See equation (1) in paper.
+        policy : str
+            The type of policy, currently, only support "gaussian"
+        target_update_interval : int
+            The number of updates to perform before the target critic network
+            is updated toward the critic network
+        critic_lr : float
+            The critic learning rate
+        actor_lr : float
+            The actor learning rate
+        actor_hidden_dim : int
+            The number of hidden units in the actor's neural network
+        critic_hidden_dim : int
+            The number of hidden units in the critic's neural network
+        replay_capacity : int
+            The number of transitions stored in the replay buffer
+        seed : int
+            The random seed so that random samples of batches are repeatable
+        batch_size : int
+            The number of elements in a batch for the batch update
+        cuda : bool, optional
+            Whether or not cuda should be used for training, by default False.
+            Note that if True, cuda is only utilized if available.
+        clip_stddev : float, optional
+            The value at which the standard deviation is clipped in order to
+            prevent numerical overflow, by default 1000. If <= 0, then
+            no clipping is done.
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+
+        Raises
+        ------
+        ValueError
+            If the batch size is larger than the replay buffer
+        """
+        super().__init__()
+        self.batch = True
+
+        # Ensure batch size < replay capacity
+        if batch_size > replay_capacity:
+            raise ValueError("cannot have a batch larger than replay " +
+                             "buffer capacity")
+
+        # Set the seed for all random number generators, this includes
+        # everything used by PyTorch, including setting the initial weights
+        # of networks. PyTorch prefers seeds with many non-zero binary units
+        self.torch_rng = torch.manual_seed(seed)
+        self.rng = np.random.default_rng(seed)
+
+        self.is_training = True
+        self.gamma = gamma
+        self.tau = tau
+        self.alpha = alpha
+
+        self.discrete_action = isinstance(action_space, Discrete)
+        self.state_dims = num_inputs
+        self.num_samples = num_samples
+        assert num_samples >= 2
+
+        self.device = torch.device("cuda:0" if cuda and
+                                   torch.cuda.is_available() else "cpu")
+
+        if isinstance(action_space, Box):
+            self.action_dims = len(action_space.high)
+
+            # Keep a replay buffer
+            self.replay = ExperienceReplay(replay_capacity, seed, num_inputs,
+                                           action_space.shape[0], self.device)
+        elif isinstance(action_space, Discrete):
+            self.action_dims = 1
+            # Keep a replay buffer
+            self.replay = ExperienceReplay(replay_capacity, seed, num_inputs,
+                                           1, self.device)
+        self.batch_size = batch_size
+
+        # Set the interval between timesteps when the target network should be
+        # updated and keep a running total of update number
+        self.target_update_interval = target_update_interval
+        self.update_number = 0
+
+        # Create the critic Q function
+        if isinstance(action_space, Box):
+            action_shape = action_space.shape[0]
+        elif isinstance(action_space, Discrete):
+            action_shape = 1
+
+        self.critic = QNetwork(num_inputs, action_shape,
+                               critic_hidden_dim, init, activation).to(
+                                   device=self.device)
+        self.critic_optim = Adam(self.critic.parameters(), lr=critic_lr,
+                                 betas=betas)
+
+        self.critic_target = QNetwork(num_inputs, action_shape,
+                                      critic_hidden_dim, init, activation).to(
+                                          self.device)
+        nn_utils.hard_update(self.critic_target, self.critic)
+
+        self.policy_type = policy.lower()
+        actor_lr = actor_lr_scale * critic_lr
+        if self.policy_type == "gaussian":
+
+            self.policy = GaussianPolicy(num_inputs, action_space.shape[0],
+                                         actor_hidden_dim, activation,
+                                         action_space, clip_stddev, init).to(
+                                             self.device)
+            self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr,
+                                     betas=betas)
+
+        else:
+            raise NotImplementedError
+
+    def sample_action(self, state):
+        """
+        Samples an action from the agent
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector
+
+        Returns
+        -------
+        array_like of float
+            The action to take
+        """
+        state = torch.FloatTensor(state).to(self.device).unsqueeze(0)
+        if self.is_training:
+            action, _, _ = self.policy.sample(state)
+        else:
+            _, _, action = self.policy.sample(state)
+
+        act = action.detach().cpu().numpy()[0]
+        return act
+
+    def sample_action_(self, state, size):
+        """
+        sample_action_ is like sample_action, except the rng for
+        action selection in the environment is not affected by running
+        this function.
+        """
+        if len(state.shape) > 1 or state.shape[0] > 1:
+            raise ValueError("sample_action_ takes a single state")
+        with torch.no_grad():
+            state = torch.FloatTensor(state).to(self.device).unsqueeze(0)
+            if self.is_training:
+                mean, log_std = self.policy.forward(state)
+
+        if not self.is_training:
+            return mean.detach().cpu().numpy()[0]
+
+        mean = mean.detach().cpu().numpy()[0]
+        std = np.exp(log_std.detach().cpu().numpy()[0])
+        return self.rng.normal(mean, std, size=size)
+
+    def update(self, state, action, reward, next_state, done_mask):
+        """
+        Takes a single update step, which may be a number of offline
+        batch updates
+
+        Parameters
+        ----------
+        state : np.array or array_like of np.array
+            The state feature vector
+        action : np.array of float or array_like of np.array
+            The action taken
+        reward : float or array_like of float
+            The reward seen by the agent after taking the action
+        next_state : np.array or array_like of np.array
+            The feature vector of the next state transitioned to after the
+            agent took the argument action
+        done_mask : bool or array_like of bool
+            False if the agent reached the goal, True if the agent did not
+            reach the goal yet the episode ended (e.g. max number of steps
+            reached)
+        """
+        if self.discrete_action:
+            action = np.array([action])
+        # Keep transition in replay buffer
+        self.replay.push(state, action, reward, next_state, done_mask)
+
+        # Sample a batch from memory
+        state_batch, action_batch, reward_batch, next_state_batch, \
+            mask_batch = self.replay.sample(batch_size=self.batch_size)
+
+        # When updating Q functions, we don't want to backprop through the
+        # policy and target network parameters
+        with torch.no_grad():
+            next_state_action, next_state_log_pi, _ = \
+                self.policy.sample(next_state_batch)
+            q_next = self.critic_target(next_state_batch, next_state_action)
+            q_next -= (self.alpha * next_state_log_pi)
+
+            q_target = reward_batch + mask_batch * self.gamma * q_next
+
+        q_prediction = self.critic(state_batch, action_batch)
+
+        # Calculate the losses on each critic
+        q_loss = F.mse_loss(q_prediction, q_target)
+
+        # Update the critic
+        self.critic_optim.zero_grad()
+        q_loss.backward()
+        self.critic_optim.step()
+
+        sampled_actions, logprob, _ = self.policy.sample(state_batch,
+                                                         self.num_samples)
+        if self.num_samples == 1:
+            raise ValueError("num_samples should be greater than 1")
+        sampled_actions = torch.permute(sampled_actions, (1, 0, 2))
+
+        # Calculate the importance sampling ratio
+        sampled_actions = torch.reshape(sampled_actions,
+                                        [-1, self.action_dims])
+        stacked_s_batch = torch.repeat_interleave(state_batch,
+                                                  self.num_samples,
+                                                  dim=0)
+        stacked_s_batch = torch.reshape(stacked_s_batch,
+                                        [-1, self.state_dims])
+
+        # Calculate the weighted importance sampling ratio
+        # Right now, we follow the FKL/RKL paper equation (13) to compute the
+        # weighted importance sampling ratio, where:
+        #
+        # ρ_i = BQ(a_i | s) / π_θ(a_i | s) ∝ exp(Q(s, a_i)τ⁻¹) / π(a_i | s)
+        # ρ̂_i = ρ_i / ∑(ρ_j)
+        #
+        # We could compute a more numerically stable weighted importance
+        # sampling ratio if needed (but the implementation is very
+        # complicated):
+        #
+        # ρ̂ = π(a_i | s) [∑_{i≠j} ([h(s, a_j)/h(s, a_i)] * π(a_j | s)⁻¹) + 1]
+        # h(s, a_j, a_i) = exp[(Q(s, a_j) - M)τ⁻¹] / exp[(Q(s, a_i) - M)τ⁻¹]
+        # M = M(a_j, a_i) = max(Q(s, a_j), Q(s, a_i))
+        with torch.no_grad():
+            IS_q_values = self.critic(stacked_s_batch,
+                                      sampled_actions)
+            IS_q_values = torch.reshape(IS_q_values, [self.batch_size,
+                                                      self.num_samples])
+
+            IS = IS_q_values / self.alpha
+            IS_max = torch.amax(IS, dim=1).unsqueeze(dim=-1)
+            IS -= IS_max
+            IS = IS.exp()
+            Z = torch.sum(IS, dim=1).unsqueeze(-1)
+            IS /= Z
+            prob = logprob.exp().squeeze(dim=-1).T
+            IS /= prob
+
+            weight = torch.sum(IS, dim=1).unsqueeze(dim=-1)
+            WIS = IS / weight
+
+        # Calculate the policy loss
+        logprob = logprob.squeeze()
+        policy_loss = WIS * logprob.T
+        policy_loss = -policy_loss.mean()
+
+        # Update the actor
+        self.policy_optim.zero_grad()
+        policy_loss.backward()
+        self.policy_optim.step()
+
+        # Update target network
+        self.update_number += 1
+        if self.update_number % self.target_update_interval == 0:
+            self.update_number = 0
+            nn_utils.soft_update(self.critic_target, self.critic, self.tau)
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        pass
+
+    def eval(self):
+        """
+        Sets the agent into offline evaluation mode, where the agent will not
+        explore
+        """
+        self.is_training = False
+
+    def train(self):
+        """
+        Sets the agent to online training mode, where the agent will explore
+        """
+        self.is_training = True
+
+    # Save model parameters
+    def save_model(self, env_name, suffix="", actor_path=None,
+                   critic_path=None):
+        """
+        Saves the models so that after training, they can be used.
+
+        Parameters
+        ----------
+        env_name : str
+            The name of the environment that was used to train the models
+        suffix : str, optional
+            The suffix to the filename, by default ""
+        actor_path : str, optional
+            The path to the file to save the actor network as, by default None
+        critic_path : str, optional
+            The path to the file to save the critic network as, by default None
+        """
+        pass
+
+    # Load model parameters
+    def load_model(self, actor_path, critic_path):
+        """
+        Loads in a pre-trained actor and a pre-trained critic to resume
+        training.
+
+        Parameters
+        ----------
+        actor_path : str
+            The path to the file which contains the actor
+        critic_path : str
+            The path to the file which contains the critic
+        """
+        pass
+
+    def get_parameters(self):
+        """
+        Gets all learned agent parameters such that training can be resumed.
+
+        Gets all parameters of the agent such that, if given the
+        hyperparameters of the agent, training is resumable from this exact
+        point. This include the learned average reward, the learned entropy,
+        and other such learned values if applicable. This does not only apply
+        to the weights of the agent, but *all* values that have been learned
+        or calculated during training such that, given these values, training
+        can be resumed from this exact point.
+
+        For example, in the LinearAC class, we must save not only the actor
+        and critic weights, but also the accumulated eligibility traces.
+
+        Returns
+        -------
+        dict of str to float, torch.Tensor
+            The agent's weights
+        """
+        pass
diff --git a/agent/nonlinear/SAC.py b/agent/nonlinear/SAC.py
new file mode 100644
index 0000000..e24c645
--- /dev/null
+++ b/agent/nonlinear/SAC.py
@@ -0,0 +1,542 @@
+# Import modules
+import os
+import torch
+import numpy as np
+import torch.nn.functional as F
+from torch.optim import Adam
+from agent.baseAgent import BaseAgent
+import agent.nonlinear.nn_utils as nn_utils
+from agent.nonlinear.policy.MLP import SquashedGaussian
+from agent.nonlinear.value_function.MLP import DoubleQ, Q
+from utils.experience_replay import TorchBuffer as ExperienceReplay
+
+
+class SAC(BaseAgent):
+    """
+    SAC implements the Soft Actor-Critic agent found in the paper
+    https://arxiv.org/pdf/1812.05905.pdf.
+
+    SAC works only with continuous action spaces and uses MLP function
+    approximators.
+    """
+    def __init__(self, gamma, tau, alpha, policy,
+                 target_update_interval, critic_lr, actor_lr_scale, alpha_lr,
+                 actor_hidden_dim, critic_hidden_dim, replay_capacity, seed,
+                 batch_size, betas, env, reparameterized=True, soft_q=True,
+                 double_q=True, automatic_entropy_tuning=False, cuda=False,
+                 clip_stddev=1000, init=None, activation="relu"):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        gamma : float
+            The discount factor
+        tau : float
+            The weight of the weighted average, which performs the soft update
+            to the target critic network's parameters toward the critic
+            network's parameters, that is: target_parameters =
+            ((1 - τ) * target_parameters) + (τ * source_parameters)
+        alpha : float
+            The entropy regularization temperature. See equation (1) in paper.
+        policy : str
+            The type of policy, currently, only support "gaussian"
+        target_update_interval : int
+            The number of updates to perform before the target critic network
+            is updated toward the critic network
+        critic_lr : float
+            The critic learning rate
+        actor_lr : float
+            The actor learning rate
+        alpha_lr : float
+            The learning rate for the entropy parameter, if using an automatic
+            entropy tuning algorithm (see automatic_entropy_tuning) parameter
+            below
+        actor_hidden_dim : int
+            The number of hidden units in the actor's neural network
+        critic_hidden_dim : int
+            The number of hidden units in the critic's neural network
+        replay_capacity : int
+            The number of transitions stored in the replay buffer
+        seed : int
+            The random seed so that random samples of batches are repeatable
+        batch_size : int
+            The number of elements in a batch for the batch update
+        automatic_entropy_tuning : bool, optional
+            Whether the agent should automatically tune its entropy
+            hyperparmeter alpha, by default False
+        cuda : bool, optional
+            Whether or not cuda should be used for training, by default False.
+            Note that if True, cuda is only utilized if available.
+        clip_stddev : float, optional
+            The value at which the standard deviation is clipped in order to
+            prevent numerical overflow, by default 1000. If <= 0, then
+            no clipping is done.
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        soft_q : bool
+            Whether or not to learn soft Q functions, by default True. The
+            original SAC uses soft Q functions since we learn an
+            entropy-regularized policy. When learning an entropy regularized
+            policy, guaranteed policy improvement (in the ideal case) only
+            exists with respect to soft action values.
+        reparameterized: bool
+            Whether to use the reparameterization trick to learn the policy or
+            to use the log-likelihood trick. The original SAC uses the
+            reparameterization trick.
+        double_q : bool
+            Whether or not to use two Q value functions or not. The original
+            SAC uses two Q value functions.
+
+        Raises
+        ------
+        ValueError
+            If the batch size is larger than the replay buffer
+        """
+        super().__init__()
+
+        # Ensure batch size < replay capacity
+        if batch_size > replay_capacity:
+            raise ValueError("cannot have a batch larger than replay " +
+                             "buffer capacity")
+
+        action_space = env.action_space
+        obs_space = env.observation_space
+        obs_dim = obs_space.shape
+        # Ensure we are working with vector observations
+        if len(obs_dim) != 1:
+            raise ValueError(
+                f"""SAC works only with vector observations, but got
+                observation with shape {obs_dim}."""
+            )
+
+        # Set the seed for all random number generators, this includes
+        # everything used by PyTorch, including setting the initial weights
+        # of networks. PyTorch prefers seeds with many non-zero binary units
+        self._torch_rng = torch.manual_seed(seed)
+        self._rng = np.random.default_rng(seed)
+
+        # Random hypers and fields
+        self._is_training = True  # Whether in training or evaluation mode
+        self._gamma = gamma  # Discount factor
+        self._tau = tau  # Polyak averaging constant for target networks
+        self._alpha = alpha  # Entropy scale
+        self._reparameterized = reparameterized  # Whether to use reparam trick
+        self._soft_q = soft_q  # Whether to use soft Q functions or nor
+        self._double_q = double_q  # Whether or not to use a double Q critic
+
+        self._device = torch.device("cuda:0" if cuda and
+                                    torch.cuda.is_available() else "cpu")
+
+        # Experience replay buffer
+        self._batch_size = batch_size
+        self._replay = ExperienceReplay(replay_capacity, seed, obs_space.shape,
+                                        action_space.shape[0], self._device)
+
+        # Set the interval between timesteps when the target network should be
+        # updated and keep a running total of update number
+        self._target_update_interval = target_update_interval
+        self._update_number = 0
+
+        # Automatic entropy tuning
+        self._automatic_entropy_tuning = automatic_entropy_tuning
+        assert not self._automatic_entropy_tuning
+
+        # Set up the critic and target critic
+        self._init_critics(
+            obs_space,
+            action_space,
+            critic_hidden_dim,
+            init,
+            activation,
+            critic_lr,
+            betas,
+        )
+
+        # Set up the policy
+        self._policy_type = policy.lower()
+        actor_lr = actor_lr_scale * critic_lr
+        self._init_policy(
+            obs_space,
+            action_space,
+            actor_hidden_dim,
+            init,
+            activation,
+            actor_lr,
+            betas,
+            clip_stddev,
+        )
+
+        # Set up auto entropy tuning
+        if self._automatic_entropy_tuning is True:
+            self._target_entropy = -torch.prod(
+                torch.Tensor(action_space.shape).to(self._device)
+            ).item()
+            self._log_alpha = torch.zeros(
+                1,
+                requires_grad=True,
+                device=self._device,
+            )
+            self._alpha_optim = Adam([self._log_alpha], lr=alpha_lr)
+
+    def sample_action(self, state):
+        """
+        Samples an action from the agent
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector
+
+        Returns
+        -------
+        array_like of float
+            The action to take
+        """
+        state = torch.FloatTensor(state).to(self._device).unsqueeze(0)
+        if self._is_training:
+            action, _, _, _ = self._policy.rsample(state)
+        else:
+            _, _, action, _ = self._policy.rsample(state)
+
+        return action.detach().cpu().numpy()[0]  # size (1, action_dims)
+
+    def update(self, state, action, reward, next_state, done_mask):
+        """
+        Takes a single update step, which may be a number of offline
+        batch updates
+
+        Parameters
+        ----------
+        state : np.array or array_like of np.array
+            The state feature vector
+        action : np.array of float or array_like of np.array
+            The action taken
+        reward : float or array_like of float
+            The reward seen by the agent after taking the action
+        next_state : np.array or array_like of np.array
+            The feature vector of the next state transitioned to after the
+            agent took the argument action
+        done_mask : bool or array_like of bool
+            False if the agent reached the goal, True if the agent did not
+            reach the goal yet the episode ended (e.g. max number of steps
+            reached)
+        """
+        # Keep transition in replay buffer
+        self._replay.push(state, action, reward, next_state, done_mask)
+
+        # Sample a batch from memory
+        state_batch, action_batch, reward_batch, next_state_batch, \
+            mask_batch = self._replay.sample(batch_size=self._batch_size)
+
+        self._update_critic(state_batch, action_batch, reward_batch,
+                            next_state_batch, mask_batch)
+
+        self._update_actor(state_batch, action_batch, reward_batch,
+                           next_state_batch, mask_batch)
+
+    def _update_actor(self, state_batch, action_batch, reward_batch,
+                      next_state_batch, mask_batch):
+        """
+        Update the actor given a batch of transitions sampled from a replay
+        buffer.
+        """
+        # Calculate the actor loss
+        if self._reparameterized:
+            # Reparameterization trick
+            pi, log_pi, _, _ = self._policy.rsample(state_batch)
+            q = self._get_q(state_batch, pi)
+
+            policy_loss = ((self._alpha * log_pi) - q).mean()
+
+        else:
+            # Log likelihood trick
+            with torch.no_grad():
+                # Context manager ensures that we don't backprop through the q
+                # function when minimizing the policy loss
+                pi, log_pi, _, x_t = self._policy.sample(state_batch)
+                q = self._get_q(state_batch, pi)
+
+            # Compute the policy loss, grad_log_pi will be the only
+            # differentiated value
+            grad_log_pi = self._policy.log_prob(state_batch, x_t)
+            policy_loss = grad_log_pi * (self._alpha * log_pi - q)
+            policy_loss = policy_loss.mean()
+
+        # Update the actor
+        self._policy_optim.zero_grad()
+        policy_loss.backward()
+        self._policy_optim.step()
+
+        # Tune the entropy if appropriate
+        if self._automatic_entropy_tuning:
+            alpha_loss = -(self._log_alpha *
+                           (log_pi + self._target_entropy).detach()).mean()
+
+            self._alpha_optim.zero_grad()
+            alpha_loss.backward()
+            self._alpha_optim.step()
+
+            self._alpha = self._log_alpha.exp()
+
+    def _update_critic(self, state_batch, action_batch, reward_batch,
+                       next_state_batch, mask_batch):
+        """
+        Update the critic(s) given a batch of transitions sampled from a replay
+        buffer.
+        """
+        if self._double_q:
+            self._update_double_critic(
+                state_batch,
+                action_batch,
+                reward_batch,
+                next_state_batch,
+                mask_batch,
+            )
+        else:
+            self._update_single_critic(
+                state_batch,
+                action_batch,
+                reward_batch,
+                next_state_batch,
+                mask_batch,
+            )
+
+    def _update_single_critic(self, state_batch, action_batch, reward_batch,
+                              next_state_batch, mask_batch):
+        """
+        Update the critic using a batch of transitions when using a single Q
+        critic.
+        """
+        if self._double_q:
+            raise ValueError("cannot call _update_single_critic when using " +
+                             "a double Q critic")
+
+        # When updating Q functions, we don't want to backprop through the
+        # policy and target network parameters
+        with torch.no_grad():
+            # Sample an action in the next state for the SARSA update
+            next_state_action, next_state_log_pi, _, _ = \
+                self._policy.sample(next_state_batch)
+
+            # Calculate the Q value of the next action in the next state
+            q_next = self._critic_target(next_state_batch, next_state_action)
+            if self._soft_q:
+                q_next -= self._alpha * next_state_log_pi
+
+            # Calculate the target for the SARSA update
+            q_target = reward_batch + mask_batch * self._gamma * q_next
+
+        # Calculate the Q value of each action in each respective state
+        q = self._critic(state_batch, action_batch)
+
+        # Calculate the loss between the target and estimate Q values
+        q_loss = F.mse_loss(q, q_target)
+
+        # Update the critic
+        self._critic_optim.zero_grad()
+        q_loss.backward()
+        self._critic_optim.step()
+
+        # Increment the running total of updates and update the critic target
+        # if needed
+        self._update_number += 1
+        if self._update_number % self._target_update_interval == 0:
+            self._update_number = 0
+            nn_utils.soft_update(self._critic_target, self._critic, self._tau)
+
+    def _update_double_critic(self, state_batch, action_batch, reward_batch,
+                              next_state_batch, mask_batch):
+        """
+        Update the critic using a batch of transitions when using a double Q
+        critic.
+        """
+
+        if not self._double_q:
+            raise ValueError("cannot call _update_single_critic when using " +
+                             "a double Q critic")
+
+        # When updating Q functions, we don't want to backprop through the
+        # policy and target network parameters
+        with torch.no_grad():
+            # Sample an action in the next state for the SARSA update
+            next_state_action, next_state_log_pi, _, _ = \
+                self._policy.sample(next_state_batch)
+
+            # Calculate the action values for the next state
+            next_q1, next_q2 = self._critic_target(next_state_batch,
+                                                   next_state_action)
+
+            # Double Q: target uses the minimum of the two computed action
+            # values
+            min_next_q = torch.min(next_q1, next_q2)
+
+            # If using soft action value functions, then adjust the target
+            if self._soft_q:
+                min_next_q -= self._alpha * next_state_log_pi
+
+            # Calculate the target for the action value function update
+            q_target = reward_batch + mask_batch * self._gamma * min_next_q
+
+        # Calculate the two Q values of each action in each respective state
+        q1, q2 = self._critic(state_batch, action_batch)
+
+        # Calculate the losses on each critic
+        # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2]
+        q1_loss = F.mse_loss(q1, q_target)
+
+        # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2]
+        q2_loss = F.mse_loss(q2, q_target)
+        q_loss = q1_loss + q2_loss
+
+        # Update the critic
+        self._critic_optim.zero_grad()
+        q_loss.backward()
+        self._critic_optim.step()
+
+        # Increment the running total of updates and update the critic target
+        # if needed
+        self._update_number += 1
+        if self._update_number % self._target_update_interval == 0:
+            self._update_number = 0
+            nn_utils.soft_update(self._critic_target, self._critic, self._tau)
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        pass
+
+    def eval(self):
+        """
+        Sets the agent into offline evaluation mode, where the agent will not
+        explore
+        """
+        self._is_training = False
+
+    def train(self):
+        """
+        Sets the agent to online training mode, where the agent will explore
+        """
+        self._is_training = True
+
+    # Save model parameters
+    def save_model(self, env_name, suffix="", actor_path=None,
+                   critic_path=None):
+        """
+        Saves the models so that after training, they can be used.
+
+        Parameters
+        ----------
+        env_name : str
+            The name of the environment that was used to train the models
+        suffix : str, optional
+            The suffix to the filename, by default ""
+        actor_path : str, optional
+            The path to the file to save the actor network as, by default None
+        critic_path : str, optional
+            The path to the file to save the critic network as, by default None
+        """
+        pass
+
+    # Load model parameters
+    def load_model(self, actor_path, critic_path):
+        """
+        Loads in a pre-trained actor and a pre-trained critic to resume
+        training.
+
+        Parameters
+        ----------
+        actor_path : str
+            The path to the file which contains the actor
+        critic_path : str
+            The path to the file which contains the critic
+        """
+        pass
+
+    def get_parameters(self):
+        """
+        Gets all learned agent parameters such that training can be resumed.
+
+        Gets all parameters of the agent such that, if given the
+        hyperparameters of the agent, training is resumable from this exact
+        point. This include the learned average reward, the learned entropy,
+        and other such learned values if applicable. This does not only apply
+        to the weights of the agent, but *all* values that have been learned
+        or calculated during training such that, given these values, training
+        can be resumed from this exact point.
+
+        For example, in the LinearAC class, we must save not only the actor
+        and critic weights, but also the accumulated eligibility traces.
+
+        Returns
+        -------
+        dict of str to float, torch.Tensor
+            The agent's weights
+        """
+        pass
+
+    def _init_critics(self, obs_space, action_space, critic_hidden_dim, init,
+                      activation, critic_lr, betas):
+        """
+        Initializes the critic(s)
+        """
+        num_inputs = obs_space.shape[0]
+        if self._double_q:
+            critic_type = DoubleQ
+        else:
+            critic_type = Q
+
+        self._critic = critic_type(num_inputs, action_space.shape[0],
+                                   critic_hidden_dim, init,
+                                   activation).to(device=self._device)
+        self._critic_optim = Adam(self._critic.parameters(), lr=critic_lr,
+                                  betas=betas)
+
+        self._critic_target = critic_type(num_inputs, action_space.shape[0],
+                                          critic_hidden_dim, init,
+                                          activation).to(self._device)
+
+        nn_utils.hard_update(self._critic_target, self._critic)
+
+    def _init_policy(self, obs_space, action_space, actor_hidden_dim, init,
+                     activation,  actor_lr, betas, clip_stddev):
+        """
+        Initializes the policy
+        """
+        num_inputs = obs_space.shape[0]
+        if self._policy_type == "squashedgaussian":
+            self._policy = SquashedGaussian(num_inputs, action_space.shape[0],
+                                            actor_hidden_dim, activation,
+                                            action_space, clip_stddev,
+                                            init).to(self._device)
+            self._policy_optim = Adam(self._policy.parameters(), lr=actor_lr,
+                                      betas=betas)
+
+        else:
+            raise NotImplementedError(f"policy {self._policy_type} unknown")
+
+    def _get_q(self, state_batch, action_batch):
+        """
+        Gets the Q values for `action_batch` actions in `state_batch` states
+        from the critic, rather than the target critic.
+
+        Parameters
+        ----------
+        state_batch : torch.Tensor
+            The batch of states to calculate the action values in. Of the form
+            (batch_size, state_dims).
+        action_batch : torch.Tensor
+            The batch of actions to calculate the action values of in each
+            state. Of the form (batch_size, action_dims).
+        """
+        if self._double_q:
+            q1, q2 = self._critic(state_batch, action_batch)
+            q = torch.min(q1, q2)
+        else:
+            q = self._critic(state_batch, action_batch)
+
+        return q
diff --git a/agent/nonlinear/SACDiscrete.py b/agent/nonlinear/SACDiscrete.py
new file mode 100644
index 0000000..273016a
--- /dev/null
+++ b/agent/nonlinear/SACDiscrete.py
@@ -0,0 +1,403 @@
+# Import modules
+import os
+from gym.spaces import Box
+import torch
+import numpy as np
+import torch.nn.functional as F
+from torch.optim import Adam
+from agent.baseAgent import BaseAgent
+import agent.nonlinear.nn_utils as nn_utils
+from agent.nonlinear.policy.MLP import Softmax
+from agent.nonlinear.value_function.MLP import DoubleQ
+from utils.experience_replay import TorchBuffer as ExperienceReplay
+
+
+class SACDiscrete(BaseAgent):
+    """
+    SACDiscrete implements a discrete-action Soft Actor-Critic agent with MLP
+    function approximation.
+
+    SACDiscrete works only with discrete action spaces.
+    """
+    def __init__(self, env, gamma, tau, alpha, policy,
+                 target_update_interval, critic_lr, actor_lr_scale, alpha_lr,
+                 actor_hidden_dim, critic_hidden_dim, replay_capacity, seed,
+                 batch_size, betas, automatic_entropy_tuning=False, cuda=False,
+                 clip_stddev=1000, init=None, activation="relu"):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        env : gym.Environment
+            The environment to run on
+        gamma : float
+            The discount factor
+        tau : float
+            The weight of the weighted average, which performs the soft update
+            to the target critic network's parameters toward the critic
+            network's parameters, that is: target_parameters =
+            ((1 - τ) * target_parameters) + (τ * source_parameters)
+        alpha : float
+            The entropy regularization temperature. See equation (1) in paper.
+        policy : str
+            The type of policy, currently, only support "gaussian"
+        target_update_interval : int
+            The number of updates to perform before the target critic network
+            is updated toward the critic network
+        critic_lr : float
+            The critic learning rate
+        actor_lr : float
+            The actor learning rate
+        alpha_lr : float
+            The learning rate for the entropy parameter, if using an automatic
+            entropy tuning algorithm (see automatic_entropy_tuning) parameter
+            below
+        actor_hidden_dim : int
+            The number of hidden units in the actor's neural network
+        critic_hidden_dim : int
+            The number of hidden units in the critic's neural network
+        replay_capacity : int
+            The number of transitions stored in the replay buffer
+        seed : int
+            The random seed so that random samples of batches are repeatable
+        batch_size : int
+            The number of elements in a batch for the batch update
+        automatic_entropy_tuning : bool, optional
+            Whether the agent should automatically tune its entropy
+            hyperparmeter alpha, by default False
+        cuda : bool, optional
+            Whether or not cuda should be used for training, by default False.
+            Note that if True, cuda is only utilized if available.
+        clip_stddev : float, optional
+            The value at which the standard deviation is clipped in order to
+            prevent numerical overflow, by default 1000. If <= 0, then
+            no clipping is done.
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+
+        Raises
+        ------
+        ValueError
+            If the batch size is larger than the replay buffer
+        """
+        action_space = env.action_space
+        obs_space = env.observation_space
+        if isinstance(action_space, Box):
+            raise ValueError("SACDiscrete can only be used with " +
+                             "discrete actions")
+
+        super().__init__()
+        self.batch = True
+
+        # Ensure batch size < replay capacity
+        if batch_size > replay_capacity:
+            raise ValueError("cannot have a batch larger than replay " +
+                             "buffer capacity")
+
+        # Set the seed for all random number generators, this includes
+        # everything used by PyTorch, including setting the initial weights
+        # of networks. PyTorch prefers seeds with many non-zero binary units
+        self.torch_rng = torch.manual_seed(seed)
+        self.rng = np.random.default_rng(seed)
+
+        self.is_training = True
+        self.gamma = gamma
+        self.tau = tau
+        self.alpha = alpha
+
+        self.device = torch.device("cuda:0" if cuda and
+                                   torch.cuda.is_available() else "cpu")
+
+        # Keep a replay buffer
+        action_shape = 1
+        obs_dim = obs_space.shape
+        self.replay = ExperienceReplay(replay_capacity, seed, obs_dim,
+                                       action_shape, self.device)
+        self.batch_size = batch_size
+
+        # Set the interval between timesteps when the target network should be
+        # updated and keep a running total of update number
+        self.target_update_interval = target_update_interval
+        self.update_number = 0
+
+        self.automatic_entropy_tuning = automatic_entropy_tuning
+        assert not self.automatic_entropy_tuning
+
+        # Ensure we are working with vector observations
+        if len(obs_dim) != 1:
+            raise ValueError(
+                f"""SACDiscrete works only with vector
+                observations, but got observation with shape
+                {obs_dim}."""
+            )
+
+        num_inputs = obs_dim[0]
+        self.critic = DoubleQ(num_inputs, action_shape,
+                              critic_hidden_dim, init, activation).to(
+                                  device=self.device)
+        self.critic_optim = Adam(self.critic.parameters(), lr=critic_lr,
+                                 betas=betas)
+
+        self.critic_target = DoubleQ(num_inputs, action_shape,
+                                     critic_hidden_dim, init, activation).to(
+                                         self.device)
+        nn_utils.hard_update(self.critic_target, self.critic)
+
+        self.policy_type = policy.lower()
+        if self.policy_type == "softmax":
+            # Target Entropy = −dim(A)
+            # (e.g. , -6 for HalfCheetah-v2) as given in the paper
+            if self.automatic_entropy_tuning:
+                raise ValueError("cannot use auto entropy tuning with" +
+                                 " discrete actions")
+
+            self.num_actions = action_space.n
+            self.policy = Softmax(
+                num_inputs, self.num_actions, actor_hidden_dim, activation,
+                init
+            ).to(self.device)
+
+            actor_lr = actor_lr_scale * critic_lr
+            self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr,
+                                     betas=betas)
+
+        else:
+            raise NotImplementedError(f"policy type {policy.lower()} not " +
+                                      "available")
+
+    def sample_action(self, state):
+        """
+        Samples an action from the agent
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector
+
+        Returns
+        -------
+        array_like of float
+            The action to take
+        """
+        state = torch.FloatTensor(state).to(self.device).unsqueeze(0)
+        if self.is_training:
+            action, _, _ = self.policy.sample(state)
+        else:
+            raise ValueError("cannot sample actions in eval mode yet")
+
+        act = action.detach().cpu().numpy()[0]
+        return int(act[0])
+
+    def update(self, state, action, reward, next_state, done_mask):
+        """
+        Takes a single update step, which may be a number of offline
+        batch updates
+
+        Parameters
+        ----------
+        state : np.array or array_like of np.array
+            The state feature vector
+        action : np.array of float or array_like of np.array
+            The action taken
+        reward : float or array_like of float
+            The reward seen by the agent after taking the action
+        next_state : np.array or array_like of np.array
+            The feature vector of the next state transitioned to after the
+            agent took the argument action
+        done_mask : bool or array_like of bool
+            False if the agent reached the goal, True if the agent did not
+            reach the goal yet the episode ended (e.g. max number of steps
+            reached)
+        """
+        # Adjust action to ensure it can be sent to the experience replay
+        # buffer properly
+        action = np.array([action])
+
+        # Keep transition in replay buffer
+        self.replay.push(state, action, reward, next_state, done_mask)
+
+        # Sample a batch from memory
+        state_batch, action_batch, reward_batch, next_state_batch, \
+            mask_batch = self.replay.sample(batch_size=self.batch_size)
+
+        # When updating Q functions, we don't want to backprop through the
+        # policy and target network parameters
+        with torch.no_grad():
+            next_state_action, next_state_log_pi, _ = \
+                    self.policy.sample(next_state_batch)
+
+            qf1_next_target, qf2_next_target = self.critic_target(
+                next_state_batch, next_state_action)
+
+            min_qf_next_target = torch.min(qf1_next_target, qf2_next_target) \
+                - self.alpha * next_state_log_pi
+            next_q_value = reward_batch + mask_batch * self.gamma * \
+                (min_qf_next_target)
+
+        # Two Q-functions to reduce positive bias in policy improvement
+        qf1, qf2 = self.critic(state_batch, action_batch)
+
+        # Calculate the losses on each critic
+        # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2]
+        qf1_loss = F.mse_loss(qf1, next_q_value)
+
+        # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2]
+        qf2_loss = F.mse_loss(qf2, next_q_value)
+        qf_loss = qf1_loss + qf2_loss
+
+        # Update the critic
+        self.critic_optim.zero_grad()
+        qf_loss.backward()
+        self.critic_optim.step()
+
+        # Calculate the actor loss using Eqn(5) in FKL/RKL paper
+        # Repeat the state for each action
+        state_batch = state_batch.repeat_interleave(self.num_actions, dim=0)
+        actions = torch.tensor([n for n in range(self.num_actions)])
+        actions = actions.repeat(self.batch_size)
+        actions = actions.unsqueeze(-1)
+
+        qf1_actions, qf2_actions = self.critic(state_batch, actions)
+        min_qf_actions = torch.min(qf1_actions, qf2_actions)
+
+        log_prob = self.policy.log_prob(state_batch, actions)
+        prob = log_prob.exp()
+        policy_loss = prob * (min_qf_actions - log_prob * self.alpha)
+        policy_loss = policy_loss.reshape([self.batch_size, self.num_actions])
+        policy_loss = -policy_loss.sum(dim=1).mean()
+
+        # Update the actor
+        self.policy_optim.zero_grad()
+        policy_loss.backward()
+        self.policy_optim.step()
+
+        # Tune the entropy if appropriate
+        if self.automatic_entropy_tuning:
+            print("warning: should not use auto entropy in these experiments")
+            alpha_loss = -(self.log_alpha *
+                           (log_pi + self.target_entropy).detach()).mean()
+
+            self.alpha_optim.zero_grad()
+            alpha_loss.backward()
+            self.alpha_optim.step()
+
+            self.alpha = self.log_alpha.exp()
+
+        # Increment the running total of updates and update the critic target
+        # if needed
+        self.update_number += 1
+        if self.update_number % self.target_update_interval == 0:
+            self.update_number = 0
+            nn_utils.soft_update(self.critic_target, self.critic, self.tau)
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        pass
+
+    def eval(self):
+        """
+        Sets the agent into offline evaluation mode, where the agent will not
+        explore
+        """
+        self.is_training = False
+
+    def train(self):
+        """
+        Sets the agent to online training mode, where the agent will explore
+        """
+        self.is_training = True
+
+    # Save model parameters
+    def save_model(self, env_name, suffix="", actor_path=None,
+                   critic_path=None):
+        """
+        Saves the models so that after training, they can be used.
+
+        Parameters
+        ----------
+        env_name : str
+            The name of the environment that was used to train the models
+        suffix : str, optional
+            The suffix to the filename, by default ""
+        actor_path : str, optional
+            The path to the file to save the actor network as, by default None
+        critic_path : str, optional
+            The path to the file to save the critic network as, by default None
+        """
+#         if not os.path.exists('models/'):
+#             os.makedirs('models/')
+#
+#         if actor_path is None:
+#             actor_path = "models/sac_actor_{}_{}".format(env_name, suffix)
+#         if critic_path is None:
+#             critic_path = "models/sac_critic_{}_{}".format(env_name, suffix)
+#         print('Saving models to {} and {}'.format(actor_path, critic_path))
+#         torch.save(self.policy.state_dict(), actor_path)
+#         torch.save(self.critic.state_dict(), critic_path)
+
+    # Load model parameters
+    def load_model(self, actor_path, critic_path):
+        """
+        Loads in a pre-trained actor and a pre-trained critic to resume
+        training.
+
+        Parameters
+        ----------
+        actor_path : str
+            The path to the file which contains the actor
+        critic_path : str
+            The path to the file which contains the critic
+        """
+        pass
+
+    def get_parameters(self):
+        """
+        Gets all learned agent parameters such that training can be resumed.
+
+        Gets all parameters of the agent such that, if given the
+        hyperparameters of the agent, training is resumable from this exact
+        point. This include the learned average reward, the learned entropy,
+        and other such learned values if applicable. This does not only apply
+        to the weights of the agent, but *all* values that have been learned
+        or calculated during training such that, given these values, training
+        can be resumed from this exact point.
+
+        For example, in the LinearAC class, we must save not only the actor
+        and critic weights, but also the accumulated eligibility traces.
+
+        Returns
+        -------
+        dict of str to float, torch.Tensor
+            The agent's weights
+        """
+#         parameters = {}
+#         parameters["actor_weights"] = self.policy.state_dict()
+#         parameters["actor_optimizer"] = self.policy_optim.state_dict()
+#         parameters["critic_weights"] = self.critic.state_dict()
+#         parameters["critic_optimizer"] = self.critic_optim.state_dict()
+#         parameters["critic_target"] = self.critic_target.state_dict()
+#         parameters["entropy"] = self.alpha
+#
+#         if self.automatic_entropy_tuning:
+#             parameters["log_entropy"] = self.log_alpha
+#             parameters["entropy_optimizer"] = self.alpha_optim.state_dict()
+#             parameters["target_entropy"] = self.target_entropy
+#
+#         return parameters
+
+
+if __name__ == "__main__":
+    import gym
+    a = gym.make("MountainCarContinuous-v0")
+    actions = a.action_space
+    s = SAC(num_inputs=5, action_space=actions, gamma=0.9, tau=0.8,
+            alpha=0.2, policy="Gaussian", target_update_interval=10,
+            critic_lr=0.01, actor_lr=0.01, alpha_lr=0.01, actor_hidden_dim=200,
+            critic_hidden_dim=200, replay_capacity=50, seed=0, batch_size=10,
+            automatic_entropy_tuning=False, cuda=False)
diff --git a/agent/nonlinear/SACDiscreteCNN.py b/agent/nonlinear/SACDiscreteCNN.py
new file mode 100644
index 0000000..b0cd16e
--- /dev/null
+++ b/agent/nonlinear/SACDiscreteCNN.py
@@ -0,0 +1,402 @@
+# Import modules
+import os
+from gym.spaces import Box
+import torch
+import numpy as np
+import torch.nn.functional as F
+from torch.optim import Adam
+from agent.baseAgent import BaseAgent
+import agent.nonlinear.nn_utils as nn_utils
+from agent.nonlinear.policy.CNN import Softmax
+from agent.nonlinear.value_function.CNN import DoubleDiscreteQ as Q
+from utils.experience_replay import TorchBuffer as ExperienceReplay
+
+
+class SACDiscrete(BaseAgent):
+    """
+    SACDiscrete implements a discrete-action Soft Actor-Critic agent with CNN
+    function approximation.
+
+    SACDiscrete works only with discrete action spaces.
+    """
+    def __init__(self, gamma, tau, alpha, policy, env,
+                 target_update_interval, critic_lr, actor_lr_scale, alpha_lr,
+                 hidden_dim, kernel_sizes, channels, replay_capacity, seed,
+                 batch_size, betas, cuda=False,
+                 clip_stddev=1000, init=None, activation="relu"):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        num_inputs : int
+            The number of input features
+        action_space : gym.spaces.Space
+            The action space from the gym environment
+        gamma : float
+            The discount factor
+        tau : float
+            The weight of the weighted average, which performs the soft update
+            to the target critic network's parameters toward the critic
+            network's parameters, that is: target_parameters =
+            ((1 - τ) * target_parameters) + (τ * source_parameters)
+        alpha : float
+            The entropy regularization temperature. See equation (1) in paper.
+        policy : str
+            The type of policy, currently, only support "gaussian"
+        target_update_interval : int
+            The number of updates to perform before the target critic network
+            is updated toward the critic network
+        critic_lr : float
+            The critic learning rate
+        actor_lr : float
+            The actor learning rate
+        alpha_lr : float
+            The learning rate for the entropy parameter, if using an automatic
+            entropy tuning algorithm (see automatic_entropy_tuning) parameter
+            below
+        actor_hidden_dim : int
+            The number of hidden units in the actor's neural network
+        critic_hidden_dim : int
+            The number of hidden units in the critic's neural network
+        replay_capacity : int
+            The number of transitions stored in the replay buffer
+        seed : int
+            The random seed so that random samples of batches are repeatable
+        batch_size : int
+            The number of elements in a batch for the batch update
+        automatic_entropy_tuning : bool, optional
+            Whether the agent should automatically tune its entropy
+            hyperparmeter alpha, by default False
+        cuda : bool, optional
+            Whether or not cuda should be used for training, by default False.
+            Note that if True, cuda is only utilized if available.
+        clip_stddev : float, optional
+            The value at which the standard deviation is clipped in order to
+            prevent numerical overflow, by default 1000. If <= 0, then
+            no clipping is done.
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+
+        Raises
+        ------
+        ValueError
+            If the batch size is larger than the replay buffer
+        """
+        self.step = 0
+
+        # Do some random error checking
+        action_space = env.action_space
+        obs_space = env.observation_space
+
+        if isinstance(action_space, Box):
+            raise ValueError("SACDiscrete can only be used with " +
+                             "discrete actions")
+
+        # Ensure batch size < replay capacity
+        if batch_size > replay_capacity:
+            raise ValueError("cannot have a batch larger than replay " +
+                             "buffer capacity")
+
+        super().__init__()
+        self.batch = True
+
+        # Set the seed for all random number generators, this includes
+        # everything used by PyTorch, including setting the initial weights
+        # of networks. PyTorch prefers seeds with many non-zero binary units
+        self.torch_rng = torch.manual_seed(seed)
+        self.rng = np.random.default_rng(seed)
+
+        self.is_training = True
+        self.gamma = gamma
+        self.tau = tau
+        self.alpha = alpha
+
+        self.device = torch.device("cuda:0" if cuda and
+                                   torch.cuda.is_available() else "cpu")
+
+        # Keep a replay buffer
+        obs_dim = obs_space.shape
+        self.replay = ExperienceReplay(replay_capacity, seed, obs_dim,
+                                       1, self.device)
+        self.batch_size = batch_size
+
+        # Set the interval between timesteps when the target network should be
+        # updated and keep a running total of update number
+        self.target_update_interval = target_update_interval
+        self.update_number = 0
+
+        self.num_actions = action_space.n
+        self.critic = Q(obs_dim, self.num_actions, channels, kernel_sizes,
+                        hidden_dim, init, activation).to(device=self.device)
+        self.critic_optim = Adam(self.critic.parameters(), lr=critic_lr,
+                                 betas=betas)
+
+        self.critic_target = Q(obs_dim, self.num_actions, channels,
+                               kernel_sizes, hidden_dim, init,
+                               activation).to(self.device)
+        nn_utils.hard_update(self.critic_target, self.critic)
+
+        self.policy_type = policy.lower()
+        if self.policy_type == "softmax":
+            self.policy = Softmax(obs_dim, channels, kernel_sizes, hidden_dim,
+                                  init, activation,
+                                  action_space).to(self.device)
+
+            actor_lr = actor_lr_scale * critic_lr
+            self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr,
+                                     betas=betas)
+
+        else:
+            raise NotImplementedError(f"policy type {policy.lower()} not " +
+                                      "available")
+
+    def sample_action(self, state):
+        """
+        Samples an action from the agent
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector
+
+        Returns
+        -------
+        array_like of float
+            The action to take
+        """
+        state = torch.FloatTensor(state).to(self.device).unsqueeze(0)
+        if self.is_training:
+            action, _, _ = self.policy.sample(state)
+        else:
+            raise ValueError("cannot sample actions in eval mode yet")
+
+        act = action.detach().cpu().numpy()[0]
+        return int(act[0])
+
+    def update(self, state, action, reward, next_state, done_mask):
+        """
+        Takes a single update step, which may be a number of offline
+        batch updates
+
+        Parameters
+        ----------
+        state : np.array or array_like of np.array
+            The state feature vector
+        action : np.array of float or array_like of np.array
+            The action taken
+        reward : float or array_like of float
+            The reward seen by the agent after taking the action
+        next_state : np.array or array_like of np.array
+            The feature vector of the next state transitioned to after the
+            agent took the argument action
+        done_mask : bool or array_like of bool
+            False if the agent reached the goal, True if the agent did not
+            reach the goal yet the episode ended (e.g. max number of steps
+            reached)
+        """
+        self.step += 1
+
+        # Adjust action to ensure it can be sent to the experience replay
+        # buffer properly
+        action = np.array([action])
+
+        # Keep transition in replay buffer
+        self.replay.push(state, action, reward, next_state, done_mask)
+
+        if self.step % 4 != 0:
+            return
+
+        # Sample a batch from memory
+        state_batch, action_batch, reward_batch, next_state_batch, \
+            mask_batch = self.replay.sample(batch_size=self.batch_size)
+
+        # For rewards, actions, and masks, we know they are scalars, so
+        # squeeze the final dimension
+        reward_batch = reward_batch.squeeze()
+        mask_batch = mask_batch.squeeze()
+        action_batch = action_batch.squeeze().long()
+
+        # When updating Q functions, we don't want to backprop through the
+        # policy and target network parameters when computing the target for
+        # the update
+        with torch.no_grad():
+            next_state_action, next_state_log_pi, _ = \
+                    self.policy.sample(next_state_batch, log_prob=True)
+            next_state_log_pi = next_state_log_pi.squeeze(-1)
+
+            next_q1, next_q2 = self.critic_target(next_state_batch)
+
+            next_q1 = next_q1[np.arange(self.batch_size),
+                              next_state_action.squeeze()]
+            next_q2 = next_q2[np.arange(self.batch_size),
+                              next_state_action.squeeze()]
+
+            min_soft_q = torch.min(next_q1, next_q2) \
+                - self.alpha * next_state_log_pi
+            target = reward_batch + mask_batch * self.gamma * min_soft_q
+
+        # Get the value of the current state
+        q1, q2 = self.critic(state_batch)
+        q1 = q1[np.arange(self.batch_size), action_batch]
+        q2 = q2[np.arange(self.batch_size), action_batch]
+
+        # Calculate the losses on each critic
+        # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2]
+        q1_loss = F.mse_loss(q1, target)
+
+        # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2]
+        q2_loss = F.mse_loss(q2, target)
+        q_loss = q1_loss + q2_loss
+
+        # Update the critic
+        self.critic_optim.zero_grad()
+        q_loss.backward()
+        self.critic_optim.step()
+
+        # Calculate the actor loss using Eqn(5) in FKL/RKL paper
+        # Repeat the state for each action
+        # state_batch = state_batch.repeat_interleave(self.num_actions, dim=0)
+        # actions = torch.tensor([n for n in range(self.num_actions)])
+        # actions = actions.repeat(self.batch_size)
+        # actions = actions.long()
+
+        q1, q2 = self.critic(state_batch)
+        q1 = q1.flatten()
+        q2 = q2.flatten()
+        # q1 = q1[np.arange(state_batch.shape[0]), actions]
+        # q2 = q2[np.arange(state_batch.shape[0]), actions]
+        min_q = torch.min(q1, q2)
+
+        log_prob = self.policy.all_log_prob(state_batch).squeeze().flatten()
+        prob = log_prob.exp()
+        policy_loss = prob * (min_q - log_prob * self.alpha)
+        policy_loss = policy_loss.reshape([self.batch_size, self.num_actions])
+        policy_loss = -policy_loss.sum(dim=1).mean()
+
+        # Update the actor
+        self.policy_optim.zero_grad()
+        policy_loss.backward()
+        self.policy_optim.step()
+
+        # Increment the running total of updates and update the critic target
+        # if needed
+        self.update_number += 1
+        if self.update_number % self.target_update_interval == 0:
+            self.update_number = 0
+            nn_utils.soft_update(self.critic_target, self.critic, self.tau)
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        pass
+
+    def eval(self):
+        """
+        Sets the agent into offline evaluation mode, where the agent will not
+        explore
+        """
+        self.policy.eval()
+        self.critic.eval()
+        self.is_training = False
+
+    def train(self):
+        """
+        Sets the agent to online training mode, where the agent will explore
+        """
+        self.policy.train()
+        self.critic.train()
+        self.is_training = True
+
+    # Save model parameters
+    def save_model(self, env_name, suffix="", actor_path=None,
+                   critic_path=None):
+        """
+        Saves the models so that after training, they can be used.
+
+        Parameters
+        ----------
+        env_name : str
+            The name of the environment that was used to train the models
+        suffix : str, optional
+            The suffix to the filename, by default ""
+        actor_path : str, optional
+            The path to the file to save the actor network as, by default None
+        critic_path : str, optional
+            The path to the file to save the critic network as, by default None
+        """
+#         if not os.path.exists('models/'):
+#             os.makedirs('models/')
+#
+#         if actor_path is None:
+#             actor_path = "models/sac_actor_{}_{}".format(env_name, suffix)
+#         if critic_path is None:
+#             critic_path = "models/sac_critic_{}_{}".format(env_name, suffix)
+#         print('Saving models to {} and {}'.format(actor_path, critic_path))
+#         torch.save(self.policy.state_dict(), actor_path)
+#         torch.save(self.critic.state_dict(), critic_path)
+
+    # Load model parameters
+    def load_model(self, actor_path, critic_path):
+        """
+        Loads in a pre-trained actor and a pre-trained critic to resume
+        training.
+
+        Parameters
+        ----------
+        actor_path : str
+            The path to the file which contains the actor
+        critic_path : str
+            The path to the file which contains the critic
+        """
+        pass
+
+    def get_parameters(self):
+        """
+        Gets all learned agent parameters such that training can be resumed.
+
+        Gets all parameters of the agent such that, if given the
+        hyperparameters of the agent, training is resumable from this exact
+        point. This include the learned average reward, the learned entropy,
+        and other such learned values if applicable. This does not only apply
+        to the weights of the agent, but *all* values that have been learned
+        or calculated during training such that, given these values, training
+        can be resumed from this exact point.
+
+        For example, in the LinearAC class, we must save not only the actor
+        and critic weights, but also the accumulated eligibility traces.
+
+        Returns
+        -------
+        dict of str to float, torch.Tensor
+            The agent's weights
+        """
+#         parameters = {}
+#         parameters["actor_weights"] = self.policy.state_dict()
+#         parameters["actor_optimizer"] = self.policy_optim.state_dict()
+#         parameters["critic_weights"] = self.critic.state_dict()
+#         parameters["critic_optimizer"] = self.critic_optim.state_dict()
+#         parameters["critic_target"] = self.critic_target.state_dict()
+#         parameters["entropy"] = self.alpha
+#
+#         if self.automatic_entropy_tuning:
+#             parameters["log_entropy"] = self.log_alpha
+#             parameters["entropy_optimizer"] = self.alpha_optim.state_dict()
+#             parameters["target_entropy"] = self.target_entropy
+#
+#         return parameters
+
+
+if __name__ == "__main__":
+    import gym
+    a = gym.make("MountainCarContinuous-v0")
+    actions = a.action_space
+    s = SAC(num_inputs=5, action_space=actions, gamma=0.9, tau=0.8,
+            alpha=0.2, policy="Gaussian", target_update_interval=10,
+            critic_lr=0.01, actor_lr=0.01, alpha_lr=0.01, actor_hidden_dim=200,
+            critic_hidden_dim=200, replay_capacity=50, seed=0, batch_size=10,
+            automatic_entropy_tuning=False, cuda=False)
diff --git a/agent/nonlinear/VAC.py b/agent/nonlinear/VAC.py
new file mode 100644
index 0000000..f27fcbe
--- /dev/null
+++ b/agent/nonlinear/VAC.py
@@ -0,0 +1,443 @@
+#!/usr/bin/env python3
+
+# Import modules
+import torch
+from gym.spaces import Box, Discrete
+import numpy as np
+import torch.nn.functional as F
+from torch.optim import Adam
+from agent.baseAgent import BaseAgent
+import agent.nonlinear.nn_utils as nn_utils
+from agent.nonlinear.policy_utils import GaussianPolicy, SoftmaxPolicy
+from agent.nonlinear.value_function_utils import QMLP
+from utils.experience_replay import TorchBuffer as ExperienceReplay
+
+
+class VAC(BaseAgent):
+    """
+    VAC implements the Vanilla Actor-Critic agent.
+
+    VAC works only with continuous actions and uses MLP function approximators.
+    """
+    def __init__(self, num_inputs, action_space, gamma, tau, alpha, policy,
+                 target_update_interval, critic_lr, actor_lr_scale,
+                 num_samples, actor_hidden_dim, critic_hidden_dim,
+                 replay_capacity, seed, batch_size, betas, env, cuda=False,
+                 clip_stddev=1000, init=None, activation="relu"):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        num_inputs : int
+            The number of input features
+        action_space : gym.spaces.Space
+            The action space from the gym environment
+        gamma : float
+            The discount factor
+        tau : float
+            The weight of the weighted average, which performs the soft update
+            to the target critic network's parameters toward the critic
+            network's parameters, that is: target_parameters =
+            ((1 - τ) * target_parameters) + (τ * source_parameters)
+        alpha : float
+            The entropy regularization temperature. See equation (1) in paper.
+        policy : str
+            The type of policy, currently, only support "gaussian"
+        target_update_interval : int
+            The number of updates to perform before the target critic network
+            is updated toward the critic network
+        critic_lr : float
+            The critic learning rate
+        actor_lr : float
+            The actor learning rate
+        actor_hidden_dim : int
+            The number of hidden units in the actor's neural network
+        critic_hidden_dim : int
+            The number of hidden units in the critic's neural network
+        replay_capacity : int
+            The number of transitions stored in the replay buffer
+        seed : int
+            The random seed so that random samples of batches are repeatable
+        batch_size : int
+            The number of elements in a batch for the batch update
+        cuda : bool, optional
+            Whether or not cuda should be used for training, by default False.
+            Note that if True, cuda is only utilized if available.
+        clip_stddev : float, optional
+            The value at which the standard deviation is clipped in order to
+            prevent numerical overflow, by default 1000. If <= 0, then
+            no clipping is done.
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+
+        Raises
+        ------
+        ValueError
+            If the batch size is larger than the replay buffer
+        """
+        super().__init__()
+        self.batch = True
+
+        # Ensure batch size < replay capacity
+        if batch_size > replay_capacity:
+            raise ValueError("cannot have a batch larger than replay " +
+                             "buffer capacity")
+
+        # Set the seed for all random number generators, this includes
+        # everything used by PyTorch, including setting the initial weights
+        # of networks. PyTorch prefers seeds with many non-zero binary units
+        self.torch_rng = torch.manual_seed(seed)
+        self.rng = np.random.default_rng(seed)
+
+        self.is_training = True
+        self.gamma = gamma
+        self.tau = tau
+        self.alpha = alpha
+
+        self.discrete_action = isinstance(action_space, Discrete)
+        self.state_dims = num_inputs
+        self.num_samples = num_samples - 1
+        assert num_samples >= 2
+
+        self.device = torch.device("cuda:0" if cuda and
+                                   torch.cuda.is_available() else "cpu")
+
+        if isinstance(action_space, Box):
+            self.action_dims = action_space.high.shape[0]
+
+            # Keep a replay buffer
+            self.replay = ExperienceReplay(replay_capacity, seed, num_inputs,
+                                           action_space.shape[0], self.device)
+        elif isinstance(action_space, Discrete):
+            self.action_dims = 1
+            # Keep a replay buffer
+            self.replay = ExperienceReplay(replay_capacity, seed, num_inputs,
+                                           1, self.device)
+        self.batch_size = batch_size
+
+        # Set the interval between timesteps when the target network should be
+        # updated and keep a running total of update number
+        self.target_update_interval = target_update_interval
+        self.update_number = 0
+
+        # Create the critic Q function
+        if isinstance(action_space, Box):
+            action_shape = action_space.shape[0]
+        elif isinstance(action_space, Discrete):
+            action_shape = 1
+
+        self.critic = QMLP(num_inputs, action_shape,
+                           critic_hidden_dim, init, activation).to(
+                               device=self.device)
+        self.critic_optim = Adam(self.critic.parameters(), lr=critic_lr,
+                                  betas=betas)
+
+        self.critic_target = QMLP(num_inputs, action_shape,
+                                  critic_hidden_dim, init, activation).to(
+                                      self.device)
+        nn_utils.hard_update(self.critic_target, self.critic)
+
+        self.policy_type = policy.lower()
+        actor_lr = actor_lr_scale * critic_lr
+        if self.policy_type == "gaussian":
+
+            self.policy = GaussianPolicy(num_inputs, action_space.shape[0],
+                                         actor_hidden_dim, activation,
+                                         action_space, clip_stddev, init).to(
+                                             self.device)
+            self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr,
+                                      betas=betas)
+        # elif self.policy_type == "softmax":
+        #     num_actions = action_space.n
+        #     self.policy = SoftmaxPolicy(num_inputs, num_actions,
+        #                                 actor_hidden_dim, activation,
+        #                                 action_space, init).to(self.device)
+        #     self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr,
+        #                               betas=betas)
+
+
+        else:
+            raise NotImplementedError
+
+    def sample_action(self, state):
+        """
+        Samples an action from the agent
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector
+
+        Returns
+        -------
+        array_like of float
+            The action to take
+        """
+        state = torch.FloatTensor(state).to(self.device).unsqueeze(0)
+        if self.is_training:
+            action, _, _ = self.policy.sample(state)
+        else:
+            _, _, action = self.policy.sample(state)
+
+        act = action.detach().cpu().numpy()[0]
+
+        if not self.discrete_action:
+            return act
+        else:
+            return int(act[0])
+
+    def sample_action_(self, state, size):
+        """
+        sample_action_ is like sample_action, except the rng for
+        action selection in the environment is not affected by running
+        this function.
+        """
+        if len(state.shape) > 1 or state.shape[0] > 1:
+            raise ValueError("sample_action_ takes a single state")
+        with torch.no_grad():
+            state = torch.FloatTensor(state).to(self.device).unsqueeze(0)
+            if self.is_training:
+                mean, log_std = self.policy.forward(state)
+
+        if not self.is_training:
+            return mean.detach().cpu().numpy()[0]
+
+        mean = mean.detach().cpu().numpy()[0]
+        std = np.exp(log_std.detach().cpu().numpy()[0])
+        return self.rng.normal(mean, std, size=size)
+
+    def update(self, state, action, reward, next_state, done_mask):
+        """
+        Takes a single update step, which may be a number of offline
+        batch updates
+
+        Parameters
+        ----------
+        state : np.array or array_like of np.array
+            The state feature vector
+        action : np.array of float or array_like of np.array
+            The action taken
+        reward : float or array_like of float
+            The reward seen by the agent after taking the action
+        next_state : np.array or array_like of np.array
+            The feature vector of the next state transitioned to after the
+            agent took the argument action
+        done_mask : bool or array_like of bool
+            False if the agent reached the goal, True if the agent did not
+            reach the goal yet the episode ended (e.g. max number of steps
+            reached)
+        """
+        if self.discrete_action:
+            action = np.array([action])
+        # Keep transition in replay buffer
+        self.replay.push(state, action, reward, next_state, done_mask)
+
+        # Sample a batch from memory
+        state_batch, action_batch, reward_batch, next_state_batch, \
+            mask_batch = self.replay.sample(batch_size=self.batch_size)
+
+        # When updating Q functions, we don't want to backprop through the
+        # policy and target network parameters
+        with torch.no_grad():
+            next_state_action, _, _ = \
+                self.policy.sample(next_state_batch)
+            qf_next_value = self.critic_target(next_state_batch,
+                                               next_state_action)
+
+            q_target = reward_batch + mask_batch * self.gamma * qf_next_value
+
+        # Two Q-functions to reduce positive bias in policy improvement
+        q_prediction = self.critic(state_batch, action_batch)
+        # print(torch.cat([reward_batch, action_batch, mask_batch], dim=1))
+        # print(q_prediction)
+
+        # Calculate the losses on each critic
+        # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2]
+        q_loss = F.mse_loss(q_prediction, q_target)
+
+        # Update the critic
+        self.critic_optim.zero_grad()
+        q_loss.backward()
+        self.critic_optim.step()
+
+        # Sample action that the agent would take
+        pi, _, _ = self.policy.sample(state_batch)
+
+        # Calculate the advantage
+        with torch.no_grad():
+            q_pi = self.critic(state_batch, pi)
+        sampled_actions, _, _ = self.policy.sample(state_batch,
+                                                   self.num_samples)
+        if self.num_samples == 1:
+            sampled_actions = sampled_actions.unsqueeze(1)
+        sampled_actions = torch.permute(sampled_actions, (1, 0, 2))
+
+        state_baseline = 0
+        if self.num_samples > 2:
+            # Baseline computed with self.num_samples - 1 action
+            # value estimates
+            baseline_actions = sampled_actions[:, :-1]
+            baseline_actions = torch.reshape(baseline_actions,
+                                             [-1, self.action_dims])
+            stacked_s_batch = torch.repeat_interleave(state_batch,
+                                                      self.num_samples-1,
+                                                      dim=0)
+            stacked_s_batch = torch.reshape(stacked_s_batch,
+                                            [-1, self.state_dims])
+
+            baseline_q_vals = self.critic(stacked_s_batch,
+                                          baseline_actions)
+            baseline_q_vals = torch.reshape(baseline_q_vals,
+                                            [self.batch_size,
+                                                self.num_samples-1])
+            state_baseline = baseline_q_vals.mean(axis=1).unsqueeze(1)
+        advantage = q_pi - state_baseline
+
+        # Estimate the entropy from a single sampled action in each state
+        entropy_actions = sampled_actions[:, -1]
+        entropy = -self.policy.log_prob(state_batch, entropy_actions)
+
+        # Jπ = 𝔼st∼D,εt∼N[α * logπ(f(εt;st)|st) − Q(st,f(εt;st))]
+        policy_loss = self.policy.log_prob(state_batch, pi) * advantage
+        policy_loss = -(policy_loss + (self.alpha * entropy)).mean()
+
+        # Update the actor
+        self.policy_optim.zero_grad()
+        policy_loss.backward()
+        self.policy_optim.step()
+
+        # Update target network
+        self.update_number += 1
+        if self.update_number % self.target_update_interval == 0:
+            self.update_number = 0
+            nn_utils.soft_update(self.critic_target, self.critic, self.tau)
+
+    def update_value_fn(self, state, action, reward, next_state, done_mask,
+                        new_sample):
+        if new_sample:
+            # Keep transition in replay buffer
+            self.replay.push(state, action, reward, next_state, done_mask)
+
+        # Sample a batch from memory
+        state_batch, action_batch, reward_batch, next_state_batch, \
+            mask_batch = self.replay.sample(batch_size=self.batch_size)
+
+        # When updating Q functions, we don't want to backprop through the
+        # policy and target network parameters
+        with torch.no_grad():
+            next_state_action, _, _ = \
+                self.policy.sample(next_state_batch)
+
+            next_q = self.critic_target(next_state_batch, next_state_action)
+            target_q_value = reward_batch + mask_batch * self.gamma * next_q
+
+        q_value = self.critic(state_batch, action_batch)
+
+        # Calculate the loss on the critic
+        # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2]
+        q_loss = F.mse_loss(target_q_value, q_value)
+
+        # Update the critic
+        self.critic_optim.zero_grad()
+        q_loss.backward()
+        self.critic_optim.step()
+
+        # Update target networks
+        self.update_number += 1
+        if self.update_number % self.target_update_interval == 0:
+            self.update_number = 0
+            nn_utils.soft_update(self.critic_target, self.critic, self.tau)
+
+    def sample_qs(self, num_q_samples):
+        """Get a number of samples of Q(s, a) for s in the replay buffer
+        and a according to current policy"""
+        # Sample a batch from memory
+        state_batch, _, _, _, _ = self.replay.sample(batch_size=num_q_samples)
+
+        # When updating Q functions, we don't want to backprop through the
+        # policy and target network parameters
+        with torch.no_grad():
+            action_batch, _, _ = \
+                self.policy.sample(state_batch)
+
+            return self.critic(state_batch, action_batch).detach().\
+                squeeze().numpy()
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        pass
+
+    def eval(self):
+        """
+        Sets the agent into offline evaluation mode, where the agent will not
+        explore
+        """
+        self.is_training = False
+
+    def train(self):
+        """
+        Sets the agent to online training mode, where the agent will explore
+        """
+        self.is_training = True
+
+    # Save model parameters
+    def save_model(self, env_name, suffix="", actor_path=None,
+                   critic_path=None):
+        """
+        Saves the models so that after training, they can be used.
+
+        Parameters
+        ----------
+        env_name : str
+            The name of the environment that was used to train the models
+        suffix : str, optional
+            The suffix to the filename, by default ""
+        actor_path : str, optional
+            The path to the file to save the actor network as, by default None
+        critic_path : str, optional
+            The path to the file to save the critic network as, by default None
+        """
+        pass
+
+    # Load model parameters
+    def load_model(self, actor_path, critic_path):
+        """
+        Loads in a pre-trained actor and a pre-trained critic to resume
+        training.
+
+        Parameters
+        ----------
+        actor_path : str
+            The path to the file which contains the actor
+        critic_path : str
+            The path to the file which contains the critic
+        """
+        pass
+
+    def get_parameters(self):
+        """
+        Gets all learned agent parameters such that training can be resumed.
+
+        Gets all parameters of the agent such that, if given the
+        hyperparameters of the agent, training is resumable from this exact
+        point. This include the learned average reward, the learned entropy,
+        and other such learned values if applicable. This does not only apply
+        to the weights of the agent, but *all* values that have been learned
+        or calculated during training such that, given these values, training
+        can be resumed from this exact point.
+
+        For example, in the LinearAC class, we must save not only the actor
+        and critic weights, but also the accumulated eligibility traces.
+
+        Returns
+        -------
+        dict of str to float, torch.Tensor
+            The agent's weights
+        """
+        pass
diff --git a/agent/nonlinear/VACDiscrete.py b/agent/nonlinear/VACDiscrete.py
new file mode 100644
index 0000000..19efb32
--- /dev/null
+++ b/agent/nonlinear/VACDiscrete.py
@@ -0,0 +1,353 @@
+#!/usr/bin/env python3
+
+# Import modules
+import torch
+import time
+from gym.spaces import Box, Discrete
+import numpy as np
+import torch.nn.functional as F
+from torch.optim import Adam
+from agent.baseAgent import BaseAgent
+import agent.nonlinear.nn_utils as nn_utils
+from agent.nonlinear.policy_utils import GaussianPolicy, SoftmaxPolicy
+from agent.nonlinear.value_function_utils import QMLP
+from utils.experience_replay import TorchBuffer as ExperienceReplay
+
+
+class VACDiscrete(BaseAgent):
+    """
+    VACDiscrete implements the Vanilla Actor-Critic agent.
+
+    VACDiscrete works only with discrete actions and uses MLP function
+    approximators.
+    """
+    def __init__(self, num_inputs, action_space, gamma, tau, alpha, policy,
+                 target_update_interval, critic_lr, actor_lr_scale,
+                 actor_hidden_dim, critic_hidden_dim,
+                 replay_capacity, seed, batch_size, betas, cuda=False,
+                 clip_stddev=1000, init=None, activation="relu"):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        num_inputs : int
+            The number of input features
+        action_space : gym.spaces.Space
+            The action space from the gym environment
+        gamma : float
+            The discount factor
+        tau : float
+            The weight of the weighted average, which performs the soft update
+            to the target critic network's parameters toward the critic
+            network's parameters, that is: target_parameters =
+            ((1 - τ) * target_parameters) + (τ * source_parameters)
+        alpha : float
+            The entropy regularization temperature. See equation (1) in paper.
+        policy : str
+            The type of policy, currently, only support "softmax"
+        target_update_interval : int
+            The number of updates to perform before the target critic network
+            is updated toward the critic network
+        critic_lr : float
+            The critic learning rate
+        actor_lr : float
+            The actor learning rate
+        actor_hidden_dim : int
+            The number of hidden units in the actor's neural network
+        critic_hidden_dim : int
+            The number of hidden units in the critic's neural network
+        replay_capacity : int
+            The number of transitions stored in the replay buffer
+        seed : int
+            The random seed so that random samples of batches are repeatable
+        batch_size : int
+            The number of elements in a batch for the batch update
+        cuda : bool, optional
+            Whether or not cuda should be used for training, by default False.
+            Note that if True, cuda is only utilized if available.
+        clip_stddev : float, optional
+            The value at which the standard deviation is clipped in order to
+            prevent numerical overflow, by default 1000. If <= 0, then
+            no clipping is done.
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+
+        Raises
+        ------
+        ValueError
+            If the batch size is larger than the replay buffer
+        """
+        super().__init__()
+        self.batch = True
+
+        # Ensure batch size < replay capacity
+        if batch_size > replay_capacity:
+            raise ValueError("cannot have a batch larger than replay " +
+                             "buffer capacity")
+
+        # Set the seed for all random number generators, this includes
+        # everything used by PyTorch, including setting the initial weights
+        # of networks. PyTorch prefers seeds with many non-zero binary units
+        self.torch_rng = torch.manual_seed(seed)
+        self.rng = np.random.default_rng(seed)
+
+        self.is_training = True
+        self.gamma = gamma
+        self.tau = tau
+        self.alpha = alpha
+
+        self.discrete_action = isinstance(action_space, Discrete)
+        self.state_dims = num_inputs
+
+        self.device = torch.device("cuda:0" if cuda and
+                                   torch.cuda.is_available() else "cpu")
+
+        if isinstance(action_space, Box):
+            raise ValueError("VACDiscrete can only be used with " +
+                             "discrete actions")
+        elif isinstance(action_space, Discrete):
+            self.action_dims = 1
+            # Keep a replay buffer
+            self.replay = ExperienceReplay(replay_capacity, seed, num_inputs,
+                                           1, self.device)
+        self.batch_size = batch_size
+
+        # Set the interval between timesteps when the target network should be
+        # updated and keep a running total of update number
+        self.target_update_interval = target_update_interval
+        self.update_number = 0
+
+        # Create the critic Q function
+        if isinstance(action_space, Box):
+            action_shape = action_space.shape[0]
+        elif isinstance(action_space, Discrete):
+            action_shape = 1
+
+        self.critic = QMLP(num_inputs, action_shape,
+                           critic_hidden_dim, init, activation).to(
+                               device=self.device)
+        self.critic_optim = Adam(self.critic.parameters(), lr=critic_lr,
+                                 betas=betas)
+
+        self.critic_target = QMLP(num_inputs, action_shape,
+                                  critic_hidden_dim, init, activation).to(
+                                      self.device)
+        nn_utils.hard_update(self.critic_target, self.critic)
+
+        self.policy_type = policy.lower()
+        actor_lr = actor_lr_scale * critic_lr
+        if self.policy_type == "softmax":
+            self.num_actions = action_space.n
+            self.policy = SoftmaxPolicy(num_inputs, self.num_actions,
+                                        actor_hidden_dim, activation,
+                                        action_space, init).to(self.device)
+            self.policy_optim = Adam(self.policy.parameters(), lr=actor_lr,
+                                     betas=betas)
+
+        else:
+            raise NotImplementedError(f"policy type {policy} not implemented")
+
+    def sample_action(self, state):
+        """
+        Samples an action from the agent
+
+        Parameters
+        ----------
+        state : np.array
+            The state feature vector
+
+        Returns
+        -------
+        array_like of float
+            The action to take
+        """
+        state = torch.FloatTensor(state).to(self.device).unsqueeze(0)
+        if self.is_training:
+            action, _, _ = self.policy.sample(state)
+        else:
+            _, _, action = self.policy.sample(state)
+
+        act = action.detach().cpu().numpy()[0]
+        if not self.discrete_action:
+            return act
+        else:
+            return int(act[0])
+
+    def sample_action_(self, state, size):
+        """
+        sample_action_ is like sample_action, except the rng for
+        action selection in the environment is not affected by running
+        this function.
+        """
+        if len(state.shape) > 1 or state.shape[0] > 1:
+            raise ValueError("sample_action_ takes a single state")
+        with torch.no_grad():
+            state = torch.FloatTensor(state).to(self.device).unsqueeze(0)
+            if self.is_training:
+                mean, log_std = self.policy.forward(state)
+
+        if not self.is_training:
+            return mean.detach().cpu().numpy()[0]
+
+        mean = mean.detach().cpu().numpy()[0]
+        std = np.exp(log_std.detach().cpu().numpy()[0])
+        return self.rng.normal(mean, std, size=size)
+
+    def update(self, state, action, reward, next_state, done_mask):
+        """
+        Takes a single update step, which may be a number of offline
+        batch updates
+
+        Parameters
+        ----------
+        state : np.array or array_like of np.array
+            The state feature vector
+        action : np.array of float or array_like of np.array
+            The action taken
+        reward : float or array_like of float
+            The reward seen by the agent after taking the action
+        next_state : np.array or array_like of np.array
+            The feature vector of the next state transitioned to after the
+            agent took the argument action
+        done_mask : bool or array_like of bool
+            False if the agent reached the goal, True if the agent did not
+            reach the goal yet the episode ended (e.g. max number of steps
+            reached)
+        """
+        if self.discrete_action:
+            action = np.array([action])
+        # Keep transition in replay buffer
+        self.replay.push(state, action, reward, next_state, done_mask)
+
+        # Sample a batch from memory
+        state_batch, action_batch, reward_batch, next_state_batch, \
+            mask_batch = self.replay.sample(batch_size=self.batch_size)
+
+        # When updating Q functions, we don't want to backprop through the
+        # policy and target network parameters
+        with torch.no_grad():
+            next_state_action, _, _ = \
+                self.policy.sample(next_state_batch)
+            qf_next_value = self.critic_target(next_state_batch,
+                                               next_state_action)
+
+            q_target = reward_batch + mask_batch * self.gamma * qf_next_value
+
+        # Two Q-functions to reduce positive bias in policy improvement
+        q_prediction = self.critic(state_batch, action_batch)
+        # print(torch.cat([reward_batch, action_batch, mask_batch], dim=1))
+        # print(q_prediction)
+
+        # Calculate the losses on each critic
+        # JQ = 𝔼(st,at)~D[0.5(Q1(st,at) - r(st,at) - γ(𝔼st+1~p[V(st+1)]))^2]
+        q_loss = F.mse_loss(q_prediction, q_target)
+
+        # Update the critic
+        self.critic_optim.zero_grad()
+        q_loss.backward()
+        self.critic_optim.step()
+
+        # Calculate the actor loss using Eqn(5) in FKL/RKL paper
+        # No need to use a baseline in this setting
+        state_batch = state_batch.repeat_interleave(self.num_actions, dim=0)
+        actions = torch.tensor([n for n in range(self.num_actions)])
+        actions = actions.repeat(self.batch_size)
+        actions = actions.unsqueeze(-1)
+
+        q = self.critic(state_batch, actions)
+        log_prob = self.policy.log_prob(state_batch, actions)
+        prob = log_prob.exp()
+
+        policy_loss = prob * (q - log_prob * self.alpha)
+        policy_loss = policy_loss.reshape([self.batch_size, self.num_actions])
+        policy_loss = -policy_loss.sum(dim=1).mean()
+
+        # Update the actor
+        self.policy_optim.zero_grad()
+        policy_loss.backward()
+        self.policy_optim.step()
+
+        # Update target network
+        self.update_number += 1
+        if self.update_number % self.target_update_interval == 0:
+            self.update_number = 0
+            nn_utils.soft_update(self.critic_target, self.critic, self.tau)
+
+    def reset(self):
+        """
+        Resets the agent between episodes
+        """
+        pass
+
+    def eval(self):
+        """
+        Sets the agent into offline evaluation mode, where the agent will not
+        explore
+        """
+        self.is_training = False
+
+    def train(self):
+        """
+        Sets the agent to online training mode, where the agent will explore
+        """
+        self.is_training = True
+
+    # Save model parameters
+    def save_model(self, env_name, suffix="", actor_path=None,
+                   critic_path=None):
+        """
+        Saves the models so that after training, they can be used.
+
+        Parameters
+        ----------
+        env_name : str
+            The name of the environment that was used to train the models
+        suffix : str, optional
+            The suffix to the filename, by default ""
+        actor_path : str, optional
+            The path to the file to save the actor network as, by default None
+        critic_path : str, optional
+            The path to the file to save the critic network as, by default None
+        """
+        pass
+
+    # Load model parameters
+    def load_model(self, actor_path, critic_path):
+        """
+        Loads in a pre-trained actor and a pre-trained critic to resume
+        training.
+
+        Parameters
+        ----------
+        actor_path : str
+            The path to the file which contains the actor
+        critic_path : str
+            The path to the file which contains the critic
+        """
+        pass
+
+    def get_parameters(self):
+        """
+        Gets all learned agent parameters such that training can be resumed.
+
+        Gets all parameters of the agent such that, if given the
+        hyperparameters of the agent, training is resumable from this exact
+        point. This include the learned average reward, the learned entropy,
+        and other such learned values if applicable. This does not only apply
+        to the weights of the agent, but *all* values that have been learned
+        or calculated during training such that, given these values, training
+        can be resumed from this exact point.
+
+        For example, in the LinearAC class, we must save not only the actor
+        and critic weights, but also the accumulated eligibility traces.
+
+        Returns
+        -------
+        dict of str to float, torch.Tensor
+            The agent's weights
+        """
+        pass
diff --git a/agent/nonlinear/nn_utils.py b/agent/nonlinear/nn_utils.py
new file mode 100644
index 0000000..63215ba
--- /dev/null
+++ b/agent/nonlinear/nn_utils.py
@@ -0,0 +1,331 @@
+# Import modules
+import torch
+import torch.nn as nn
+import numpy as np
+
+
+# Function definitions
+def weights_init_(layer, init="kaiming", activation="relu"):
+    """
+    Initializes the weights for a fully connected layer of a neural network.
+
+    Parameters
+    ----------
+    layer : torch.nn.Module
+        The layer to initialize
+    init : str
+        The type of initialization to use, one of 'xavier_uniform',
+        'xavier_normal', 'uniform', 'normal', 'orthogonal', 'kaiming_uniform',
+        'default', by default 'kaiming_uniform'.
+    activation : str
+        The activation function in use, used to calculate the optimal gain
+        value.
+
+    """
+    if "weight" in dir(layer):
+        gain = torch.nn.init.calculate_gain(activation)
+
+        if init == "xavier_uniform":
+            torch.nn.init.xavier_uniform_(layer.weight, gain=gain)
+        elif init == "xavier_normal":
+            torch.nn.init.xavier_normal_(layer.weight, gain=gain)
+        elif init == "uniform":
+            torch.nn.init.uniform_(layer.weight) / layer.in_features
+        elif init == "normal":
+            torch.nn.init.normal_(layer.weight) / layer.in_features
+        elif init == "orthogonal":
+            torch.nn.init.orthogonal_(layer.weight)
+        elif init == "zeros":
+            torch.nn.init.zeros_(layer.weight)
+        elif init == "kaiming_uniform" or init == "default" or init is None:
+            # PyTorch default
+            return
+        else:
+            raise NotImplementedError(f"init {init} not implemented yet")
+
+    if "bias" in dir(layer):
+        torch.nn.init.constant_(layer.bias, 0)
+
+
+def soft_update(target, source, tau):
+    """
+    Updates the parameters of the target network towards the parameters of
+    the source network by a weight average depending on tau. The new
+    parameters for the target network are:
+
+        ((1 - τ) * target_parameters) + (τ * source_parameters)
+
+    Parameters
+    ----------
+    target : torch.nn.Module
+        The target network
+    source : torch.nn.Module
+        The source network
+    tau : float
+        The weighting for the weighted average
+    """
+    with torch.no_grad():
+        for target_param, param in zip(target.parameters(),
+                                       source.parameters()):
+            # Use in-place operations mul_ and add_ to avoid
+            # copying tensor data
+            target_param.data.mul_(1.0 - tau)
+            target_param.data.add_(tau * param.data)
+
+
+def hard_update(target, source):
+    """
+    Sets the parameters of the target network to the parameters of the
+    source network. Equivalent to soft_update(target,  source, 1)
+
+    Parameters
+    ----------
+    target : torch.nn.Module
+        The target network
+    source : torch.nn.Module
+        The source network
+    """
+    with torch.no_grad():
+        for target_param, param in zip(target.parameters(),
+                                       source.parameters()):
+            target_param.data.copy_(param.data)
+
+
+def init_layers(layers, init_scheme):
+    """
+    Initializes the weights for the layers of a neural network.
+
+    Parameters
+    ----------
+    layers : list of nn.Module
+        The list of layers
+    init_scheme : str
+        The type of initialization to use, one of 'xavier_uniform',
+        'xavier_normal', 'uniform', 'normal', 'orthogonal', by default None.
+        If None, leaves the default PyTorch initialization.
+    """
+    def fill_weights(layers, init_fn):
+        for i in range(len(layers)):
+            init_fn(layers[i].weight)
+
+    if init_scheme.lower() == "xavier_uniform":
+        fill_weights(layers, nn.init.xavier_uniform_)
+    elif init_scheme.lower() == "xavier_normal":
+        fill_weights(layers, nn.init.xavier_normal_)
+    elif init_scheme.lower() == "uniform":
+        fill_weights(layers, nn.init.uniform_)
+    elif init_scheme.lower() == "normal":
+        fill_weights(layers, nn.init.normal_)
+    elif init_scheme.lower() == "orthogonal":
+        fill_weights(layers, nn.init.orthogonal_)
+    elif init_scheme is None:
+        # Use PyTorch default
+        return
+
+
+def _calc_conv_outputs(in_height, in_width, kernel_size, dilation=1, padding=0,
+                       stride=1):
+    """
+    Calculates the output height and width given in input height and width and
+    the kernel size.
+
+    Parameters
+    ----------
+    in_height : int
+        The height of the input image
+    in_width : int
+        The width of the input image
+    kernel_size : tuple[int, int] or int
+        The kernel size
+    dilation : tuple[int, int] or int
+        Spacing between kernel elements, by default 1
+    padding : tuple[int, int] or int
+        Padding added to all four sides of the input, by default 0
+    stride : tuple[int, int] or int
+        Stride of the convolution, by default 1
+
+    Returns
+    -------
+    tuple[int, int]
+        The output width and height
+    """
+    # Reshape so that kernel_size, padding, dilation, and stride have one
+    # element per dimension
+    if isinstance(kernel_size, int):
+        kernel_size = [kernel_size] * 2
+    if isinstance(padding, int):
+        padding = [padding] * 2
+    if isinstance(dilation, int):
+        dilation = [dilation] * 2
+    if isinstance(stride, int):
+        stride = [stride] * 2
+
+    out_height = in_height + 2 * padding[0] - dilation[0] * (
+        kernel_size[0] - 1) - 1
+    out_height //= stride[0]
+
+    out_width = in_width + 2 * padding[1] - dilation[1] * (
+        kernel_size[1] - 1) - 1
+    out_width //= stride[1]
+
+    return out_height + 1, out_width + 1
+
+
+def _get_activation(activation):
+    """
+    Returns an activation operation given a string describing the activation
+    operation
+
+    Parameters
+    ----------
+    activation : str
+        The string representation of the activation operation, one of 'relu',
+        'tanh'
+
+    Returns
+    -------
+    nn.Module
+        The activation function
+    """
+    # Set the activation funcitons
+    if activation.lower() == "relu":
+        act = nn.ReLU()
+    elif activation.lower() == "tanh":
+        act = nn.Tanh()
+    else:
+        raise ValueError(f"unknown activation {activation}")
+
+    return act
+
+
+def _construct_conv_linear(input_dim, num_actions, channels, kernel_sizes,
+                           hidden_sizes, init, activation, single_output):
+    """
+    Constructs a number of convolutional layers and a sequence of
+    densely-connected layers which operate on the output of the convolutional
+    layers, returning the convolutional sequence and densely-connected sequence
+    separately.
+
+    This function is particularly suited to produce Q functions or
+    Softmax policies, but can also be used to construct other approximators
+    such as Gaussian policies or V functions (where `num_actions == 1` would
+    actually be the number of state values to output, which is always 1; in
+    such a case one should set `single_output = True`).
+
+    This function construct a neural net which looks like:
+        input   -->   convolutional   -->    densely-connected   -->   Output
+                        layers                    layers
+    and returns the convolutional and densely-connected layers separately.
+
+    Parameters
+    ----------
+    input_dim : tuple[int, int, int]
+        Dimensionality of state features, which should be (channels,
+        height, width)
+    num_actions : int
+        If `single_output` is `True`, then this should be the dimensionality of
+        the action, since then the action will be concatenated with the input
+        to the linear layers. If `single_output` is `False`, then this should
+        be the number of discrete available actions in the environment, and the
+        network will output `num_actions` action values.
+    channels : array-like[int]
+        The number of channels in each hidden convolutional layer
+    kernel_sizes : array-like[int]
+        The number of channels in each consecutive convolutional layer
+    hidden_sizes : array-like[int]
+        The number of units in each consecutive fully connected layer
+    init : str
+        The initialization scheme to use for the weights, one of
+        'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+        'orthogonal', by default None. If None, leaves the default
+        PyTorch initialization.
+    activation : indexable[str] or str
+        The activation function to use; each element should be one of
+        'relu', 'tanh'
+    single_output : bool
+        Whether or not the network should have a single output. If `True`, then
+        the action is concatenated with the input the the linear layers. If
+        `False`, then `num_actions` are outputted.
+    """
+    # Ensure the number of channels == the number of kernels
+    if len(channels) != len(kernel_sizes):
+        kernels = len(kernel_sizes)
+        channels = len(channels)
+        raise ValueError("must have the same number of channels and " +
+                         f"kernels but got {channels} channels " +
+                         f"and {kernels} kernels")
+
+    if isinstance(activation, str):
+        act = [_get_activation(activation)] * (len(channels) +
+                                               len(hidden_sizes))
+    elif len(channels) != len(activations):
+        activations = len(activations)
+        channels = len(channels)
+        raise ValueError("must have the same number of channels and " +
+                         f"activations but got {channels} channels " +
+                         f"and {activations} activations")
+
+    # Convolutional layers
+    conv = []  # List of sequential convolutional layers and activations
+    in_channels = input_dim[0]
+    out_channels = channels[0]
+    kernel = kernel_sizes[0]
+    channel_size = input_dim[1:]
+    for i in range(1, len(channels)):
+        # Append the convolutional layer and activation to the list of layers
+        conv.append(nn.Conv2d(in_channels, out_channels, kernel))
+        conv.append(act[i-1])
+
+        # Calculate the next channel size to be used later for the number of
+        # inputs to the dense layers
+        channel_size = _calc_conv_outputs(channel_size[0],
+                                          channel_size[1], kernel)
+
+        # Update some running variables for the convolutional layer sizes for
+        # the next convolutional layer
+        in_channels = out_channels
+        out_channels = channels[i]
+        kernel = kernel_sizes[i]
+
+    # Append the last convolutional layer to the list of layers
+    conv.append(nn.Conv2d(in_channels, out_channels, kernel))
+    conv.append(act[len(channels)-1])
+    channel_size = _calc_conv_outputs(channel_size[0],
+                                      channel_size[1], kernel)
+
+    # Ensure that the final output size of the convolutional layers
+    # is non-negative
+    if np.any(np.array(channel_size) < 0):
+        raise ValueError("convolutions produce shape with negative size")
+
+    # Construct the chain of convolutions and activations
+    conv = nn.Sequential(*conv)
+    conv.apply(lambda module: weights_init_(module, init))
+
+    # Get the final number of elements of the output of the
+    # convolutional layers
+    conv_out = out_channels * np.prod(channel_size)
+
+    # Linear layers
+    linear = []  # List of dense connections and activations
+    in_units = conv_out + (num_actions if single_output else 0)
+    for i in range(len(hidden_sizes)):
+        # Add a dense layer and activation to the list of operations for the
+        # fuylly connected layers
+        linear.append(nn.Linear(in_units, hidden_sizes[i]))
+        linear.append(act[len(channels) + i])
+
+        # Update the number of inputs to the next layer
+        in_units = hidden_sizes[i]
+
+    # Add the final dense layer
+    if single_output:
+        linear.append(nn.Linear(in_units, 1))
+    else:
+        linear.append(nn.Linear(in_units, num_actions))
+
+    # Construct the chain of dense connections and activations
+    linear = nn.Sequential(*linear)
+    linear.apply(lambda module: weights_init_(module, init))
+
+    return conv, linear
diff --git a/agent/nonlinear/policy/CNN.py b/agent/nonlinear/policy/CNN.py
new file mode 100644
index 0000000..4ce5c47
--- /dev/null
+++ b/agent/nonlinear/policy/CNN.py
@@ -0,0 +1,146 @@
+# Import modules
+import agent.nonlinear.nn_utils as nn_utils
+import numpy as np
+import time
+import torch
+from torch.distributions import Normal, Independent
+import torch.nn as nn
+import torch.nn.functional as F
+
+
+# Global variables
+EPSILON = 1e-6
+
+
+# Class definitions
+class Softmax(nn.Module):
+    """
+    Softmax implements a softmax policy in each state, parameterized
+    using an CNN to predict logits.
+    """
+    def __init__(self, input_dim, channels, kernel_sizes,
+                 hidden_sizes,  init, activation, action_space):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        input_dim : tuple[int, int, int]
+            Dimensionality of state features, which should be (channels,
+            height, width)
+        channels : array-like[int]
+            The number of channels in each hidden convolutional layer
+        kernel_sizes : array-like[int]
+            The number of channels in each consecutive convolutional layer
+        hidden_sizes : array-like[int]
+            The number of units in each consecutive fully connected layer
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        activation : indexable[str] or str
+            The activation function to use; each element should be one of
+            'relu', 'tanh'
+        action_space : gym.Spaces.Discrete
+            The action space
+        """
+        super(Softmax, self).__init__()
+
+        self.num_actions = action_space.n
+        self.conv, self.linear = nn_utils._construct_conv_linear(
+            input_dim,
+            self.num_actions,
+            channels,
+            kernel_sizes,
+            hidden_sizes,
+            init,
+            activation,
+            False,
+        )
+
+    def forward(self, state):
+        """
+        Performs the forward pass through the network, predicting a logit for
+        each action in `state`.
+
+        Parameters
+        ----------
+        state : torch.Tensor[float] or np.array[float]
+            The state that the action was taken in
+
+        Returns
+        -------
+        torch.Tensor
+            The logit for each action in `state` with shape `(batch,
+            num_actions)`
+        """
+        if isinstance(state, np.ndarray):
+            x = torch.tensor(state)
+
+        x = self.conv(state)
+        return self.linear(torch.flatten(x, start_dim=1))
+
+    def sample(self, state, num_samples=1, log_prob=False):
+        """
+        Returns actions sampled from the policy in `state`
+
+        Parameters
+        ----------
+        state : torch.Tensor
+            The states to sample the actions in
+        num_samples : int, optional
+            The number of actions to sampler per state
+        log_prob : bool, optional
+            Whether or not to return the log probability of each action in
+            each state in `state`, by default `False`
+
+        Returns
+        -------
+        torch.Tensor
+            A sample of `num_samples` actions in each state, with shape
+            `(num_samples, batch, action_dims = 1)`
+        """
+        logits = self.forward(state)
+
+        probs = F.softmax(logits, dim=1)
+
+        policy = torch.distributions.Categorical(probs)
+        actions = policy.sample((num_samples,))
+
+        log_prob_val = None
+        if log_prob:
+            log_prob_val = F.log_softmax(logits, dim=1)
+            log_prob_val = torch.gather(log_prob_val, dim=1, index=actions)
+
+        if num_samples == 1:
+            actions = actions.squeeze(0)
+            if log_prob:
+                log_prob_val = log_prob_val.squeeze(0)
+
+        actions = actions.unsqueeze(-1)
+        if log_prob:
+            log_prob_val = log_prob_val.unsqueeze(-1)
+
+        return actions.long(), log_prob_val, None
+
+    def all_log_prob(self, states):
+        """
+        Returns the log probability of taking each action in `states`.
+        """
+        logits = self.forward(states)
+        log_probs = F.log_softmax(logits, dim=1)
+
+        return log_probs
+
+    def log_prob(self, states, actions):
+        """
+        Returns the log probability of taking `actions` in `states`.
+        """
+        logits = self.forward(states)
+        log_probs = F.log_softmax(logits, dim=1)
+        if actions.shape[0] == log_probs.shape[0] and len(actions.shape) == 1:
+            actions = actions.unsqueeze(-1)
+        log_probs = torch.gather(log_probs, dim=1, index=actions.long())
+
+        return log_probs
diff --git a/agent/nonlinear/policy/MLP.py b/agent/nonlinear/policy/MLP.py
new file mode 100644
index 0000000..7d7cec9
--- /dev/null
+++ b/agent/nonlinear/policy/MLP.py
@@ -0,0 +1,723 @@
+# Import modules
+import torch
+import time
+import numpy as np
+import torch.nn as nn
+import torch.nn.functional as F
+from torch.distributions import Normal, Independent
+from agent.nonlinear.nn_utils import weights_init_
+from utils.TruncatedNormal import TruncatedNormal
+
+
+# Global variables
+EPSILON = 1e-6
+
+
+# Class definitions
+class SquashedGaussian(nn.Module):
+    """
+    Class SquashedGaussian implements a policy following a squashed
+    Gaussian distribution in each state, parameterized by an MLP.
+
+    The MLP architecture is implemented
+    as two shared hidden layers, followed by two separate output layers:
+    one to predict the mean, and the other to predict the log standard
+    deviation.
+
+    For the the version that SAC used for the submission to ICML, see
+    commit f66e4bf666da8c4142ff5acd33aed91dc25f4110.
+    Basically there was a bug where the first and last layers
+    used xavier_uniform while the second layer used kaiming_uniform
+    """
+    def __init__(self, num_inputs, num_actions, hidden_dim, activation,
+                 action_space=None, clip_stddev=1000, init=None):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        num_inputs : int
+            The number of elements in the state feature vector
+        num_actions : int
+            The dimensionality of the action vector
+        hidden_dim : int
+            The number of units in each hidden layer of the network
+        activation : str
+            The activation function to use, one of 'relu', 'tanh'
+        action_space : gym.spaces.Space, optional
+            The action space of the environment, by default None. This argument
+            is used to ensure that the actions are within the correct scale.
+        clip_stddev : float, optional
+            The value at which the standard deviation is clipped in order to
+            prevent numerical overflow, by default 1000. If <= 0, then
+            no clipping is done.
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        """
+        super(SquashedGaussian, self).__init__()
+
+        self.num_actions = num_actions
+
+        # Determine standard deviation clipping
+        self.clip_stddev = clip_stddev > 0
+        self.clip_std_threshold = np.log(clip_stddev)
+
+        # Set up the layers
+        self.linear1 = nn.Linear(num_inputs, hidden_dim)
+        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
+        self.mean_linear = nn.Linear(hidden_dim, num_actions)
+        self.log_std_linear = nn.Linear(hidden_dim, num_actions)
+
+        # Initialize weights
+        self.apply(lambda module: weights_init_(module, init, activation))
+
+        # action rescaling
+        if action_space is None:
+            self.action_scale = torch.tensor(1.)
+            self.action_bias = torch.tensor(0.)
+        else:
+            self.action_scale = torch.FloatTensor(
+                (action_space.high - action_space.low) / 2.)
+            self.action_bias = torch.FloatTensor(
+                (action_space.high + action_space.low) / 2.)
+
+        if activation == "relu":
+            self.act = F.relu
+        elif activation == "tanh":
+            self.act = torch.tanh
+        else:
+            raise ValueError(f"unknown activation function {activation}")
+
+    def forward(self, state):
+        """
+        Performs the forward pass through the network, predicting the mean
+        and the log standard deviation.
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+             The input state to predict the policy in
+
+        Returns
+        -------
+        2-tuple of torch.Tensor of float
+            The mean and log standard deviation of the Gaussian policy in the
+            argument state
+        """
+        x = self.act(self.linear1(state))
+        x = self.act(self.linear2(x))
+
+        mean = self.mean_linear(x)
+        log_std = self.log_std_linear(x)
+
+        if self.clip_stddev:
+            log_std = torch.clamp(log_std, min=-self.clip_std_threshold,
+                                  max=self.clip_std_threshold)
+        return mean, log_std
+
+    def sample(self, state, num_samples=1):
+        """
+        Samples the policy for an action in the argument state
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+             The input state to predict the policy in
+
+        Returns
+        -------
+        torch.Tensor of float
+            A sampled action
+        """
+        mean, log_std = self.forward(state)
+        std = log_std.exp()
+        normal = Normal(mean, std)
+
+        if self.num_actions > 1:
+            normal = Independent(normal, 1)
+
+        x_t = normal.sample((num_samples,))
+        if num_samples == 1:
+            x_t = x_t.squeeze(0)
+        y_t = torch.tanh(x_t)
+        action = y_t * self.action_scale + self.action_bias
+        log_prob = normal.log_prob(x_t)
+
+        log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) +
+                              EPSILON).sum(axis=-1).reshape(log_prob.shape)
+        if self.num_actions > 1:
+            log_prob = log_prob.unsqueeze(-1)
+
+        mean = torch.tanh(mean) * self.action_scale + self.action_bias
+
+        return action, log_prob, mean, x_t
+
+    def rsample(self, state, num_samples=1):
+        """
+        Samples the policy for an action in the argument state using
+        the reparameterization trick
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+             The input state to predict the policy in
+
+        Returns
+        -------
+        torch.Tensor of float
+            A sampled action
+        """
+        mean, log_std = self.forward(state)
+        std = log_std.exp()
+        normal = Normal(mean, std)
+
+        if self.num_actions > 1:
+            normal = Independent(normal, 1)
+
+        # For re-parameterization trick (mean + std * N(0,1))
+        # rsample() implements the re-parameterization trick
+        x_t = normal.rsample((num_samples,))
+        if num_samples == 1:
+            x_t = x_t.squeeze(0)
+        y_t = torch.tanh(x_t)
+        action = y_t * self.action_scale + self.action_bias
+        log_prob = normal.log_prob(x_t)
+
+        log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) +
+                              EPSILON).sum(axis=-1).reshape(log_prob.shape)
+        if self.num_actions > 1:
+            log_prob = log_prob.unsqueeze(-1)
+
+        mean = torch.tanh(mean) * self.action_scale + self.action_bias
+
+        return action, log_prob, mean, x_t
+
+    def log_prob(self, state_batch, x_t_batch):
+        """
+        Calculates the log probability of taking the action generated
+        from x_t, where x_t is returned from sample or rsample. The
+        log probability is returned for each action dimension separately.
+        """
+        mean, log_std = self.forward(state_batch)
+        std = log_std.exp()
+        normal = Normal(mean, std)
+
+        if self.num_actions > 1:
+            normal = Independent(normal, 1)
+
+        y_t = torch.tanh(x_t_batch)
+        log_prob = normal.log_prob(x_t_batch)
+        log_prob -= torch.log(self.action_scale * (1 - y_t.pow(2)) +
+                              EPSILON).sum(axis=-1).reshape(log_prob.shape)
+        if self.num_actions > 1:
+            log_prob = log_prob.unsqueeze(-1)
+
+        return log_prob
+
+    def to(self, device):
+        """
+        Moves the network to a device
+
+        Parameters
+        ----------
+        device : torch.device
+            The device to move the network to
+
+        Returns
+        -------
+        nn.Module
+            The current network, moved to a new device
+        """
+        self.action_scale = self.action_scale.to(device)
+        self.action_bias = self.action_bias.to(device)
+        return super(SquashedGaussian, self).to(device)
+
+
+class Softmax(nn.Module):
+    """
+    Softmax implements a softmax policy in each state, parameterized
+    using an MLP to predict logits.
+    """
+    def __init__(self, num_inputs, num_actions, hidden_dim, activation,
+                 init=None):
+        super(Softmax, self).__init__()
+
+        self.num_actions = num_actions
+
+        self.linear1 = nn.Linear(num_inputs, hidden_dim)
+        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
+        self.linear3 = nn.Linear(hidden_dim, num_actions)
+
+        # self.apply(weights_init_)
+        self.apply(lambda module: weights_init_(module, init, activation))
+
+        if activation == "relu":
+            self.act = F.relu
+        elif activation == "tanh":
+            self.act = torch.tanh
+        else:
+            raise ValueError(f"unknown activation {activation}")
+
+    def forward(self, state):
+        x = self.act(self.linear1(state))
+        x = self.act(self.linear2(x))
+        return self.linear3(x)
+
+    def sample(self, state, num_samples=1):
+        logits = self.forward(state)
+
+        if len(logits.shape) != 1 and (len(logits.shape) != 2 and 1 not in
+           logits.shape):
+            shape = logits.shape
+            raise ValueError(f"expected a vector of logits, got shape {shape}")
+
+        probs = F.softmax(logits, dim=1)
+
+        policy = torch.distributions.Categorical(probs)
+        actions = policy.sample((num_samples,))
+
+        log_prob = F.log_softmax(logits, dim=1)
+
+        log_prob = torch.gather(log_prob, dim=1, index=actions)
+        if num_samples == 1:
+            actions = actions.squeeze(0)
+            log_prob = log_prob.squeeze(0)
+
+        actions = actions.unsqueeze(-1)
+        log_prob = log_prob.unsqueeze(-1)
+
+        # return actions.float(), log_prob, None
+        return actions.int(), log_prob, None
+
+    def all_log_prob(self, states):
+        logits = self.forward(states)
+        log_probs = F.log_softmax(logits, dim=1)
+
+        return log_probs
+
+    def log_prob(self, states, actions):
+        """TODO: Docstring for log_prob.
+
+        Parameters
+        ----------
+        states : TODO
+        actions : TODO
+
+        Returns
+        -------
+        TODO
+
+        """
+        logits = self.forward(states)
+        log_probs = F.log_softmax(logits, dim=1)
+        log_probs = torch.gather(log_probs, dim=1, index=actions.long())
+
+        return log_probs
+
+
+class Gaussian(nn.Module):
+    """
+    Class Gaussian implements a policy following Gaussian distribution
+    in each state, parameterized as an MLP. The predicted mean is scaled to be
+    within `(action_min, action_max)`.
+
+    The MLP architecture is implemented as two shared hidden layers,
+    followed by two separate output layers: one to predict the mean, and the
+    other to predict the log standard deviation.
+    """
+    def __init__(self, num_inputs, num_actions, hidden_dim, activation,
+                 action_space, clip_stddev=1000, init=None):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        num_inputs : int
+            The number of elements in the state feature vector
+        num_actions : int
+            The dimensionality of the action vector
+        hidden_dim : int
+            The number of units in each hidden layer of the network
+        action_space : gym.spaces.Space
+            The action space of the environment
+        clip_stddev : float, optional
+            The value at which the standard deviation is clipped in order to
+            prevent numerical overflow, by default 1000. If <= 0, then
+            no clipping is done.
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        """
+        super(Gaussian, self).__init__()
+
+        self.num_actions = num_actions
+
+        # Determine standard deviation clipping
+        self.clip_stddev = clip_stddev > 0
+        self.clip_std_threshold = np.log(clip_stddev)
+
+        self.linear1 = nn.Linear(num_inputs, hidden_dim)
+        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
+
+        self.mean_linear = nn.Linear(hidden_dim, num_actions)
+        self.log_std_linear = nn.Linear(hidden_dim, num_actions)
+
+        # Initialize weights
+        self.apply(lambda module: weights_init_(module, init, activation))
+
+        # Action rescaling
+        self.action_max = torch.FloatTensor(action_space.high)
+        self.action_min = torch.FloatTensor(action_space.low)
+
+        if activation == "relu":
+            self.act = F.relu
+        elif activation == "tanh":
+            self.act = torch.tanh
+        else:
+            raise ValueError(f"unknown activation {activation}")
+
+    def forward(self, state):
+        """
+        Performs the forward pass through the network, predicting the mean
+        and the log standard deviation.
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+             The input state to predict the policy in
+
+        Returns
+        -------
+        2-tuple of torch.Tensor of float
+            The mean and log standard deviation of the Gaussian policy in the
+            argument state
+        """
+        x = self.act(self.linear1(state))
+        x = self.act(self.linear2(x))
+
+        mean = torch.tanh(self.mean_linear(x))
+        mean = ((mean + 1) / 2) * (self.action_max - self.action_min) + \
+            self.action_min  # ∈ [action_min, action_max]
+        log_std = self.log_std_linear(x)
+
+        # Works better with std dev clipping to ±1000
+        if self.clip_stddev:
+            log_std = torch.clamp(log_std, min=-self.clip_std_threshold,
+                                  max=self.clip_std_threshold)
+        return mean, log_std
+
+    def rsample(self, state, num_samples=1):
+        """
+        Samples the policy for an action in the argument state
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+             The input state to predict the policy in
+
+        Returns
+        -------
+        torch.Tensor of float
+            A sampled action
+        """
+        mean, log_std = self.forward(state)
+        std = log_std.exp()
+        normal = Normal(mean, std)
+        if self.num_actions > 1:
+            normal = Independent(normal, 1)
+
+        # For re-parameterization trick (mean + std * N(0,1))
+        # rsample() implements the re-parameterization trick
+        action = normal.rsample((num_samples,))
+        action = torch.clamp(action, self.action_min, self.action_max)
+        if num_samples == 1:
+            action = action.squeeze(0)
+
+        log_prob = normal.log_prob(action)
+        if self.num_actions == 1:
+            log_prob.unsqueeze(-1)
+
+        return action, log_prob, mean
+
+    def sample(self, state, num_samples=1):
+        """
+        Samples the policy for an action in the argument state
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+             The input state to predict the policy in
+        num_samples : int
+            The number of actions to sample
+
+        Returns
+        -------
+        torch.Tensor of float
+            A sampled action
+        """
+        mean, log_std = self.forward(state)
+        std = log_std.exp()
+        normal = Normal(mean, std)
+        if self.num_actions > 1:
+            normal = Independent(normal, 1)
+
+        # Non-differentiable
+        action = normal.sample((num_samples,))
+        action = torch.clamp(action, self.action_min, self.action_max)
+
+        if num_samples == 1:
+            action = action.squeeze(0)
+
+        log_prob = normal.log_prob(action)
+        if self.num_actions == 1:
+            log_prob.unsqueeze(-1)
+
+        # print(action.shape)
+
+        return action, log_prob, mean
+
+    def log_prob(self, states, actions, show=False):
+        """
+        Returns the log probability of taking actions in states. The
+        log probability is returned for each action dimension
+        separately, and should be added together to get the final
+        log probability
+        """
+        mean, log_std = self.forward(states)
+        std = log_std.exp()
+        normal = Normal(mean, std)
+        if self.num_actions > 1:
+            normal = Independent(normal, 1)
+
+        log_prob = normal.log_prob(actions)
+        if self.num_actions == 1:
+            log_prob.unsqueeze(-1)
+
+        if show:
+            print(torch.cat([mean, std], axis=1)[0])
+
+        return log_prob
+
+    def to(self, device):
+        """
+        Moves the network to a device
+
+        Parameters
+        ----------
+        device : torch.device
+            The device to move the network to
+
+        Returns
+        -------
+        nn.Module
+            The current network, moved to a new device
+        """
+        self.action_max = self.action_max.to(device)
+        self.action_min = self.action_min.to(device)
+        return super(Gaussian, self).to(device)
+
+
+class TruncatedGaussian(nn.Module):
+    """
+    Class TruncatedGaussian implements a policy following
+    a truncated Gaussian distribution in each state.
+
+    The MLP architecture is implemented
+    as two shared hidden layers, followed by two separate output layers:
+    one to predict the mean, and the other to predict the log standard
+    deviation. The mean is scaled to be within (action_min, `action_max)`.
+    """
+    def __init__(self, num_inputs, num_actions, hidden_dim, activation,
+                 action_space, clip_stddev=1000, init=None):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        num_inputs : int
+            The number of elements in the state feature vector
+        num_actions : int
+            The dimensionality of the action vector
+        hidden_dim : int
+            The number of units in each hidden layer of the network
+        action_space : gym.spaces.Space
+            The action space of the environment
+        clip_stddev : float, optional
+            The value at which the standard deviation is clipped in order to
+            prevent numerical overflow, by default 1000. If <= 0, then
+            no clipping is done.
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        """
+        super(TruncatedGaussian, self).__init__()
+
+        # Determine standard deviation clipping
+        self.clip_stddev = clip_stddev > 0
+        self.clip_std_threshold = np.log(clip_stddev)
+
+        self.linear1 = nn.Linear(num_inputs, hidden_dim)
+        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
+
+        self.mean_linear = nn.Linear(hidden_dim, num_actions)
+        self.log_std_linear = nn.Linear(hidden_dim, num_actions)
+
+        # self.apply(weights_init_)
+        self.apply(lambda module: weights_init_(module, init, activation))
+
+        # action rescaling
+        assert len(action_space.low.shape) == 1
+        self.action_max = torch.FloatTensor(action_space.high)
+        self.action_min = torch.FloatTensor(action_space.low)
+
+        if activation == "relu":
+            self.act = F.relu
+        elif activation == "tanh":
+            self.act = torch.tanh
+        else:
+            raise ValueError(f"unknown activation {activation}")
+
+    def forward(self, state):
+        """
+        Performs the forward pass through the network, predicting the mean
+        and the log standard deviation.
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+             The input state to predict the policy in
+
+        Returns
+        -------
+        2-tuple of torch.Tensor of float
+            The mean and log standard deviation of the Gaussian policy in the
+            argument state
+        """
+        x = self.act(self.linear1(state))
+        x = self.act(self.linear2(x))
+
+        mean = torch.tanh(self.mean_linear(x))
+        mean = ((mean + 1)/2) * (self.action_max - self.action_min) + \
+            self.action_min  # ∈ [action_min, action_max]
+        log_std = self.log_std_linear(x)
+
+        # Works better with std dev clipping to ±1000
+        if self.clip_stddev:
+            log_std = torch.clamp(log_std, min=-self.clip_std_threshold,
+                                  max=self.clip_std_threshold)
+        return mean, log_std
+
+    def rsample(self, state, num_samples=1):
+        """
+        Samples the policy for an action in the argument state
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+             The input state to predict the policy in
+
+        Returns
+        -------
+        torch.Tensor of float
+            A sampled action
+        """
+        mean, log_std = self.forward(state)
+        std = log_std.exp()
+        normal = TruncatedNormal(loc=mean, scale=std, a=self.action_min,
+                                 b=self.action_max)
+
+        # For re-parameterization trick (mean + std * N(0,1))
+        # rsample() implements the re-parameterization trick
+        x = normal.rsample((num_samples,))
+        action = torch.clamp(x, self.action_min, self.action_max)
+        if num_samples == 1:
+            action = action.squeeze(0)
+
+        log_prob = normal.log_prob(action)
+        if num_samples == 1:
+            log_prob = log_prob.sum(1, keepdim=True)
+        else:
+            log_prob = log_prob.sum(2, keepdim=True)
+
+        return action, log_prob, mean
+
+    def sample(self, state, num_samples=1):
+        """
+        Samples the policy for an action in the argument state
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+             The input state to predict the policy in
+        num_samples : int
+            The number of actions to sample
+
+        Returns
+        -------
+        torch.Tensor of float
+            A sampled action
+        """
+        mean, log_std = self.forward(state)
+        std = log_std.exp()
+        normal = TruncatedNormal(loc=mean, scale=std, a=self.action_min,
+                                 b=self.action_max)
+
+        # Non-differentiable
+        x = normal.sample((num_samples,))
+        action = torch.clamp(x, self.action_min, self.action_max)
+        if num_samples == 1:
+            action = action.squeeze(0)
+
+        log_prob = normal.log_prob(action)
+        if num_samples == 1:
+            log_prob = log_prob.sum(1, keepdim=True)
+        else:
+            log_prob = log_prob.sum(2, keepdim=True)
+
+        return action, log_prob, mean
+
+    def log_prob(self, states, actions, show=False):
+        """
+        Returns the log probability of taking actions in states. The
+        log probability is returned for each action dimension
+        separately, and should be added together to get the final
+        log probability
+        """
+        mean, log_std = self.forward(states)
+        std = log_std.exp()
+        normal = TruncatedNormal(loc=mean, scale=std, a=self.action_min,
+                                 b=self.action_max)
+
+        log_prob = normal.log_prob(actions)
+
+        if show:
+            print(torch.cat([mean, std], axis=1)[0])
+            # print(log_prob.shape)
+
+        return log_prob
+
+    def to(self, device):
+        """
+        Moves the network to a device
+
+
+        Parameters
+        ----------
+        device : torch.device
+            The device to move the network to
+
+        Returns
+        -------
+        nn.Module
+            The current network, moved to a new device
+        """
+        self.action_max = self.action_max.to(device)
+        self.action_min = self.action_min.to(device)
+        return super(TruncatedGaussian, self).to(device)
diff --git a/agent/nonlinear/value_function/CNN.py b/agent/nonlinear/value_function/CNN.py
new file mode 100644
index 0000000..6f304b8
--- /dev/null
+++ b/agent/nonlinear/value_function/CNN.py
@@ -0,0 +1,238 @@
+#!/usr/bin/env python3
+
+# Import modules
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import agent.nonlinear.nn_utils as nn_utils
+
+
+class Q(nn.Module):
+    """
+    Class Q implements an action-value network using a CNN function
+    approximator. The network has a single output, which is the action value
+    for the input action in the input state.
+
+    The action value is compute by first convolving the state observation, the
+    concatenating the flattened state convolution with the action and using
+    this as input to the fully connected layers. A single action value is
+    outputted for the input action.
+    """
+    def __init__(self, input_dim, action_dim, channels, kernel_sizes,
+                 hidden_sizes, init, activation):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        input_dim : tuple[int, int, int]
+            Dimensionality of state features, which should be (channels,
+            height, width)
+        action_dim : int
+            Dimensionality of the action vector
+        channels : array-like[int]
+            The number of channels in each hidden convolutional layer
+        kernel_sizes : array-like[int]
+            The number of channels in each consecutive convolutional layer
+        hidden_sizes : array-like[int]
+            The number of units in each consecutive fully connected layer
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        activation : indexable[str] or str
+            The activation function to use; each element should be one of
+            'relu', 'tanh'
+        """
+        super(Q, self).__init__()
+
+        self.conv, self.linear = nn_utils._construct_conv_linear(
+            input_dim,
+            action_dim,
+            channels,
+            kernel_sizes,
+            hidden_sizes,
+            init,
+            activation,
+            True,
+        )
+
+    def forward(self, state, action):
+        """
+        Performs the forward pass through the network, predicting the
+        action-value for `action` in `state`.
+
+        Parameters
+        ----------
+        state : torch.Tensor[float]
+            The state that the action was taken in
+        action : torch.Tensor[float] or np.ndarray[float]
+            The action to get the value of
+
+        Returns
+        -------
+        torch.Tensor
+            The action value prediction
+        """
+        if isinstance(state, np.ndarray):
+            x = torch.tensor(state)
+
+        x = self.conv(state)
+
+        x = torch.flatten(x)
+        x = torch.cat([x, action])
+        return self.linear(x)
+
+
+class DiscreteQ(nn.Module):
+    """
+    Class DiscreteQ implements an action-value network using a CNN function
+    approximator. The network outputs one action value for each available
+    action.
+    """
+    def __init__(self, input_dim, num_actions, channels, kernel_sizes,
+                 hidden_sizes, init, activation):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        input_dim : tuple[int, int, int]
+            Dimensionality of state features, which should be (channels,
+            height, width)
+        num_actions : int
+            The number of available actions in the environment
+        channels : array-like[int]
+            The number of channels in each hidden convolutional layer
+        kernel_sizes : array-like[int]
+            The number of channels in each consecutive convolutional layer
+        hidden_sizes : array-like[int]
+            The number of units in each consecutive fully connected layer
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        activation : indexable[str] or str
+            The activation function to use; each element should be one of
+            'relu', 'tanh'
+        """
+        super(DiscreteQ, self).__init__()
+
+        self.conv, self.linear = nn_utils._construct_conv_linear(
+            input_dim,
+            num_actions,
+            channels,
+            kernel_sizes,
+            hidden_sizes,
+            init,
+            activation,
+            False,
+        )
+
+    def forward(self, state):
+        """
+        Performs the forward pass through the network, predicting an action
+        value for each action in `state`.
+
+        Parameters
+        ----------
+        state : torch.Tensor[float] or np.array[float]
+            The state that the action was taken in
+
+        Returns
+        -------
+        torch.Tensor
+            The action value prediction for each action in `state`
+        """
+        if isinstance(state, np.ndarray):
+            x = torch.tensor(x)
+
+        x = self.conv(state)
+        return self.linear(torch.flatten(x, start_dim=1))
+
+
+class DoubleDiscreteQ(nn.Module):
+    """
+    Class DoubleDiscreteQ implements a double action-value network
+    using a CNN function approximator.
+    The network outputs two action values for each available action.
+    """
+    def __init__(self, input_dim, num_actions, channels, kernel_sizes,
+                 hidden_sizes, init, activation):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        input_dim : tuple[int, int, int]
+            Dimensionality of state features, which should be (channels,
+            height, width)
+        num_actions : int
+            The number of available actions in the environment
+        channels : array-like[int]
+            The number of channels in each hidden convolutional layer
+        kernel_sizes : array-like[int]
+            The number of channels in each consecutive convolutional layer
+        hidden_sizes : array-like[int]
+            The number of units in each consecutive fully connected layer
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        activation : indexable[str] or str
+            The activation function to use; each element should be one of
+            'relu', 'tanh'
+        """
+        super(DoubleDiscreteQ, self).__init__()
+
+        self.conv1, self.linear1 = nn_utils._construct_conv_linear(
+            input_dim,
+            num_actions,
+            channels,
+            kernel_sizes,
+            hidden_sizes,
+            init,
+            activation,
+            False,
+        )
+
+        self.conv2, self.linear2 = nn_utils._construct_conv_linear(
+            input_dim,
+            num_actions,
+            channels,
+            kernel_sizes,
+            hidden_sizes,
+            init,
+            activation,
+            False,
+        )
+
+    def forward(self, state):
+        """
+        Performs the forward pass through the network, predicting an action
+        value for each action in `state`.
+
+        Parameters
+        ----------
+        state : torch.Tensor[float] or np.array[float]
+            The state that the action was taken in
+
+        Returns
+        -------
+        torch.Tensor
+            The action value prediction for each action in `state`
+        """
+        if isinstance(state, np.ndarray):
+            x = torch.tensor(x)
+
+        x1 = self.conv1(state)
+        q1 = self.linear1(torch.flatten(x1, start_dim=1))
+
+        x2 = self.conv2(state)
+        q2 = self.linear2(torch.flatten(x2, start_dim=1))
+
+        return q1, q2
diff --git a/agent/nonlinear/value_function/MLP.py b/agent/nonlinear/value_function/MLP.py
new file mode 100644
index 0000000..bebffd8
--- /dev/null
+++ b/agent/nonlinear/value_function/MLP.py
@@ -0,0 +1,282 @@
+#!/usr/bin/env python3
+
+# Import modules
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import agent.nonlinear.nn_utils as nn_utils
+
+
+# Class definitions
+class V(nn.Module):
+    """
+    Class V is an MLP for estimating the state value function `v`.
+    """
+    def __init__(self, num_inputs, hidden_dim, init, activation):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        num_inputs : int
+            Dimensionality of input feature vector
+        hidden_dim : int
+            The number of units in each hidden layer
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        activation : str
+            The activation function to use; one of 'relu', 'tanh'
+        """
+        super(V, self).__init__()
+
+        self.linear1 = nn.Linear(num_inputs, hidden_dim)
+        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
+        self.linear3 = nn.Linear(hidden_dim, 1)
+
+        self.apply(lambda module: nn_utils.weights_init_(module, init))
+
+        if activation == "relu":
+            self.act = F.relu
+        elif activation == "tanh":
+            self.act = torch.tanh
+        else:
+            raise ValueError(f"unknown activation {activation}")
+
+    def forward(self, state):
+        """
+        Performs the forward pass through the network, predicting the value of
+        `state`.
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+            The feature vector of the state to compute the value of
+
+        Returns
+        -------
+        torch.Tensor of float
+            The value of the state
+        """
+        x = self.act(self.linear1(state))
+        x = self.act(self.linear2(x))
+        x = self.linear3(x)
+        return x
+
+
+class DiscreteQ(nn.Module):
+    """
+    Class DiscreteQ implements an action value network with number of
+    predicted action values equal to the number of available actions.
+    """
+    def __init__(self, num_inputs, num_actions, hidden_dim, init,
+                 activation):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        num_inputs : int
+            Dimensionality of state feature vector
+        num_actions : int
+            Dimensionality of the action feature vector
+        hidden_dim : int
+            The number of units in each hidden layer
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        activation : str
+            The activation function to use; one of 'relu', 'tanh'
+        """
+        super(DiscreteQ, self).__init__()
+
+        self.linear1 = nn.Linear(num_inputs, hidden_dim)
+        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
+        self.linear3 = nn.Linear(hidden_dim, num_actions)
+
+        self.apply(lambda module: nn_utils.weights_init_(module, init))
+
+        if activation == "relu":
+            self.act = F.relu
+        elif activation == "tanh":
+            self.act = torch.tanh
+        else:
+            raise ValueError(f"unknown activation {activation}")
+
+    def forward(self, state):
+        """
+        Performs the forward pass through each network, predicting the
+        action-value for `action` in `state`.
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+            The state that the action was taken in
+
+        Returns
+        -------
+        torch.Tensor
+            The action value predictions
+        """
+
+        x = self.act(self.linear1(state))
+        x = self.act(self.linear2(x))
+        return self.linear3(x)
+
+
+class Q(nn.Module):
+    """
+    Class Q implements an action-value network using an MLP function
+    approximator. The action value is computed by concatenating the action to
+    the state observation as the input to the neural network.
+    """
+    def __init__(self, num_inputs, num_actions, hidden_dim, init,
+                 activation):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        num_inputs : int
+            Dimensionality of state feature vector
+        num_actions : int
+            Dimensionality of the action feature vector
+        hidden_dim : int
+            The number of units in each hidden layer
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        activation : str
+            The activation function to use; one of 'relu', 'tanh'
+        """
+        super(Q, self).__init__()
+
+        # Q1 architecture
+        self.linear1 = nn.Linear(num_inputs + num_actions, hidden_dim)
+        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
+        self.linear3 = nn.Linear(hidden_dim, 1)
+
+        self.apply(lambda module: nn_utils.weights_init_(module, init))
+
+        if activation == "relu":
+            self.act = F.relu
+        elif activation == "tanh":
+            self.act = torch.tanh
+        else:
+            raise ValueError(f"unknown activation {activation}")
+
+    def forward(self, state, action):
+        """
+        Performs the forward pass through each network, predicting the
+        action-value for `action` in `state`.
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+            The state that the action was taken in
+        action : torch.Tensor of float
+            The action taken in the input state to predict the value function
+            of
+
+        Returns
+        -------
+        torch.Tensor
+            The action value prediction
+        """
+        xu = torch.cat([state, action], 1)
+
+        x = self.act(self.linear1(xu))
+        x = self.act(self.linear2(x))
+        x = self.linear3(x)
+
+        return x
+
+
+class DoubleQ(nn.Module):
+    """
+    Class DoubleQ implements two action-value networks,
+    computing the action-value function using two separate fully
+    connected neural net. This is useful for implementing double Q-learning.
+    The action values are computed by concatenating the action to the state
+    observation and using this as input to each neural network.
+    """
+    def __init__(self, num_inputs, num_actions, hidden_dim, init,
+                 activation):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        num_inputs : int
+            Dimensionality of state feature vector
+        num_actions : int
+            Dimensionality of the action feature vector
+        hidden_dim : int
+            The number of units in each hidden layer
+        init : str
+            The initialization scheme to use for the weights, one of
+            'xavier_uniform', 'xavier_normal', 'uniform', 'normal',
+            'orthogonal', by default None. If None, leaves the default
+            PyTorch initialization.
+        activation : str
+            The activation function to use; one of 'relu', 'tanh'
+        """
+        super(DoubleQ, self).__init__()
+
+        # Q1 architecture
+        self.linear1 = nn.Linear(num_inputs + num_actions, hidden_dim)
+        self.linear2 = nn.Linear(hidden_dim, hidden_dim)
+        self.linear3 = nn.Linear(hidden_dim, 1)
+
+        # Q2 architecture
+        self.linear4 = nn.Linear(num_inputs + num_actions, hidden_dim)
+        self.linear5 = nn.Linear(hidden_dim, hidden_dim)
+        self.linear6 = nn.Linear(hidden_dim, 1)
+
+        self.apply(lambda module: nn_utils.weights_init_(module, init))
+
+        if activation == "relu":
+            self.act = F.relu
+        elif activation == "tanh":
+            self.act = torch.tanh
+        else:
+            raise ValueError(f"unknown activation {activation}")
+
+    def forward(self, state, action):
+        """
+        Performs the forward pass through each network, predicting two
+        action-values (from each action-value approximator) for the input
+        action in the input state.
+
+        Parameters
+        ----------
+        state : torch.Tensor of float
+            The state that the action was taken in
+        action : torch.Tensor of float
+            The action taken in the input state to predict the value function
+            of
+
+        Returns
+        -------
+        2-tuple of torch.Tensor of float
+            A 2-tuple of action values, one predicted by each function
+            approximator
+        """
+        xu = torch.cat([state, action], 1)
+
+        x1 = self.act(self.linear1(xu))
+        x1 = self.act(self.linear2(x1))
+        x1 = self.linear3(x1)
+
+        x2 = self.act(self.linear4(xu))
+        x2 = self.act(self.linear5(x2))
+        x2 = self.linear6(x2)
+
+        return x1, x2
diff --git a/combine.py b/combine.py
new file mode 100644
index 0000000..2ec2e33
--- /dev/null
+++ b/combine.py
@@ -0,0 +1,78 @@
+#!/usr/bin/env python3
+
+import sys
+import pickle
+import os
+import json
+import utils.experiment_utils as exp
+import click
+
+
+def add_dicts(data, newfiles):
+    """
+    add_dicts adds the data dictionaries in newfiles to the existing
+    dictionary data. This function assumes that the hyperparameter
+    indices between data and those found in each file in newfiles are
+    consistent.
+    """
+    set_experiment_val = False
+    if data is None:
+        set_experiment_val = True
+        data = {
+                "experiment_data": {},
+                "experiment": {},
+                }
+    # Add data from all other dictionaries
+    for file in newfiles:
+        with open(file, "rb") as in_file:
+            # Read in the new dictionary
+            try:
+                in_data = pickle.load(in_file)
+            except EOFError:
+                print(file)
+                continue
+
+            if set_experiment_val:
+                data["experiment"] = in_data["experiment"]
+
+            # Add experiment data to running dictionary
+            for key in in_data["experiment_data"]:
+                # Check if key exists
+                if key in data["experiment_data"]:
+                    if "learned_params" in \
+                            data["experiment_data"][key]["runs"][0]:
+                        del data["experiment_data"][key]["runs"][0][
+                            "learned_params"]
+                    # continue
+                    # Append data if existing
+                    data["experiment_data"][key]["runs"].extend(
+                        in_data["experiment_data"][key]["runs"])
+
+                else:
+                    # Key doesn't exist - add data to dictionary
+                    data["experiment_data"][key] = \
+                            in_data["experiment_data"][key]
+
+    return data
+
+
+@click.command(help="combine a number of data files in a single " +
+               "directory into a single data file called data.pkl")
+@click.argument("directory", required=True, type=click.Path(exists=True))
+def main(directory):
+    data = None
+    if os.path.exists(os.path.join(directory, "data.pkl")):
+        print("remove data.pkl from directory first")
+
+    files = os.listdir(directory)
+    if "data.pkl" in files:
+        files.remove("data.pkl")
+    filenames = list(map(lambda x: os.path.join(directory, x), files))
+    data = add_dicts(data, filenames)
+
+    with open(os.path.join(directory, "data.pkl"), "wb") as outfile:
+        pickle.dump(data, outfile)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/config/agent/FKL.json b/config/agent/FKL.json
new file mode 100644
index 0000000..232e6e5
--- /dev/null
+++ b/config/agent/FKL.json
@@ -0,0 +1,21 @@
+{
+    "agent_name": "fkl",
+    "parameters":
+    {
+        "replay_capacity": [100000],
+        "batch_size": [32],
+        "tau": [0.01],
+        "alpha": [0.001, 0.01, 0.1, 1.0, 10.0],
+		"betas": [[0.9, 0.999], [0.0, 0.999]],
+		"num_samples": [30],
+        "policy_type": ["Gaussian"],
+        "target_update_interval": [1],
+        "critic_lr": [1e-1, 1e-2, 1e-3, 1e-4],
+        "actor_lr_scale": [0.01, 0.1, 1.0, 2.0],
+        "hidden_dim": [64],
+        "weight_init": ["xavier_uniform"],
+        "clip_stddev": [1000],
+		"cuda": [false]
+    }
+}
+
diff --git a/config/agent/LinearGaussianAC.json b/config/agent/LinearGaussianAC.json
new file mode 100644
index 0000000..6534f01
--- /dev/null
+++ b/config/agent/LinearGaussianAC.json
@@ -0,0 +1,17 @@
+{
+    "agent_name": "LinearGaussianAC",
+    "parameters":
+    {
+        "decay": [0.5, 0.75, 0.9],
+        "critic_lr": [2.0, 0.5, 0.25, 0.125, 0.0625],
+        "actor_lr_scale": [0.01, 0.1, 1.0, 2.0, 5.0],
+		"use_critic_trace": [true, false],
+		"use_actor_trace": [true, false],
+        "scaled": [false],
+		"bins": [4],
+		"num_tilings": [16],
+        "clip_stddev": [1000],
+        "count_interval": [10000],
+		"trace_type": ["replacing"]
+    }
+}
diff --git a/config/agent/LinearSoftmaxAC.json b/config/agent/LinearSoftmaxAC.json
new file mode 100644
index 0000000..4081b47
--- /dev/null
+++ b/config/agent/LinearSoftmaxAC.json
@@ -0,0 +1,18 @@
+{
+    "agent_name": "LinearSoftmaxAC",
+    "parameters":
+    {
+        "decay": [0.5, 0.75, 0.9],
+        "critic_lr": [2.0, 0.5, 0.25, 0.125, 0.0625],
+        "actor_lr_scale": [0.01, 0.1, 1.0, 2.0, 5.0],
+		"use_critic_trace": [true, false],
+		"use_actor_trace": [true, false],
+		"temperature": [1.0, 0.1, 0.001],
+        "scaled": [false],
+		"bins": [4],
+		"num_tilings": [16],
+        "clip_stddev": [1000],
+        "count_interval": [10000],
+		"trace_type": ["replacing"]
+    }
+}
diff --git a/config/agent/SAC.json b/config/agent/SAC.json
new file mode 100644
index 0000000..dd86546
--- /dev/null
+++ b/config/agent/SAC.json
@@ -0,0 +1,25 @@
+{
+    "agent_name": "SAC",
+    "parameters":
+    {
+        "replay_capacity": [100000],
+        "batch_size": [32],
+        "tau": [0.01],
+        "num_hidden": [3],
+		"reparameterized": [true, false],
+        "soft_q": [true, false],
+        "double_q": [true, false],
+        "alpha": [0.001, 0.01, 0.1, 1.0],
+		"betas": [[0.9, 0.999]],
+        "policy_type": ["SquashedGaussian"],
+        "target_update_interval": [1],
+        "critic_lr": [1e-2, 1e-3, 1e-4, 1e-5],
+        "actor_lr_scale": [0.01, 0.1, 1.0, 2.0],
+        "alpha_lr": [0.0],
+        "hidden_dim": [64],
+        "automatic_entropy_tuning": [false],
+        "weight_init": ["xavier_uniform"],
+        "clip_stddev": [1000],
+		"cuda": [false]
+    }
+}
diff --git a/config/agent/SACDiscrete.json b/config/agent/SACDiscrete.json
new file mode 100644
index 0000000..d74c903
--- /dev/null
+++ b/config/agent/SACDiscrete.json
@@ -0,0 +1,23 @@
+{
+    "agent_name": "SACDiscrete",
+    "parameters":
+    {
+        "replay_capacity": [100000],
+        "batch_size": [32],
+        "tau": [0.01],
+        "num_hidden": [3],
+        "alpha": [0.001, 0.01, 0.1, 1.0, 10.0],
+		"betas": [[0.9, 0.999], [0.0, 0.999]],
+        "policy_type": ["Softmax"],
+        "target_update_interval": [1],
+        "critic_lr": [1e-1, 1e-2, 1e-3, 1e-4, 1e-5],
+        "actor_lr_scale": [0.01, 0.1, 1.0, 2.0, 10.0],
+        "alpha_lr": [0.0],
+        "hidden_dim": [64],
+        "automatic_entropy_tuning": [false],
+        "weight_init": ["xavier_uniform"],
+        "clip_stddev": [1000],
+		"cuda": [false]
+    }
+}
+
diff --git a/config/agent/SACDiscreteCNN.json b/config/agent/SACDiscreteCNN.json
new file mode 100644
index 0000000..7a89761
--- /dev/null
+++ b/config/agent/SACDiscreteCNN.json
@@ -0,0 +1,24 @@
+{
+    "agent_name": "SACDiscreteCNN",
+    "parameters":
+    {
+        "replay_capacity": [1000000],
+        "batch_size": [32],
+        "tau": [0.01],
+        "alpha": [1.0],
+		"betas": [[0.9, 0.999]],
+        "policy_type": ["Softmax"],
+        "target_update_interval": [1],
+        "critic_lr": [0.1],
+        "actor_lr_scale": [10.0],
+        "alpha_lr": [0.0],
+        "hidden_dim": [[128]],
+		"channels": [[16]],
+		"kernel_sizes": [[3]],
+        "weight_init": ["xavier_uniform"],
+        "clip_stddev": [1000],
+		"cuda": [false],
+		"activation": ["relu"]
+    }
+}
+
diff --git a/config/agent/VAC.json b/config/agent/VAC.json
new file mode 100644
index 0000000..ea5e9b7
--- /dev/null
+++ b/config/agent/VAC.json
@@ -0,0 +1,21 @@
+{
+    "agent_name": "VAC",
+    "parameters":
+    {
+        "replay_capacity": [100000],
+        "batch_size": [32],
+        "tau": [0.01],
+        "alpha": [0.001, 0.01, 0.1, 1.0, 10.0],
+		"betas": [[0.9, 0.999], [0.0, 0.999]],
+		"num_samples": [30],
+        "policy_type": ["Gaussian"],
+        "target_update_interval": [1],
+        "critic_lr": [1e-1, 1e-2, 1e-3, 1e-4],
+        "actor_lr_scale": [0.01, 0.1, 1.0, 2.0],
+        "hidden_dim": [64],
+        "weight_init": ["xavier_uniform"],
+        "clip_stddev": [1000],
+		"cuda": [false]
+    }
+}
+
diff --git a/config/agent/VACDiscrete.json b/config/agent/VACDiscrete.json
new file mode 100644
index 0000000..cc3073d
--- /dev/null
+++ b/config/agent/VACDiscrete.json
@@ -0,0 +1,21 @@
+{
+    "agent_name": "VACDiscrete",
+    "parameters":
+    {
+        "replay_capacity": [100000],
+        "batch_size": [32],
+        "tau": [0.01],
+        "alpha": [0.001, 0.01, 0.1, 1.0, 10.0],
+		"betas": [[0.9, 0.999], [0.0, 0.999]],
+		"num_samples": [30],
+        "policy_type": ["Softmax"],
+        "target_update_interval": [1],
+        "critic_lr": [1e-1, 1e-2, 1e-3, 1e-4, 1e-5],
+        "actor_lr_scale": [0.01, 0.1, 1.0, 2.0, 10.0],
+        "hidden_dim": [64],
+        "weight_init": ["xavier_uniform"],
+        "clip_stddev": [1000],
+		"cuda": [false]
+    }
+}
+
diff --git a/config/agent/VACDiscreteCNN.json b/config/agent/VACDiscreteCNN.json
new file mode 100644
index 0000000..dd52ed1
--- /dev/null
+++ b/config/agent/VACDiscreteCNN.json
@@ -0,0 +1,24 @@
+{
+    "agent_name": "VACDiscreteCNN",
+    "parameters":
+    {
+        "replay_capacity": [1000000],
+        "batch_size": [32],
+        "tau": [0.01],
+        "alpha" : [1.0, 1e-1, 1e-2, 1e-3],
+		"critic_lr" : [1e-1, 1e-2, 1e-3, 1e-4, 1e-5],
+		"actor_lr_scale" : [10.0, 1.0, 1e-1, 1e-2, 1e-3],
+		"betas": [[0.9, 0.999]],
+        "policy_type": ["Softmax"],
+        "target_update_interval": [1],
+        "alpha_lr": [0.0],
+        "hidden_dim": [[128]],
+		"channels": [[16]],
+		"kernel_sizes": [[3]],
+        "weight_init": ["xavier_uniform"],
+        "clip_stddev": [1000],
+		"cuda": [false],
+		"activation": ["relu"]
+    }
+}
+
diff --git a/config/environment/AcrobotContinuous-v1.json b/config/environment/AcrobotContinuous-v1.json
new file mode 100644
index 0000000..5940ecd
--- /dev/null
+++ b/config/environment/AcrobotContinuous-v1.json
@@ -0,0 +1,9 @@
+{
+    "env_name": "Acrobot-v1",
+    "total_timesteps": 100000,
+    "steps_per_episode": 1000,
+    "eval_interval_timesteps": 10000000,
+    "eval_episodes": 0,
+    "gamma": 0.99,
+	"continuous": true
+}
diff --git a/config/environment/AcrobotDiscrete-v1.json b/config/environment/AcrobotDiscrete-v1.json
new file mode 100644
index 0000000..ed1fde8
--- /dev/null
+++ b/config/environment/AcrobotDiscrete-v1.json
@@ -0,0 +1,9 @@
+{
+    "env_name": "Acrobot-v1",
+    "total_timesteps": 100000,
+    "steps_per_episode": 1000,
+    "eval_interval_timesteps": 1000000,
+    "eval_episodes": 0,
+    "gamma": 0.99,
+	"continuous": false
+}
diff --git a/config/environment/Asterix.json b/config/environment/Asterix.json
new file mode 100644
index 0000000..a9f7862
--- /dev/null
+++ b/config/environment/Asterix.json
@@ -0,0 +1,13 @@
+{
+    "env_name": "MinAtarAsterix",
+    "total_timesteps": 1500000,
+    "steps_per_episode": 1000,
+    "eval_interval_timesteps": 100000000,
+    "eval_episodes": 0,
+    "gamma": 0.99,
+    "accumulate_trace": false,
+    "overwrite_rewards": false,
+	"continuous": false,
+    "rewards": {},
+    "start_state": []
+}
diff --git a/config/environment/BipedalWalker.json b/config/environment/BipedalWalker.json
new file mode 100644
index 0000000..aa2b32b
--- /dev/null
+++ b/config/environment/BipedalWalker.json
@@ -0,0 +1,9 @@
+{
+    "env_name": "BipedalWalker-v3",
+    "total_timesteps": 2500000,
+    "steps_per_episode": 1000,
+    "eval_interval_timesteps": 10000000,
+    "eval_episodes": 0,
+    "gamma": 0.99,
+    "accumulate_trace": false
+}
diff --git a/config/environment/Breakout.json b/config/environment/Breakout.json
new file mode 100644
index 0000000..1262b09
--- /dev/null
+++ b/config/environment/Breakout.json
@@ -0,0 +1,13 @@
+{
+    "env_name": "MinAtarBreakout",
+    "total_timesteps": 1500000,
+    "steps_per_episode": 1000,
+    "eval_interval_timesteps": 100000000,
+    "eval_episodes": 0,
+    "gamma": 0.99,
+    "accumulate_trace": false,
+    "overwrite_rewards": false,
+	"continuous": false,
+    "rewards": {},
+    "start_state": []
+}
diff --git a/config/environment/Freeway.json b/config/environment/Freeway.json
new file mode 100644
index 0000000..e033b27
--- /dev/null
+++ b/config/environment/Freeway.json
@@ -0,0 +1,13 @@
+{
+    "env_name": "MinAtarFreeway",
+    "total_timesteps": 5000000,
+    "steps_per_episode": 2500,
+    "eval_interval_timesteps": 100000000,
+    "eval_episodes": 0,
+    "gamma": 0.99,
+    "accumulate_trace": false,
+    "overwrite_rewards": false,
+	"continuous": false,
+    "rewards": {},
+    "start_state": []
+}
diff --git a/config/environment/PendulumContinuous-v0.json b/config/environment/PendulumContinuous-v0.json
new file mode 100644
index 0000000..ee10244
--- /dev/null
+++ b/config/environment/PendulumContinuous-v0.json
@@ -0,0 +1,8 @@
+{
+    "env_name": "Pendulum-v0",
+    "total_timesteps": 100000,
+    "steps_per_episode": 1000,
+    "eval_interval_timesteps": 1000000,
+    "eval_episodes": 0,
+    "gamma": 0.99
+}
diff --git a/config/environment/PendulumDiscrete-v0.json b/config/environment/PendulumDiscrete-v0.json
new file mode 100644
index 0000000..ee10244
--- /dev/null
+++ b/config/environment/PendulumDiscrete-v0.json
@@ -0,0 +1,8 @@
+{
+    "env_name": "Pendulum-v0",
+    "total_timesteps": 100000,
+    "steps_per_episode": 1000,
+    "eval_interval_timesteps": 1000000,
+    "eval_episodes": 0,
+    "gamma": 0.99
+}
diff --git a/config/environment/Seaquest.json b/config/environment/Seaquest.json
new file mode 100644
index 0000000..9d016b4
--- /dev/null
+++ b/config/environment/Seaquest.json
@@ -0,0 +1,13 @@
+{
+    "env_name": "MinAtarSeaquest",
+    "total_timesteps": 2500000,
+    "steps_per_episode": 1000,
+    "eval_interval_timesteps": 1000000,
+    "eval_episodes": 0,
+    "gamma": 0.99,
+    "accumulate_trace": false,
+    "overwrite_rewards": false,
+	"continuous": false,
+    "rewards": {},
+    "start_state": []
+}
diff --git a/config/environment/SpaceInvaders.json b/config/environment/SpaceInvaders.json
new file mode 100644
index 0000000..e91ee56
--- /dev/null
+++ b/config/environment/SpaceInvaders.json
@@ -0,0 +1,13 @@
+{
+    "env_name": "MinAtarSpace_Invaders",
+    "total_timesteps": 1000000,
+    "steps_per_episode": 1000,
+    "eval_interval_timesteps": 1000000000,
+    "eval_episodes": 0,
+    "gamma": 0.99,
+    "accumulate_trace": false,
+    "overwrite_rewards": false,
+	"continuous": false,
+    "rewards": {},
+    "start_state": []
+}
diff --git a/environment.py b/environment.py
new file mode 100644
index 0000000..41f4ad6
--- /dev/null
+++ b/environment.py
@@ -0,0 +1,207 @@
+#!/usr/bin/env python3
+
+# Import modules
+import gym
+from copy import deepcopy
+from env.PendulumEnv import PendulumEnv
+from env.Acrobot import AcrobotEnv
+from env.Gridworld import GridworldEnv
+import env.MinAtar as MinAtar
+import numpy as np
+
+
+class Environment:
+    """
+    Environment is a wrapper around concrete implementations of environments
+    which logs data.
+    """
+    def __init__(self, config, seed, monitor=False, monitor_after=0):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        config : dict
+            The environment configuration file
+        seed : int
+            The seed to use for all random number generators
+        monitor : bool
+            Whether or not to render the scenes as the agent learns, by
+            default False
+        monitor_after : int
+            If monitor is True, how many timesteps should pass before
+            the scene is rendered, by default 0.
+        """
+
+        self.steps = 0
+        self.episodes = 0
+
+        # Whether to render the environment, and when to. Useful for debugging.
+        self.monitor = monitor
+        self.steps_until_monitor = monitor_after
+
+        # Set up the wrapped environment
+        self.env_name = config["env_name"]
+        self.env = _env_factory(config)
+        self.env.seed(seed=seed)
+        self.steps_per_episode = config["steps_per_episode"]
+
+        # Log environment info
+        if "info" in dir(self.env):
+            self.info = self.env.info
+        else:
+            self.info = {}
+
+    @property
+    def action_space(self):
+        """
+        Gets the action space of the Gym environment
+
+        Returns
+        -------
+        gym.spaces.Space
+            The action space
+        """
+        return self.env.action_space
+
+    @property
+    def observation_space(self):
+        """
+        Gets the observation space of the Gym environment
+
+        Returns
+        -------
+        gym.spaces.Space
+            The observation space
+        """
+        return self.env.observation_space
+
+    def seed(self, seed):
+        """
+        Seeds the environment with a random seed
+
+        Parameters
+        ----------
+        seed : int
+            The random seed to seed the environment with
+        """
+        self.env.seed(seed)
+
+    def reset(self):
+        """
+        Resets the environment by resetting the step counter to 0 and resetting
+        the wrapped environment. This function also increments the total
+        episode count.
+
+        Returns
+        -------
+        2-tuple of array_like, dict
+            The new starting state and an info dictionary
+        """
+        self.steps = 0
+        self.episodes += 1
+
+        state = self.env.reset()
+
+        return state, {"orig_state": state}
+
+    def render(self):
+        """
+        Renders the current frame
+        """
+        self.env.render()
+
+    def step(self, action):
+        """
+        Takes a single environmental step
+
+        Parameters
+        ----------
+        action : array_like of float
+            The action array. The number of elements in this array should be
+            the same as the action dimension.
+
+        Returns
+        -------
+        float, array_like of float, bool, dict
+            The reward and next state as well as a flag specifying if the
+            current episode has been completed and an info dictionary
+        """
+        if self.monitor and self.steps_until_monitor < 0:
+            self.render()
+        elif self.monitor:
+            self.steps_until_monitor -= (
+                1 if self.steps_until_monitor >= 0 else 0
+            )
+
+        self.steps += 1
+
+        # Get the next state, reward, and done flag
+        state, reward, done, info = self.env.step(action)
+        info["orig_state"] = state
+
+        # If the episode completes, return the goal reward
+        if done:
+            info["steps_exceeded"] = False
+            return state, reward, done, info
+
+        # If the maximum time-step was reached
+        if self.steps >= self.steps_per_episode > 0:
+            done = True
+            info["steps_exceeded"] = True
+
+        return state, reward, done, info
+
+
+def _env_factory(config):
+    """
+    Instantiates and returns an environment given an environment configuration
+    file.
+
+    Parameters
+    ----------
+    config : dict
+        The environment config
+
+    Returns
+    -------
+    gym.Env
+        The environment to train on
+    """
+    name = config["env_name"]
+    seed = config["seed"]
+    env = None
+
+    if name == "Pendulum-v0":
+        env = PendulumEnv(seed=seed, continuous_action=config["continuous"])
+
+    elif name == "Gridworld":
+        env = GridworldEnv(config["rows"], config["cols"])
+        env.seed(seed)
+
+    elif name == "Acrobot-v1":
+        env = AcrobotEnv(seed=seed, continuous_action=config["continuous"])
+
+    # If using MinAtar environments, we need a wrapper to permute the batch
+    # dimensions to be consistent with PyTorch.
+    elif "minatar" in name.lower():
+        if "/" in name:
+            raise ValueError(f"specify environment as MinAtar{name} rather " +
+                             "than MinAtar/{name}")
+
+        minimal_actions = config.get("use_minimal_action_set", True)
+        stripped_name = name[7:].lower()  # Strip off "MinAtar"
+
+        env = MinAtar.BatchFirst(
+            MinAtar.GymEnv(
+                stripped_name,
+                use_minimal_action_set=minimal_actions,
+            )
+        )
+
+    # Otherwise use a gym environment
+    else:
+        env = gym.make(name).env
+        env.seed(seed)
+
+    return env
diff --git a/experiment.py b/experiment.py
new file mode 100644
index 0000000..2ff5645
--- /dev/null
+++ b/experiment.py
@@ -0,0 +1,330 @@
+#!/usr/bin/env python3
+
+# Import modules
+import time
+from datetime import datetime
+from copy import deepcopy
+import numpy as np
+
+
+class Experiment:
+    """
+    Class Experiment will run a single experiment while logging data. An
+    experiment consists of a single run of agent-environment interaction.
+    """
+    def __init__(self, agent, env, eval_env, eval_episodes,
+                 total_timesteps, eval_interval_timesteps, max_episodes=-1):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        agent : baseAgent.BaseAgent
+            The agent to run the experiment on
+        env : environment.Environment
+            The environment to use for the experiment
+        eval_episodes : int
+            The number of evaluation episodes to run when measuring offline
+            performance
+        total_timesteps : int
+            The maximum number of allowable timesteps per experiment
+        eval_interval_timesteps: int
+            The interval of timesteps at which an agent's performance will be
+            evaluated
+        state_bins : tuple of int
+            For the sequence of states used in each update, the number of bins
+            per dimension with which to bin the states.
+        min_state_values : array_like
+             The minimum value of states along each dimension, used to encode
+             states used in updates to count the number of times states are
+             used in each update.
+        max_state_values : array_like
+             The maximum value of states along each dimension, used to encode
+             states used in updates to count the number of times states are
+             used in each update.
+        action_bins : tuple of int
+            For the sequence of actions used in each update, the number of bins
+            per dimension with which to bin the actions.
+        min_action_values : array_like
+             The minimum value of actions along each dimension, used to encode
+             actions used in updates to count the number of times actions are
+             used in each update.
+        max_state_values : array_like
+             The maximum value of actions along each dimension, used to encode
+             actions used in updates to count the number of times actions are
+             used in each update.
+        count_interval : int
+            The interval of timesteps at which we will store the counts of
+            state or action bins seen during training or used in updates. At
+            each timestep, we determine which state/action bins were used in
+            an update or seen at the current timestep. These values are
+            accumulated so that the total number of times each bin was
+            seen/used is stored up to the current timestep. This parameter
+            controls the timestep interval at which these accumulated values
+            should be checkpointed.
+        max_episodes : int
+            The maximum number of episodes to run. If <= 0, then there is no
+            episode limit.
+        """
+        self.agent = agent
+        self.env = env
+        self.eval_env = eval_env
+        self.eval_env.monitor = False
+
+        self.eval_episodes = eval_episodes
+        self.max_episodes = max_episodes
+
+        # Track the number of time steps
+        self.timesteps_since_last_eval = 0
+        self.eval_interval_timesteps = eval_interval_timesteps
+        self.timesteps_elapsed = 0
+        self.total_timesteps = total_timesteps
+
+        # Keep track of number of training episodes
+        self.train_episodes = 0
+
+        # Track the returns seen at each training episode
+        self.train_ep_return = []
+
+        # Track the steps per each training episode
+        self.train_ep_steps = []
+
+        # Track the steps at which evaluation occurs
+        self.timesteps_at_eval = []
+
+        # Track the returns seen at each eval episode
+        self.eval_ep_return = []
+
+        # Track the number of evaluation steps taken in each evaluation episode
+        self.eval_ep_steps = []
+
+        # Anything the experiment tracks
+        self.info = {}
+
+        # Track the total training and evaluation time
+        self.train_time = 0.0
+        self.eval_time = 0.0
+
+    def run(self):
+        """
+        Runs the experiment
+
+        Returns
+        -------
+        14-tuple of list of float, float, int
+            The online training episodic return, the return per
+            episode when evaluating offline, the training steps per
+            episode, the evaluation steps per episode when evaluating
+            offline, the list of timesteps at which the evaluation episodes
+            were run, the total amount of training time, the total amount
+            of evaluation time, and the number of total training episodes,
+            and the sequence of state, rewards, and actions during training.
+            Also returns the states, actions, and next states used in each
+            update to the agent.
+        """
+        # Count total run time
+        start_run = time.time()
+        print(f"Starting experiment at: {datetime.now()}")
+
+        # Evaluate once at the beginning
+        self.eval_time += self.eval()
+        self.timesteps_at_eval.append(self.timesteps_elapsed)
+
+        # Train
+        i = 0
+        while self.timesteps_elapsed < self.total_timesteps and \
+                (self.train_episodes < self.max_episodes if
+                 self.max_episodes > 0 else True):
+
+            # Run the training episode and save the relevant info
+            ep_reward, ep_steps, train_time = self.run_episode_train()
+            self.train_ep_return.append(ep_reward)
+            self.train_ep_steps.append(ep_steps)
+            self.train_time += train_time
+            print(f"=== Train ep: {i}, r: {ep_reward}, n_steps: {ep_steps}, " +
+                  f"elapsed: {train_time}")
+            i += 1
+
+        # Evaluate once at the end
+        self.eval_time += self.eval()
+        self.timesteps_at_eval.append(self.timesteps_elapsed)
+
+        end_run = time.time()
+        print(f"End run at time {datetime.now()}")
+        print(f"Total time taken: {end_run - start_run}")
+        print(f"Training time: {self.train_time}")
+        print(f"Evaluation time: {self.eval_time}")
+
+        self.info["eval_episode_rewards"] = np.array(self.eval_ep_return)
+        self.info["eval_episode_steps"] = np.array(self.eval_ep_steps)
+        self.info["timesteps_at_eval"] = np.array(self.timesteps_at_eval)
+        self.info["train_episode_steps"] = np.array(self.train_ep_steps)
+        self.info["train_episode_rewards"] = np.array(self.train_ep_return)
+        self.info["train_time"] = self.train_time
+        self.info["eval_time"] = self.eval_time
+        self.info["total_train_episodes"] = self.train_episodes
+
+    def run_episode_train(self):
+        """
+        Runs a single training episode, saving the evaluation metrics in
+        the corresponding instance variables.
+
+        Returns
+        -------
+        float, int, float
+            The return for the episode, the number of steps in the episode,
+            and the total amount of training time for the episode
+        """
+        # Reset the agent
+        self.agent.reset()
+
+        self.train_episodes += 1
+
+        # Track the sequences of states, rewards, and actions during training
+        # episode_states = []
+        episode_rewards = []
+        # episode_actions = []
+
+        start = time.time()
+        episode_return = 0.0
+        episode_steps = 0
+
+        state, _ = self.env.reset()
+
+        done = False
+        action = self.agent.sample_action(state)
+
+        while not done:
+            # Evaluate offline at the appropriate intervals
+            if self.timesteps_since_last_eval >= \
+                    self.eval_interval_timesteps:
+                self.eval_time += self.eval()
+                self.timesteps_at_eval.append(self.timesteps_elapsed)
+
+            # Sample the next transition
+            next_state, reward, done, info = self.env.step(action)
+            episode_steps += 1
+
+            # episode_states.append(next_state_info["orig_state"])
+            episode_rewards.append(reward)
+            episode_return += reward
+
+            # Compute the done mask, which is 1 if the episode terminated
+            # without the goal being reached or the episode is incomplete,
+            # and 0 if the agent reached the goal or terminal state
+            if self.env.steps_per_episode <= 1:
+                done_mask = 0
+            else:
+                if episode_steps <= self.env.steps_per_episode and done and \
+                        not info["steps_exceeded"]:
+                    done_mask = 0
+                else:
+                    done_mask = 1
+
+            # Update agent
+            self.agent.update(state, action, reward, next_state, done_mask)
+
+            # Continue the episode if not done
+            if not done:
+                action = self.agent.sample_action(next_state)
+            state = next_state
+
+            # Keep track of the timesteps since we last evaluated so we know
+            # when to evaluate again
+            self.timesteps_since_last_eval += 1
+
+            # Keep track of timesteps since we train for a specified number of
+            # timesteps
+            self.timesteps_elapsed += 1
+
+            # Stop if we are at the max allowable timesteps
+            if self.timesteps_elapsed >= self.total_timesteps:
+                break
+
+        end = time.time()
+
+        return episode_return, episode_steps, (end-start)
+
+    def eval(self):
+        """
+        Evaluates the agent's performance offline, for the appropriate number
+        of offline episodes as determined by the self.eval_episodes
+        instance variable. While evaluating, this function will populate the
+        appropriate instance variables with the evaluation data.
+
+        Returns
+        -------
+        float
+            The total amount of evaluation time
+        """
+        self.timesteps_since_last_eval = 0
+
+        # Set the agent to evaluation mode
+        self.agent.eval()
+
+        # Save the episodic return and the number of steps per episode
+        temp_rewards_per_episode = []
+        episode_steps = []
+        eval_session_time = 0.0
+
+        # Evaluate offline
+        for i in range(self.eval_episodes):
+            eval_start_time = time.time()
+            episode_reward, num_steps = self.run_episode_eval()
+            eval_end_time = time.time()
+
+            # Save the evaluation data
+            temp_rewards_per_episode.append(episode_reward)
+            episode_steps.append(num_steps)
+
+            # Calculate time
+            eval_elapsed_time = eval_end_time - eval_start_time
+            eval_session_time += eval_elapsed_time
+
+            # Display the offline episodic return
+            print("=== EVAL ep: " + str(i) + ", r: " +
+                  str(episode_reward) + ", n_steps: " + str(num_steps) +
+                  ", elapsed: " +
+                  time.strftime("%H:%M:%S", time.gmtime(eval_elapsed_time)))
+
+        # Save evaluation data
+        self.eval_ep_return.append(temp_rewards_per_episode)
+        self.eval_ep_steps.append(episode_steps)
+
+        self.eval_time += eval_session_time
+
+        # Return the agent to training mode
+        self.agent.train()
+
+        return eval_session_time
+
+    def run_episode_eval(self):
+        """
+        Runs a single evaluation episode.
+
+        Returns
+        -------
+        float, int, list
+            The episodic return and number of steps and the sequence of states,
+            rewards, and actions during the episode
+        """
+        state, _ = self.eval_env.reset()
+
+        episode_return = 0.0
+        episode_steps = 0
+        done = False
+
+        action = self.agent.sample_action(state)
+
+        while not done:
+            next_state, reward, done, _ = self.eval_env.step(action)
+
+            episode_return += reward
+
+            if not done:
+                action = self.agent.sample_action(next_state)
+
+            state = next_state
+            episode_steps += 1
+
+        return episode_return, episode_steps
diff --git a/main.py b/main.py
new file mode 100644
index 0000000..ba9f908
--- /dev/null
+++ b/main.py
@@ -0,0 +1,237 @@
+#!usr/bin/env python3
+
+# Import modules
+import numpy as np
+import environment
+import experiment
+import pickle
+from utils import experiment_utils as exp_utils
+import click
+import json
+from copy import deepcopy
+import os
+import utils.hypers as hypers
+
+
+@click.command(help="""Given agent and environment configuration files, run
+               the experiment defined by the configuration files
+               """)
+@click.option("--env-json", help="Path to the environment json " +
+              "configuration file",
+              type=str, required=True)
+@click.option("--agent-json", help="Path to the agent json configuration file",
+              type=str, required=True)
+@click.option("--index", type=int, required=False, help="The index " +
+              "of the hyperparameter to run", default=1)
+@click.option("--monitor", "-m", is_flag=True, help="Whether or not to " +
+              "render the scene as the agent trains.", type=bool)
+@click.option("--after", "-a", type=int, default=-1, help="How many " +
+              "timesteps (training) should pass before " +
+              "rendering the scene")
+@click.option("--save-dir", type=str, default="./results", help="Which " +
+              "directory to save the results file in", required=False)
+def run(env_json, agent_json, index, monitor, after, save_dir):
+    """
+    Perform runs over hyperparameter settings.
+
+    Performs the runs on the hyperparameter settings indices specified by
+    range(start, stop step), with values over the total number of
+    hyperparameters wrapping around to perform successive runs on the same
+    hyperparameter settings. For example, if there are 10 hyperparameter
+    settings and we run with hyperparameter settings 12, then this is the
+    (12 // 10) = 1 run of hyperparameter settings 12 % 10 = 2, where runs
+    are 0-based indexed.
+
+    Parameters
+    ----------
+    env_json : str
+        The path to the JSON environment configuration file
+    agent_json : str
+        The path to the JSON agent configuration file
+    start : int
+        The hyperparameter index to start the sweep at
+    stop : int
+        The hyperparameter index to stop the sweep at
+    step : int
+        The stepping value between hyperparameter settings indices
+    monitor : bool
+        Whether or not to render the scene as the agent trains
+    after : int
+        How many training + evaluation timesteps should pass before rendering
+        the scene
+    save_dir : str
+        The directory to save the data in
+    """
+    # Read the config files
+    with open(env_json) as in_json:
+        env_config = json.load(in_json)
+    with open(agent_json) as in_json:
+        agent_config = json.load(in_json)
+
+    main(agent_config, env_config, index, monitor, after, save_dir)
+
+
+def main(agent_config, env_config, index, monitor, after,
+         save_dir="./results"):
+    """
+    Runs experiments on the agent and environment corresponding the the input
+    JSON files using the hyperparameter settings corresponding to the indices
+    returned from range(start, stop, step).
+
+    Saves a pickled python dictionary of all training and evaluation data.
+
+    Note: this function will run the experiments sequentially.
+
+    Parameters
+    ----------
+    agent_json : dict
+        The agent JSON configuration file, as a Python dict
+    env_json : dict
+        The environment JSON configuration file, as a Python dict
+    index : int
+        The index of the hyperparameter setting to run
+    monitor : bool
+        Whether to render the scene as the agent trains or not
+    after : int
+        How many training + evaluation timesteps should pass before rendering
+        the scene
+    save_dir : str
+        The directory to save all data in
+    """
+    # Create the data dictionary
+    data = {}
+    data["experiment"] = {}
+
+    # Experiment meta-data
+    data["experiment"]["environment"] = env_config
+    data["experiment"]["agent"] = agent_config
+
+    # Experiment runs per each hyperparameter
+    data["experiment_data"] = {}
+
+    # Calculate the number of timesteps before rendering. It is inputted as
+    # number of training steps, but the environment uses training + eval steps
+    if after >= 0:
+        eval_steps = env_config["eval_episodes"] * \
+            env_config["steps_per_episode"]
+        eval_intervals = 1 + (after // env_config["eval_interval_timesteps"])
+        after = after + eval_steps * eval_intervals
+        print(f"Evaluation intervals before monitor: {eval_intervals}")
+
+    # Get the directory to save in
+    if not save_dir.startswith("./results"):
+        save_dir = os.path.join("./results", save_dir)
+    save_dir = os.path.join(save_dir, env_config["env_name"] + "_" +
+                            agent_config["agent_name"] + "results/")
+    # Run the experiments
+    # Get agent params from config file for the next experiment
+    agent_run_params, total_sweeps = hypers.sweeps(
+        agent_config["parameters"], index)
+    agent_run_params["gamma"] = env_config["gamma"]
+
+    print(f"Total number of hyperparam combinations: {total_sweeps}")
+
+    # Calculate the run number and the random seed
+    RUN_NUM = index // total_sweeps
+    RANDOM_SEED = np.iinfo(np.int16).max - RUN_NUM
+
+    # Create the environment
+    env_config["seed"] = RANDOM_SEED
+    if agent_config["agent_name"] == "linearAC" or \
+       agent_config["agent_name"] == "linearAC_softmax":
+        if "use_tile_coding" in env_config:
+            use_tile_coding = env_config["use_tile_coding"]
+            env_config["use_full_tile_coding"] = use_tile_coding
+            del env_config["use_tile_coding"]
+
+    env = environment.Environment(env_config, RANDOM_SEED, monitor, after)
+    eval_env = environment.Environment(env_config, RANDOM_SEED)
+
+    num_features = env.observation_space.shape[0]
+    agent_run_params["feature_size"] = num_features
+
+    # Set up the data dictionary to store the data from each run
+    hp_sweep = index % total_sweeps
+    if hp_sweep not in data["experiment_data"].keys():
+        data["experiment_data"][hp_sweep] = {}
+        data["experiment_data"][hp_sweep]["agent_hyperparams"] = \
+            dict(agent_run_params)
+        data["experiment_data"][hp_sweep]["runs"] = []
+
+    SETTING_NUM = index % total_sweeps
+    TOTAL_TIMESTEPS = env_config["total_timesteps"]
+    MAX_EPISODES = env_config.get("max_episodes", -1)
+    EVAL_INTERVAL = env_config["eval_interval_timesteps"]
+    EVAL_EPISODES = env_config["eval_episodes"]
+
+    # Store the seed in the agent run parameters so that batch algorithms
+    # can sample randomly
+    agent_run_params["seed"] = RANDOM_SEED
+
+    # Include the environment observation and action spaces in the agent's
+    # configuration so that neural networks can have the corrent number of
+    # output nodes
+    agent_run_params["observation_space"] = env.observation_space
+    agent_run_params["action_space"] = env.action_space
+
+    # Saving this data is redundant since we save the env_config file as
+    # well. Also, each run has the run number as the random seed
+    run_data = {}
+    run_data["run_number"] = RUN_NUM
+    run_data["random_seed"] = RANDOM_SEED
+    run_data["total_timesteps"] = TOTAL_TIMESTEPS
+    run_data["eval_interval_timesteps"] = EVAL_INTERVAL
+    run_data["episodes_per_eval"] = EVAL_EPISODES
+
+    # Print some data about the run
+    print(f"SETTING_NUM: {SETTING_NUM}")
+    print(f"RUN_NUM: {RUN_NUM}")
+    print(f"RANDOM_SEED: {RANDOM_SEED}")
+    print('Agent setting: ', agent_run_params)
+
+    # Create the agent
+    print(agent_config["agent_name"])
+    agent_run_params["env"] = env
+    agent = exp_utils.create_agent(agent_config["agent_name"],
+                                   agent_run_params)
+
+    # Initialize and run experiment
+    exp = experiment.Experiment(
+        agent,
+        env,
+        eval_env,
+        EVAL_EPISODES,
+        TOTAL_TIMESTEPS,
+        EVAL_INTERVAL,
+        MAX_EPISODES,
+    )
+    exp.run()
+
+    # Save the agent's learned parameters, with these parameters and the
+    # hyperparams, training can be exactly resumed from the end of the run
+    run_data["learned_params"] = agent.get_parameters()
+
+    # Save any information the agent saved during training
+    run_data = {**run_data, **agent.info, **exp.info, **env.info}
+
+    # Save data in parent dictionary
+    data["experiment_data"][hp_sweep]["runs"].append(run_data)
+
+    # After each run, save the data. Since data is accumulated, the
+    # later runs will overwrite earlier runs with updated data.
+    if not os.path.exists(save_dir):
+        os.makedirs(save_dir)
+
+    save_file = save_dir + env_config["env_name"] + "_" + \
+        agent_config["agent_name"] + f"_data_{index}.pkl"
+
+    print("=== Saving ===")
+    print(save_file)
+    print("==============")
+    with open(save_file, "wb") as out_file:
+        pickle.dump(data, out_file)
+
+
+if __name__ == "__main__":
+    # run_concurrent()
+    run()
diff --git a/minatar/setup.py b/minatar/setup.py
new file mode 100644
index 0000000..0c54513
--- /dev/null
+++ b/minatar/setup.py
@@ -0,0 +1,38 @@
+from setuptools import setup
+
+packages = ['minatar', 'minatar.environments']
+install_requires = [
+    'cycler>=0.10.0',
+    'kiwisolver>=1.0.1',
+    'matplotlib>=3.0.3',
+    'numpy>=1.16.2',
+    'pandas>=0.24.2',
+    'pyparsing>=2.3.1',
+    'python-dateutil>=2.8.0',
+    'pytz>=2018.9',
+    'scipy>=1.2.1',
+    'seaborn>=0.9.0',
+    'six>=1.12.0',
+]
+
+examples_requires = [
+    'torch>=1.0.0',
+]
+
+entry_points = {
+    'gym.envs': ['MinAtar=minatar.gym:register_envs']
+}
+
+setup(
+    name='MinAtar-Faster',
+    version='1.1.0',
+    description='A faster miniaturized version of the arcade learning environment.',
+    url='https://github.com/kenjyoung/MinAtar',
+    author='Robert Joseph George',
+    author_email='rjoseph1@ualberta.com',
+    license='GPL',
+    packages=packages,
+    entry_points=entry_points,
+    install_requires=install_requires,
+    extras_require={'examples': examples_requires},
+)
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000..355c11f
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,85 @@
+appnope==0.1.3
+asttokens==2.0.5
+autopep8==1.6.0
+backcall==0.2.0
+bootstrapped==0.0.2
+cffi==1.15.0
+click==8.1.2
+cloudpickle==2.0.0
+colorama==0.4.4
+commonmark==0.9.1
+cycler==0.11.0
+Cython==0.29.28
+debugpy==1.6.0
+decorator==5.1.1
+entrypoints==0.4
+executing==0.8.3
+fasteners==0.17.3
+filelock==3.6.0
+flake8==4.0.1
+fonttools==4.32.0
+glfw==2.5.3
+gym==0.23.1
+gym-notices==0.0.6
+h5py==3.6.0
+imageio==2.18.0
+importlib-metadata==4.11.3
+industrial-benchmark-python==2.0
+ipykernel==6.13.0
+ipython==8.2.0
+itermplot==0.331
+jedi==0.18.1
+jupyter-client==7.2.2
+jupyter-core==4.10.0
+kernel-driver==0.0.7
+kiwisolver==1.4.2
+llvmlite==0.38.0
+matplotlib==3.5.1
+matplotlib-inline==0.1.3
+mccabe==0.6.1
+mujoco-py==2.1.2.14
+nest-asyncio==1.5.5
+numba==0.55.1
+numpy==1.21.6
+packaging==21.3
+pandas==1.4.2
+parso==0.8.3
+pexpect==4.8.0
+pickleshare==0.7.5
+Pillow==9.1.0
+prompt-toolkit==3.0.29
+psutil==5.9.0
+ptyprocess==0.7.0
+pure-eval==0.2.2
+pycodestyle==2.8.0
+pycparser==2.21
+git+ssh://git@github.com/andnp/PyExpUtils@2.18#egg=PyExpUtils
+git+ssh://git@github.com/andnp/PyExpPlotting@0.7#egg=PyExpPlotting
+git+ssh://git@github.com/andnp/PyFixedReps@0.5#egg=PyFixedReps
+git+ssh://git@github.com/andnp/PyRlEnvs@0.19#egg=PyRlEnvs
+pyflakes==2.4.0
+pygame==2.1.2
+pyglet==1.5.23
+Pygments==2.11.2
+pyparsing==3.0.8
+python-dateutil==2.8.2
+pytz==2022.1
+pyzmq==22.3.0
+rich==12.2.0
+rlglue==2.2
+scipy==1.8.0
+seaborn==0.11.2
+setuptools-scm==6.4.2
+six==1.16.0
+stack-data==0.2.0
+toml==0.10.2
+tomli==2.0.1
+torch==1.11.0
+tornado==6.1
+tqdm==4.64.0
+traitlets==5.1.1
+typer==0.4.1
+typing_extensions==4.2.0
+wcwidth==0.2.5
+wrapt==1.14.0
+zipp==3.8.0
diff --git a/utils/TruncatedNormal.py b/utils/TruncatedNormal.py
new file mode 100644
index 0000000..702b87e
--- /dev/null
+++ b/utils/TruncatedNormal.py
@@ -0,0 +1,148 @@
+# Taken from:
+# # https://github.com/toshas/torch_truncnorm/blob/main/TruncatedNormal.py
+
+import math
+from numbers import Number
+
+import torch
+from torch.distributions import Distribution, constraints
+from torch.distributions.utils import broadcast_all
+
+CONST_SQRT_2 = math.sqrt(2)
+CONST_INV_SQRT_2PI = 1 / math.sqrt(2 * math.pi)
+CONST_INV_SQRT_2 = 1 / math.sqrt(2)
+CONST_LOG_INV_SQRT_2PI = math.log(CONST_INV_SQRT_2PI)
+CONST_LOG_SQRT_2PI_E = 0.5 * math.log(2 * math.pi * math.e)
+
+
+class TruncatedStandardNormal(Distribution):
+    """
+    Truncated Standard Normal distribution
+    https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
+    """
+
+    arg_constraints = {
+        'a': constraints.real,
+        'b': constraints.real,
+    }
+    has_rsample = True
+
+    def __init__(self, a, b, validate_args=None):
+        self.a, self.b = broadcast_all(a, b)
+        if isinstance(a, Number) and isinstance(b, Number):
+            batch_shape = torch.Size()
+        else:
+            batch_shape = self.a.size()
+        super(TruncatedStandardNormal, self).__init__(
+            batch_shape, validate_args=validate_args)
+        if self.a.dtype != self.b.dtype:
+            raise ValueError('Truncation bounds types are different')
+        if any((self.a >= self.b).view(-1,).tolist()):
+            raise ValueError('Incorrect truncation range')
+        eps = torch.finfo(self.a.dtype).eps
+        self._dtype_min_gt_0 = eps
+        self._dtype_max_lt_1 = 1 - eps
+        self._little_phi_a = self._little_phi(self.a)
+        self._little_phi_b = self._little_phi(self.b)
+        self._big_phi_a = self._big_phi(self.a)
+        self._big_phi_b = self._big_phi(self.b)
+        self._Z = (self._big_phi_b - self._big_phi_a).clamp_min(eps)
+        self._log_Z = self._Z.log()
+        little_phi_coeff_a = torch.nan_to_num(self.a, nan=math.nan)
+        little_phi_coeff_b = torch.nan_to_num(self.b, nan=math.nan)
+        self._lpbb_m_lpaa_d_Z = (self._little_phi_b * little_phi_coeff_b -
+                                 self._little_phi_a *
+                                 little_phi_coeff_a) / self._Z
+        self._mean = -(self._little_phi_b - self._little_phi_a) / self._Z
+        self._variance = 1 - self._lpbb_m_lpaa_d_Z - ((self._little_phi_b -
+                                                       self._little_phi_a) /
+                                                      self._Z) ** 2
+        self._entropy = CONST_LOG_SQRT_2PI_E + self._log_Z - 0.5 * \
+            self._lpbb_m_lpaa_d_Z
+
+    @constraints.dependent_property
+    def support(self):
+        return constraints.interval(self.a, self.b)
+
+    @property
+    def mean(self):
+        return self._mean
+
+    @property
+    def variance(self):
+        return self._variance
+
+    @property
+    def entropy(self):
+        return self._entropy
+
+    @property
+    def auc(self):
+        return self._Z
+
+    @staticmethod
+    def _little_phi(x):
+        return (-(x ** 2) * 0.5).exp() * CONST_INV_SQRT_2PI
+
+    @staticmethod
+    def _big_phi(x):
+        return 0.5 * (1 + (x * CONST_INV_SQRT_2).erf())
+
+    @staticmethod
+    def _inv_big_phi(x):
+        return CONST_SQRT_2 * (2 * x - 1).erfinv()
+
+    def cdf(self, value):
+        if self._validate_args:
+            self._validate_sample(value)
+        return ((self._big_phi(value) - self._big_phi_a) / self._Z).clamp(0, 1)
+
+    def icdf(self, value):
+        return self._inv_big_phi(self._big_phi_a + value * self._Z)
+
+    def log_prob(self, value):
+        if self._validate_args:
+            self._validate_sample(value)
+        return CONST_LOG_INV_SQRT_2PI - self._log_Z - (value ** 2) * 0.5
+
+    def rsample(self, sample_shape=torch.Size()):
+        shape = self._extended_shape(sample_shape)
+        p = torch.empty(shape, device=self.a.device).uniform_(
+            self._dtype_min_gt_0, self._dtype_max_lt_1)
+        return self.icdf(p)
+
+
+class TruncatedNormal(TruncatedStandardNormal):
+    """
+    Truncated Normal distribution
+    https://people.sc.fsu.edu/~jburkardt/presentations/truncated_normal.pdf
+    """
+
+    has_rsample = True
+
+    def __init__(self, loc, scale, a, b, validate_args=None):
+        self.loc, self.scale, a, b = broadcast_all(loc, scale, a, b)
+        a = (a - self.loc) / self.scale
+        b = (b - self.loc) / self.scale
+        super(TruncatedNormal, self).__init__(a, b,
+                                              validate_args=validate_args)
+        self._log_scale = self.scale.log()
+        self._mean = self._mean * self.scale + self.loc
+        self._variance = self._variance * self.scale ** 2
+        self._entropy += self._log_scale
+
+    def _to_std_rv(self, value):
+        return (value - self.loc) / self.scale
+
+    def _from_std_rv(self, value):
+        return value * self.scale + self.loc
+
+    def cdf(self, value):
+        return super(TruncatedNormal, self).cdf(self._to_std_rv(value))
+
+    def icdf(self, value):
+        return self._from_std_rv(super(TruncatedNormal, self).icdf(value))
+
+    def log_prob(self, value):
+        return super(TruncatedNormal, self).log_prob(self._to_std_rv(value)) \
+            - self._log_scale
diff --git a/utils/experience_replay.py b/utils/experience_replay.py
new file mode 100644
index 0000000..01bdd03
--- /dev/null
+++ b/utils/experience_replay.py
@@ -0,0 +1,270 @@
+# Import modules
+import numpy as np
+import torch
+from abc import ABC, abstractmethod
+
+
+# Class definitions
+class ExperienceReplay(ABC):
+    """
+    Abstract base class ExperienceReplay implements an experience replay
+    buffer. The specific kind of buffer is determined by classes which
+    implement this base class. For example, NumpyBuffer stores all
+    transitions in a numpy array while TorchBuffer implements the buffer
+    as a torch tensor.
+
+    Attributes
+    ----------
+    self.cast : func
+        A function which will cast data into an appropriate form to be
+        stored in the replay buffer. All incoming data is assumed to be
+        a numpy array.
+    """
+    def __init__(self, capacity, seed, state_size, action_size,
+                 device=None):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        capacity : int
+            The capacity of the buffer
+        seed : int
+            The random seed used for sampling from the buffer
+        state_size : tuple[int]
+            The number of dimensions of the state features
+        action_size : int
+            The number of dimensions in the action vector
+        """
+        self.device = device
+        self.is_full = False
+        self.position = 0
+        self.capacity = capacity
+
+        # Set the casting function, which is needed for implementations which
+        # may keep the ER buffer as a different data structure, for example
+        # a torch tensor, in this case all data needs to be cast to a torch
+        # tensor before storing
+        self.cast = lambda x: x
+
+        # Set the random number generator
+        self.random = np.random.default_rng(seed=seed)
+
+        # Save the size of states and actions
+        self.state_size = state_size
+        self.action_size = action_size
+
+        # Buffer of state, action, reward, next_state, done
+        self.state_buffer = None
+        self.action_buffer = None
+        self.reward_buffer = None
+        self.next_state_buffer = None
+        self.done_buffer = None
+        self.init_buffer()
+
+    @abstractmethod
+    def init_buffer(self):
+        """
+        Initializes the buffers on which to store transitions.
+
+        Note that different classes which implement this abstract base class
+        may use different data types as buffers. For example, NumpyBuffer
+        stores all transitions using a numpy array, while TorchBuffer
+        stores all transitions on a torch Tensor on a specific device in order
+        to speed up training by keeping transitions on the same device as
+        the device which holds the model.
+
+        Post-Condition
+        --------------
+        The replay buffer self.buffer has been initialized
+        """
+        pass
+
+    def push(self, state, action, reward, next_state, done):
+        """
+        Pushes a trajectory onto the replay buffer
+
+        Parameters
+        ----------
+        state : array_like
+            The state observation
+        action : array_like
+            The action taken by the agent in the state
+        reward : float
+            The reward seen after taking the argument action in the argument
+            state
+        next_state : array_like
+            The next state transitioned to
+        done : bool
+            Whether or not the transition was a transition to a goal state
+        """
+        reward = np.array([reward])
+        done = np.array([done])
+
+        state = self.cast(state)
+        action = self.cast(action)
+        reward = self.cast(reward)
+        next_state = self.cast(next_state)
+        done = self.cast(done)
+
+        self.state_buffer[self.position] = state
+        self.action_buffer[self.position] = action
+        self.reward_buffer[self.position] = reward
+        self.next_state_buffer[self.position] = next_state
+        self.done_buffer[self.position] = done
+
+        if self.position >= self.capacity - 1:
+            self.is_full = True
+        self.position = (self.position + 1) % self.capacity
+
+    def sample(self, batch_size):
+        """
+        Samples a random batch from the buffer
+
+        Parameters
+        ----------
+        batch_size : int
+            The size of the batch to sample
+
+        Returns
+        -------
+        5-tuple of torch.Tensor
+            The arrays of state, action, reward, next_state, and done from the
+            batch
+        """
+        # Get the indices for the batch
+        if self.is_full:
+            indices = self.random.integers(low=0, high=len(self),
+                                           size=batch_size)
+        else:
+            indices = self.random.integers(low=0, high=self.position,
+                                           size=batch_size)
+
+        # # Sample the batch
+        # batch = self.buffer[indices]
+
+        # # Keep running indices and get state sample
+        # start = 0
+        # end = self.state_size
+        # state = batch[:, start:end]
+
+        # # Action sample
+        # start = end
+        # end += self.action_size
+        # action = batch[:, start:end]
+
+        # # Reward sample
+        # start = end
+        # end += 1
+        # reward = batch[:, start:end]
+
+        # # Next state sample
+        # start = end
+        # end += self.state_size
+        # next_state = batch[:, start:end]
+
+        # # Done mask sample
+        # start = end
+        # done = batch[:, start:]
+
+        state = self.state_buffer[indices, :]
+        action = self.action_buffer[indices, :]
+        reward = self.reward_buffer[indices]
+        next_state = self.next_state_buffer[indices, :]
+        done = self.done_buffer[indices]
+
+        return state, action, reward, next_state, done
+
+    def __len__(self):
+        """
+        Gets the number of elements in the buffer
+
+        Returns
+        -------
+        int
+            The number of elements currently in the buffer
+        """
+        if not self.is_full:
+            return self.position
+        else:
+            return self.capacity
+
+
+class NumpyBuffer(ExperienceReplay):
+    """
+    Class NumpyBuffer implements an experience replay buffer. This
+    class stores all states, actions, and rewards as numpy arrays.
+    For an implementation that uses PyTorch tensors, see
+    TorchExperienceReplay
+    """
+    def __init__(self, capacity, seed, state_size, action_size):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        capacity : int
+            The capacity of the buffer
+        seed : int
+            The random seed used for sampling from the buffer
+        state_size : tuple[int]
+            The dimensions of the state features
+        action_size : int
+            The number of dimensions in the action vector
+        """
+        super().__init__(capacity, seed, state_size, action_size, None)
+
+    def init_buffer(self):
+        self.state_buffer = np.zeros((self.capacity, *self.state_size))
+        self.next_state_buffer = np.zeros((self.capacity, *self.state_size))
+        self.action_buffer = np.zeros(self.capacity, self.action_size)
+        self.reward_buffer = np.zeros((self.capacity, 1))
+        self.done_buffer = np.zeros((self.capacity, 1))
+
+
+class TorchBuffer(ExperienceReplay):
+    """
+    Class TorchBuffer implements an experience replay buffer. The
+    difference between this class and the ExperienceReplay class is that this
+    class keeps all experiences as a torch Tensor on the appropriate device
+    so that if using PyTorch, we do not need to cast the batch to a
+    FloatTensor every time we sample and then place it on the appropriate
+    device, as this is very time consuming. This class is basically a
+    PyTorch efficient implementation of ExperienceReplay.
+    """
+    def __init__(self, capacity, seed, state_size, action_size, device):
+        """
+        Constructor
+
+        Parameters
+        ----------
+        capacity : int
+            The capacity of the buffer
+        seed : int
+            The random seed used for sampling from the buffer
+        device : torch.device
+            The device on which the buffer instances should be stored
+        state_size : int
+            The number of dimensions in the state feature vector
+        action_size : int
+            The number of dimensions in the action vector
+        """
+        super().__init__(capacity, seed, state_size, action_size, device)
+        self.cast = torch.from_numpy
+
+    def init_buffer(self):
+        self.state_buffer = torch.FloatTensor(self.capacity, *self.state_size)
+        self.state_buffer = self.state_buffer.to(self.device)
+
+        self.next_state_buffer = torch.FloatTensor(self.capacity,
+                                                   *self.state_size)
+        self.next_state_buffer = self.next_state_buffer.to(self.device)
+
+        self.action_buffer = torch.FloatTensor(self.capacity, self.action_size)
+        self.action_buffer = self.action_buffer.to(self.device)
+
+        self.reward_buffer = torch.FloatTensor(self.capacity, 1)
+        self.reward_buffer = self.reward_buffer.to(self.device)
+
+        self.done_buffer = torch.FloatTensor(self.capacity, 1)
+        self.done_buffer = self.done_buffer.to(self.device)
diff --git a/utils/experiment_utils.py b/utils/experiment_utils.py
new file mode 100644
index 0000000..85561d6
--- /dev/null
+++ b/utils/experiment_utils.py
@@ -0,0 +1,1415 @@
+# Import modules
+import os
+import numpy as np
+from glob import glob
+# from env.tile_coder import TileCoding
+import pickle
+from tqdm import tqdm
+from copy import deepcopy
+import bootstrapped.bootstrap as bs
+import bootstrapped.stats_functions as bs_stats
+from scipy import signal as signal
+try:
+    import runs
+except ModuleNotFoundError:
+    import utils.runs
+
+
+def create_agent(agent, config):
+    """
+    Creates an agent given the agent name and configuration dictionary
+
+    Parameters
+    ----------
+    agent : str
+        The name of the agent
+    config : dict
+        The agent configuration dictionary
+
+    Returns
+    -------
+    baseAgent.BaseAgent
+        The agent to train
+    """
+    # Random agent
+    if agent.lower() == "random":
+        from agent.Random import Random
+        return Random(config["action_space"], config["seed"])
+
+    # Sarsa(λ)
+    if agent.lower() == "sarsa":
+        from agent.linear.Sarsa import Sarsa
+        return Sarsa(
+            decay=config["decay"],
+            lr=config["lr"],
+            gamma=config["gamma"],
+            epsilon=config["epsilon"],
+            action_space=config["action_space"],
+            seed=config["seed"],
+            bins=config["bins"],
+            num_tilings=config["num_tilings"],
+            env=config["env"],
+            trace_type=config["trace_type"],
+            policy_type=config["policy_type"],
+        )
+
+    # 𝔼Sarsa(λ)
+    if agent.lower() == "esarsa":
+        from agent.linear.ESarsa import ESarsa
+        return ESarsa(
+            decay=config["decay"],
+            lr=config["lr"],
+            gamma=config["gamma"],
+            epsilon=config["epsilon"],
+            action_space=config["action_space"],
+            seed=config["seed"],
+            bins=config["bins"],
+            num_tilings=config["num_tilings"],
+            env=config["env"],
+            trace_type=config["trace_type"],
+        )
+
+    # Linear-Gaussian Actor-Critic
+    if agent.lower() == "LinearGaussianAC".lower():
+        from agent.GaussianAC import GaussianAC
+        return GaussianAC(
+            decay=config["decay"],
+            actor_lr_scale=config["actor_lr_scale"],
+            critic_lr=config["critic_lr"],
+            gamma=config["gamma"],
+            accumulate_trace=config["accumulate_trace"],
+            action_space=config["action_space"],
+            scaled=config["scaled"],
+            clip_stddev=config["clip_stddev"],
+            seed=config["seed"],
+            bins=config["bins"],
+            num_tilings=config["num_tilings"],
+            env=config["env"],
+            use_critic_trace=config["use_critic_trace"],
+            use_actor_trace=config["use_actor_trace"],
+            trace_type=config["trace_type"],
+        )
+
+    # Linear-Softmax Actor-Critic
+    if agent.lower() == "LinearSoftmaxAC".lower():
+        from agent.linear.SoftmaxAC import SoftmaxAC
+        return SoftmaxAC(
+            decay=config["decay"],
+            actor_lr=config["actor_lr"],
+            critic_lr=config["critic_lr"],
+            gamma=config["gamma"],
+            accumulate_trace=config["accumulate_trace"],
+            action_space=config["action_space"],
+            seed=config["seed"],
+            bins=config["bins"],
+            num_tilings=config["num_tilings"],
+            env=config["env"],
+            use_critic_trace=config["use_critic_trace"],
+            use_actor_trace=config["use_actor_trace"],
+            trace_type=config["trace_type"],
+            temperature=config["temperature"]
+        )
+
+    # FKL
+    if agent.lower() == "fkl":
+        if "activation" in config:
+            activation = config["activation"]
+        else:
+            activation = "relu"
+
+        # Vanilla Actor Critic using FKL
+        from agent.nonlinear.FKL import FKL
+        return FKL(
+            num_inputs=config["feature_size"],
+            action_space=config["action_space"],
+            gamma=config["gamma"], tau=config["tau"],
+            alpha=config["alpha"], policy=config["policy_type"],
+            target_update_interval=config["target_update_interval"],
+            critic_lr=config["critic_lr"],
+            actor_lr_scale=config["actor_lr_scale"],
+            actor_hidden_dim=config["hidden_dim"],
+            critic_hidden_dim=config["hidden_dim"],
+            replay_capacity=config["replay_capacity"],
+            seed=config["seed"], batch_size=config["batch_size"],
+            cuda=config["cuda"], clip_stddev=config["clip_stddev"],
+            init=config["weight_init"], betas=config["betas"],
+            num_samples=config["num_samples"], activation="relu",
+            env=config["env"],
+        )
+
+    # Vanilla Actor-Critic
+    if agent.lower() == "VAC".lower():
+        if "activation" in config:
+            activation = config["activation"]
+        else:
+            activation = "relu"
+
+        from agent.nonlinear.VAC import VAC
+        return VAC(
+            num_inputs=config["feature_size"],
+            action_space=config["action_space"],
+            gamma=config["gamma"], tau=config["tau"],
+            alpha=config["alpha"], policy=config["policy_type"],
+            target_update_interval=config["target_update_interval"],
+            critic_lr=config["critic_lr"],
+            actor_lr_scale=config["actor_lr_scale"],
+            actor_hidden_dim=config["hidden_dim"],
+            critic_hidden_dim=config["hidden_dim"],
+            replay_capacity=config["replay_capacity"],
+            seed=config["seed"], batch_size=config["batch_size"],
+            cuda=config["cuda"], clip_stddev=config["clip_stddev"],
+            init=config["weight_init"], betas=config["betas"],
+            num_samples=config["num_samples"], activation="relu",
+            env=config["env"],
+        )
+
+    # Discrete Vanilla Actor-Critic
+    if agent.lower() == "VACDiscrete".lower():
+        if "activation" in config:
+            activation = config["activation"]
+        else:
+            activation = "relu"
+
+        from agent.nonlinear.VACDiscrete import VACDiscrete
+        return VACDiscrete(
+            num_inputs=config["feature_size"],
+            action_space=config["action_space"],
+            gamma=config["gamma"], tau=config["tau"],
+            alpha=config["alpha"], policy=config["policy_type"],
+            target_update_interval=config[
+                "target_update_interval"],
+            critic_lr=config["critic_lr"],
+            actor_lr_scale=config["actor_lr_scale"],
+            actor_hidden_dim=config["hidden_dim"],
+            critic_hidden_dim=config["hidden_dim"],
+            replay_capacity=config["replay_capacity"],
+            seed=config["seed"], batch_size=config["batch_size"],
+            cuda=config["cuda"],
+            clip_stddev=config["clip_stddev"],
+            init=config["weight_init"], betas=config["betas"],
+            activation="relu",
+        )
+
+    # Soft Actor-Critic
+    if agent.lower() == "SAC".lower():
+        if "activation" in config:
+            activation = config["activation"]
+        else:
+            activation = "relu"
+
+        if "num_hidden" in config:
+            num_hidden = config["num_hidden"]
+        else:
+            num_hidden = 3
+        from agent.nonlinear.SAC import SAC
+        return SAC(
+            gamma=config["gamma"], tau=config["tau"],
+            alpha=config["alpha"], policy=config["policy_type"],
+            target_update_interval=config["target_update_interval"],
+            critic_lr=config["critic_lr"],
+            actor_lr_scale=config["actor_lr_scale"],
+            alpha_lr=config["alpha_lr"],
+            actor_hidden_dim=config["hidden_dim"],
+            critic_hidden_dim=config["hidden_dim"],
+            replay_capacity=config["replay_capacity"],
+            seed=config["seed"], batch_size=config["batch_size"],
+            automatic_entropy_tuning=config["automatic_entropy_tuning"],
+            cuda=config["cuda"], clip_stddev=config["clip_stddev"],
+            init=config["weight_init"], betas=config["betas"],
+            activation=activation, env=config["env"],
+        )
+
+    # Discrete Soft Actor-Critic
+    if agent.lower() == "SACDiscrete".lower():
+        if "activation" in config:
+            activation = config["activation"]
+        else:
+            activation = "relu"
+
+        if "num_hidden" in config:
+            num_hidden = config["num_hidden"]
+        else:
+            num_hidden = 3
+
+        from agent.nonlinear.SACDiscrete import SACDiscrete
+        return SACDiscrete(
+            env=config["env"],
+            gamma=config["gamma"], tau=config["tau"],
+            alpha=config["alpha"], policy=config["policy_type"],
+            target_update_interval=config[
+                "target_update_interval"],
+            critic_lr=config["critic_lr"],
+            actor_lr_scale=config["actor_lr_scale"],
+            alpha_lr=config["alpha_lr"],
+            actor_hidden_dim=config["hidden_dim"],
+            critic_hidden_dim=config["hidden_dim"],
+            replay_capacity=config["replay_capacity"],
+            seed=config["seed"], batch_size=config["batch_size"],
+            automatic_entropy_tuning=config[
+                "automatic_entropy_tuning"],
+            cuda=config["cuda"],
+            clip_stddev=config["clip_stddev"],
+            init=config["weight_init"], betas=config["betas"],
+            activation=activation,
+        )
+
+    # Discrete Soft Actor-Critic + CNN
+    if agent.lower() == "SACDiscreteCNN".lower():
+        if "activation" in config:
+            activation = config["activation"]
+        else:
+            activation = "relu"
+
+        from agent.nonlinear.SACDiscreteCNN import SACDiscrete
+        return SACDiscrete(
+            env=config["env"],
+            gamma=config["gamma"], tau=config["tau"],
+            alpha=config["alpha"], policy=config["policy_type"],
+            target_update_interval=config[
+                "target_update_interval"],
+            critic_lr=config["critic_lr"],
+            actor_lr_scale=config["actor_lr_scale"],
+            alpha_lr=config["alpha_lr"],
+            hidden_dim=config["hidden_dim"],
+            channels=config["channels"],
+            kernel_sizes=config["kernel_sizes"],
+            replay_capacity=config["replay_capacity"],
+            seed=config["seed"], batch_size=config["batch_size"],
+            cuda=config["cuda"],
+            clip_stddev=config["clip_stddev"],
+            init=config["weight_init"], betas=config["betas"],
+            activation=activation,
+        )
+
+    raise NotImplementedError("No agent " + agent)
+
+
+def _calculate_mean_return_episodic(hp_returns, type_, after=0):
+    """
+    Calculates the mean return for an experiment run on an episodic environment
+    over all runs and episodes
+
+    Parameters
+    ----------
+    hp_returns : Iterable of Iterable
+        A list of lists, where the outer list has a single inner list for each
+        run. The inner lists store the return per episode for that run. Note
+        that these returns should be for a single hyperparameter setting, as
+        everything in these lists are averaged and returned as the average
+        return.
+    type_ : str
+        Whether calculating the training or evaluation mean returns, one of
+        'train', 'eval'
+    after : int, optional
+        Only consider episodes after this episode, by default 0
+
+    Returns
+    -------
+    2-tuple of float
+        The mean and standard error of the returns over all episodes and all
+        runs
+    """
+    if type_ == "eval":
+        hp_returns = [np.mean(hp_returns[i][after:], axis=-1) for i in
+                      range(len(hp_returns))]
+
+    # Calculate the average return for all episodes in the run
+    run_returns = [np.mean(hp_returns[i][after:]) for i in
+                   range(len(hp_returns))]
+
+    mean = np.mean(run_returns)
+    stderr = np.std(run_returns) / np.sqrt(len(hp_returns))
+
+    return mean, stderr
+
+
+def _calculate_mean_return_episodic_conf(hp_returns, type_, significance,
+                                         after=0):
+    """
+    Calculates the mean return for an experiment run on an episodic environment
+    over all runs and episodes
+
+    Parameters
+    ----------
+    hp_returns : Iterable of Iterable
+        A list of lists, where the outer list has a single inner list for each
+        run. The inner lists store the return per episode for that run. Note
+        that these returns should be for a single hyperparameter setting, as
+        everything in these lists are averaged and returned as the average
+        return.
+    type_ : str
+        Whether calculating the training or evaluation mean returns, one of
+        'train', 'eval'
+    significance: float
+        The level of significance for the confidence interval
+    after : int, optional
+        Only consider episodes after this episode, by default 0
+
+    Returns
+    -------
+    2-tuple of float
+        The mean and standard error of the returns over all episodes and all
+        runs
+    """
+    if type_ == "eval":
+        hp_returns = [np.mean(hp_returns[i][after:], axis=-1) for i in
+                      range(len(hp_returns))]
+
+    # Calculate the average return for all episodes in the run
+    run_returns = [np.mean(hp_returns[i][after:]) for i in
+                   range(len(hp_returns))]
+
+    mean = np.mean(run_returns)
+    run_returns = np.array(run_returns)
+
+    conf = bs.bootstrap(run_returns, stat_func=bs_stats.mean,
+                        alpha=significance)
+
+    return mean, conf
+
+
+def _calculate_mean_return_continuing(hp_returns, type_, after=0):
+    """
+    Calculates the mean return for an experiment run on a continuing
+    environment over all runs and episodes
+
+    Parameters
+    ----------
+    hp_returns : Iterable of Iterable
+        A list of lists, where the outer list has a single inner list for each
+        run. The inner lists store the return per episode for that run. Note
+        that these returns should be for a single hyperparameter setting, as
+        everything in these lists are averaged and returned as the average
+        return.
+    type_ : str
+        Whether calculating the training or evaluation mean returns, one of
+        'train', 'eval'
+    after : int, optional
+        Only consider episodes after this episode, by default 0
+
+    Returns
+    -------
+    2-tuple of float
+        The mean and standard error of the returns over all episodes and all
+        runs
+    """
+    hp_returns = np.stack(hp_returns)
+
+    # If evaluating, use the mean return over all episodes for each
+    # evaluation interval. That is, if 10 eval episodes for each
+    # evaluation the take the average return over all these eval
+    # episodes
+    if type_ == "eval":
+        hp_returns = hp_returns.mean(axis=-1)
+
+    # Calculate the average return over all runs
+    hp_returns = hp_returns[after:, :].mean(axis=-1)
+
+    # Calculate the average return over all "episodes"
+    stderr = np.std(hp_returns) / np.sqrt(len(hp_returns))
+    mean = hp_returns.mean(axis=0)
+
+    return mean, stderr
+
+
+def _calculate_mean_return_continuing_conf(hp_returns, type_, significance,
+                                           after=0):
+    """
+    Calculates the mean return for an experiment run on a continuing
+    environment over all runs and episodes
+
+    Parameters
+    ----------
+    hp_returns : Iterable of Iterable
+        A list of lists, where the outer list has a single inner list for each
+        run. The inner lists store the return per episode for that run. Note
+        that these returns should be for a single hyperparameter setting, as
+        everything in these lists are averaged and returned as the average
+        return.
+    type_ : str
+        Whether calculating the training or evaluation mean returns, one of
+        'train', 'eval'
+    after : int, optional
+        Only consider episodes after this episode, by default 0
+
+    Returns
+    -------
+    2-tuple of float
+        The mean and standard error of the returns over all episodes and all
+        runs
+    """
+    hp_returns = np.stack(hp_returns)
+
+    # If evaluating, use the mean return over all episodes for each
+    # evaluation interval. That is, if 10 eval episodes for each
+    # evaluation the take the average return over all these eval
+    # episodes
+    if type_ == "eval":
+        hp_returns = hp_returns.mean(axis=-1)
+
+    # Calculate the average return over all episodes
+    hp_returns = hp_returns[after:, :].mean(axis=-1)
+
+    # Calculate the average return over all runs
+    mean = hp_returns.mean(axis=0)
+    conf = bs.bootstrap(hp_returns, stat_func=bs_stats.mean,
+                        alpha=significance)
+
+    return mean, conf
+
+
+def get_best_hp_by_file(dir, type_, after=0, env_type="continuing"):
+    """
+    Find the best hyperparameters from a list of files.
+
+    Gets and returns a list of the hyperparameter settings, sorted by average
+    return. This function assumes a single directory containing all data
+    dictionaries, where each data dictionary contains all data of all runs for
+    a *single* hyperparameter setting. There must be a single file for each
+    hyperparameter setting in the argument directory.
+
+    Note: If any retrun is NaN within the range specified by after, then the
+    entire return is considered NaN.
+
+    Parameters
+    ----------
+    dir : str
+        The directory which contains the data dictionaries, with one data
+        dictionary per hyperparameter setting
+    type_ : str
+        The type of return by which to compare hyperparameter settings, one of
+        "train" or "eval"
+    after : int, optional
+        Hyperparameters will only be compared by their performance after
+        training for this many episodes (in continuing tasks, this is the
+        number of times the task is restarted). For example, if after = -10,
+        then only the last 10 returns from training/evaluation are taken
+        into account when comparing the hyperparameters. As usual, positive
+        values index from the front, and negative values index from the back.
+    env_type : str, optional
+        The type of environment, one of 'continuing', 'episodic'. By default
+        'continuing'
+
+    Returns
+    -------
+        n-tuple of 2-tuple(int, float)
+    A tuple with the number of elements equal to the total number of
+    hyperparameter combinations. Each sub-tuple is a tuple of (hyperparameter
+    setting number, mean return over all runs and episodes)
+    """
+    files = glob(os.path.join(dir, "*.pkl"))
+
+    if type_ not in ("train", "eval"):
+        raise ValueError("type_ should be one of 'train', 'eval'")
+
+    return_type = "train_episode_rewards" if type_ == "train" \
+        else "eval_episode_rewards"
+
+    mean_returns = []
+    # hp_settings = []
+    # hp_settings = sorted(list(data["experiment_data"].keys()))
+    for file in tqdm(files):
+        hp_returns = []
+
+        # Get the data
+        file = open(file, "rb")
+        data = pickle.load(file)
+
+        hp_setting = next(iter(data["experiment_data"]))
+        # hp_settings.append(hp_setting)
+        for run in data["experiment_data"][hp_setting]["runs"]:
+            hp_returns.append(run[return_type])
+
+        # Episodic and continuing must be dealt with differently since
+        # we may have many episodes for a given number of timesteps for
+        # episodic tasks
+        if env_type == "episodic":
+            hp_returns, _ = _calculate_mean_return_episodic(hp_returns, type_,
+                                                            after)
+
+        elif env_type == "continuing":
+            hp_returns, _ = _calculate_mean_return_continuing(hp_returns,
+                                                              type_, after)
+
+        # Save mean return
+        mean_returns.append((hp_setting, hp_returns))
+
+        # Close the file
+        file.close()
+        del data
+
+    # Create a structured array for sorting by return
+    dtype = [("setting index", int), ("return", float)]
+    mean_returns = np.array(mean_returns, dtype=dtype)
+
+    # Return the best hyperparam settings in order with the
+    # mean returns sorted by hyperparmater setting performance
+    # best_hp_settings = np.argsort(mean_returns)
+    # mean_returns = np.array(mean_returns)[best_hp_settings]
+    mean_returns = np.sort(mean_returns, order="return")
+
+    # return tuple(zip(best_hp_settings, mean_returns))
+    return mean_returns
+
+
+def combine_runs(data1, data2):
+    """
+    Adds the runs for each hyperparameter setting in data2 to the runs for the
+    corresponding hyperparameter setting in data1.
+
+    Given two data dictionaries, this function will get each hyperparameter
+    setting and extend the runs done on this hyperparameter setting and saved
+    in data1 by the runs of this hyperparameter setting and saved in data2.
+    In short, this function extends the lists
+    data1["experiment_data"][i]["runs"] by the lists
+    data2["experiment_data"][i]["runs"] for all i. This is useful if
+    multiple runs are done at different times, and the two data files need
+    to be combined.
+
+    Parameters
+    ----------
+    data1 : dict
+        A data dictionary as generated by main.py
+    data2 : dict
+        A data dictionary as generated by main.py
+
+    Raises
+    ------
+    KeyError
+        If a hyperparameter setting exists in data2 but not in data1. This
+        signals that the hyperparameter settings indices are most likely
+        different, so the hyperparameter index i in data1 does not correspond
+        to the same hyperparameter index in data2. In addition, all other
+        functions expect the number of runs to be consistent for each
+        hyperparameter setting, which would be violated in this case.
+    """
+    for hp_setting in data1["experiment_data"]:
+        if hp_setting not in list(data2.keys()):
+            # Ensure consistent hyperparam settings indices
+            raise KeyError("hyperparameter settings are different " +
+                           "between the two experiments")
+
+        extra_runs = data2["experiment_data"][hp_setting]["runs"]
+        data1["experiment_data"][hp_setting]["runs"].extend(extra_runs)
+
+
+def get_returns(data, type_, ind, env_type="continuing"):
+    """
+    Gets the returns seen by an agent
+
+    Gets the online or offline returns seen by an agent trained with
+    hyperparameter settings index ind.
+
+    Parameters
+    ----------
+    data : dict
+        The Python data dictionary generated from running main.py
+    type_ : str
+        Whether to get the training or evaluation returns, one of 'train',
+        'eval'
+    ind : int
+        Gets the returns of the agent trained with this hyperparameter
+        settings index
+    env_type : str, optional
+        The type of environment, one of 'continuing', 'episodic'. By default
+        'continuing'
+
+    Returns
+    -------
+    array_like
+        The array of returns of the form (N, R, C) where N is the number of
+        runs, R is the number of times a performance was measured, and C is the
+        number of returns generated each time performance was measured
+        (offline >= 1; online = 1). For the online setting, N is the number of
+        runs, and R is the number of episodes and C = 1. For the offline
+        setting, N is the number of runs, R is the number of times offline
+        evaluation was performed, and C is the number of episodes run each
+        time performance was evaluated offline.
+    """
+    if env_type == "episodic":
+        # data = reduce_episodes(data, ind, type_)
+        data = runs.expand_episodes(data, ind, type_)
+
+    returns = []
+    if type_ == "eval":
+        # Get the offline evaluation episode returns per run
+        for run in data["experiment_data"][ind]["runs"]:
+            returns.append(run["eval_episode_rewards"])
+        returns = np.stack(returns)
+
+    elif type_ == "train":
+        # Get the returns per episode per run
+        for run in data["experiment_data"][ind]["runs"]:
+            returns.append(run["train_episode_rewards"])
+        returns = np.expand_dims(np.stack(returns), axis=-1)
+
+    return returns
+
+
+def get_avg_returns(data, type_, ind, after=0, before=None):
+    """
+    Gets the average returns over all episodes seen by an agent for each run
+
+    Gets the online or offline returns seen by an agent trained with
+    hyperparameter settings index ind.
+
+    Parameters
+    ----------
+    data : dict
+        The Python data dictionary generated from running main.py
+    type_ : str
+        Whether to get the training or evaluation returns, one of 'train',
+        'eval'
+    ind : int
+        Gets the returns of the agent trained with this hyperparameter
+        settings index
+
+    Returns
+    -------
+    array_like
+        The array of returns of the form (N, R, C) where N is the number of
+        runs, R is the number of times a performance was measured, and C is the
+        number of returns generated each time performance was measured
+        (offline >= 1; online = 1). For the online setting, N is the number of
+        runs, and R is the number of episodes and C = 1. For the offline
+        setting, N is the number of runs, R is the number of times offline
+        evaluation was performed, and C is the number of episodes run each
+        time performance was evaluated offline.
+    """
+    returns = []
+    if type_ == "eval":
+        # Get the offline evaluation episode returns per run
+        for run in data["experiment_data"][ind]["runs"]:
+            if before is not None:
+                run_returns = run["eval_episode_rewards"][after:before]
+            else:
+                run_returns = run["eval_episode_rewards"][after:before]
+            returns.append(run_returns)
+
+        returns = np.stack(returns).mean(axis=(-2, -1))
+
+    elif type_ == "train":
+        # Get the returns per episode per run
+        for run in data["experiment_data"][ind]["runs"]:
+            if before is not None:
+                run_returns = run["train_episode_rewards"][after:before]
+            else:
+                run_returns = run["train_episode_rewards"][after:]
+            returns.append(np.mean(run_returns))
+
+        returns = np.array(returns)
+
+    return returns
+
+
+def get_mean_returns_with_stderr_hp_varying(dir_, type_, hp_name, combo,
+                                            env_config, agent_config, after=0,
+                                            env_type="continuing"):
+    """
+    Calculate mean and standard error of return for each hyperparameter value.
+
+    Gets the mean returns for each variation of a single hyperparameter,
+    with all other hyperparameters remaining constant. Since there are
+    many different ways this can happen (the hyperparameter can vary
+    with all other remaining constant, but there are many combinations
+    of these constant hyperparameters), the combo argument cycles through
+    the combinations of constant hyperparameters.
+
+    Given hyperparameters a, b, and c, let's say we want to get all
+    hyperparameter settings indices where a varies, and b and c are constant.
+    if a, b, and c can each be 1 or 2, then there are four ways that a can
+    vary with b and c remaining constant:
+
+        [
+            ((a=1, b=1, c=1), (a=2, b=1, c=1)),         combo = 0
+            ((a=1, b=2, c=1), (a=2, b=2, c=1)),         combo = 1
+            ((a=1, b=1, c=2), (a=2, b=1, c=2)),         combo = 2
+            ((a=1, b=2, c=2), (a=2, b=2, c=2))          combo = 3
+        ]
+
+    The combo argument indexes into this list of hyperparameter settings
+
+    Parameters
+    ----------
+    dir_ : str
+        The directory of data dictionaries generated from running main.py,
+        separated into one data dictionary per HP setting
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    hp_name : str
+        The name of the hyperparameter to plot the sensitivity curves of
+    combo : int
+        Determines the values of the constant hyperparameters. Given that
+        only one hyperparameter may vary, there are many different sets
+        having this hyperparameter varying with all others remaining constant
+        since each constant hyperparameter may take on many values. This
+        argument cycles through all sets of hyperparameter settings indices
+        that have only one hyperparameter varying and all others constant.
+    env_config : dict
+        The environment configuration file as a Python dictionary
+    agent_config : dict
+        The agent configuration file as a Python dictionary
+    after : int
+        Only consider returns after this episode
+    """
+    hp_combo = get_varying_single_hyperparam(env_config, agent_config,
+                                             hp_name)[combo]
+
+    env_name = env_config["env_name"]
+    agent_name = agent_config["agent_name"]
+    filename = f"{env_name}_{agent_name}_hp-" + "{hp}.pkl"
+
+    mean_returns = []
+    stderr_returns = []
+    hp_values = []
+    for hp in hp_combo:
+        if hp is None:
+            continue
+
+        with open(os.path.join(dir_, filename.format(hp=hp)), "rb") as in_file:
+            data = pickle.load(in_file)
+
+        hp_returns = []
+        return_type = f"{type_}_episode_rewards"
+        for run in data["experiment_data"][hp]["runs"]:
+            hp_returns.append(run[return_type])
+
+        if env_type == "episodic":
+            mean_return, stderr_return = \
+                _calculate_mean_return_episodic(hp_returns, type_, after)
+        elif env_type == "continuing":
+            mean_return, stderr_return = \
+                _calculate_mean_return_continuing(hp_returns, type_, after)
+
+        mean_returns.append(mean_return)
+        stderr_returns.append(stderr_return)
+        hp_value = data["experiment_data"][hp]["agent_hyperparams"][hp_name]
+        hp_values.append(hp_value)
+
+        del data
+
+    # Get each hp value and sort all results by hp value
+    # hp_values = np.array(agent_config["parameters"][hp_name])
+    hp_values = np.array(hp_values)
+    indices = np.argsort(hp_values)
+
+    mean_returns = np.array(mean_returns)[indices]
+    stderr_returns = np.array(stderr_returns)[indices]
+    hp_values = hp_values[indices]
+
+    return hp_values, mean_returns, stderr_returns
+
+
+def get_mean_returns_with_conf_hp_varying(dir_, type_, hp_name, combo,
+                                          env_config, agent_config, after=0,
+                                          env_type="continuing",
+                                          significance=0.1):
+    """
+    Calculate mean and standard error of return for each hyperparameter value.
+
+    Gets the mean returns for each variation of a single hyperparameter,
+    with all other hyperparameters remaining constant. Since there are
+    many different ways this can happen (the hyperparameter can vary
+    with all other remaining constant, but there are many combinations
+    of these constant hyperparameters), the combo argument cycles through
+    the combinations of constant hyperparameters.
+
+    Given hyperparameters a, b, and c, let's say we want to get all
+    hyperparameter settings indices where a varies, and b and c are constant.
+    if a, b, and c can each be 1 or 2, then there are four ways that a can
+    vary with b and c remaining constant:
+
+        [
+            ((a=1, b=1, c=1), (a=2, b=1, c=1)),         combo = 0
+            ((a=1, b=2, c=1), (a=2, b=2, c=1)),         combo = 1
+            ((a=1, b=1, c=2), (a=2, b=1, c=2)),         combo = 2
+            ((a=1, b=2, c=2), (a=2, b=2, c=2))          combo = 3
+        ]
+
+    The combo argument indexes into this list of hyperparameter settings
+
+    Parameters
+    ----------
+    dir_ : str
+        The directory of data dictionaries generated from running main.py,
+        separated into one data dictionary per HP setting
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    hp_name : str
+        The name of the hyperparameter to plot the sensitivity curves of
+    combo : int
+        Determines the values of the constant hyperparameters. Given that
+        only one hyperparameter may vary, there are many different sets
+        having this hyperparameter varying with all others remaining constant
+        since each constant hyperparameter may take on many values. This
+        argument cycles through all sets of hyperparameter settings indices
+        that have only one hyperparameter varying and all others constant.
+    env_config : dict
+        The environment configuration file as a Python dictionary
+    agent_config : dict
+        The agent configuration file as a Python dictionary
+    after : int
+        Only consider returns after this episode
+    """
+    hp_combo = get_varying_single_hyperparam(env_config, agent_config,
+                                             hp_name)[combo]
+
+    env_name = env_config["env_name"]
+    agent_name = agent_config["agent_name"]
+    filename = f"{env_name}_{agent_name}_hp-" + "{hp}.pkl"
+
+    mean_returns = []
+    conf_returns = []
+    hp_values = []
+    for hp in hp_combo:
+        if hp is None:
+            continue
+
+        with open(os.path.join(dir_, filename.format(hp=hp)), "rb") as in_file:
+            data = pickle.load(in_file)
+
+        hp_returns = []
+        return_type = f"{type_}_episode_rewards"
+        for run in data["experiment_data"][hp]["runs"]:
+            hp_returns.append(run[return_type])
+
+        if env_type == "episodic":
+            mean_return, conf_return = \
+                _calculate_mean_return_episodic_conf(hp_returns, type_,
+                                                     significance, after)
+        elif env_type == "continuing":
+            mean_return, conf_return = \
+                _calculate_mean_return_continuing_conf(hp_returns, type_,
+                                                       significance, after)
+
+        mean_returns.append(mean_return)
+        conf_returns.append([conf_return.lower_bound, conf_return.upper_bound])
+        hp_value = data["experiment_data"][hp]["agent_hyperparams"][hp_name]
+        hp_values.append(hp_value)
+
+        del data
+
+    # Get each hp value and sort all results by hp value
+    # hp_values = np.array(agent_config["parameters"][hp_name])
+    hp_values = np.array(hp_values)
+    indices = np.argsort(hp_values)
+
+    mean_returns = np.array(mean_returns)[indices]
+    conf_returns = np.array(conf_returns)[indices, :].transpose()
+    hp_values = hp_values[indices]
+
+    return hp_values, mean_returns, conf_returns
+
+
+def get_mean_err(data, type_, ind, smooth_over, error,
+                 env_type="continuing", keep_shape=False,
+                 err_args={}):
+    """
+    Gets the timesteps, mean, and standard error to be plotted for
+    a given hyperparameter settings index
+
+    Note: This function assumes that each run has an equal number of episodes.
+    This is true for continuing tasks. For episodic tasks, you will need to
+    cutoff the episodes so all runs have the same number of episodes.
+
+    Parameters
+    ----------
+    data : dict
+        The Python data dictionary generated from running main.py
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    ind : int
+        The hyperparameter settings index to plot
+    smooth_over : int
+        The number of previous data points to smooth over. Note that this
+        is *not* the number of timesteps to smooth over, but rather the number
+        of data points to smooth over. For example, if you save the return
+        every 1,000 timesteps, then setting this value to 15 will smooth
+        over the last 15 readings, or 15,000 timesteps.
+    error: function
+        The error function to compute the error with
+    env_type : str, optional
+        The type of environment the data was generated on
+    keep_shape : bool, optional
+        Whether or not the smoothed data should discard or keep the first
+        few data points before smooth_over.
+    err_args : dict
+        A dictionary of keyword arguments to pass to the error function
+
+    Returns
+    -------
+    3-tuple of list(int), list(float), list(float)
+        The timesteps, mean episodic returns, and standard errors of the
+        episodic returns
+    """
+    timesteps = None  # So the linter doesn't have a temper tantrum
+
+    # Determine the timesteps to plot at
+    if type_ == "eval":
+        timesteps = \
+            data["experiment_data"][ind]["runs"][0]["timesteps_at_eval"]
+
+    elif type_ == "train":
+        timesteps_per_ep = \
+            data["experiment_data"][ind]["runs"][0]["train_episode_steps"]
+        timesteps = get_cumulative_timesteps(timesteps_per_ep)
+
+    # Get the mean over all episodes per evaluation step (for online
+    # returns, this axis will have length 1 so we squeeze it)
+    returns = get_returns(data, type_, ind, env_type=env_type)
+    returns = returns.mean(axis=-1)
+
+    returns = smooth(returns, smooth_over, keep_shape=keep_shape)
+
+    # Get the standard error of mean episodes per evaluation
+    # step over all runs
+    if error is not None:
+        err = error(returns, **err_args)
+    else:
+        err = None
+
+    # Get the mean over all runs
+    print("RUNS:", returns.shape)
+    mean = returns.mean(axis=0)
+
+    # Return only the valid portion of timesteps. If smoothing and not
+    # keeping the first data points, then the first smooth_over columns
+    # will not have any data
+    if not keep_shape:
+        end = len(timesteps) - smooth_over + 1
+        timesteps = timesteps[:end]
+
+    return timesteps, mean, err
+
+
+def bootstrap_conf(runs, significance=0.01):
+    """
+    THIS NEEDS TO BE UPDATED
+
+
+    Gets the bootstrap confidence interval of the distribution of mean return
+    per episode for a single hyperparameter setting.
+
+    Note that this function assumes that there are an equal number of episodes
+    for each run. This is true for continuing environments. If using an
+    episodic environment, ensure that the episodes have been made consistent
+    across runs before running this function.
+
+    Parameters
+    ----------
+    data : dict
+        The Python data dictionary generated from running main.py
+    significance : float, optional
+        The significance level for the confidence interval, by default 0.01
+
+    Returns
+    -------
+    array_like
+        An array with two rows and n columns. The first row denotes the lower
+        bound of the confidence interval and the second row denotes the upper
+        bound of the confidence interval. The number of columns, n, is the
+        number of episodes.
+    """
+    # return_type = type_ + "_episode_rewards"
+    # runs = []
+    # for run in data["experiment_data"][hp]["runs"]:
+    #     if type_ == "eval":
+    #         runs.append(run[return_type].mean())
+    #     else:
+    #         runs.append(run[return_type])
+
+    # Rows are the returns for the episode number == row number for each run
+    ep_conf = []
+    run_returns = []
+    for ep in range(runs.shape[1]):
+        ep_returns = []
+        for run in range(runs.shape[0]):
+            ep_returns.append(np.mean(runs[run][ep]))
+        run_returns.append(ep_returns)
+
+    run_returns = np.array(run_returns)
+
+    conf_interval = []
+    for ep in range(run_returns.shape[0]):
+        ep_conf = bs.bootstrap(run_returns[ep, :], stat_func=bs_stats.mean,
+                               alpha=significance)
+        conf_interval.append([ep_conf.lower_bound, ep_conf.upper_bound])
+
+    return np.array(conf_interval).transpose()
+
+
+def stderr(matrix, axis=0):
+    """
+    Calculates the standard error along a specified axis
+
+    Parameters
+    ----------
+    matrix : array_like
+        The matrix to calculate standard error along the rows of
+    axis : int, optional
+        The axis to calculate the standard error along, by default 0
+
+    Returns
+    -------
+    array_like
+        The standard error of each row along the specified axis
+
+    Raises
+    ------
+    np.AxisError
+        If an invalid axis is passed in
+    """
+    if axis > len(matrix.shape) - 1:
+        raise np.AxisError(f"""axis {axis} is out of bounds for array with
+                           {len(matrix.shape) - 1} dimensions""")
+
+    samples = matrix.shape[axis]
+    return np.std(matrix, axis=axis) / np.sqrt(samples)
+
+
+def smooth(matrix, smooth_over, keep_shape=False):
+    """
+    Smooth the rows of returns
+
+    Smooths the rows of returns by replacing the value at index i in a
+    row of returns with the average of the next smooth_over elements,
+    starting at element i.
+
+    Parameters
+    ----------
+    matrix : array_like
+        The array to smooth over
+    smooth_over : int
+        The number of elements to smooth over
+    keep_shape : bool, optional
+        Whether the smoothed array should have the same shape as
+        as the input array, by default True. If True, then for the first
+        few i < smooth_over columns of the input array, the element at
+        position i is replaced with the average of all elements at
+        positions j <= i.
+
+    Returns
+    -------
+    array_like
+        The smoothed over array
+    """
+    if smooth_over > 1:
+        # Smooth each run separately
+        kernel = np.ones(smooth_over) / smooth_over
+        smoothed_matrix = _smooth(matrix, kernel, "valid", axis=1)
+
+        # Smooth the first few episodes
+        if keep_shape:
+            beginning_cols = []
+            for i in range(1, smooth_over):
+                # Calculate smoothing over the first i columns
+                beginning_cols.append(matrix[:, :i].mean(axis=1))
+
+            # Numpy will use each smoothed col as a row, so transpose
+            beginning_cols = np.array(beginning_cols).transpose()
+    else:
+        return matrix
+
+    if keep_shape:
+        # Return the smoothed array
+        return np.concatenate([beginning_cols, smoothed_matrix],
+                              axis=1)
+    else:
+        return smoothed_matrix
+
+
+def _smooth(matrix, kernel, mode="valid", axis=0):
+    """
+    Performs an axis-wise convolution of matrix with kernel
+
+    Parameters
+    ----------
+    matrix : array_like
+        The matrix to convolve
+    kernel : array_like
+        The kernel to convolve on each row of matrix
+    mode : str, optional
+         The mode of convolution, by default "valid". One of 'valid',
+         'full', 'same'
+    axis : int, optional
+         The axis to perform the convolution along, by default 0
+
+    Returns
+    -------
+    array_like
+        The convolved array
+
+    Raises
+    ------
+    ValueError
+        If kernel is multi-dimensional
+    """
+    if len(kernel.shape) != 1:
+        raise ValueError("kernel must be 1D")
+
+    def convolve(mat):
+        return np.convolve(mat, kernel, mode=mode)
+
+    return np.apply_along_axis(convolve, axis=axis, arr=matrix)
+
+
+def get_cumulative_timesteps(timesteps_per_episode):
+    """
+    Creates an array of cumulative timesteps.
+
+    Creates an array of timesteps, where each timestep is the cumulative
+    number of timesteps up until that point. This is needed for plotting the
+    training data, where  the training timesteps are stored for each episode,
+    and we need to plot on the x-axis the cumulative timesteps, not the
+    timesteps per episode.
+
+    Parameters
+    ----------
+    timesteps_per_episode : list
+        A list where each element in the list denotes the amount of timesteps
+        for the corresponding episode.
+
+    Returns
+    -------
+    array_like
+        An array where each element is the cumulative number of timesteps up
+        until that point.
+    """
+    timesteps_per_episode = np.array(timesteps_per_episode)
+    cumulative_timesteps = [timesteps_per_episode[:i].sum()
+                            for i in range(timesteps_per_episode.shape[0])]
+
+    return np.array(cumulative_timesteps)
+
+
+def combine_data_dictionaries_by_hp(dir_, env, agent, num_hp_settings,
+                                    num_runs, save_dir=".", save_returns=True,
+                                    env_type="continuing", offset=0):
+    """
+    Combines all data dictionaries by hyperparameter setting.
+
+    Given a directory, combines all data dictionaries relating to the argument
+    agent and environment, grouped by hyperparameter settings index. This way,
+    each resulting data dictionary will contain all data of all runs for
+    a single hyperparameter setting. This function will save one data
+    dictionary, consisting of all runs, for each hyperparameter setting.
+
+    This function looks for files named like
+    "env_agent_data_start_stop_step.pkl" in the argument directory and
+    combines all those whose start index refers to the same hyperparameter
+    settings index.
+
+    Parameters
+    ----------
+    dir_ : str
+        The directory containing the data files
+    env : str
+        The name of the environment the experiments were run on
+    agent : str
+        The name of the agent in the experiments
+    num_hp_settings : int
+        The total number of hyperparameter settings used in the experiment
+    num_runs : int
+        The number of runs in the experiment
+    save_dir : str, optional
+        The directory to save the combined data in, by default "."
+    save_returns : bool, optinal
+        Whether or not to save the mean training and evaluation returns over
+        all episodes and runs in a text file, by default True
+    env_type : str, optional
+        Whether the environment is continuing or episodic, one of 'continuing',
+        'episodic'; by default 'continuing'. This determines how the average
+        return is calculated. For continuing environments, each episode's
+        performance is first averaged over runs and then over episodes. For
+        episodic environments, the average return is calculated by first
+        averaging over all episodes in each run, and then averaging over all
+        runs; this is required since each run may have a different number of
+        episodes.
+    """
+    hp_returns = []
+
+    for hp_ind in range(num_hp_settings):
+        _, train_mean, eval_mean = \
+            combine_data_dictionaries_single_hp(dir_, env, agent, hp_ind,
+                                                num_hp_settings, num_runs,
+                                                save_dir, save_returns,
+                                                env_type, offset=offset)
+        if save_returns:
+            hp_returns.append((hp_ind, train_mean, eval_mean))
+
+    # Write the mean training and evaluation returns to a file
+    if save_returns:
+        filename = f"{env}_{agent}_avg_returns.pkl"
+        with open(os.path.join(save_dir, filename), "wb") as out_file:
+            # out_file.write(f"{train_mean}, {eval_mean}")
+            pickle.dump(hp_returns, out_file)
+
+
+def combine_data_dictionaries_single_hp(dir_, env, agent, hp_ind,
+                                        num_hp_settings, num_runs,
+                                        save_dir=".", calculate_returns=True,
+                                        env_type="continuing", offset=0):
+    filenames = f"{env}_{agent}_data_" + "{start}.pkl"
+
+    hp_run_files = []
+    hp_offset = offset * num_hp_settings
+    start = hp_ind + hp_offset
+    for j in range(start, start + num_hp_settings * num_runs, num_hp_settings):
+        filename = os.path.join(dir_, filenames.format(start=j))
+        if os.path.exists(filename):
+            hp_run_files.append(filename)
+    data = combine_data_dictionaries(hp_run_files, True, save_dir=save_dir,
+                                     filename=f"{env}_{agent}_hp-{hp_ind}")
+
+    if not calculate_returns:
+        return hp_ind, -1., -1.
+
+    # Get the returns for each episode in each run
+    train_returns = []
+    eval_returns = []
+    for run in data["experiment_data"][hp_ind]["runs"]:
+        train_returns.append(run["train_episode_rewards"])
+        eval_returns.append(run["eval_episode_rewards"])
+
+    # Get the mean performance
+    if env_type == "continuing":
+        train_mean, _ = _calculate_mean_return_continuing(train_returns,
+                                                          "train")
+        eval_mean, _ = _calculate_mean_return_continuing(eval_returns,
+                                                         "eval")
+
+    elif env_type == "episodic":
+        train_mean, _ = _calculate_mean_return_episodic(train_returns,
+                                                        "train")
+        eval_mean, _ = _calculate_mean_return_episodic(eval_returns,
+                                                       "eval")
+
+    return hp_ind, train_mean, eval_mean
+
+
+def combine_data_dictionaries(files, save=True, save_dir=".", filename="data"):
+    """
+    Combine data dictionaries given a list of filenames
+
+    Given a list of paths to data dictionaries, combines each data dictionary
+    into a single one.
+
+    Parameters
+    ----------
+    files : list of str
+        A list of the paths to data dictionary files to combine
+    save : bool
+        Whether or not to save the data
+    save_dir : str, optional
+        The directory to save the resulting data dictionaries in
+    filename : str, optional
+        The name of the file to save which stores the combined data, by default
+        'data'
+
+    Returns
+    -------
+    dict
+        The combined dictionary
+    """
+    # Use first dictionary as base dictionary
+    with open(files[0], "rb") as in_file:
+        data = pickle.load(in_file)
+
+    # Add data from all other dictionaries
+    for file in files[1:]:
+        with open(file, "rb") as in_file:
+            # Read in the new dictionary
+            in_data = pickle.load(in_file)
+
+            # Add experiment data to running dictionary
+            for key in in_data["experiment_data"]:
+                # Check if key exists
+                if key in data["experiment_data"]:
+                    # Append data if existing
+                    data["experiment_data"][key]["runs"] \
+                        .extend(in_data["experiment_data"][key]["runs"])
+
+                else:
+                    # Key doesn't exist - add data to dictionary
+                    data["experiment_data"][key] = \
+                        in_data["experiment_data"][key]
+
+    if save:
+        with open(os.path.join(save_dir, f"{filename}.pkl"), "wb") as out_file:
+            pickle.dump(data, out_file)
+
+    return data
+
+
+def combine_data_dictionaries_by_dir(dir):
+    """
+    Combines the many data dictionaries created during the concurrent
+    training procedure into a single data dictionary. The combined data is
+    saved as "data.pkl" in the argument dir.
+
+    Parameters
+    ----------
+    dir : str
+        The path to the directory containing all data dictionaries to combine
+
+    Returns
+    -------
+    dict
+        The combined dictionary
+    """
+    files = glob(os.path.join(dir, "*.pkl"))
+
+    combine_data_dictionaries(files)
+
+
+if __name__ == "__main__":
+    f = open("results/MountainCarContinuous-v0_linearACresults" +
+             "/MountainCarContinuous-v0_linearAC_hp-12.pkl", "rb")
+    data = pickle.load(f)
+    f.close()
+
+    # get_mean_stderr(data, "train", 12, 5)
+    r = get_returns(data, "train", 12, "episodic")
+    print(r.shape)
+
+
+def detrend_linear(arr, axis=-1, type_="linear"):
+    """
+    Detrends a matrix along an axis using linear model fitting
+
+    Parameters
+    ----------
+    arr : array_like
+        The array to detrend
+    axis : int, optional
+        The axis along which to detrend, by default -1
+    type_ : str, optional
+        Whether to use the prediction of the linear model or the mean
+        generated by the linear model, by default "linear". One of "linear",
+        "mean"
+
+    Returns
+    -------
+    array_like
+        The array of detrended data
+    """
+    return signal.detrend(arr, axis=axis, type=type_)
+
+
+def detrend_difference(arr, axis=-1):
+    """
+    Detrends a matrix along an axis using the method of differences
+
+    Parameters
+    ----------
+    arr : array_like
+        The array to detrend
+    axis : int, optional
+        The axis along which to detrend, by default -1
+
+    Returns
+    -------
+    array_like
+        The array of detrended data
+    """
+    return np.diff(arr, axis=axis)
diff --git a/utils/hypers.py b/utils/hypers.py
new file mode 100644
index 0000000..0c1061a
--- /dev/null
+++ b/utils/hypers.py
@@ -0,0 +1,520 @@
+import numpy as np
+from collections.abc import Iterable
+from copy import deepcopy
+from pprint import pprint
+try:
+    from utils.runs import expand_episodes
+except ModuleNotFoundError:
+    from runs import expand_episodes
+
+
+CONTINIUING = "continuing"
+EPISODIC = "episodic"
+TRAIN = "train"
+EVAL = "eval"
+
+
+def sweeps(parameters, index):
+    """
+    Gets the parameters for the hyperparameter sweep defined by the index.
+
+    Each hyperparameter setting has a specific index number, and this function
+    will get the appropriate parameters for the argument index. In addition,
+    this the indices will wrap around, so if there are a total of 10 different
+    hyperparameter settings, then the indices 0 and 10 will return the same
+    hyperparameter settings. This is useful for performing loops.
+
+    For example, if you had 10 hyperparameter settings and you wanted to do
+    10 runs, the you could just call this for indices in range(0, 10*10). If
+    you only wanted to do runs for hyperparameter setting i, then you would
+    use indices in range(i, 10, 10*10)
+
+    Parameters
+    ----------
+    parameters : dict
+        The dictionary of parameters, as found in the agent's json
+        configuration file
+    index : int
+        The index of the hyperparameters configuration to return
+
+    Returns
+    -------
+    dict, int
+        The dictionary of hyperparameters to use for the agent and the total
+        number of combinations of hyperparameters (highest possible unique
+        index)
+    """
+    # If the algorithm is a batch algorithm, ensure the batch size if less
+    # than the replay buffer size
+    if "batch_size" in parameters and "replay_capacity" in parameters:
+        batches = np.array(parameters["batch_size"])
+        replays = np.array(parameters["replay_capacity"])
+        legal_settings = []
+
+        # Calculate the legal combinations of batch sizes and replay capacities
+        for batch in batches:
+            legal = np.where(replays >= batch)[0]
+            legal_settings.extend(list(zip([batch] *
+                                           len(legal), replays[legal])))
+
+        # Replace the configs batch/replay combos with the legal ones
+        parameters["batch/replay"] = legal_settings
+        replaced_hps = ["batch_size", "replay_capacity"]
+    else:
+        replaced_hps = []
+
+    # Get the hyperparameters corresponding to the argument index
+    out_params = {}
+    accum = 1
+    for key in parameters:
+        if key in replaced_hps:
+            # Ignore the HPs that have been sanitized and replaced by a new
+            # set of HPs
+            continue
+
+        num = len(parameters[key])
+        if key == "batch/replay":
+            # Batch/replay must be treated differently
+            batch_replay_combo = parameters[key][(index // accum) % num]
+            out_params["batch_size"] = batch_replay_combo[0]
+            out_params["replay_capacity"] = batch_replay_combo[1]
+            accum *= num
+            continue
+
+        out_params[key] = parameters[key][(index // accum) % num]
+        accum *= num
+
+    return (out_params, accum)
+
+
+def total(parameters):
+    """
+    Similar to sweeps but only returns the total number of
+    hyperparameter combinations. This number is the total number of distinct
+    hyperparameter settings. If this function returns k, then there are k
+    distinct hyperparameter settings, and indices 0 and k refer to the same
+    distinct hyperparameter setting.
+
+    Parameters
+    ----------
+    parameters : dict
+        The dictionary of parameters, as found in the agent's json
+        configuration file
+
+    Returns
+    -------
+    int
+        The number of distinct hyperparameter settings
+    """
+    return sweeps(parameters, 0)[1]
+
+
+def satisfies(data, f):
+    """
+    Similar to hold_constant. Returns all hyperparameter settings
+    that result in f evaluating to True.
+
+    For each run, the hyperparameter dictionary for that run is inputted to f.
+    If f returns True, then those hypers are kept.
+
+    Parameters
+    ----------
+    data : dict
+        The data dictionary generate from running an experiment
+    f : f(dict) -> bool
+        A function mapping hyperparameter settings (in a dictionary) to a
+        boolean value
+
+    Returns
+    -------
+    tuple of list[int], dict
+        The list of hyperparameter settings satisfying the constraints
+        defined by constant_hypers and a dictionary of new hyperparameters
+        which satisfy these constraints
+    """
+    indices = []
+
+    # Generate a new hyperparameter configuration based on the old
+    # configuration
+    new_hypers = deepcopy(data["experiment"]["agent"]["parameters"])
+    # Clear the hyper configuration
+    for key in new_hypers:
+        if isinstance(new_hypers[key], list):
+            new_hypers[key] = set()
+
+    for index in data["experiment_data"]:
+        hypers = data["experiment_data"][index]["agent_hyperparams"]
+        if not f(hypers):
+            continue
+
+        # Track the hyper indices and the full hyper settings
+        indices.append(index)
+        for key in new_hypers:
+            if key not in data["experiment_data"][index]["agent_hyperparams"]:
+                # print(f"{key} not in agent hyperparameters, ignoring...")
+                continue
+
+            if isinstance(new_hypers[key], set):
+                agent_val = data["experiment_data"][index][
+                    "agent_hyperparams"][key]
+
+                # Convert lists to a hashable type
+                if isinstance(agent_val, list):
+                    agent_val = tuple(agent_val)
+
+                new_hypers[key].add(agent_val)
+            else:
+                if key in new_hypers:
+                    value = new_hypers[key]
+                    raise IndexError("clobbering existing hyper " +
+                                     f"{key} with value {value} with " +
+                                     f"new value {agent_val}")
+                new_hypers[key] = agent_val
+
+    # Convert each set in new_hypers to a list
+    for key in new_hypers:
+        if isinstance(new_hypers[key], set):
+            new_hypers[key] = sorted(list(new_hypers[key]))
+
+    return indices, new_hypers
+
+
+def hold_constant(data, constant_hypers):
+    """
+    Returns the hyperparameter settings indices and hyperparameter values
+    of the hyperparameter settings satisfying the constraints constant_hypers.
+
+    Returns the hyperparameter settings indices in the data that
+    satisfy the constraints as well as a new dictionary of hypers which satisfy
+    the constraints. The indices returned are the hyper indices of the original
+    data and not the indices into the new hyperparameter configuration
+    returned.
+
+    Parameters
+    ----------
+    data: dict
+        The data dictionary generated from an experiment
+
+    constant_hypers: dict[string]any
+        A dictionary mapping hyperparameters to a value that they should be
+        equal to.
+
+    Returns
+    -------
+    tuple of list[int], dict
+        The list of hyperparameter settings satisfying the constraints
+        defined by constant_hypers and a dictionary of new hyperparameters
+        which satisfy these constraints
+
+    Example
+    -------
+    >>> data = ...
+    >>> contraints = {"stepsize": 0.8}
+    >>> hold_constant(data, constraints)
+    (
+        [0, 1, 6, 7],
+        {
+            "stepsize": [0.8],
+            "decay":    [0.0, 0.5],
+            "epsilon":  [0.0, 0.1],
+        }
+    )
+    """
+    indices = []
+
+    # Generate a new hyperparameter configuration based on the old
+    # configuration
+    new_hypers = deepcopy(data["experiment"]["agent"]["parameters"])
+    # Clear the hyper configuration
+    for key in new_hypers:
+        if isinstance(new_hypers[key], list):
+            new_hypers[key] = set()
+
+    # Go through each hyperparameter index, checking if it satisfies the
+    # constraints
+    for index in data["experiment_data"]:
+        # Assume we hyperparameter satisfies the constraints
+        constraint_satisfied = True
+
+        # Check to see if the agent hyperparameter satisfies the constraints
+        for hyper in constant_hypers:
+            constant_val = constant_hypers[hyper]
+
+            # Ensure the constrained hyper exists in the data
+            if hyper not in data["experiment_data"][index][
+               "agent_hyperparams"]:
+                raise IndexError(f"no such hyper {hyper} in agent hypers")
+
+            agent_val = data["experiment_data"][index]["agent_hyperparams"][
+                hyper]
+
+            if agent_val != constant_val:
+                # Hyperparameter does not satisfy the constraints
+                constraint_satisfied = False
+                break
+
+        # If the constraint is satisfied, then we will store the hypers
+        if constraint_satisfied:
+            indices.append(index)
+
+            # Add the hypers to the configuration
+            for key in new_hypers:
+                if isinstance(new_hypers[key], set):
+                    agent_val = data["experiment_data"][index][
+                        "agent_hyperparams"][key]
+
+                    if isinstance(agent_val, list):
+                        agent_val = tuple(agent_val)
+
+                    new_hypers[key].add(agent_val)
+                else:
+                    if key in new_hypers:
+                        value = new_hypers[key]
+                        raise IndexError("clobbering existing hyper " +
+                                         f"{key} with value {value} with " +
+                                         f"new value {agent_val}")
+                    new_hypers[key] = agent_val
+
+    # Convert each set in new_hypers to a list
+    for key in new_hypers:
+        if isinstance(new_hypers[key], set):
+            new_hypers[key] = sorted(list(new_hypers[key]))
+
+    return indices, new_hypers
+
+
+def renumber(data, hypers):
+    """
+    Renumbers the hyperparameters in data to reflect the hyperparameter map
+    hypers. If any hyperparameter settings exist in data that do not exist in
+    hypers, then those data are discarded.
+
+    Note that each hyperparameter listed in hypers must also be listed in data
+    and vice versa, but the specific hyperparameter values need not be the
+    same. For example if "decay" ∈ data[hypers], then it also must be in hypers
+    and vice versa. If 0.9 ∈ data[hypers][decay], then it need *not* be in
+    hypers[decay].
+
+    This function does not mutate the input data, but rather returns a copy of
+    the input data, appropriately mutated.
+
+    Parameters
+    ----------
+    data : dict
+        The data dictionary generated from running the experiment
+    hypers : dict
+        The new dictionary of hyperparameter values
+
+    Returns
+    -------
+    dict
+        The modified data dictionary
+
+    Examples
+    --------
+    >>> data = ...
+    >>> contraints = {"stepsize": 0.8}
+    >>> new_hypers = hold_constant(data, constraints)[1]
+    >>> new_data = renumber(data, new_hypers)
+    """
+    data = deepcopy(data)
+    # Ensure each hyperparameter is in both hypers and data; hypers need not
+    # list every hyperparameter *value* that is listed in data, but it needs to
+    # have the same hyperparameters. E.g. if "decay" exists in data then it
+    # should also exist in hypers, but if 0.9 ∈ data[hypers][decay], this value
+    # need not exist in hypers.
+    for key in data["experiment"]["agent"]["parameters"]:
+        if key not in hypers:
+            raise ValueError("data and hypers should have all the same " +
+                             f"hyperparameters but {key} ∈ data but ∉ hypers")
+
+    # Ensure each hyperparameter listed in hypers is also listed in data. If it
+    # isn't then it isn't clear which value of this hyperparamter the data in
+    # data should map to. E.g. if "decay" = [0.1, 0.2] ∈ hypers but ∉ data,
+    # which value should we set for the data in data when renumbering? 0.1 or
+    # 0.2?
+    for key in hypers:
+        if key not in data["experiment"]["agent"]["parameters"]:
+            raise ValueError("data and hypers should have all the same " +
+                             f"hyperparameters but {key} ∈ hypers but ∉ data")
+
+    new_data = {}
+    new_data["experiment"] = data["experiment"]
+    new_data["experiment"]["agent"]["parameters"] = hypers
+    new_data["experiment_data"] = {}
+
+    total_hypers = total(hypers)
+
+    for i in range(total_hypers):
+        setting = sweeps(hypers, i)[0]
+
+        for j in data["experiment_data"]:
+            agent_hypers = data["experiment_data"][j]["agent_hyperparams"]
+            setting_in_data = True
+
+            # For each hyperparameter value in setting, ensure that the
+            # corresponding agent hyperparameter is equal. If not, ignore that
+            # hyperparameter setting.
+            for key in setting:
+                # If the hyper setting is iterable, then check each value in
+                # the iterable to ensure it is equal to the corresponding
+                # value in the agent hyperparameters
+                if isinstance(setting[key], Iterable):
+                    if len(setting[key]) != len(agent_hypers[key]):
+                        setting_in_data = False
+                        break
+                    for k in range(len(setting[key])):
+                        if setting[key][k] != agent_hypers[key][k]:
+                            setting_in_data = False
+                            break
+
+                # Non-iterable data
+                elif setting[key] != agent_hypers[key]:
+                    setting_in_data = False
+                    break
+
+            if setting_in_data:
+                new_data["experiment_data"][i] = data["experiment_data"][j]
+
+    return new_data
+
+
+def get_performance(data, hyper, type_=TRAIN, repeat=True):
+    """
+    Returns the data for each run of key, optionally adjusting the runs'
+    data so that each run has the same number of data points. This is
+    accomplished by repeating each episode's performance by the number of
+    timesteps the episode took to complete
+
+    Parameters
+    ----------
+    data : dict
+        The data dictionary
+    hyper : int
+        The hyperparameter index to get the run data of
+    repeat : bool
+        Whether or not to repeat the runs data
+
+    Returns
+    -------
+    np.array
+        The array of performance data
+    """
+    if type_ not in (TRAIN, EVAL):
+        raise ValueError(f"unknown type {type_}")
+
+    key = type_ + "_episode_rewards"
+
+    if repeat:
+        data = expand_episodes(data, hyper, type_)
+
+    run_data = []
+    for run in data["experiment_data"][hyper]["runs"]:
+        run_data.append(run[key])
+
+    return np.array(run_data)
+
+
+def best(data, perf=TRAIN):
+    """
+    Returns the hyperparameter index of the hyper setting which resulted in the
+    highest AUC of the learning curve. AUC is calculated by computing the AUC
+    for each run, then taking the average over all runs.
+
+    Parameters
+    ----------
+    data : dict
+        The data dictionary
+    perf : str
+        The type of performance to evaluate, train or eval
+
+    Returns
+    -------
+    np.array[int], np.float32
+        The hyper settings that resulted in the maximum return as well as the
+        maximum return
+    """
+    max_hyper = int(np.max(list(data["experiment_data"].keys())))
+    hypers = [np.finfo(np.float64).min] * (max_hyper + 1)
+    for hyper in data["experiment_data"]:
+        hyper_data = []
+        for run in data["experiment_data"][hyper]["runs"]:
+            hyper_data.append(run[f"{perf}_episode_rewards"].mean())
+
+        hyper_data = np.array(hyper_data)
+        hypers[hyper] = hyper_data.mean()
+
+    return np.argmax(hypers), np.max(hypers)
+
+
+def get(data, ind):
+    """
+    Gets the hyperparameters for hyperparameter settings index ind
+
+    data : dict
+        The Python data dictionary generated from running main.py
+    ind : int
+        Gets the returns of the agent trained with this hyperparameter
+        settings index
+
+    Returns
+    -------
+    dict
+        The dictionary of hyperparameters
+    """
+    return data["experiment_data"][ind]["agent_hyperparams"]
+
+
+def which(data, hypers, equal_keys=False):
+    """
+    Get the hyperparameter index at which all agent hyperparameters are
+    equal to those specified by hypers.
+
+    Parameters
+    ----------
+    data : dict
+        The data dictionary that resulted from running an experiment
+    hypers : dict[string]any
+        A dictionary of hyperparameters to the values that those
+        hyperparameters should take on
+    equal_keys : bool, optional
+        Whether or not all keys must be shared between the sets of agent
+        hyperparameters and the argument hypers. By default False.
+
+    Returns
+    -------
+    int, None
+        The hyperparameter index at which the agent had hyperparameters equal
+        to those specified in hypers.
+
+    Examples
+    --------
+    >>> data = ... # Some data from an experiment
+    >>> hypers = {"critic_lr": 0.01, "actor_lr": 1.0}
+    >>> ind = which(data, hypers)
+    >>> print(ind in data["experiment_data"])
+        True
+    """
+    for ind in data["experiment_data"]:
+        is_equal = True
+        agent_hypers = data["experiment_data"][ind]["agent_hyperparams"]
+
+        # Ensure that all keys in each dictionary are equal
+        if equal_keys and set(agent_hypers.keys()) != set(hypers.keys()):
+            continue
+
+        # For the current set of agent hyperparameters (index ind), check to
+        # see if all hyperparameters used by the agent are equal to those
+        # specified by hypers. If not, then break and check the next set of
+        # agent hyperparameters.
+        for h in hypers:
+            if h in agent_hypers and hypers[h] != agent_hypers[h]:
+                is_equal = False
+                break
+
+        if is_equal:
+            return ind
+
+    # No agent hyperparameters were found that coincided with the argument
+    # hypers
+    return None
diff --git a/utils/max_time.py b/utils/max_time.py
new file mode 100644
index 0000000..85d6f39
--- /dev/null
+++ b/utils/max_time.py
@@ -0,0 +1,27 @@
+#!/usr/bin/env python3
+
+# This script looks through all runs of an experiment, over all hyper settings.
+# It will return the runtime from the longest running experiment.
+
+import numpy as np
+import pickle
+import sys
+import os
+
+if len(sys.argv) != 2:
+    print(f"{sys.argv[0]}: checks the maximum runtime over all runs for an "
+          "experiment")
+    print("usage:")
+    print(f"\t{sys.argv[0]} path/to/dir/containing/data.pkl")
+
+f = sys.argv[1]
+with open(f, "rb") as infile:
+    data = pickle.load(infile)
+
+time = []
+for hyper in data["experiment_data"]:
+    for run in data["experiment_data"][hyper]["runs"]:
+        total = run["train_time"] + run["eval_time"]
+        time.append(total)
+
+print("Maximum run time:", np.max(time) / 3600)
diff --git a/utils/plot_hypers.py b/utils/plot_hypers.py
new file mode 100644
index 0000000..3423907
--- /dev/null
+++ b/utils/plot_hypers.py
@@ -0,0 +1,103 @@
+import pickle
+import functools
+from tqdm import tqdm
+import os
+import matplotlib.pyplot as plt
+import numpy as np
+import scipy
+import json
+import sys
+import seaborn as sns
+import plot_utils as plot
+import matplotlib as mpl
+import experiment_utils as exp
+import hypers
+
+
+# Place environment name with type of environment in type_map so that we know
+# how to plot/evaluate. This terrible code-style is due to legacy code which
+# needs to be fixed badly.
+CONTINUING = "continuing"
+EPISODIC = "episodic"
+type_map = {
+        "MinAtarBreakout": EPISODIC,
+        "MinAtarFreeway": EPISODIC,
+        "PendulumFixed-v0": EPISODIC,
+        "Acrobot-v1": EPISODIC,
+        "BipedalWalker-v3": EPISODIC,
+        "LunarLanderContinuous-v2": EPISODIC,
+        "Bimodal1DEnv": CONTINUING,
+        "Hopper-v2": EPISODIC,
+        "PuddleWorld-v1": EPISODIC,
+        "MountainCar-v0": EPISODIC,
+        "MountainCarContinuous-v0": EPISODIC,
+        "Pendulum-v0": CONTINUING,
+        "Pendulum-v1": CONTINUING,
+        "Walker2d": EPISODIC,
+        "Swimmer-v2": EPISODIC
+}
+
+if len(sys.argv) < 5:
+    print("invalid number of inputs:")
+    print(f"\t{sys.argv[0]} env_json hyper agent_json")
+
+env_json = sys.argv[1]
+DIR = sys.argv[2]
+HYPER = sys.argv[3]
+agent_json = sys.argv[4:]
+
+# Load configuration files
+with open(env_json, "r") as infile:
+    env_config = json.load(infile)
+if "gamma" not in env_config:
+    env_config["gamma"] = -1
+
+agent_configs = []
+for j in agent_json:
+    with open(j, "r") as infile:
+        agent_configs.append(json.load(infile))
+
+ENV = env_config["env_name"]
+ENV_TYPE = type_map[ENV]
+PERFORMANCE_METRIC_TYPE = "train"
+DATA_FILE = "data.pkl"
+
+
+# Script
+DATA_FILES = []
+for config in agent_configs:
+    agent = config["agent_name"]
+    if DIR:
+        DATA_FILES.append(f"./results/{DIR}/{ENV}_{agent}results")
+    else:
+        DATA_FILES.append(f"./results/{ENV}_{agent}results")
+
+DATA = []
+for f in DATA_FILES:
+    with open(os.path.join(f, DATA_FILE), "rb") as infile:
+        DATA.append(pickle.load(infile))
+
+# Generate labels for plots
+labels = []
+for ag in DATA:
+    labels.append([ag["experiment"]["agent"]["agent_name"]])
+colours = [["#003f5c"], ["#bc5090"], ["#ffa600"], ["#ff6361"], ["#58cfa1"]]
+
+# Plot the hyperparameter sensitivities
+all_fig, all_ax = plot.hyper_sensitivity(DATA, HYPER)
+
+# Adjust axis spines
+all_ax.spines['top'].set_visible(False)
+all_ax.spines['right'].set_visible(False)
+all_ax.spines['bottom'].set_linewidth(2)
+all_ax.spines['left'].set_linewidth(2)
+
+# Set title and legend
+all_ax.set_title(HYPER + " " + os.path.basename(env_json).rstrip(".json"))
+all_ax.legend()
+
+all_fig.savefig(
+    f"{os.path.expanduser('~')}/{ENV}_{HYPER}.png",
+    bbox_inches="tight",
+)
+exit(0)
diff --git a/utils/plot_mse.py b/utils/plot_mse.py
new file mode 100644
index 0000000..f705bf4
--- /dev/null
+++ b/utils/plot_mse.py
@@ -0,0 +1,117 @@
+import pickle
+import seaborn as sns
+import os
+import matplotlib.pyplot as plt
+import numpy as np
+import hypers
+import json
+import sys
+import plot_utils as plot
+import matplotlib as mpl
+mpl.rcParams["font.size"] = 24
+mpl.rcParams["svg.fonttype"] = "none"
+
+
+# Place environment name with type of environment in type_map so that we know
+# how to plot/evaluate. This terrible code-style is due to legacy code which
+# needs to be fixed badly.
+CONTINUING = "continuing"
+EPISODIC = "episodic"
+type_map = {
+        "MinAtarBreakout": EPISODIC,
+        "MinAtarFreeway": EPISODIC,
+        "LunarLanderContinuous-v2": EPISODIC,
+        "Bimodal3Env": CONTINUING,
+        "Bimodal2DEnv": CONTINUING,
+        "Bimodal1DEnv": CONTINUING,
+        "BipedalWalker-v3": EPISODIC,
+        "Hopper-v2": EPISODIC,
+        "PuddleWorld-v1": EPISODIC,
+        "MountainCar-v0": EPISODIC,
+        "MountainCarContinuous-v0": EPISODIC,
+        "PendulumFixed-v0": CONTINUING,
+        "Pendulum-v0": CONTINUING,
+        "Acrobot-v1": EPISODIC,
+        "Pendulum-v1": CONTINUING,
+        "Walker2d": EPISODIC,
+        "Swimmer-v2": EPISODIC
+        }
+
+if len(sys.argv) < 4:
+    raise ArgumentError("""invalid arguments, call ./plot_mse
+                        path/to/env_config dir/with/data.pkl
+                        path/to/agent_config(s)
+                        """)
+env_json = sys.argv[1]
+DIR = sys.argv[2]
+agent_json = sys.argv[3:]
+
+# Load configuration files
+with open(env_json, "r") as infile:
+    env_config = json.load(infile)
+agent_configs = []
+for j in agent_json:
+    with open(j, "r") as infile:
+        agent_configs.append(json.load(infile))
+
+ENV = env_config["env_name"]
+ENV_TYPE = type_map[ENV]
+PERFORMANCE_METRIC_TYPE = "train"
+DATA_FILE = "data.pkl"
+
+
+# Script
+DATA_FILES = []
+for config in agent_configs:
+    agent = config["agent_name"]
+    if DIR:
+        DATA_FILES.append(f"./results/{DIR}/{ENV}_{agent}results")
+    else:
+        DATA_FILES.append(f"./results/{ENV}_{agent}results")
+
+DATA = []
+for f in DATA_FILES:
+    print(f"Opening file: {f}")
+    with open(os.path.join(f, DATA_FILE), "rb") as infile:
+        DATA.append(pickle.load(infile))
+
+# Find best hypers
+BEST_IND = []
+for agent in DATA:
+    best_hp = hypers.best(agent)[0]
+    BEST_IND.append(best_hp)
+
+# Generate labels for plots
+labels = []
+for ag in DATA:
+    labels.append([ag["experiment"]["agent"]["agent_name"]])
+
+CMAP = "tab10"
+colours = list(sns.color_palette(CMAP, 8).as_hex())
+colours = list(map(lambda x: [x], colours))
+plt.rcParams["axes.prop_cycle"] = mpl.cycler(color=sns.color_palette(CMAP))
+
+# Plot the mean + standard error
+print("=== Plotting mean with standard error")
+PLOT_TYPE = "train"
+SOLVED = 0
+TYPE = "online" if PLOT_TYPE == "train" else "offline"
+best_ind = list(map(lambda x: [x], BEST_IND))
+
+plot_labels = list(map(lambda x: x[0], labels))  # Adjust labels for plot
+fig, ax = plot.mean_with_stderr(
+    DATA,
+    PLOT_TYPE,
+    best_ind,
+    [5000]*len(best_ind),
+    plot_labels,
+    env_type="episodic",
+    figsize=(16, 16),
+    colours=colours,
+)
+ax.set_title(ENV)
+
+fig.savefig(
+    f"{os.path.expanduser('~')}/{ENV}.png",
+    bbox_inches="tight",
+)
diff --git a/utils/plot_runs_separate.py b/utils/plot_runs_separate.py
new file mode 100644
index 0000000..9259ad4
--- /dev/null
+++ b/utils/plot_runs_separate.py
@@ -0,0 +1,232 @@
+# Plot each separate run on a different sub-axis, ordered by AUC
+
+import pickle
+from math import ceil
+import seaborn as sns
+import functools
+from tqdm import tqdm
+import os
+import matplotlib.pyplot as plt
+import numpy as np
+import scipy
+import json
+import sys
+import plot_utils as plot
+import matplotlib as mpl
+mpl.rcParams["font.size"] = 24
+
+try:
+    import hypers
+    import runs
+except ModuleNotFoundError:
+    import utils.hypers
+    import utils.runs
+
+# Set up plots
+params = {
+      'axes.labelsize': 8,
+      'axes.titlesize': 32,
+      'legend.fontsize': 16,
+      'xtick.labelsize': 24,
+      'ytick.labelsize': 24
+}
+plt.rcParams.update(params)
+
+plt.rc('text', usetex=False)  # You might want usetex=True to get DejaVu Sans
+plt.rc('font', **{'family': 'sans-serif', 'serif': ['DejaVu Sans']})
+plt.rcParams["font.family"] = "DejaVu Sans"
+plt.rcParams.update({'font.size': 32})
+plt.tick_params(top=False, right=False, labelsize=24)
+
+mpl.rcParams["svg.fonttype"] = "none"
+
+if len(sys.argv) != 4:
+    raise ArgumentError("""should run ./plot_runs_separate.py
+                        path/to/env_config save/dir path/to/agent_config
+                        """)
+
+env_json = sys.argv[1]
+DIR = sys.argv[2]
+agent_json = sys.argv[3]
+
+
+def get_y_bounds(env, per_env_tuning):
+    """
+    Get the bounds for the y-axis plots on `env` given that `per_env_tuning`
+    determines whether we are tuning per environment or across environments.
+    """
+    if per_env_tuning:
+        if "mountaincar" in env.lower():
+            return (-1000, -50)
+        elif "acrobot" in env.lower():
+            return (-1000, -50)
+        elif "pendulum" in env.lower():
+            return (-1000, 1000)
+    else:
+        if "mountaincar" in env.lower():
+            return (-1000, -50)
+        elif "acrobot" in env.lower():
+            return (-1000, -50)
+        elif "pendulum" in env.lower():
+            return (-1000, 950)
+
+    if "breakout" in env.lower():
+        return (0, 25)
+
+
+# Load configuration files
+with open(env_json, "r") as infile:
+    env_config = json.load(infile)
+
+with open(agent_json, "r") as infile:
+    agent_config = json.load(infile)
+agent = agent_config["agent_name"]
+
+ENV = env_config["env_name"]
+
+# Uncomment the next lines if using ICML data
+if agent == "GreedyAC":
+    agent = "cem"
+elif agent == "GreedyACSoftmax":
+    agent = "cem_softmax"
+
+if ENV == "Pendulum-v0":
+    env = "PendulumFixed-v0"
+else:
+    env = ENV
+
+if ENV == "MountainCarContinuous-v0":
+    env_config["env_name"] = "MountainCar-v0"
+    env_config["continuous"] = True
+    ENV = env_config["env_name"]
+
+# Script
+if DIR:
+    data_file = f"./results/{DIR}/{env}_{agent}results/data.pkl"
+else:
+    data_file = f"./results/{env}_{agent}results/data.pkl"
+with open(data_file, "rb") as infile:
+    data = pickle.load(infile)
+
+# Find best hypers
+# #################################
+# For new runs
+# #################################
+best_hp = hypers.best(data)[0]
+per_env_tuning = True
+
+# Expand data to ensure episodic environments have the same number of data
+# points per run
+if "pendulum" not in ENV.lower():
+    data = runs.expand_episodes(data, best_hp)
+low_x = 0
+if "pendulum" not in ENV.lower():
+    high_x = np.cumsum(
+        data["experiment_data"][best_hp]["runs"][0]["train_episode_steps"]
+    )[-1]
+else:
+    high_x = len(
+        data["experiment_data"][best_hp]["runs"][0]["train_episode_steps"]
+    )
+
+# Go through and get the list of hyperparameter indices ordered by AUC
+num_runs = list(range(len(data["experiment_data"][best_hp]["runs"])))
+auc = hypers.get_performance(data, best_hp, repeat=False).mean(axis=-1)
+order = np.argsort(auc)
+
+# Figure out the number of rows and columns for the subplots
+num_plots = len(num_runs)
+COLS = 4
+ROWS = max(1, ceil(num_plots / COLS))
+fig = plt.figure(figsize=(7 * COLS, 4.8 * ROWS), constrained_layout=True)
+spec = fig.add_gridspec(ROWS, COLS)
+
+# Plot
+low_y, high_y = get_y_bounds(ENV, per_env_tuning)
+returns = []
+for i, run_num in enumerate(order):
+    run = data["experiment_data"][best_hp]["runs"][run_num]
+
+    # Figure out which row and column of the subplots we are on
+    y = i // COLS
+    x = i - y * COLS
+    ax = fig.add_subplot(spec[y, x])
+
+    if "pendulum" not in ENV.lower():
+        # If an episodic environment, ignore the last episode, since it will be
+        # cut off. We actually are cutting off too much here, but the
+        # alternative is to iterate over the entire data set twice, since we
+        # need to find the maximum steps for the last episode.
+        cutoff = env_config["steps_per_episode"]
+        performance = run["train_episode_rewards"][:-cutoff]
+    else:
+        performance = run["train_episode_rewards"]
+
+    ax.plot(
+        performance,
+        label=f"Run {i}",
+        linewidth=2.5,
+        color="#007bff",
+    )
+
+    # Only set x ticks for bottom row
+    if y == ROWS-1:
+        ax.set_xticks([low_x, high_x])
+    else:
+        ax.set_xticks([])
+
+    # Only set y ticks for leftmost column
+    if x == 0:
+        ax.set_yticks(get_y_bounds(ENV, per_env_tuning))
+    else:
+        ax.set_yticks([])
+
+    # Set axis title and bounds
+    ax.set_title(f"Run {i}")
+    ax.set_xlim(low_x, high_x)
+    ax.set_ylim(low_y-10, high_y+10)
+
+    # Adjust axis spines
+    ax.spines['top'].set_visible(False)
+    ax.spines['right'].set_visible(False)
+    ax.spines['bottom'].set_linewidth(2)
+    ax.spines['left'].set_linewidth(2)
+
+    returns.append(performance)
+
+# Calculate returns and stderr of returns
+returns = np.array(returns)
+mean = returns.mean(axis=0)
+stderr = np.std(returns, axis=0, ddof=1)
+stderr /= np.sqrt(returns.shape[0])
+
+ax = fig.add_subplot(spec[:, COLS-1])
+ax.fill_between(
+    np.arange(mean.shape[-1]),
+    mean-stderr,
+    mean+stderr,
+    alpha=0.1,
+    color="#161c1e",
+)
+ax.plot(mean, label="Mean", linewidth=3.0, color="#161c1e")
+
+# Set title and axes limits
+ax.set_title("Mean")
+ax.set_xlim(low_x, high_x)
+ax.set_ylim(low_y-10, high_y+10)
+ax.set_yticks(get_y_bounds(ENV, per_env_tuning))
+ax.set_xticks([low_x, high_x])
+
+# Adjust axis spines
+ax.spines['top'].set_visible(False)
+ax.spines['right'].set_visible(False)
+ax.spines['bottom'].set_linewidth(2)
+ax.spines['left'].set_linewidth(2)
+
+# Add the figure title
+fig.suptitle(ENV)
+
+fig.savefig(
+    f"{os.path.expanduser('~')}/{ENV}_{agent}_runs.png",
+    bbox_inches="tight",
+)
diff --git a/utils/plot_utils.py b/utils/plot_utils.py
new file mode 100644
index 0000000..c572561
--- /dev/null
+++ b/utils/plot_utils.py
@@ -0,0 +1,1456 @@
+# Import modules
+import matplotlib.pyplot as plt
+from matplotlib.lines import Line2D
+from matplotlib import ticker, gridspec
+import experiment_utils as exp
+import numpy as np
+from scipy import ndimage
+import seaborn as sns
+from collections.abc import Iterable
+import pickle
+import matplotlib as mpl
+import hypers
+import warnings
+import runs
+
+TRAIN = "train"
+EVAL = "eval"
+
+
+# Set up plots
+params = {
+      'axes.labelsize': 48,
+      'axes.titlesize': 36,
+      'legend.fontsize': 16,
+      'xtick.labelsize': 48,
+      'ytick.labelsize': 48
+}
+plt.rcParams.update(params)
+
+plt.rc('text', usetex=False)  # You might want usetex=True to get DejaVu Sans
+plt.rc('font', **{'family': 'sans-serif', 'serif': ['DejaVu Sans']})
+plt.rcParams["font.family"] = "DejaVu Sans"
+plt.rcParams.update({'font.size': 15})
+plt.tick_params(top=False, right=False, labelsize=20)
+
+mpl.rcParams["svg.fonttype"] = "none"
+
+
+# Constants
+EPISODIC = "episodic"
+CONTINUING = "continuing"
+
+
+# Colours
+CMAP = "tab10"
+DEFAULT_COLOURS = list(sns.color_palette(CMAP, 6).as_hex())
+plt.rcParams["axes.prop_cycle"] = mpl.cycler(color=sns.color_palette(CMAP))
+OFFSET = 0  # The offset to start in DEFAULT_COLOURS
+
+
+def episode_steps(data, type_, ind, labels, xlim=None,
+                  ylim=None, colours=None, xlabel="episodes",
+                  ylabel="steps to goal", figsize=(16, 9),
+                  title="Steps to Goal", α=0.2):
+    """
+    Plot steps per episode
+
+    Parameters
+    ----------
+    data : TODO
+    type_ : TODO
+    ind : TODO
+    smooth_over : TODO
+    labels : TODO
+    xlim : TODO, optional
+    ylim : TODO, optional
+    colours : TODO, optional
+    xlabel : TODO, optional
+    ylabel : TODO, optional
+
+    Returns
+    -------
+    TODO
+
+    """
+    # Set the colours to be default if not set
+    if colours is None:
+        colours = _get_default_colours(ind)
+
+    # Set up figure
+    fig, ax = _setup_fig(None, None, figsize, xlim=xlim, ylim=ylim,
+                         xlabel=xlabel, ylabel=ylabel, title=title)
+
+    # For a single dict, then many
+    for i in range(len(data)):
+        for j in range(len(ind[i])):
+            _episode_steps(data[i], type_, ind[i][j], colours[i][j],
+                           labels[i], ax, α)
+
+    ax.legend()
+    return fig, ax
+
+
+def _episode_steps(data, type_, ind, colour, label, ax, α=0.2):
+    """
+    Plot steps per episode
+
+    Parameters
+    ----------
+    data : TODO
+    type_ : TODO
+    ind : TODO
+    smooth_over : TODO
+    label : TODO
+    xlim : TODO, optional
+    ylim : TODO, optional
+    colours : TODO, optional
+    xlabel : TODO, optional
+    ylabel : TODO, optional
+
+    Returns
+    -------
+    TODO
+
+    """
+    key = type_ + "_episode_steps"
+
+    # For a single dict, then many
+    steps_per_run = []
+    lengths = []
+    for run in data["experiment_data"][ind]["runs"]:
+        steps_per_run.append(run[key])
+        lengths.append(len(steps_per_run[-1]))
+
+    # Adjust the lengths of each run so that there are a consistent number of
+    # episodes in each row
+    min_length = min(lengths)
+    for i in range(len(steps_per_run)):
+        steps_per_run[i] = steps_per_run[i][0:min_length]
+    steps_per_run = np.array(steps_per_run)
+
+    mean = steps_per_run.mean(axis=0)
+    std_err = np.std(steps_per_run, axis=0, ddof=1) / \
+        np.sqrt(steps_per_run.shape[0])
+
+    print(f"Final steps to goal for {label}:", mean[-1])
+
+    _plot_shaded(ax, np.arange(mean.shape[0]), mean, std_err, colour,
+                 label, α)
+
+
+def hyper_sensitivity(data_dicts, hyper, type_=TRAIN, figsize=(16, 9),
+                      labels=None, metric="return"):
+    """
+    Plots the hyperparameter sensitivity curves
+
+    Parameters
+    ----------
+    data_dicts : list[dict]
+        A list of data dictionaries resulting from some experiments
+    hyper : str
+        The hyper to plot the sensitivity of
+    type_ : str
+        The type of data to plot, one of train or eval
+    figsize : tuple[int]
+        The figure size
+    labels : list[str]
+        A list of labels, of the same length as data_dicts. If None, then the
+        agent name is used
+    metric : str
+        The metric to gauge sensitivity by, one of 'return', 'steps'
+
+    Returns
+    -------
+    plt.figure, plt.Axes
+        The figure and axes plotted on
+    """
+    fig = plt.figure(figsize=figsize)
+    ax = fig.add_subplot()
+
+    if type_ not in (TRAIN, EVAL):
+        raise ValueError(f"type_ must be one of '{TRAIN}', '{EVAL}'")
+
+    metric = metric.lower()
+    if metric not in ("return", "steps"):
+        raise ValueError(f"metric must be one of 'return', 'steps'")
+
+    key = type_ + "_episode_" + ("rewards" if metric == "return" else "steps")
+
+    for j, ag in enumerate(data_dicts):
+        config = ag["experiment"]["agent"]
+        num_settings = hypers.total(config["parameters"])
+        hps = config["parameters"][hyper]
+
+        max_returns = [None] * len(hps)
+        max_inds = [-1] * len(hps)
+        runs = -1
+
+        for i in range(num_settings):
+            setting = hypers.sweeps(config["parameters"], i)[0]
+            ind = hps.index(setting[hyper])
+
+            # Get the average return for the run. If no such data exists, we
+            # assume that the agent diverged and we give it minimum performance
+            if i not in ag["experiment_data"].keys():
+                avg_return = np.finfo(np.float64).min
+                continue
+
+            # Store the total number of runs for each setting, which will be
+            # needed in the final loop
+            if len(ag["experiment_data"][i]["runs"]) > runs:
+                runs = len(ag["experiment_data"][i]["runs"])
+
+            avg_return = []
+            for run in ag["experiment_data"][i]["runs"]:
+                avg_return.append(run[key])
+
+            avg_run_return = [np.mean(run) for run in avg_return]
+            avg_return = np.mean(avg_run_return)
+
+            if max_returns[ind] is None or (
+               metric == "return" and avg_return > max_returns[ind]) or (
+               metric == "steps" and avg_return < max_returns[ind]):
+                max_inds[ind] = i
+                max_returns[ind] = avg_return
+
+        # Go through each best hyper and get the mean performance + std err
+        # per run. If no data exists due to divergence, then just append nans
+        returns = []
+        for index in max_inds:
+            if index not in ag["experiment_data"]:
+                returns.append([np.nan] * runs)
+                continue
+
+            index_returns = []
+            for run in ag["experiment_data"][index]["runs"]:
+                index_returns.append(run[key].mean())
+            returns.append(index_returns)
+
+            # Warn the user if some hyper setting does not have the expected
+            # number of runs
+            n = len(index_returns)
+            if n != runs:
+                warnings.warn(f"hyper setting {index} has only {n} " +
+                              f"runs when {runs} runs expected")
+
+        # To deal with hyper settings which don't have the full number of runs,
+        # we take each mean and standard error separately before adding to an
+        # array.
+        mean = np.array([np.mean(r) for r in returns])
+        std_err = np.array([np.std(r, ddof=1) / len(r) for r in returns])
+
+        ag_name = ag["experiment"]["agent"]["agent_name"]
+
+        # Any runs that failed due to invalid hypers and resulted in nans
+        # should have low performance. We make it 10 * lower than the lowest
+        # performance
+        std_err[np.where(np.isnan(std_err))] = 0
+        min_ = np.min(mean[np.where(~np.isnan(mean))])
+        mean[np.where(np.isnan(mean))] = min_ * (10 if min_ < 0 else 0.1)
+
+        if not labels:
+            label = ag_name
+        else:
+            label = labels[j]
+        ax.plot(hps, mean, label=label)
+        ax.fill_between(hps, mean-std_err, mean+std_err, alpha=0.1)
+
+        ylabel = "Steps to Goal" if metric == "steps" else "Average Return"
+        ax.set_ylabel(ylabel)
+        ax.set_xlabel(hyper)
+
+    return fig, ax
+
+
+def mean_with_bootstrap_conf(data, type_, ind, smooth_over, names,
+                             fig=None, ax=None, figsize=(12, 6),
+                             xlim=None, ylim=None, alpha=0.1,
+                             colours=None, env_type="continuing",
+                             significance=0.05, keep_shape=False,
+                             xlabel=None, ylabel=None):
+    """
+    Plots the average training or evaluation return over all runs with
+    confidence intervals.
+
+    Given a list of data dictionaries of the form returned by main.py, this
+    function will plot each episodic return for the list of hyperparameter
+    settings ind each data dictionary. The ind argument is a list, where each
+    element is a list of hyperparameter settings to plot for the data
+    dictionary at the same index as this list. For example, if ind[i] = [1, 2],
+    then plots will be generated for the data dictionary at location i
+    in the data argument for hyperparameter settings ind[i] = [1, 2].
+    The smooth_over argument tells how many previous data points to smooth
+    over
+
+    Parameters
+    ----------
+    data : list of dict
+        The Python data dictionaries generated from running main.py for the
+        agents
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    ind : iter of iter of int
+        The list of lists of hyperparameter settings indices to plot for
+        each agent. For example [[1, 2], [3, 4]] means that the first agent
+        plots will use hyperparameter settings indices 1 and 2, while the
+        second will use 3 and 4.
+    smooth_over : list of int
+        The number of previous data points to smooth over for the agent's
+        plot for each data dictionary. Note that this is *not* the number of
+        timesteps to smooth over, but rather the number of data points to
+        smooth over. For example, if you save the return every 1,000
+        timesteps, then setting this value to 15 will smooth over the last
+        15 readings, or 15,000 timesteps. For example, [1, 2] will mean that
+        the plots using the first data dictionary will smooth over the past 1
+        data points, while the second will smooth over the passed 2 data
+        points for each hyperparameter setting.
+    fig : plt.figure
+        The figure to plot on, by default None. If None, creates a new figure
+    ax : plt.Axes
+        The axis to plot on, by default None, If None, creates a new axis
+    figsize : tuple(int, int)
+        The size of the figure to plot
+    names : list of str
+        The name of the agents, used for the legend
+    xlim : float, optional
+        The x limit for the plot, by default None
+    ylim : float, optional
+        The y limit for the plot, by default None
+    alpha : float, optional
+        The alpha channel for the plot, by default 0.1
+    colours : list of list of str
+        The colours to use for each hyperparameter settings plot for each data
+        dictionary
+    env_type : str, optional
+        The type of environment, one of 'continuing', 'episodic'. By default
+        'continuing'
+    significance : float, optional
+        The significance level for the confidence interval, by default 0.01
+
+    Returns
+    -------
+    plt.figure, plt.Axes
+        The figure and axes of the plot
+    """
+    fig, ax = _setup_fig(fig, ax, figsize, None, xlim, ylim, xlabel, ylabel)
+
+    # Set the colours to be default if not set
+    if colours is None:
+        colours = _get_default_colours(ind)
+
+    # Track the total timesteps per hyperparam setting over all episodes and
+    # the cumulative timesteps per episode per data dictionary (timesteps
+    # should be consistent between all hp settings in a single data dict)
+    total_timesteps = []
+    cumulative_timesteps = []
+
+    for i in range(len(data)):
+        if type_ == "train":
+            cumulative_timesteps.append(exp.get_cumulative_timesteps(data[i]
+                                        ["experiment_data"][ind[i][0]]["runs"]
+                                        [0]["train_episode_steps"]))
+        elif type_ == "eval":
+            cumulative_timesteps.append(data[i]["experiment_data"][ind[i][0]]
+                                        ["runs"][0]["timesteps_at_eval"])
+        else:
+            raise ValueError("type_ must be one of 'train', 'eval'")
+        total_timesteps.append(cumulative_timesteps[-1][-1])
+
+    # Find the minimum of total trained-for timesteps. Each plot will only
+    # be plotted on the x-axis until this value
+    min_timesteps = min(total_timesteps)
+
+    # For each data dictionary, find the minimum index where the timestep at
+    # that index is >=  minimum timestep
+    ind_ge_min_timesteps = []
+    for cumulative_timesteps_per_data in cumulative_timesteps:
+        final_ind = np.where(cumulative_timesteps_per_data >=
+                             min_timesteps)[0][0]
+        # Since indexing will stop right before the minimum, increment it
+        ind_ge_min_timesteps.append(final_ind + 1)
+
+    # Plot all data for all HP settings, only up until the minimum index
+    # fig, ax = None, None
+    if env_type == "continuing":
+        plot_fn = _plot_mean_with_conf_continuing
+    else:
+        plot_fn = _plot_mean_with_conf_episodic
+
+    for i in range(len(data)):
+        fig, ax = \
+            plot_fn(data=data[i], type_=type_,
+                    ind=ind[i], smooth_over=smooth_over[i], name=names[i],
+                    fig=fig, ax=ax, figsize=figsize, xlim=xlim, ylim=ylim,
+                    last_ind=ind_ge_min_timesteps[i], alpha=alpha,
+                    colours=colours[i], significance=significance,
+                    keep_shape=keep_shape)
+
+    return fig, ax
+
+
+def _plot_mean_with_conf_continuing(data, type_, ind, smooth_over, fig=None,
+                                    ax=None, figsize=(12, 6), name="",
+                                    last_ind=-1, xlabel="Timesteps",
+                                    ylabel="Average Return", xlim=None,
+                                    ylim=None, alpha=0.1, colours=None,
+                                    significance=0.05, keep_shape=False):
+    """
+    Plots the average training or evaluation return over all runs for a single
+    data dictionary on a continuing environment. Bootstrap confidence intervals
+    are plotted as shaded regions.
+
+    Parameters
+    ----------
+    data : dict
+        The Python data dictionary generated from running main.py
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    ind : iter of int
+        The list of hyperparameter settings indices to plot
+    smooth_over : int
+        The number of previous data points to smooth over. Note that this
+        is *not* the number of timesteps to smooth over, but rather the number
+        of data points to smooth over. For example, if you save the return
+        every 1,000 timesteps, then setting this value to 15 will smooth
+        over the last 15 readings, or 15,000 timesteps.
+    fig : plt.figure
+        The figure to plot on, by default None. If None, creates a new figure
+    ax : plt.Axes
+        The axis to plot on, by default None, If None, creates a new axis
+    figsize : tuple(int, int)
+        The size of the figure to plot
+    name : str, optional
+        The name of the agent, used for the legend
+    last_ind : int, optional
+        The index of the last element to plot in the returns list,
+        by default -1. This is useful if you want to plot many things on the
+        same axis, but all of which have a different number of elements. This
+        way, we can plot the first last_ind elements of each returns for each
+        agent.
+    timestep_multiply : int, optional
+        A value to multiply each timstep by, by default 1. This is useful if
+        your agent does multiple updates per timestep and you want to plot
+        performance vs. number of updates.
+    xlim : float, optional
+        The x limit for the plot, by default None
+    ylim : float, optional
+        The y limit for the plot, by default None
+    alpha : float, optional
+        The alpha channel for the plot, by default 0.1
+    colours : list of str
+        The colours to use for each plot of each hyperparameter setting
+    env_type : str, optional
+        The type of environment, one of 'continuing', 'episodic'. By default
+        'continuing'
+    significance : float, optional
+        The significance level for the confidence interval, by default 0.01
+
+    Returns
+    -------
+    plt.figure, plt.Axes
+        The figure and axes of the plot
+
+    Raises
+    ------
+    ValueError
+        When an axis is passed but no figure is passed
+        When an appropriate number of colours is not specified to cover all
+        hyperparameter settings
+    """
+    # This should be the exact same as the episodic version except without
+    # reducing the episodes. Follow the same structure as the episodic function
+    # and the continuing function with standard error.
+    raise NotImplementedError
+
+
+def _plot_mean_with_conf_episodic(data, type_, ind, smooth_over, fig=None,
+                                  ax=None, figsize=(12, 6), name="",
+                                  last_ind=-1, xlabel="Timesteps",
+                                  ylabel="Average Return", xlim=None,
+                                  ylim=None, alpha=0.1, colours=None,
+                                  significance=0.05, keep_shape=False):
+    """
+    Plots the average training or evaluation return over all runs for a single
+    data dictionary on an episodic environment. Bootstrap confidence intervals
+    are plotted as shaded regions.
+
+    Parameters
+    ----------
+    data : dict
+        The Python data dictionary generated from running main.py
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    ind : iter of int
+        The list of hyperparameter settings indices to plot
+    smooth_over : int
+        The number of previous data points to smooth over. Note that this
+        is *not* the number of timesteps to smooth over, but rather the number
+        of data points to smooth over. For example, if you save the return
+        every 1,000 timesteps, then setting this value to 15 will smooth
+        over the last 15 readings, or 15,000 timesteps.
+    fig : plt.figure
+        The figure to plot on, by default None. If None, creates a new figure
+    ax : plt.Axes
+        The axis to plot on, by default None, If None, creates a new axis
+    figsize : tuple(int, int)
+        The size of the figure to plot
+    name : str, optional
+        The name of the agent, used for the legend
+    last_ind : int, optional
+        The index of the last element to plot in the returns list,
+        by default -1. This is useful if you want to plot many things on the
+        same axis, but all of which have a different number of elements. This
+        way, we can plot the first last_ind elements of each returns for each
+        agent.
+    xlim : float, optional
+        The x limit for the plot, by default None
+    ylim : float, optional
+        The y limit for the plot, by default None
+    alpha : float, optional
+        The alpha channel for the plot, by default 0.1
+    colours : list of str
+        The colours to use for each plot of each hyperparameter setting
+    env_type : str, optional
+        The type of environment, one of 'continuing', 'episodic'. By default
+        'continuing'
+    significance : float, optional
+        The significance level for the confidence interval, by default 0.01
+
+    Returns
+    -------
+    plt.figure, plt.Axes
+        The figure and axes of the plot
+
+    Raises
+    ------
+    ValueError
+        When an axis is passed but no figure is passed
+        When an appropriate number of colours is not specified to cover all
+        hyperparameter settings
+    """
+    if colours is not None and len(colours) != len(ind):
+        raise ValueError("must have one colour for each hyperparameter " +
+                         "setting")
+
+    if colours is None:
+        colours = _get_default_colours(ind)
+
+    # Set up figure
+    # if ax is None and fig is None:
+    #     fig = plt.figure(figsize=figsize)
+    #     ax = fig.add_subplot()
+
+    # if xlim is not None:
+    #     ax.set_xlim(xlim)
+    # if ylim is not None:
+    #     ax.set_ylim(ylim)
+
+    conf_level = "{:.2f}".format(1-significance)
+    title = f"""Average {type_.title()} Return per Run with {conf_level}
+                 Confidence Intervals"""
+    fig, ax = _setup_fig(fig, ax, figsize, title, xlim, ylim, xlabel, ylabel)
+
+    # Plot with bootstrap confidence interval
+    for i in range(len(ind)):
+        data = runs.expand_episodes(data, ind[i], type_=type_)
+
+        _, mean, conf = exp.get_mean_err(data, type_, ind[i], smooth_over,
+                                         exp.bootstrap_conf,
+                                         err_args={
+                                            "significance": significance,
+                                         },
+                                         keep_shape=keep_shape)
+
+        mean = mean[:last_ind]
+        conf = conf[:, :last_ind]
+
+        episodes = np.arange(mean.shape[0])
+
+        # Plot based on colours
+        label = f"{name}"
+        print(mean.shape, conf.shape, episodes.shape)
+        _plot_shaded(ax, episodes, mean, conf, colours[i], label, alpha)
+
+    ax.legend()
+    conf_level = "{:.2f}".format(1-significance)
+    ax.set_title(f"""Average {type_.title()} Return per Run with {conf_level}
+                 Confidence Intervals""")
+    # ax.set_ylabel(ylabel)
+    # ax.set_xlabel(xlabel)
+
+    fig.show()
+    return fig, ax
+
+
+def plot_mean_with_runs(data, type_, ind, smooth_over, names, colours=None,
+                        figsize=(12, 6), xlim=None, ylim=None, alpha=0.1,
+                        plot_avg=True, env_type="continuing",
+                        keep_shape=False, fig=None, ax=None):
+    """
+    Plots the mean return over all runs and the return for each run for a list
+    of data dictionaries and hyperparameter indices
+
+    Plots both the mean return per episode (over runs) as well as the return
+    for each individual run (including "mini-runs", if a set of concurrent
+    episodes were run for all runs, e.g. multiple evaluation episodes per
+    run at set intervals)
+
+    Note that this function takes in a list of data dictionaries and will
+    plot the runs for each ind (which is a list of lists, where each super-list
+    refers to a data dictionary and each sub-list refers to the indices for
+    that data dictionary to plot).
+
+    Example
+    -------
+    plot_mean_with_runs([sac_data, linear_data], "train", [[3439], [38, 3]],
+    smooth_over=[5, 2], names=["SAC", "LAC"], figsize=(12, 6), alpha=0.2,
+    plot_avg=True, env_type="episodic")
+
+    will plot hyperparameter index 3439 for the sac_data, smoothing over the
+    last 5 episodes, and the label will have the term "SAC" in it; also plots
+    the mean and each individual run on the linear_data for hyperparameter
+    settings 38 and 3, smoothing over the last 2 episodes for each and with
+    the term "LAC" in the labels.
+
+    Parameters
+    ----------
+    data : list of dict
+        The Python data dictionaries generated from running main.py for the
+        agents
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    ind : iter of iter of int
+        The list of lists of hyperparameter settings indices to plot for
+        each agent. For example [[1, 2], [3, 4]] means that the first agent
+        plots will use hyperparameter settings indices 1 and 2, while the
+        second will use 3 and 4.
+    smooth_over : list of int
+        The number of previous data points to smooth over for the agent's
+        plot for each data dictionary. Note that this is *not* the number of
+        timesteps to smooth over, but rather the number of data points to
+        smooth over. For example, if you save the return every 1,000
+        timesteps, then setting this value to 15 will smooth over the last
+        15 readings, or 15,000 timesteps. For example, [1, 2] will mean that
+        the plots using the first data dictionary will smooth over the past 1
+        data points, while the second will smooth over the passed 2 data
+        points for each hyperparameter setting.
+    figsize : tuple(int, int)
+        The size of the figure to plot
+    names : list of str
+        The name of the agents, used for the legend
+    colours : list of list of str, optional
+        The colours to use for each hyperparameter settings plot for each data
+        dictionary, by default None
+    xlim : float, optional
+        The x limit for the plot, by default None
+    ylim : float, optional
+        The y limit for the plot, by default None
+    alpha : float, optional
+        The alpha to use for plots of the runs, by default 0.1
+    plot_avg : bool, optional
+        If concurrent episodes are executed in each run (e.g. multiple
+        evaluation episodes are run at set intervals), then whether to plot the
+        performance of each separately or the average performance over all
+    env_type : str, optional
+        The type of environment, one of 'continuing', 'episodic'. By default
+        'continuing'
+    fig : plt.Figure
+        The figure to plot on
+    ax : plt.Axes
+        The axis to plot on
+
+    Returns
+    -------
+    tuple of plt.figure, plt.Axes
+        The figure and axis plotted on
+    """
+    # Set the colours to be default if not set
+    if colours is None:
+        colours = _get_default_colours(ind)
+
+    # Set up figure
+    # fig = plt.figure(figsize=figsize)
+    # ax = fig.add_subplot()
+    if env_type == "continuing":
+        xlabel = "Timesteps"
+        ylabel = "Average Reward"
+    else:
+        xlabel = "Timesteps"
+        ylabel = "Return"
+    title = "Mean Return with Runs"
+
+    fig, ax = _setup_fig(fig, ax, figsize, title, xlim, ylim, xlabel, ylabel)
+
+    # Plot for each data dictionary given
+    legend_lines = []
+    legend_labels = []
+    for i in range(len(data)):
+        for _ in range(len(ind[i])):
+            fig, ax, labels, lines = \
+                _plot_mean_with_runs(data[i], type_, ind[i], smooth_over[i],
+                                     names[i], colours[i], figsize, xlim, ylim,
+                                     alpha, plot_avg, env_type, fig, ax,
+                                     keep_shape)
+
+            legend_lines.extend(lines)
+            legend_labels.extend(labels)
+
+    ax.legend(legend_lines, legend_labels)
+    fig.show()
+
+    return fig, ax
+
+
+def _plot_mean_with_runs(data, type_, ind, smooth_over, name, colours=None,
+                         figsize=(12, 6), xlim=None, ylim=None, alpha=0.1,
+                         plot_avg=True, env_type="continuing", fig=None,
+                         ax=None, keep_shape=False):
+    """
+    Plots the mean return over all runs and the return for each run for a
+    single data dictionary and for each in a list of hyperparameter settings.
+
+    Similar to plot_mean_with_runs, except that this function takes in only
+    a single data dictionary.
+
+    Parameters
+    ----------
+    data : dict
+        The Python data dictionary generated from running main.py
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    ind : iter of int
+        The list of hyperparameter settings indices to plot for
+        each agent. For example [1, 2] means that the agent plots will use
+        hyperparameter settings indices 1 and 2.
+    smooth_over : int
+        The number of previous data points to smooth over for the agent's
+        plot. Note that this is *not* the number of timesteps to smooth over,
+        but rather the number of data points to smooth over. For example,
+        if you save the return every 1,000 timesteps, then setting this value
+        to 15 will smooth over the last 15 readings, or 15,000 timesteps.
+    figsize : tuple(int, int)
+        The size of the figure to plot
+    name : str
+        The name of the agents, used for the legend
+    colours : list of list of str, optional
+        The colours to use for each hyperparameter settings plot for each data
+        dictionary, by default None
+    xlim : float, optional
+        The x limit for the plot, by default None
+    ylim : float, optional
+        The y limit for the plot, by default None
+    alpha : float, optional
+        The alpha to use for plots of the runs, by default 0.1
+    plot_avg : bool, optional
+        If concurrent episodes are executed in each run (e.g. multiple
+        evaluation episodes are run at set intervals), then whether to plot the
+        performance of each separately or the average performance over all
+    env_type : str, optional
+        The type of environment, one of 'continuing', 'episodic'. By default
+        'continuing'
+
+    Returns
+    -------
+    tuple of plt.figure, plt.Axes, list of str, list of mpl.Lines2D
+        The figure and axis plotted on as well as the list of strings to use
+        as labels and the list of lines to include in the legend
+    """
+    # Set up figure and axis
+    fig, ax = _setup_fig(fig, ax, figsize, None, xlim, ylim)
+
+    if colours is None:
+        colours = _get_default_colours(ind)
+
+    # Store the info to keep in the legend
+    legend_labels = []
+    legend_lines = []
+
+    # Plot each selected hyperparameter setting in the data dictionary
+    for j in range(len(ind)):
+        fig, ax, labels, lines = \
+            _plot_mean_with_runs_single_hp(data, type_, ind[j], smooth_over,
+                                           name, colours[j], figsize, xlim,
+                                           ylim, alpha, plot_avg, env_type,
+                                           fig, ax, keep_shape)
+        legend_labels.extend(labels)
+        legend_lines.extend(lines)
+
+    return fig, ax, legend_labels, legend_lines
+
+
+def _plot_mean_with_runs_single_hp(data, type_, ind, smooth_over, names,
+                                   colour=None, figsize=(12, 6), xlim=None,
+                                   ylim=None, alpha=0.1, plot_avg=True,
+                                   env_type="continuing", fig=None, ax=None,
+                                   keep_shape=False):
+    """
+    Plots the mean return over all runs and the return for each run for a
+    single data dictionary and a single hyperparameter setting.
+
+    Similar to _plot_mean_with_runs, except that this function takes in only
+    a single data dictionary and a single hyperparameter setting.
+
+    Parameters
+    ----------
+    data : dict
+        The Python data dictionary generated from running main.py
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    ind : int
+        The hyperparameter settings indices to plot for the agent. For example
+        5 means that the agent plots will use hyperparameter settings index 5.
+    smooth_over : int
+        The number of previous data points to smooth over for the agent's
+        plot. Note that this is *not* the number of timesteps to smooth over,
+        but rather the number of data points to smooth over. For example,
+        if you save the return every 1,000 timesteps, then setting this value
+        to 15 will smooth over the last 15 readings, or 15,000 timesteps.
+    figsize : tuple(int, int)
+        The size of the figure to plot
+    name : str
+        The name of the agents, used for the legend
+    colours : list of list of str, optional
+        The colours to use for each hyperparameter settings plot for each data
+        dictionary, by default None
+    xlim : float, optional
+        The x limit for the plot, by default None
+    ylim : float, optional
+        The y limit for the plot, by default None
+    alpha : float, optional
+        The alpha to use for plots of the runs, by default 0.1
+    plot_avg : bool, optional
+        If concurrent episodes are executed in each run (e.g. multiple
+        evaluation episodes are run at set intervals), then whether to plot the
+        performance of each separately or the average performance over all
+    env_type : str, optional
+        The type of environment, one of 'continuing', 'episodic'. By default
+        'continuing'
+
+    Returns
+    -------
+    tuple of plt.figure, plt.Axes, list of str, list of mpl.Lines2D
+        The figure and axis plotted on as well as the list of strings to use
+        as labels and the list of lines to include in the legend
+    """
+    # if env_type == "episodic":
+    #     data = runs.expand_episodes(data, ind, type_=type_)
+
+    # Set up figure and axis
+    fig, ax = _setup_fig(fig, ax, figsize, None, xlim, ylim)
+
+    if colour is None:
+        colour = _get_default_colours([ind])[0]
+
+    # Determine the timesteps to plot at
+    if type_ == "eval":
+        timesteps = \
+            data["experiment_data"][ind]["runs"][0]["timesteps_at_eval"]
+
+    elif type_ == "train":
+        timesteps_per_ep = \
+            data["experiment_data"][ind]["runs"][0]["train_episode_steps"]
+        timesteps = exp.get_cumulative_timesteps(timesteps_per_ep)
+
+    # Plot the average reward
+    if env_type == "continuing":
+        episode_steps = data["experiment"]["environment"]["steps_per_episode"]
+
+    # Get returns
+    all_returns = exp.get_returns(data, type_, ind, env_type)
+
+    # If concurrent episodes are run in each run then average them if
+    # appropriate
+    if type_ == "eval" and plot_avg:
+        all_returns = all_returns.mean(axis=-1)
+    elif type_ == "eval" and not plot_avg:
+        all_returns = np.concatenate(all_returns, axis=1)
+        all_returns = all_returns.transpose()
+    elif type_ == "train":
+        all_returns = np.squeeze(all_returns)
+
+    # Smooth returns
+    all_returns = exp.smooth(all_returns, smooth_over, keep_shape)
+
+    # Plot the average reward
+    if env_type == "continuing":
+        episode_steps = data["experiment"]["environment"]
+        episode_steps = episode_steps["steps_per_episode"]
+        # all_returns /= episode_steps
+
+    # Determine whether to plot episodes or timesteps on the x-axis, which is
+    # dependent on the environment type
+    if env_type == "episodic":
+        xvalues = np.arange(all_returns.shape[1])  # episodes
+    else:
+        xvalues = timesteps[:all_returns.shape[1]]
+
+    # Plot each run
+    for run in range(all_returns.shape[0]):
+        print(all_returns[run].shape)
+        ax.plot(xvalues, all_returns[run], color=colour, linestyle="-",
+                alpha=alpha)
+
+    # Plot the mean
+    mean_colour = "black"
+    # mean = all_returns.mean(axis=0)
+    # ax.plot(xvalues, mean, color=mean_colour)
+
+    # Store legend identifiers for the run
+    legend_labels = []
+    legend_lines = []
+    legend_labels.append("Individual Runs")
+    legend_lines.append(Line2D([0], [0], color=colour, linestyle="--",
+                               alpha=alpha))
+
+    # Set up the legend variables for the mean over all runs
+    label = f"{names}"
+    legend_labels.append(label)
+    legend_lines.append(Line2D([0], [0], color=mean_colour, linestyle="-"))
+
+    return fig, ax, legend_labels, legend_lines
+
+
+def mean_with_stderr(data, type_, ind, smooth_over, names,
+                     fig=None, ax=None, figsize=(12, 6),
+                     xlim=None, ylim=None, alpha=0.1,
+                     colours=None, env_type="continuing",
+                     keep_shape=False, xlabel="", ylabel=""):
+    """
+    Plots the average training or evaluation return over all runs with standard
+    error.
+
+    Given a list of data dictionaries of the form returned by main.py, this
+    function will plot each episodic return for the list of hyperparameter
+    settings ind each data dictionary. The ind argument is a list, where each
+    element is a list of hyperparameter settings to plot for the data
+    dictionary at the same index as this list. For example, if ind[i] = [1, 2],
+    then plots will be generated for the data dictionary at location i
+    in the data argument for hyperparameter settings ind[i] = [1, 2].
+    The smooth_over argument tells how many previous data points to smooth
+    over
+
+    Parameters
+    ----------
+    data : list of dict
+        The Python data dictionaries generated from running main.py for the
+        agents
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    ind : iter of iter of int
+        The list of lists of hyperparameter settings indices to plot for
+        each agent. For example [[1, 2], [3, 4]] means that the first agent
+        plots will use hyperparameter settings indices 1 and 2, while the
+        second will use 3 and 4.
+    smooth_over : list of int
+        The number of previous data points to smooth over for the agent's
+        plot for each data dictionary. Note that this is *not* the number of
+        timesteps to smooth over, but rather the number of data points to
+        smooth over. For example, if you save the return every 1,000
+        timesteps, then setting this value to 15 will smooth over the last
+        15 readings, or 15,000 timesteps. For example, [1, 2] will mean that
+        the plots using the first data dictionary will smooth over the past 1
+        data points, while the second will smooth over the passed 2 data
+        points for each hyperparameter setting.
+    fig : plt.figure
+        The figure to plot on, by default None. If None, creates a new figure
+    ax : plt.Axes
+        The axis to plot on, by default None, If None, creates a new axis
+    figsize : tuple(int, int)
+        The size of the figure to plot
+    names : list of str
+        The name of the agents, used for the legend
+    xlim : float, optional
+        The x limit for the plot, by default None
+    ylim : float, optional
+        The y limit for the plot, by default None
+    alpha : float, optional
+        The alpha channel for the plot, by default 0.1
+    colours : list of list of str
+        The colours to use for each hyperparameter settings plot for each data
+        dictionary
+    env_type : str, optional
+        The type of environment, one of 'continuing', 'episodic'. By default
+        'continuing'
+
+    Returns
+    -------
+    plt.figure, plt.Axes
+        The figure and axes of the plot
+    """
+    # Set the colours to be default if not set
+    if colours is None:
+        colours = _get_default_colours(ind)
+
+    # Set up figure
+    title = f"Average {type_.title()} Return per Run with Standard Error"
+    fig, ax = _setup_fig(fig, ax, figsize, xlim=xlim, ylim=ylim, xlabel=xlabel,
+                         ylabel=ylabel, title=title)
+
+    # Track the total timesteps per hyperparam setting over all episodes and
+    # the cumulative timesteps per episode per data dictionary (timesteps
+    # should be consistent between all hp settings in a single data dict)
+    total_timesteps = []
+    cumulative_timesteps = []
+
+    for i in range(len(data)):
+        if type_ == "train":
+            cumulative_timesteps.append(exp.get_cumulative_timesteps(data[i]
+                                        ["experiment_data"][ind[i][0]]["runs"]
+                                        [0]["train_episode_steps"]))
+        elif type_ == "eval":
+            cumulative_timesteps.append(data[i]["experiment_data"][ind[i][0]]
+                                        ["runs"][0]["timesteps_at_eval"])
+        else:
+            raise ValueError("type_ must be one of 'train', 'eval'")
+        total_timesteps.append(cumulative_timesteps[-1][-1])
+
+    # Find the minimum of total trained-for timesteps. Each plot will only
+    # be plotted on the x-axis until this value
+    min_timesteps = min(total_timesteps)
+
+    # For each data dictionary, find the minimum index where the timestep at
+    # that index is >=  minimum timestep
+    ind_ge_min_timesteps = []
+    for cumulative_timesteps_per_data in cumulative_timesteps:
+        final_ind = np.where(cumulative_timesteps_per_data >=
+                             min_timesteps)[0][0]
+        # Since indexing will stop right before the minimum, increment it
+        ind_ge_min_timesteps.append(final_ind + 1)
+
+    # Plot all data for all HP settings, only up until the minimum index
+    # fig, ax = None, None
+    plot_fn = _plot_mean_with_stderr_continuing if env_type == "continuing" \
+        else _plot_mean_with_stderr_episodic
+    for i in range(len(data)):
+        fig, ax = \
+            plot_fn(data=data[i], type_=type_,
+                    ind=ind[i], smooth_over=smooth_over[i], name=names[i],
+                    fig=fig, ax=ax, figsize=figsize, xlim=xlim, ylim=ylim,
+                    last_ind=ind_ge_min_timesteps[i], alpha=alpha,
+                    colours=colours[i], keep_shape=keep_shape)
+
+    return fig, ax
+
+
+def _plot_mean_with_stderr_continuing(data, type_, ind, smooth_over, fig=None,
+                                      ax=None, figsize=(12, 6), xlim=None,
+                                      ylim=None, xlabel=None, ylabel=None,
+                                      name="", last_ind=-1,
+                                      timestep_multiply=None, alpha=0.1,
+                                      colours=None,
+                                      keep_shape=False):
+    """
+    Plots the average training or evaluation return over all runs for a single
+    data dictionary on a continuing environment. Standard error
+    is plotted as shaded regions.
+
+    Parameters
+    ----------
+    data : dict
+        The Python data dictionary generated from running main.py
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    ind : iter of int
+        The list of hyperparameter settings indices to plot
+    smooth_over : int
+        The number of previous data points to smooth over. Note that this
+        is *not* the number of timesteps to smooth over, but rather the number
+        of data points to smooth over. For example, if you save the return
+        every 1,000 timesteps, then setting this value to 15 will smooth
+        over the last 15 readings, or 15,000 timesteps.
+    fig : plt.figure
+        The figure to plot on, by default None. If None, creates a new figure
+    ax : plt.Axes
+        The axis to plot on, by default None, If None, creates a new axis
+    figsize : tuple(int, int)
+        The size of the figure to plot
+    name : str, optional
+        The name of the agent, used for the legend
+    last_ind : int, optional
+        The index of the last element to plot in the returns list,
+        by default -1. This is useful if you want to plot many things on the
+        same axis, but all of which have a different number of elements. This
+        way, we can plot the first last_ind elements of each returns for each
+        agent.
+    timestep_multiply : array_like of float, optional
+        A value to multiply each timstep by, by default None. This is useful if
+        your agent does multiple updates per timestep and you want to plot
+        performance vs. number of updates.
+    xlim : float, optional
+        The x limit for the plot, by default None
+    ylim : float, optional
+        The y limit for the plot, by default None
+    alpha : float, optional
+        The alpha channel for the plot, by default 0.1
+    colours : list of str
+        The colours to use for each plot of each hyperparameter setting
+    env_type : str, optional
+        The type of environment, one of 'continuing', 'episodic'. By default
+        'continuing'
+
+    Returns
+    -------
+    plt.figure, plt.Axes
+        The figure and axes of the plot
+
+    Raises
+    ------
+    ValueError
+        When an axis is passed but no figure is passed
+        When an appropriate number of colours is not specified to cover all
+        hyperparameter settings
+    """
+    if colours is not None and len(colours) != len(ind):
+        raise ValueError("must have one colour for each hyperparameter " +
+                         "setting")
+
+    if timestep_multiply is None:
+        timestep_multiply = [1] * len(ind)
+
+    if ax is not None and fig is None:
+        raise ValueError("must pass figure when passing axis")
+
+    if colours is None:
+        colours = _get_default_colours(ind)
+
+    # Set up figure
+    if ax is None and fig is None:
+        title = f"Average {type_.title()} Return per Run with Standard Error"
+        fig, ax = _setup_fig(fig, ax, figsize, xlim=xlim, ylim=ylim,
+                             xlabel=xlabel, ylabel=ylabel, title=title)
+
+    episode_length = data["experiment"]["environment"]["steps_per_episode"]
+
+    # Plot with the standard error
+    for i in range(len(ind)):
+        timesteps, mean, std = exp.get_mean_err(data, type_, ind[i],
+                                                smooth_over, exp.stderr,
+                                                keep_shape=keep_shape)
+        timesteps = np.array(timesteps[:last_ind]) * timestep_multiply[i]
+        # mean = mean[:last_ind] / episode_length
+        # std = std[:last_ind] / episode_length
+
+        # Plot based on colours
+        label = f"{name}"
+        if colours is not None:
+            _plot_shaded(ax, timesteps, mean, std, colours[i], label, alpha)
+        else:
+            _plot_shaded(ax, timesteps, mean, std, None, label, alpha)
+
+    ax.legend()
+
+    fig.show()
+    return fig, ax
+
+
+def _plot_mean_with_stderr_episodic(data, type_, ind, smooth_over, fig=None,
+                                    ax=None, figsize=(12, 6), name="",
+                                    last_ind=-1, xlabel="Timesteps",
+                                    ylabel="Average Return", xlim=None,
+                                    ylim=None, alpha=0.1, colours=None,
+                                    keep_shape=False):
+    """
+    Plots the average training or evaluation return over all runs for a
+    single data dictionary on an episodic environment. Plots shaded retions
+    as standard error.
+
+    Parameters
+    ----------
+    data : dict
+        The Python data dictionary generated from running main.py
+    type_ : str
+        Which type of data to plot, one of "eval" or "train"
+    ind : iter of int
+        The list of hyperparameter settings indices to plot
+    smooth_over : int
+        The number of previous data points to smooth over. Note that this
+        is *not* the number of timesteps to smooth over, but rather the number
+        of data points to smooth over. For example, if you save the return
+        every 1,000 timesteps, then setting this value to 15 will smooth
+        over the last 15 readings, or 15,000 timesteps.
+    fig : plt.figure
+        The figure to plot on, by default None. If None, creates a new figure
+    ax : plt.Axes
+        The axis to plot on, by default None, If None, creates a new axis
+    figsize : tuple(int, int)
+        The size of the figure to plot
+    name : str, optional
+        The name of the agent, used for the legend
+    xlim : float, optional
+        The x limit for the plot, by default None
+    ylim : float, optional
+        The y limit for the plot, by default None
+    alpha : float, optional
+        The alpha channel for the plot, by default 0.1
+    colours : list of str
+        The colours to use for each plot of each hyperparameter setting
+    env_type : str, optional
+        The type of environment, one of 'continuing', 'episodic'. By default
+        'continuing'
+
+    Returns
+    -------
+    plt.figure, plt.Axes
+        The figure and axes of the plot
+
+    Raises
+    ------
+    ValueError
+        When an axis is passed but no figure is passed
+        When an appropriate number of colours is not specified to cover all
+        hyperparameter settings
+    """
+    if colours is not None and len(colours) != len(ind):
+        raise ValueError("must have one colour for each hyperparameter " +
+                         "setting")
+
+    if ax is not None and fig is None:
+        raise ValueError("must pass figure when passing axis")
+
+    if colours is None:
+        colours = _get_default_colours(ind)
+
+    # Set up figure
+    if ax is None and fig is None:
+        fig = plt.figure(figsize=figsize)
+        ax = fig.add_subplot()
+
+    if xlim is not None:
+        ax.set_xlim(xlim)
+    if ylim is not None:
+        ax.set_ylim(ylim)
+
+    # Plot with the standard error
+    for i in range(len(ind)):
+        # data = exp.reduce_episodes(data, ind[i], type_=type_)
+        data = runs.expand_episodes(data, ind[i], type_=type_)
+
+        # data has consistent # of episodes, so treat as env_type="continuing"
+        _, mean, std = exp.get_mean_err(data, type_, ind[i], smooth_over,
+                                        exp.stderr, keep_shape=keep_shape)
+        print(mean.shape, std.shape, "HERE")
+        episodes = np.arange(mean.shape[0])
+        print(mean.shape, episodes[0], episodes[-1])
+
+        # Plot based on colours
+        label = f"{name}"
+        if colours is not None:
+            _plot_shaded(ax, episodes, mean, std, colours[i], label, alpha)
+        else:
+            _plot_shaded(ax, episodes, mean, std, None, label, alpha)
+
+    ax.legend()
+    ax.set_title(f"Average {type_.title()} Return per Run with Standard Error")
+    ax.set_ylabel(ylabel)
+    ax.set_xlabel(xlabel)
+
+    fig.show()
+    return fig, ax
+
+
+def return_distribution(data, type_, hp_ind, bins, figsize=(12, 6), xlim=None,
+                        ylim=None, after=0, before=-1):
+    """
+    Plots the distribution of returns on either an episodic or continuing
+    environment
+
+    Parameters
+    ----------
+    data : dict
+        The data dictionary containing the runs of a single hyperparameter
+        setting
+    type_ : str, optional
+        The type of surface to plot, by default "surface". One of 'surface',
+        'wireframe', or 'bar'
+    hp_ind : int, optional
+        The hyperparameter settings index in the data dictionary to use for
+        the plot, by default -1. If less than 0, then the first hyperparameter
+        setting in the dictionary is used.
+    bins : Iterable, int
+        The bins to use for the plot. If an Iterable, then each value in the
+        Iterable is considered as a cutoff for bins. If an integer, separates
+        the returns into that many bins
+    figsize : tuple, optional
+        The size of the figure to plot, by default (15, 10)
+    xlim : 2-tuple of float, optional
+        The cutoff points for the x-axis to plot between, by default None
+    ylim : 2-tuple of float, optional
+        The cutoff points for the y-axis to plot between, by default None
+
+    Returns
+    -------
+    plt.figure, plt.Axes3D
+        The figure and axis plotten on
+    """
+    # Get the episode returns for each run
+    run_returns = []
+    return_type = type_ + "_episode_rewards"
+    for run in data["experiment_data"][hp_ind]["runs"]:
+        run_returns.append(np.mean(run[return_type][after:before]))
+
+    title = f"Learning Curve Distribution - HP Settings {hp_ind}"
+    return _return_histogram(run_returns, bins, figsize, title, xlim, ylim)
+
+
+def _return_histogram(run_returns, bins, figsize, title, xlim, ylim, kde=True):
+    fig = plt.figure(figsize=figsize)
+    ax = fig.add_subplot()
+
+    ax.set_title(title)
+    ax.set_xlabel("Average Return Per Run")
+    ax.set_ylabel("Relative Frequency")
+    _ = sns.histplot(run_returns, bins=bins, kde=kde)
+
+    if xlim is not None:
+        ax.set_xlim(xlim)
+    if ylim is not None:
+        ax.set_ylim(ylim)
+
+    # Plot relative frequency on the y-axis
+    ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos:
+                                 "{:.2f}".format(x / len(run_returns))))
+
+    fig.show()
+    return fig, ax
+
+
+def _get_default_colours(iter_):
+    """
+    Recursively turns elements of an Iterable into strings representing
+    colours.
+
+    This function will turn each element of an Iterable into strings that
+    represent colours, recursively. If the elements of an Iterable are
+    also Iterable, then this function will recursively descend all the way
+    through every Iterable until it finds an Iterable with non-Iterable
+    elements. These elements will be replaced by strings that represent
+    colours. In effect, this function keeps the data structure, but replaces
+    non-Iterable elements by strings representing colours. Note that this
+    funcion assumes that all elements of an Iterable are of the same type,
+    and so it only checks if the first element of an Iterable object is
+    Iterable or not to stop the recursion.
+
+    Parameters
+    ----------
+    iter_ : collections.Iterable
+        The top-level Iterable object to turn into an Iterable of strings of
+        colours, recursively.
+
+    Returns
+    -------
+    list of list of ... of strings
+        A data structure that has the same architecture as the input Iterable
+        but with all non-Iterable elements replaced by strings.
+    """
+    colours = []
+
+    # Calculate the number of lists at the current level to go through
+    paths = range(len(iter_))
+
+    # Return a list of colours if the elements of the list are not lists
+    if not isinstance(iter_[0], Iterable):
+        global OFFSET
+        col = [DEFAULT_COLOURS[(OFFSET + i) % len(DEFAULT_COLOURS)]
+               for i in paths]
+        OFFSET += len(paths)
+        return col
+
+    # For each list at the current level, get the colours corresponding to
+    # this level
+    for i in paths:
+        colours.append(_get_default_colours(iter_[i]))
+
+    return colours
+
+
+def _plot_shaded(ax, x, y, region, colour, label, alpha):
+    """
+    Plots a curve with a shaded region.
+
+    Parameters
+    ----------
+    ax : plt.Axes
+        The axis to plot on
+    x : Iterable
+        The points on the x-axis
+    y : Iterable
+        The points on the y-axis
+    region : list or array_like
+        The region to shade about the y points. The shaded region will be
+        y +/- region. If region is a list or 1D np.ndarray, then the region
+        is used both for the + and - portions. If region is a 2D np.ndarray,
+        then the first row will be used as the lower bound (-) and the
+        second row will be used for the upper bound (+). That is, the region
+        between (lower_bound, upper_bound) will be shaded, and there will be
+        no subtraction/adding of the y-values.
+    colour : str
+        The colour to plot with
+    label : str
+        The label to use for the plot
+    alpha : float
+        The alpha value for the shaded region
+    """
+    if colour is None:
+        colour = DEFAULT_COLOURS[0]
+
+    ax.plot(x, y, color=colour, label=label)
+    if type(region) == list:
+        ax.fill_between(x, y-region, y+region, alpha=alpha, color=colour)
+    elif type(region) == np.ndarray and len(region.shape) == 1:
+        ax.fill_between(x, y-region, y+region, alpha=alpha, color=colour)
+    elif type(region) == np.ndarray and len(region.shape) == 2:
+        ax.fill_between(x, region[0, :], region[1, :], alpha=alpha,
+                        color=colour)
+
+
+def _setup_fig(fig, ax, figsize=None, title=None, xlim=None, ylim=None,
+               xlabel=None, ylabel=None, xscale=None, yscale=None, xbase=None,
+               ybase=None):
+    if fig is None:
+        if ax is not None:
+            raise ValueError("Must specify figure when axis given")
+        if figsize is not None:
+            fig = plt.figure(figsize=figsize)
+        else:
+            fig = plt.figure()
+
+    if ax is None:
+        ax = fig.add_subplot()
+
+    if title is not None:
+        ax.set_title(title)
+
+    if xlabel is not None:
+        ax.set_xlabel(xlabel)
+
+    if ylabel is not None:
+        ax.set_ylabel(ylabel)
+
+    if xlim is not None:
+        ax.set_xlim(xlim)
+
+    if ylim is not None:
+        ax.set_ylim(ylim)
+
+    if xscale is not None:
+        if xbase is not None:
+            ax.set_xscale(xscale, base=xbase)
+        else:
+            ax.set_xscale(xscale)
+
+    if yscale is not None:
+        if ybase is not None:
+            ax.set_yscale(yscale, base=ybase)
+        else:
+            ax.set_yscale(yscale)
+
+    return fig, ax
+
+
+def reset():
+    """
+    Resets the colours offset
+    """
+    global OFFSET
+    OFFSET = 0
diff --git a/utils/runs.py b/utils/runs.py
new file mode 100644
index 0000000..f9e4d96
--- /dev/null
+++ b/utils/runs.py
@@ -0,0 +1,184 @@
+import numpy as np
+from copy import deepcopy
+
+TRAIN = "train"
+EVAL = "eval"
+
+def episodes_to(in_data, i, type_=TRAIN):
+    """
+    Restricts the number of `type_` episodes to be from episode 0 to the
+    episode right before episode i.
+    The input data dictionary is not changed. If `type_` is 'train', then the
+    training returns are restricted to be only from episodes 0 to i and the
+    'eval' episodes are restricted to reflect this. If `type_` is 'eval", then
+    the evaluation returns are restricted to be only from episode 0 to i and
+    the 'train' returns are restricted to reflect this.
+
+    By 'restricted to reflect this', we mean that the returns are
+    restricted so that the final return is at the same timestep (or
+    nearest timestep, rounding up to episode completion) as the
+    final timestep of episode i for the data of type `type_`.
+
+
+    Parameters
+    ----------
+    in_data : dict
+        The data dictionary
+    i : int
+        The episode to restrict values to
+    type_ : str
+        The type of data to restrict to be from episode 0 to i. One
+        of 'train', 'eval'.
+
+    Returns
+    -------
+    dict
+        The modified data dictionary
+    """
+    data = deepcopy(in_data)
+
+    if type_ not in (TRAIN, EVAL):
+        raise ValueError("type_ must be one of 'train',  'eval'")
+
+    key = type_
+    other = "eval" if key == "train" else "train"
+
+    for hyper in data["experiment_data"]:
+        for j in range(len(data["experiment_data"][hyper]["runs"])):
+            run_data = data["experiment_data"][hyper]["runs"][j]
+
+            if i > len(run_data[f"{key}_episode_rewards"]):
+                last = len(run_data[f"{key}_episode_rewards"])
+                raise IndexError(f"no such episode i={i}, largest episode " +
+                                 f"index is {last}")
+
+            # Adjust training data
+            run_data[f"{key}_episode_rewards"] = run_data[
+                f"{key}_episode_rewards"][:i]
+
+            run_data[f"{key}_episode_steps"] = run_data[
+                f"{key}_episode_steps"][:i]
+
+            # Figure out which timestep episode i happened on
+            last_step = np.cumsum(run_data[f"{key}_episode_steps"])[-1]
+
+            # Figure out which episodes to keep of the "other" type (if type_
+            # is 'train' then other is 'eval' and vice versa)
+            to_discard = np.cumsum(run_data[f"{other}_episode_steps"]) \
+                > last_step
+
+            if len(to_discard):
+                last_other_step = np.argmax(to_discard)
+
+                # Adjust "other" data
+                run_data[f"{other}_episode_reward"] = run_data[
+                    f"{other}_episode_reward"][:last_other_step]
+
+                run_data[f"{other}_episode_steps"] = run_data[
+                    f"{other}_episode_steps"][:last_other_step]
+            else:
+                # Adjust "other" data
+                run_data[f"{other}_episode_reward"] = []
+
+                run_data[f"{other}_episode_steps"] = []
+
+    return data
+
+
+def expand_episodes(data, ind, type_='train'):
+    """
+    For each run, repeat each episode's performance measure by how many
+    timesteps that episode took to finish. This results in episodic experiments
+    having the same number of data readings per run, so that performances can
+    be averaged over runs and an be easily plotted.
+
+    This function will modify a single run's data such that if you plotted only
+    that run's data, then it would appear as a step plot. For example, if we
+    had the following episode performances:
+
+        [100, 110]
+
+    with the following number of timesteps for each episode:
+
+        [2, 3]
+
+    Then this function will modify the data so that it looks like:
+
+        [100, 100, 110, 110, 110]
+
+    Parameters
+    ----------
+    data : dict
+        The data dictionary generated by the experiment
+    ind : int
+        The hyperparameter index to adjust
+    type_ : str
+        Which data type to adjust, one of 'train', 'eval'
+    """
+    data = deepcopy(data)
+    runs = data["experiment_data"][ind]["runs"]
+    episodes = []
+    if type_ == "train":
+        for i in range(len(runs)):
+            run_return = []
+            for j in range(len(runs[i]["train_episode_rewards"])):
+                run_return.extend([runs[i]["train_episode_rewards"][j] for _ in
+                                  range(runs[i]["train_episode_steps"][j])])
+            data["experiment_data"][ind]["runs"][i][
+                "train_episode_rewards"] = run_return
+
+    elif type_ == "eval":
+        for i in range(len(runs)):
+            run_return = []
+            for j in range(len(runs[i]["eval_episode_rewards"])):
+                run_return.extend([runs[i]["eval_episode_rewards"][j] for _ in
+                                   range(runs[i]["eval_episode_steps"][j])])
+            data["experiment_data"][ind]["runs"][i][
+                "eval_episode_rewards"] = run_return
+
+    else:
+        raise ValueError(f"unknown type {type_}")
+    return data
+
+
+def reduce_episodes(data, ind, type_):
+    """
+    Reduce the number of episodes in an episodic setting
+
+    Given a data dictionary, this function will reduce the number of episodes
+    seen on each run to the minimum among all runs for that hyperparameter
+    settings index. This is needed to plot curves by episodic return.
+
+    Parameters
+    ----------
+    data : dict
+        The Python data dictionary generated from running main.py
+    ind : int
+        The hyperparameter settings index to reduce the episodes of
+    type_ : str
+        Whether to reduce the training or evaluation returns, one of 'train',
+        'eval'
+    """
+    data = deepcopy(data)
+    runs = data["experiment_data"][ind]["runs"]
+    episodes = []
+    if type_ == "train":
+        for run in data["experiment_data"][ind]["runs"]:
+            episodes.append(len(run["train_episode_rewards"]))
+
+        min_ = np.min(episodes)
+        for i in range(len(runs)):
+            runs[i]["train_episode_rewards"] = \
+                runs[i]["train_episode_rewards"][:min_]
+
+    elif type_ == "eval":
+        for run in data["experiment_data"][ind]["runs"]:
+            episodes.append(run["eval_episode_rewards"].shape[0])
+
+        min_ = np.min(episodes)
+
+        for i in range(len(runs)):
+            runs[i]["eval_episode_rewards"] = \
+                runs[i]["eval_episode_rewards"][:min_, :]
+
+    return data