Skip to content

SaifAlWahaibi/ReinforcementLearning

Repository files navigation

Reinforcement Learning

This repository contains my implementation of different reinforcement learning algorithms, such as

  • Q-Learning
  • Double Q-Learning
  • Deep Q-Learning (DQN)
  • Double DQN

Future algorithms to be included:

  • Policy Gradient (PG) (REINFORCE)
  • Actor-Critic PG
  • Vanilla PG
  • Deep Deterministic PD (DDPG)
  • Twin-Delayed DDPG (TD3)
  • Option-Critic

Comments: -

  • While all the codes were indeed written and contain personal touches by myself, they are influenced by the works of @philtabor and @lweitkamp
  • All codes are written in PyTorch language
  • While all the codes learn to solve the control problem presented by the chosen environment, in no way are they the ideal solution. More fine tuning of the hyperparameters is needed. Nonetheless, the solutions offered are acceptable
  • Please read "RL Algorithms_v2.pdf" for more information

Q-Learning: -

  • Estimate $Q_{\pi}(s, a)$ via function approximation

  • Cost Function:

    • $J(\theta)=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta}}(s, a))^{2}]$
    • $\delta_{TD} = r + \gamma \max_{a^{'}} \hat{Q}_{\theta}(s^{'}, a^{'})$
  • Pseudocode:
    Initialize $Q_{\theta}(s, a)$ with random weight
    for $episode = 1, 2, 3, ..., N$ do
        Initialize environment $s_{0}$
        for $t = 0, 1, 2, ..., T$ do
            Select action $a_{t}$ randomly with probability $\epsilon$, otherwise
                $a_{t} = \arg\max_{a_{t}} \hat{Q_{\theta}}(s_{t}, a)$
            Execute action $a_{t}$ in environment and observe $r_{t + 1}$, $s_{t + 1}$, terminal and truncate flags
            Set TD target $\delta_{TD} = r_{t + 1}$ if terminal or truncate flag is true, otherwise
                $\delta_{TD} = r_{t + 1} + \gamma \max_{a_{t + 1}} \hat{Q_{\theta}}(s_{t + 1}, a)$
            Perform a gradient descent step on
                $J(\theta)=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta}}(s_{t}, a_{t}))^{2}]$
            Set $s_{t + 1}$ as current state
        end for
    end for

  • Example:

    • My solution for the "CartPole-v1" Gym environment:

Double Q-Learning: -

  • Estimate $Q_{\pi}(s, a)$ via function approximation

  • Use two Q-networks to handle maximization bias

  • Cost Function:

    • $J(\theta_{i})=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta_{i}}}(s, a))^{2}]$
    • $\delta_{TD} = r + \gamma \hat{Q_{\theta_{j}}}(s^{'}, \arg\max_{a^{'}} \hat{Q_{\theta_{i}}}(s^{'}, a^{'}))$
  • Pseudocode:
    Initialize $Q_{\theta_{x}}(s, a)$ with random weight
    for $episode = 1, 2, 3, ..., N$ do
        Initialize environment $s_{0}$
        for $t = 0, 1, 2, ..., T$ do
            Select action $a_{t}$ randomly with probability $\epsilon$, otherwise
                $a_{t} = \arg\max_{a_{t}} ((\hat{Q_{\theta_{1}}}(s_{t}, a_{t}) + \hat{Q_{\theta_{2}}}(s_{t}, a_{t})) / 2)$
            Execute action $a_{t}$ in environment and observe $r_{t + 1}$, $s_{t + 1}$, terminal and truncate flags
            Choose at random either to update 1 or 2
            if $i = 1$ or $2$ then
                Set TD target $\delta_{TD} = r_{t + 1}$ if terminal or truncate flag is true, otherwise
                    $a^{+}=\arg\max_{a_{t+1}} \hat{Q_{\theta_{i}}}(s_{t+1}, a_{t+1})$
                    $\delta_{TD} = r_{t + 1} + \gamma \hat{Q_{\theta_{3-i}}}(s_{t + 1}, a^{+})$
                Perform a gradient descent step on
                    $J(\theta_{i})=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta_{i}}}(s_{t}, a_{t}))^{2}]$
            end if
            Set $s_{t + 1}$ as current state
        end for
    end for

  • Example:

    • My solution for the "CartPole-v1" Gym environment:

About

Different reinforcement learning algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages