This repository contains my implementation of different reinforcement learning algorithms, such as
- Q-Learning
- Double Q-Learning
- Deep Q-Learning (DQN)
- Double DQN
Future algorithms to be included:
- Policy Gradient (PG) (REINFORCE)
- Actor-Critic PG
- Vanilla PG
- Deep Deterministic PD (DDPG)
- Twin-Delayed DDPG (TD3)
- Option-Critic
Comments: -
- While all the codes were indeed written and contain personal touches by myself, they are influenced by the works of @philtabor and @lweitkamp
- All codes are written in PyTorch language
- While all the codes learn to solve the control problem presented by the chosen environment, in no way are they the ideal solution. More fine tuning of the hyperparameters is needed. Nonetheless, the solutions offered are acceptable
- Please read "RL Algorithms_v2.pdf" for more information
Q-Learning: -
-
Estimate
$Q_{\pi}(s, a)$ via function approximation
-
Cost Function:
$J(\theta)=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta}}(s, a))^{2}]$ $\delta_{TD} = r + \gamma \max_{a^{'}} \hat{Q}_{\theta}(s^{'}, a^{'})$
-
Pseudocode:
Initialize$Q_{\theta}(s, a)$ with random weight
for$episode = 1, 2, 3, ..., N$ do
Initialize environment$s_{0}$
for$t = 0, 1, 2, ..., T$ do
Select action$a_{t}$ randomly with probability$\epsilon$ , otherwise
$a_{t} = \arg\max_{a_{t}} \hat{Q_{\theta}}(s_{t}, a)$
Execute action$a_{t}$ in environment and observe$r_{t + 1}$ ,$s_{t + 1}$ , terminal and truncate flags
Set TD target$\delta_{TD} = r_{t + 1}$ if terminal or truncate flag is true, otherwise
$\delta_{TD} = r_{t + 1} + \gamma \max_{a_{t + 1}} \hat{Q_{\theta}}(s_{t + 1}, a)$
Perform a gradient descent step on
$J(\theta)=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta}}(s_{t}, a_{t}))^{2}]$
Set$s_{t + 1}$ as current state
end for
end for -
Example:
Double Q-Learning: -
-
Estimate
$Q_{\pi}(s, a)$ via function approximation -
Use two Q-networks to handle maximization bias
-
Cost Function:
$J(\theta_{i})=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta_{i}}}(s, a))^{2}]$ $\delta_{TD} = r + \gamma \hat{Q_{\theta_{j}}}(s^{'}, \arg\max_{a^{'}} \hat{Q_{\theta_{i}}}(s^{'}, a^{'}))$
-
Pseudocode:
Initialize$Q_{\theta_{x}}(s, a)$ with random weight
for$episode = 1, 2, 3, ..., N$ do
Initialize environment$s_{0}$
for$t = 0, 1, 2, ..., T$ do
Select action$a_{t}$ randomly with probability$\epsilon$ , otherwise
$a_{t} = \arg\max_{a_{t}} ((\hat{Q_{\theta_{1}}}(s_{t}, a_{t}) + \hat{Q_{\theta_{2}}}(s_{t}, a_{t})) / 2)$
Execute action$a_{t}$ in environment and observe$r_{t + 1}$ ,$s_{t + 1}$ , terminal and truncate flags
Choose at random either to update 1 or 2
if$i = 1$ or$2$ then
Set TD target$\delta_{TD} = r_{t + 1}$ if terminal or truncate flag is true, otherwise
$a^{+}=\arg\max_{a_{t+1}} \hat{Q_{\theta_{i}}}(s_{t+1}, a_{t+1})$
$\delta_{TD} = r_{t + 1} + \gamma \hat{Q_{\theta_{3-i}}}(s_{t + 1}, a^{+})$
Perform a gradient descent step on
$J(\theta_{i})=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta_{i}}}(s_{t}, a_{t}))^{2}]$
end if
Set$s_{t + 1}$ as current state
end for
end for -
Example:

