Reinforcement Learning

This repository contains my implementation of different reinforcement learning algorithms, such as

Future algorithms to be included:

Comments: -

While all the codes were indeed written and contain personal touches by myself, they are influenced by the works of @philtabor and @lweitkamp
All codes are written in PyTorch language
While all the codes learn to solve the control problem presented by the chosen environment, in no way are they the ideal solution. More fine tuning of the hyperparameters is needed. Nonetheless, the solutions offered are acceptable
Please read "RL Algorithms_v2.pdf" for more information

Q-Learning: -

Estimate $Q_{\pi}(s, a)$ via function approximation
Cost Function:
- $J(\theta)=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta}}(s, a))^{2}]$
- $\delta_{TD} = r + \gamma \max_{a^{'}} \hat{Q}_{\theta}(s^{'}, a^{'})$
Pseudocode:
Initialize $Q_{\theta}(s, a)$ with random weight
for $episode = 1, 2, 3, ..., N$ do
Initialize environment $s_{0}$
for $t = 0, 1, 2, ..., T$ do
Select action $a_{t}$ randomly with probability $\epsilon$, otherwise
$a_{t} = \arg\max_{a_{t}} \hat{Q_{\theta}}(s_{t}, a)$
Execute action $a_{t}$ in environment and observe $r_{t + 1}$, $s_{t + 1}$, terminal and truncate flags
Set TD target $\delta_{TD} = r_{t + 1}$ if terminal or truncate flag is true, otherwise
$\delta_{TD} = r_{t + 1} + \gamma \max_{a_{t + 1}} \hat{Q_{\theta}}(s_{t + 1}, a)$
Perform a gradient descent step on
$J(\theta)=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta}}(s_{t}, a_{t}))^{2}]$
Set $s_{t + 1}$ as current state
end for
end for
Example:
- My solution for the "CartPole-v1" Gym environment:

Double Q-Learning: -

Estimate $Q_{\pi}(s, a)$ via function approximation
Use two Q-networks to handle maximization bias
Cost Function:
- $J(\theta_{i})=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta_{i}}}(s, a))^{2}]$
- $\delta_{TD} = r + \gamma \hat{Q_{\theta_{j}}}(s^{'}, \arg\max_{a^{'}} \hat{Q_{\theta_{i}}}(s^{'}, a^{'}))$
Pseudocode:
Initialize $Q_{\theta_{x}}(s, a)$ with random weight
for $episode = 1, 2, 3, ..., N$ do
Initialize environment $s_{0}$
for $t = 0, 1, 2, ..., T$ do
Select action $a_{t}$ randomly with probability $\epsilon$, otherwise
$a_{t} = \arg\max_{a_{t}} ((\hat{Q_{\theta_{1}}}(s_{t}, a_{t}) + \hat{Q_{\theta_{2}}}(s_{t}, a_{t})) / 2)$
Execute action $a_{t}$ in environment and observe $r_{t + 1}$, $s_{t + 1}$, terminal and truncate flags
Choose at random either to update 1 or 2
if $i = 1$ or $2$ then
Set TD target $\delta_{TD} = r_{t + 1}$ if terminal or truncate flag is true, otherwise
$a^{+}=\arg\max_{a_{t+1}} \hat{Q_{\theta_{i}}}(s_{t+1}, a_{t+1})$
$\delta_{TD} = r_{t + 1} + \gamma \hat{Q_{\theta_{3-i}}}(s_{t + 1}, a^{+})$
Perform a gradient descent step on
$J(\theta_{i})=E_{\pi}[(\delta_{TD} - \hat{Q_{\theta_{i}}}(s_{t}, a_{t}))^{2}]$
end if
Set $s_{t + 1}$ as current state
end for
end for
Example:
- My solution for the "CartPole-v1" Gym environment:

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
01 Q Learning		01 Q Learning
02 Double Q Learning		02 Double Q Learning
03 Deep Q Learning (DQN)		03 Deep Q Learning (DQN)
04 Double DQN		04 Double DQN
README.md		README.md
RL Algorithms_v2.pdf		RL Algorithms_v2.pdf

Provide feedback