Pseudocode and questions

Hey thanks for sharing this work! And I really appreciate the in depth beginner friendly blog post! I was wondering if this pseudocode was
1. Correct
2. Helpful to anyone else trying to understand the code

If not feel free to close. But I would appreciate it if you could help me understand a few parts about the code! Thanks!

## Questions
1. How come the environment reward [env_reward](https://github.com/Div99/IQ-Learn/blob/main/iq_learn/iq.py#L14) is unused and [reward](https://github.com/Div99/IQ-Learn/blob/main/iq_learn/train_iq.py#L231) is entirely dependent on the output of the model? Does this algorithm only learn the expert and never take into account environment reward?
2. Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse? 

## Pseudocode

```python3

def init_network():
  q_net = torch.nn.Linear(state_size, action_size)
  target_net = deepcopy(q_net)
  
def episode_step():
  action = softmax(q_net(state))
  next_state, reward = env.step(action)
  memory.add((state, next_state, action, reward)) # memory = collections.deque
  update_critic(memory, expert_memory)
  target_net = deepcopy(q_net)
  
def update_critic(memory, expert_memory):
  # The idea here is that we backprop both the rewards for the expert's actions and the agent's actions
  # the batch dimension contains examples from the expert and the agent
  state = torch.cat((memory[:][0], expert_memory[:][0]))
  next_state = torch.cat((memory[:][1], expert_memory[:][1]))
  action = torch.cat((memory[:][2], expert_memory[:][2]))
  # v = sum of future rewards for all possible actions given current state
  v = torch.logsumexp(q_net(state), dim=1, keepdim=True)
  # next_v = sum of future rewards for all possible actions given state(t+1)
  next_v = torch.logsumexp(q_net(next_state), dim=1, keepdim=True)
  # q = sum of future rewards predicted given current state, action pair
  q = q_net(state).gather(action) 
  loss = iq_loss(q, v, next_v)
  critic_optimizer.zero_grad()
  loss.backward()
  critic_optimizer.step()
  
def iq_loss(q, v, next_v):
  if done:
    expert_reward = q[where_expert]
    # Why is value_loss determined entirely from the model output? Wouldn't this cause the model to collapse? 
    value_loss = v.mean()
  else:
    expert_reward = (q - next_v)[where_expert]
    value_loss = (v - next_v).mean()
  # Why is this negative?
  expert_reward_loss = -expert_reward.mean()
  loss = reward_loss + value_loss
  return loss
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Pseudocode and questions #9

Questions

Pseudocode

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Pseudocode and questions #9

Description

Questions

Pseudocode

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions