Divergence Issue

Hi, I opened this issue regarding divergence problems. I think you already suggested some solutions (#11 ) to this problem but still there are some problems. I spent a couple of months to handle this problem but still faced with same issue. For a better discussion, I think I need to explain the training environment. 

I am using your approach in vision-based collision avoidance problem. If the agent reaches to a goal point or encounter collision, I restart my episode in a random position. It seems like my agent is learning well initially; however, the critic network starts to diverge and the network is totally destroyed at the end. It seems like the critic network thinks that current states are very good.

Here are my training settings:
- initial alpha value: 1e-3
- actor learning rate: 3e-5
- critic learnign rate: 3e-4
- CNN layers for actor and critic: DQN structure layers
- Actor network uses Squashed Normal
- phi value (The value in the 26~44 line in `iq.py`): 1
- SAC with single critic without updating alpha

I have tried with different `initial alpha value` and found out that using higher `initial alpha value` gives more stability to network but resulted in poor performance (The behavior of agent is far from the expert behavior). Am I using it in a wrong way or needs more hyperparameter tunings?

I attach loss figures for more clear understanding. Waiting for your response. Many Thanks.
![rewards](https://user-images.githubusercontent.com/44888654/233264495-a71c76c2-98ff-4fe5-8bc6-a7c816a03bf6.png)
![regularized_loss](https://user-images.githubusercontent.com/44888654/233264496-d79cc71b-032d-46cc-8103-6ac8c50d06fc.png)
![q_values](https://user-images.githubusercontent.com/44888654/233264500-7b5b3049-7689-4b2b-a637-bffb81c931d5.png)
![actor_loss](https://user-images.githubusercontent.com/44888654/233264502-69fa789c-8cb3-43e9-9d63-1675acd6104b.png)


This is my `iq.py` to update critic.
```
   def q_update(self, current_Q, current_v, next_v, done_masks, is_expert):
        with torch.no_grad():
            y = (1-done) * self.args.gamma * next_v
        reward = (current_Q - y)[is_expert]        
        # 1st loss function             
        loss = -(reward).mean()
        # 2nd loss function
        value_loss = (current_v - y).mean()
        loss += value_loss        

        # Use χ2 divergence (calculate the regularization term for IQ loss using expert and policy states) (works online)
        reward = current_Q - y                 
        chi2_loss = 1/(4 * 0.5) * (reward**2).mean()
        return loss
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Divergence Issue #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Divergence Issue #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions