[Maintenance] RL policy training #54

zqwerty · 2020-07-16T01:34:04Z

Describe the feature
I've noticed that there are a few issues (#8, #13, #15, #20, #40) mention that it's hard to train RL policy (PG, PPO, GDPL). Thanks all of you, we have fixed some bugs. To help discussion and debugging, I suggest we report the bugs all under this issue.

Since we have improved our user agenda policy (#31), the performance of RL policies in the README is out-of-date. However, as mentioned in this comment, you can still reproduce the result before the change of user policy.

Currently, we are working on training RL policies with the newest agenda policy. We greatly appreciate it If you could help!

Since the training of RL policy is unstable and sensitive to hyperparameters, here are some suggestions (and we welcome more):

Initialize the parameters by imitation learning (load parameters trained by MLE policy).
Save the model that performs best during training.
Try multiple runs with different random seeds.
Tuning hyperparameters.

sherlock1987 · 2020-07-16T03:08:45Z

GDPL still could not train, the loss is so big!

thenickben · 2020-07-16T12:18:39Z

Cool. I'll try later to set up a simple script where it's easy to change hyperparameters and seeds and get the performance, and will share it here and also do some runs!

ChrisGeishauser · 2020-07-21T15:13:13Z

Hey Nick,

regarding your question in #8 (comment)

I used the best_mle model as a starting point and did not train my own one. I managed to reproduce the results once but when I tried again, I failed somehow and do not know why.

In your code, you can also use the load-method provided by policy_sys and do not have to write your own. (but whatever feels better to you :D)

Apart from that, my code is exactly like yours, where after every epoch I evaluate my model on compele-rate and success-rate.

I dont know if you noticed it but they fixed a small bug in "convlab2/dialog_agent/env.py", where they added a line "self.sys_dst.state['user_action'] = dialog_act". This line is very important as it tells the system what was the last user action was.

When you pretrain the model on the multiwoz-data set, this information is provided and the system really exploits that knowledge to decide about the next action. I guess that PPO failed all the time because it was initialised with a model that has the knowledge about the last user-action, while it didnt got that info during training with the simulator.

I tried again with the bug fixed and now PPO trains well and reaches high performance of around 90% complete-rate and 76% success-rate after 10 epochs.

zqwerty · 2020-07-22T01:03:14Z

I dont know if you noticed it but they fixed a small bug in "convlab2/dialog_agent/env.py", where they added a line "self.sys_dst.state['user_action'] = dialog_act". This line is very important as it tells the system what was the last user action was.

Yes, please update this.

sherlock1987 · 2020-07-22T03:12:20Z

Hey, guys, I am using vanilla PPO with original reward to train. However, the evluation (success rate) is not good at all. It could not go higher when it meets 0.3, far from 0.74.

I know using MLE could help me go to 0.74, but the thing is, I have to use this vanilla PPO, since I am doing the work of reward function, my baseline is vanilla PPO.

Besides, this is the score after 40 epochs: [0.31500000000000006, 0.31875, 0.31375000000000003, 0.308125, 0.29875, 0.30000000000000004, 0.30500000000000005, 0.30125, 0.306875, 0.318125, 0.30562500000000004, 0.31249999999999994, 0.2975, 0.30062500000000003, 0.295, 0.30250000000000005, 0.29500000000000004, 0.29874999999999996, 0.30374999999999996, 0.2975, 0.3025, 0.29125, 0.28625, 0.2875, 0.28812499999999996, 0.28437500000000004, 0.284375, 0.29124999999999995, 0.28875, 0.286875, 0.289375, 0.303125, 0.30000000000000004, 0.3025, 0.29937499999999995, 0.301875, 0.313125, 0.30874999999999997, 0.31125, 0.30437500000000006, 0.29937499999999995, 0.295625, 0.298125, 0.30187499999999995, 0.30562500000000004, 0.30125, 0.29625, 0.29125, 0.3, 0.301875, 0.3025, 0.301875, 0.305625, 0.31499999999999995, 0.31250000000000006, 0.31125, 0.311875, 0.306875, 0.314375, 0.30875]

thenickben · 2020-07-23T03:15:46Z

Hey, guys, I am using vanilla PPO with original reward to train. However, the evluation (success rate) is not good at all. It could not go higher when it meets 0.3, far from 0.74.

I know using MLE could help me go to 0.74, but the thing is, I have to use this vanilla PPO, since I am doing the work of reward function, my baseline is vanilla PPO.

Besides, this is the score after 40 epochs: [0.31500000000000006, 0.31875, 0.31375000000000003, 0.308125, 0.29875, 0.30000000000000004, 0.30500000000000005, 0.30125, 0.306875, 0.318125, 0.30562500000000004, 0.31249999999999994, 0.2975, 0.30062500000000003, 0.295, 0.30250000000000005, 0.29500000000000004, 0.29874999999999996, 0.30374999999999996, 0.2975, 0.3025, 0.29125, 0.28625, 0.2875, 0.28812499999999996, 0.28437500000000004, 0.284375, 0.29124999999999995, 0.28875, 0.286875, 0.289375, 0.303125, 0.30000000000000004, 0.3025, 0.29937499999999995, 0.301875, 0.313125, 0.30874999999999997, 0.31125, 0.30437500000000006, 0.29937499999999995, 0.295625, 0.298125, 0.30187499999999995, 0.30562500000000004, 0.30125, 0.29625, 0.29125, 0.3, 0.301875, 0.3025, 0.301875, 0.305625, 0.31499999999999995, 0.31250000000000006, 0.31125, 0.311875, 0.306875, 0.314375, 0.30875]

I think the big question here would be "why is PPO not training if you don't pre-train it with MLE?". I think the answer has to be related to the complexity of this kind of environments, where without any sort of "expert knowledge" (e.g. taking info from demonstrations as in MLE) simple models like PPO will never train..

sherlock1987 · 2020-07-23T03:23:22Z

Y

Hey, guys, I am using vanilla PPO with original reward to train. However, the evluation (success rate) is not good at all. It could not go higher when it meets 0.3, far from 0.74.
I know using MLE could help me go to 0.74, but the thing is, I have to use this vanilla PPO, since I am doing the work of reward function, my baseline is vanilla PPO.
Besides, this is the score after 40 epochs: [0.31500000000000006, 0.31875, 0.31375000000000003, 0.308125, 0.29875, 0.30000000000000004, 0.30500000000000005, 0.30125, 0.306875, 0.318125, 0.30562500000000004, 0.31249999999999994, 0.2975, 0.30062500000000003, 0.295, 0.30250000000000005, 0.29500000000000004, 0.29874999999999996, 0.30374999999999996, 0.2975, 0.3025, 0.29125, 0.28625, 0.2875, 0.28812499999999996, 0.28437500000000004, 0.284375, 0.29124999999999995, 0.28875, 0.286875, 0.289375, 0.303125, 0.30000000000000004, 0.3025, 0.29937499999999995, 0.301875, 0.313125, 0.30874999999999997, 0.31125, 0.30437500000000006, 0.29937499999999995, 0.295625, 0.298125, 0.30187499999999995, 0.30562500000000004, 0.30125, 0.29625, 0.29125, 0.3, 0.301875, 0.3025, 0.301875, 0.305625, 0.31499999999999995, 0.31250000000000006, 0.31125, 0.311875, 0.306875, 0.314375, 0.30875]

I think the big question here would be "why is PPO not training if you don't pre-train it with MLE?". I think the answer has to be related to the complexity of this kind of environments, where without any sort of "expert knowledge" (e.g. taking info from demonstrations as in MLE) simple models like PPO will never train..

Yeah, I agree with that. I guess without expert trajectory, PPO will never goes to that highest point.

lalapo · 2021-01-12T06:11:40Z

is there any update about RuntimeError: CUDA error: device-side assert triggered issue?
I have trained my own RL agents, all of them were pretrained using MLE. However, when I do end-to-end evaluation using analyzer this error "RuntimeError: CUDA error: device-side assert triggered" appears. Is there any solution to this problem?

zqwerty · 2021-01-12T12:31:43Z

is there any update about RuntimeError: CUDA error: device-side assert triggered issue?
I have trained my own RL agents, all of them were pretrained using MLE. However, when I do end-to-end evaluation using analyzer this error "RuntimeError: CUDA error: device-side assert triggered" appears. Is there any solution to this problem?

this error may come from the mismatch of output dimension. You could add CUDA_LAUNCH_BLOCKING=1 args to see detail information

YenChen-Wu · 2021-05-23T05:23:04Z

ppo performance drops from 0.8 (40 epochs) to 0.4 (200 epochs)
When I train ppo without mle pretraining, the performance stuck at 0.1, while 6 months ago it reached 0.35. What has been updated in user simulator during these months? When I look into the trajectories, the rewards are weird. Sequences of rewards of 5 ruin the value estimation. e.g. -1,-1,5,5,5,5,5,5,5,5,5,5,5 (end)
Just curious does anyone have the same problem.

ChrisGeishauser · 2021-05-23T07:08:17Z

Hey Yen-Chen!

I am not sure what has been changed but maybe the following happens with the reward:

https://github.com/thu-coai/ConvLab-2/blob/master/convlab2/evaluator/multiwoz_eval.py#L417

If you look here, you get a reward of 5 if one domain has been successfully completed. Once completed, I guess it views it as completed in every consecutive turn as well, so that you still get the reward in every turn and not just once. I don't know how you feel about that but I just changed it so that you only get rewarded for success/failure at the very end of the dialogue as we are used to it. Giving the 5 correctly would enhance learning a bit but at the moment, I skipped that.

zqwerty · 2021-05-24T01:20:35Z

Hey Yen-Chen!

I am not sure what has been changed but maybe the following happens with the reward:

https://github.com/thu-coai/ConvLab-2/blob/master/convlab2/evaluator/multiwoz_eval.py#L417

If you look here, you get a reward of 5 if one domain has been successfully completed. Once completed, I guess it views it as completed in every consecutive turn as well, so that you still get the reward in every turn and not just once. I don't know how you feel about that but I just changed it so that you only get rewarded for success/failure at the very end of the dialogue as we are used to it. Giving the 5 correctly would enhance learning a bit but at the moment, I skipped that.

Thanks. But I think this is not the reason. We just move the reward given by the evaluator from https://github.com/thu-coai/ConvLab-2/blob/0bd551b5b3ad7ceb97b9d9a7e86e5b9bff8a9383/convlab2/dialog_agent/env.py to https://github.com/thu-coai/ConvLab-2/blob/master/convlab2/evaluator/multiwoz_eval.py#L417 . You can choose the reward function to use in env.py

YenChen-Wu · 2021-05-24T07:50:23Z

Thank you, Chris and zqwerty!

I think domain rewards of 5 should be removed or modified. Otherwise, the system agent is encouraged to stay in the completed domain and get rewards for free. I just experimented on PPO and both success rate and variance are improved after domain rewards are removed. (from 0.67 to 0.72)

Some other issues:
1.

ConvLab-2/convlab2/policy/evaluate.py

Line 249 in 3812629

reward_tot.append(np.mean(reward))

should be reward_tot.append(np.sum(reward)) but not mean.

ConvLab-2/convlab2/policy/evaluate.py

Line 193 in 3812629

env = Environment(None, simulator, None, dst_sys)

should be env = Environment(None, simulator, None, dst_sys, evaluator).
The evaluator should be included so that the testing rewards are the same as the training ones.

aaa123git · 2021-07-29T17:45:38Z

Thanks all of you. The discussions help a lot. I followed the instructions and trained a better PPO policy. The evaluation results are updated (PR #211 ). I'll share my experience.

Train an MLE policy by running python convlab2/policy/mle/multiwoz/train.py with the default config. The best model is saved automatically.
Train PPO policy by running python convlab2/policy/ppo/train.py --load_path PATH_TO_MLE --epoch 40. PATH_TO_MLE is the best MLE policy trained at step 1. Note that PATH_TO_MLE should be "xxxx/best_mle" instead of "xxxx/best_mle.pol.mdl".
Choose the best epoch by running python convlab2/policy/evaluate.py --load_path PATH_TO_PPO. I suggest you update the code and set calculate_reward=False. Otherwise, it may take you a while to find the success rate in the log.

zqwerty · 2021-07-30T03:38:40Z

@YenChen-Wu @thenickben @sherlock1987 @ChrisGeishauser
We have re-trained PPO and updated the results (PR #211), thanks @aaa123git !

ruleGreen · 2021-09-30T03:11:42Z

I am afraid that the evaluation results of PPO maybe not correct.
There are two levels of evaluation. The first one is policy/evaluate.py which is action-level, it follows the instructions by @aaa123git But the second one is tests/test_BERTNLU_xxxx.py which is sentence-level. These two results are not the same at all.

ChrisGeishauser · 2021-09-30T07:27:44Z

@ruleGreen

These two results are not the same at all.

This is reasonable as sentence-level creates a more difficult environment for the policy. On action-level, you basically have the ground-truth of the user simulator input, whereas with BERTNLU you have the incorporated error of the NLU component. Moreover, I guess the policy in the sentence-level pipeline is trained on action-level and only evaluated in the sentence-level. This creates a mismatch between training and testing. So a drop in performance is to be expected.

This was referenced Jul 16, 2020

Training PPO-algorithm #8

Closed

[BUG] for training part of policy gradient #13

Closed

[BUG] #15

Closed

[BUG] GDPL cound not train. #20

Closed

[BUG] Trained PPO from scratch won't work with Analyzer #40

Closed

zqwerty pinned this issue Jul 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Maintenance] RL policy training #54

[Maintenance] RL policy training #54

zqwerty commented Jul 16, 2020

sherlock1987 commented Jul 16, 2020

thenickben commented Jul 16, 2020

ChrisGeishauser commented Jul 21, 2020

zqwerty commented Jul 22, 2020

sherlock1987 commented Jul 22, 2020 •

edited

Loading

thenickben commented Jul 23, 2020

sherlock1987 commented Jul 23, 2020

lalapo commented Jan 12, 2021

zqwerty commented Jan 12, 2021

YenChen-Wu commented May 23, 2021

ChrisGeishauser commented May 23, 2021

zqwerty commented May 24, 2021 •

edited

Loading

YenChen-Wu commented May 24, 2021

aaa123git commented Jul 29, 2021

zqwerty commented Jul 30, 2021

ruleGreen commented Sep 30, 2021

ChrisGeishauser commented Sep 30, 2021

[Maintenance] RL policy training #54

[Maintenance] RL policy training #54

Comments

zqwerty commented Jul 16, 2020

sherlock1987 commented Jul 16, 2020

thenickben commented Jul 16, 2020

ChrisGeishauser commented Jul 21, 2020

zqwerty commented Jul 22, 2020

sherlock1987 commented Jul 22, 2020 • edited Loading

thenickben commented Jul 23, 2020

sherlock1987 commented Jul 23, 2020

lalapo commented Jan 12, 2021

zqwerty commented Jan 12, 2021

YenChen-Wu commented May 23, 2021

ChrisGeishauser commented May 23, 2021

zqwerty commented May 24, 2021 • edited Loading

YenChen-Wu commented May 24, 2021

aaa123git commented Jul 29, 2021

zqwerty commented Jul 30, 2021

ruleGreen commented Sep 30, 2021

ChrisGeishauser commented Sep 30, 2021

sherlock1987 commented Jul 22, 2020 •

edited

Loading

zqwerty commented May 24, 2021 •

edited

Loading