-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about training #17
Comments
Hi @masonwang513,
|
Hi @seoungwugoh, |
@ryancll I don't know what teacher forcing strategy is. Can you describe more about it? |
@seoungwugoh During the training, instead of feeding previous predicted masks into memory, sometimes we can feed the ground truth masks into memory to guide the training process. This strategy is widely used in NLP Seq2Seq task, but I'm not sure if it is useful for STM. |
@ryancll We did not use such training technique in our work. But it seems interesting idea to try. I think it will be effective for some very challenging training samples where the network fail to deliver good results for the first estimation. |
From your previous answers:
I have two another questions targeting these two answers:
why do you only use 3 frames for training ?
According to your paper, more previous frames do benefit model performance; what's more, more than 3 previous frames would be used and added into memory in inference mode, meaning that it causes inconsistency between training and testing; so why not just use longer frames in main training ?
Is BP or BP-Through-Time used for gradient computation ?
For each sample, there are several frames computed one by one and the subsequent frames rely on previous frames' activations and predictions, so whether gradients are computed each time a frame is forwarded (and previous activations are detached) OR gradients are only computed after all frames' losses are accumulated? If it is former, it is simple BP, otherwise, it's BPTT, right?
The text was updated successfully, but these errors were encountered: