Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about training #17

Open
masonwang513 opened this issue May 30, 2020 · 5 comments
Open

Questions about training #17

masonwang513 opened this issue May 30, 2020 · 5 comments

Comments

@masonwang513
Copy link

From your previous answers:

  1. We only use sampled 3 frames for training. That's the reason why we sample frames from videos.
  2. The gradient is computed based on 4 samples in the batch. Backpropagation is done after all the frames are processed.

I have two another questions targeting these two answers:

  1. why do you only use 3 frames for training ?
    According to your paper, more previous frames do benefit model performance; what's more, more than 3 previous frames would be used and added into memory in inference mode, meaning that it causes inconsistency between training and testing; so why not just use longer frames in main training ?

  2. Is BP or BP-Through-Time used for gradient computation ?
    For each sample, there are several frames computed one by one and the subsequent frames rely on previous frames' activations and predictions, so whether gradients are computed each time a frame is forwarded (and previous activations are detached) OR gradients are only computed after all frames' losses are accumulated? If it is former, it is simple BP, otherwise, it's BPTT, right?

@seoungwugoh
Copy link
Owner

Hi @masonwang513,
Here are my answers:

  1. Yes, there is inconsistency between training and testing. The reason why we use only 3 frames for training is to reduce computation and accelerate training. We found that our model trained using very short clip performs well on a long clips. It is due to attention mechanism we use is not sensitive to the size of memory.

  2. We tried both but found no big difference (detaching vs non-detaching). But, the important point is to make the second forward step use the output of the first step (to adapt to its own output).

@ryancll
Copy link

ryancll commented Jun 12, 2020

Hi @seoungwugoh,
For your answer 2, did you mean teacher forcing strategy is not suitable for training STM model?

@seoungwugoh
Copy link
Owner

@ryancll I don't know what teacher forcing strategy is. Can you describe more about it?

@ryancll
Copy link

ryancll commented Jun 19, 2020

@seoungwugoh During the training, instead of feeding previous predicted masks into memory, sometimes we can feed the ground truth masks into memory to guide the training process. This strategy is widely used in NLP Seq2Seq task, but I'm not sure if it is useful for STM.

@seoungwugoh
Copy link
Owner

@ryancll We did not use such training technique in our work. But it seems interesting idea to try. I think it will be effective for some very challenging training samples where the network fail to deliver good results for the first estimation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants