Questions about training #17

masonwang513 · 2020-05-30T08:05:05Z

From your previous answers:

We only use sampled 3 frames for training. That's the reason why we sample frames from videos.
The gradient is computed based on 4 samples in the batch. Backpropagation is done after all the frames are processed.

I have two another questions targeting these two answers:

why do you only use 3 frames for training ?
According to your paper, more previous frames do benefit model performance; what's more, more than 3 previous frames would be used and added into memory in inference mode, meaning that it causes inconsistency between training and testing; so why not just use longer frames in main training ?
Is BP or BP-Through-Time used for gradient computation ?
For each sample, there are several frames computed one by one and the subsequent frames rely on previous frames' activations and predictions, so whether gradients are computed each time a frame is forwarded (and previous activations are detached) OR gradients are only computed after all frames' losses are accumulated? If it is former, it is simple BP, otherwise, it's BPTT, right?

seoungwugoh · 2020-06-05T04:13:55Z

Hi @masonwang513,
Here are my answers:

Yes, there is inconsistency between training and testing. The reason why we use only 3 frames for training is to reduce computation and accelerate training. We found that our model trained using very short clip performs well on a long clips. It is due to attention mechanism we use is not sensitive to the size of memory.
We tried both but found no big difference (detaching vs non-detaching). But, the important point is to make the second forward step use the output of the first step (to adapt to its own output).

ryancll · 2020-06-12T07:52:07Z

Hi @seoungwugoh,
For your answer 2, did you mean teacher forcing strategy is not suitable for training STM model?

seoungwugoh · 2020-06-19T01:58:49Z

@ryancll I don't know what teacher forcing strategy is. Can you describe more about it?

ryancll · 2020-06-19T02:10:26Z

@seoungwugoh During the training, instead of feeding previous predicted masks into memory, sometimes we can feed the ground truth masks into memory to guide the training process. This strategy is widely used in NLP Seq2Seq task, but I'm not sure if it is useful for STM.

seoungwugoh · 2020-06-19T02:26:16Z

@ryancll We did not use such training technique in our work. But it seems interesting idea to try. I think it will be effective for some very challenging training samples where the network fail to deliver good results for the first estimation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about training #17

Questions about training #17

masonwang513 commented May 30, 2020

seoungwugoh commented Jun 5, 2020

ryancll commented Jun 12, 2020

seoungwugoh commented Jun 19, 2020

ryancll commented Jun 19, 2020

seoungwugoh commented Jun 19, 2020

Questions about training #17

Questions about training #17

Comments

masonwang513 commented May 30, 2020

seoungwugoh commented Jun 5, 2020

ryancll commented Jun 12, 2020

seoungwugoh commented Jun 19, 2020

ryancll commented Jun 19, 2020

seoungwugoh commented Jun 19, 2020