Concerns about Applying stride=2
in CausalConv
and Padding Strategy for ASR
#9883
-
Hello, I have been exploring the implementation of
Reference: >>> asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="nvidia/stt_en_fastconformer_hybrid_large_streaming_multi")
>>> print(asr_model)
...
(encoder): ConformerEncoder(
(pre_encode): ConvSubsampling(
(out): Linear(in_features=2816, out_features=512, bias=True)
(conv): Sequential(
(0): CausalConv2D(1, 256, kernel_size=(3, 3), stride=(2, 2))
(1): ReLU(inplace=True)
(2): CausalConv2D(256, 256, kernel_size=(3, 3), stride=(2, 2), groups=256)
(3): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(4): ReLU(inplace=True)
(5): CausalConv2D(256, 256, kernel_size=(3, 3), stride=(2, 2), groups=256)
(6): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1))
(7): ReLU(inplace=True)
)
)
...
def forward(self, x,):
x = F.pad(x, pad=(self._left_padding, self._right_padding, self._left_padding, self._right_padding))
x = super().forward(x)
return x Looking forward to understanding the idea behind this. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 4 replies
-
Not sure why strided convolutions may cause the model to see the future. It just moves the kernel by 2 instead of 1. Can you provide a simple example of one dimension vector with strided of 2 where a timestep is enabled to see the future? The following script simulates the streaming and it has shown that the outputs of the model in streaming mode (no future exists) is exactly the same as when you pass the whole audio at once. So it is very unlikely that such an issue has happened.
The padding on the left (time axis) is needed to make the convolution causal. The padding on the top is added to make sure all channels are seen by the convolution. When you have an even number of channels (like 80 in most of our models) and you use strided convolution of 2 (the convolution does the striding on all dimensions, not just the time), then the last channel may get skipped and not seen by the convolution, By adding one padding on the top, we make the channels odd and make sure all channels are being seen. Adding extra paddings does not affect the results as values are always zero and model can easily understand it. |
Beta Was this translation helpful? Give feedback.
@VahidooX 's point still holds good with Test 2 @raman-r-4978 . One way to explain your example is that, for inference, every alternate input frame has to wait till the next frame is arrived. Hence 2 is held till 3 arrives for further processing. So in your example of
Second window [1, 2, 3]
, 3 is the current frame and not future frame. With causal convolutions, the current input is always at the right end. When 2 was the current frame in the previous step, there was no real output.