Q: If the weight of a conv layer is zero, the gradient will also be zero, and the network will not learn anything. Why "zero convolution" works?
A: This is wrong. Let us consider a very simple
and we have
and if
which means as long as
so that the zero convolutions will progressively become a common conv layer with non-zero weights.