Description
-
In http://nlp.seas.harvard.edu/2018/04/03/attention.html#encoder, the paper text says
the output of each sub-layer is LayerNorm(x+Sublayer(x))... We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.
but the code is
return x + self.dropout(sublayer(self.norm(x)))
It seems it should be
return self.norm(x + self.dropout(sublayer(x)))
instead.
-
In
Encoder
andDecoder
, where does the extranorm
on top of the stack come from? -
In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by sqrt(dmodel).
It's described in http://nlp.seas.harvard.edu/2018/04/03/attention.html#additional-components-bpe-search-averaging, but it may be better to link that section from the quoted part, I couldn't find it initially.
-
Should http://nlp.seas.harvard.edu/2018/04/01/attention.html link to the updated version http://nlp.seas.harvard.edu/2018/04/03/attention.html?