Possible issues in "The Annotated Transformer"

1. In http://nlp.seas.harvard.edu/2018/04/03/attention.html#encoder, the paper text says

    > the output of each sub-layer is LayerNorm(x+Sublayer(x))... We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.

    but the code is 

    >     return x + self.dropout(sublayer(self.norm(x)))

    It seems it should be 

       return self.norm(x + self.dropout(sublayer(x)))

    instead.

2. In `Encoder` and `Decoder`, where does the extra `norm` on top of the stack come from?

3. >  In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by sqrt(dmodel).

    It's described in http://nlp.seas.harvard.edu/2018/04/03/attention.html#additional-components-bpe-search-averaging, but it may be better to link that section from the quoted part, I couldn't find it initially.

4. Should http://nlp.seas.harvard.edu/2018/04/01/attention.html link to the updated version http://nlp.seas.harvard.edu/2018/04/03/attention.html?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Possible issues in "The Annotated Transformer" #6

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Possible issues in "The Annotated Transformer" #6

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions