Skip to content

Possible issues in "The Annotated Transformer" #6

Open
@alexeyr

Description

@alexeyr
  1. In http://nlp.seas.harvard.edu/2018/04/03/attention.html#encoder, the paper text says

    the output of each sub-layer is LayerNorm(x+Sublayer(x))... We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.

    but the code is

    return x + self.dropout(sublayer(self.norm(x)))
    

    It seems it should be

    return self.norm(x + self.dropout(sublayer(x)))
    

    instead.

  2. In Encoder and Decoder, where does the extra norm on top of the stack come from?

  3. In our model, we share the same weight matrix between the two embedding layers and the pre-softmax linear transformation, similar to (cite). In the embedding layers, we multiply those weights by sqrt(dmodel).

    It's described in http://nlp.seas.harvard.edu/2018/04/03/attention.html#additional-components-bpe-search-averaging, but it may be better to link that section from the quoted part, I couldn't find it initially.

  4. Should http://nlp.seas.harvard.edu/2018/04/01/attention.html link to the updated version http://nlp.seas.harvard.edu/2018/04/03/attention.html?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions