Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

[Dict] Add extra special tokens #2828

Merged
merged 19 commits into from
Jul 16, 2020
Merged

[Dict] Add extra special tokens #2828

merged 19 commits into from
Jul 16, 2020

Conversation

emilydinan
Copy link
Contributor

@emilydinan emilydinan commented Jul 9, 2020

Patch description
Add a utility for adding special tokens; currently only BytelevelBPE and the simple space/split tokenizers are supported. Special tokens are added via a CLI flag in torch agent, but can be added to the dictionary manually. A couple of a notes

  • Previously, the HF tokenizer does not give special tokens in the string returned by decode by default. I added a flag to turn this on. Curious to hear if others thing this should be on by default.
  • Additionally, due to the presence of FP16Pad tokens, the simple "offset" map from the idx of special tokens in the ParlAI dictionary and the HF dictionary was not correct. I added a dict which explicitly stores this map to rectify this.

I added a test to check that this works for bytelevelbpe.

I also added an implementation of resizing token embeddings for Transformer Generator Agent and a general utility to do this in Torch Agent and a test

Emily Dinan added 2 commits July 9, 2020 14:48
@emilydinan emilydinan marked this pull request as ready for review July 14, 2020 13:14
@emilydinan emilydinan requested a review from wyshi July 14, 2020 13:14
@emilydinan
Copy link
Contributor Author

test

parlai/core/dict.py Outdated Show resolved Hide resolved
parlai/agents/special_tok/agents.py Outdated Show resolved Hide resolved
tests/test_dict.py Show resolved Hide resolved
Copy link
Contributor

@stephenroller stephenroller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few suggestions, but lgtm. Thanks for iterating on this.

parlai/core/torch_agent.py Outdated Show resolved Hide resolved
return text

def add_special_tokens(self, dict_agent, special_tokens: List[str]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def add_special_tokens(self, dict_agent, special_tokens: List[str]):
def add_special_tokens(self, dict_agent, special_tokens: List[str]):
"""
Add special tokens to the tokenizer and dict_agent.
"""

tests/test_dict.py Show resolved Hide resolved
Copy link
Contributor

@klshuster klshuster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎊 thanks for this, sorry for the several rewrites

parlai/core/dict.py Show resolved Hide resolved
parlai/core/torch_agent.py Outdated Show resolved Hide resolved
@emilydinan emilydinan merged commit 20cc87d into master Jul 16, 2020
@emilydinan emilydinan deleted the bpeStuFFff branch July 16, 2020 17:37
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants