Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove <unk> token and index from experimental Vocab #1027

Open
wants to merge 50 commits into
base: main
Choose a base branch
from

Conversation

zhangguanheng66
Copy link
Contributor

@zhangguanheng66 zhangguanheng66 commented Oct 9, 2020

This PR is to remove the default '<unk>' token along with the index from experimental.vocab. Fix #1016

In the experimental vocabulary, there will be no special symbols or user reserved symbols. Instead, we add a builtin index for the default scenario, and users are required to call set_default_index func explicitly to reset the default index. If not reset, the vocabulary will throw out error message for the default scenario. With the set_default_index function, users will have the flexibility to have or not have default index. For the special symbols (e.g. '<unk>', '<pad>'), users should insert the tokens with the existing method self.insert_token(token: str, index: int). Later on, when users need the index of the special symbols, they can obtain them by calling the vocab instance. For example:

vocab.insert_token('<unk>', 0)
vocab.set_fallback_index(0)
vocab.insert_token('<pad>', 1)
print(vocab('not_in_vocab_token'), vocab('<pad>'))
>>> 0, 1

@zhangguanheng66
Copy link
Contributor Author

cc @bentrevett Please review the PR and let us know if you have any other suggestions.

@bentrevett
Copy link
Contributor

@zhangguanheng66 All looks good to me.

@cpuhrsch
Copy link
Contributor

Maybe "default" is a better name than "fallback" since it's akin to the default kwarg passed to dict.get.

self.assertEqual(v['not_in_it'], 0)
v.insert_token('not_in_it', 0)
v.set_default_index(0)
self.assertEqual(v.get_default_index(), 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably also want to check what numbers these tokens correspond to

        self.assertEqual(v['not_in_it'], 0)
        self.assertEqual(v['<unk>'], 0)

@cpuhrsch
Copy link
Contributor

Related to this would be a test that verifies the behavior of insert_token(<unk>) if <unk> is already part of the vocabulary.

c = OrderedDict()
v = vocab(c)
self.assertEqual(v.get_default_index(), -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better if this were to return "None". You should be able to do this easily by using c10::optional<int64_t> instead of int64_t in the C++ code.

@cpuhrsch
Copy link
Contributor

As an aside, while we're introducing this for the Vocab we should probably as a follow-up also introduce the same concepts to the Vectors class

Copy link
Contributor

@cpuhrsch cpuhrsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we merge this we also need to support resassignment. A special token might show up in the dataset and ends up inadvertently being mapped to the wrong index. For example, might show up in the dataset used to build this Vocab, but really the user wants it to be mapped to index "0" (which is what Vocab currrently does).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[RFC] Special symbols in torchtext.experimental.Vocab
4 participants