-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove <unk> token and index from experimental Vocab #1027
base: main
Are you sure you want to change the base?
Conversation
cc @bentrevett Please review the PR and let us know if you have any other suggestions. |
@zhangguanheng66 All looks good to me. |
Maybe "default" is a better name than "fallback" since it's akin to the default kwarg passed to dict.get. |
self.assertEqual(v['not_in_it'], 0) | ||
v.insert_token('not_in_it', 0) | ||
v.set_default_index(0) | ||
self.assertEqual(v.get_default_index(), 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably also want to check what numbers these tokens correspond to
self.assertEqual(v['not_in_it'], 0)
self.assertEqual(v['<unk>'], 0)
Related to this would be a test that verifies the behavior of insert_token( |
test/experimental/test_vocab.py
Outdated
c = OrderedDict() | ||
v = vocab(c) | ||
self.assertEqual(v.get_default_index(), -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better if this were to return "None". You should be able to do this easily by using c10::optional<int64_t> instead of int64_t in the C++ code.
As an aside, while we're introducing this for the Vocab we should probably as a follow-up also introduce the same concepts to the Vectors class |
82a14a9
to
b40d8dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we merge this we also need to support resassignment. A special token might show up in the dataset and ends up inadvertently being mapped to the wrong index. For example, might show up in the dataset used to build this Vocab, but really the user wants it to be mapped to index "0" (which is what Vocab currrently does).
This PR is to remove the default
'<unk>'
token along with the index fromexperimental.vocab
. Fix #1016In the experimental vocabulary, there will be no special symbols or user reserved symbols. Instead, we add a builtin index for the default scenario, and users are required to call
set_default_index
func explicitly to reset the default index. If not reset, the vocabulary will throw out error message for the default scenario. With theset_default_index
function, users will have the flexibility to have or not have default index. For the special symbols (e.g.'<unk>'
,'<pad>'
), users should insert the tokens with the existing methodself.insert_token(token: str, index: int)
. Later on, when users need the index of the special symbols, they can obtain them by calling the vocab instance. For example: