-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating new JITable Vocab class #854
Conversation
self.assertEqual(v['a'], 3) | ||
self.assertEqual(v['b'], 4) | ||
|
||
def test_vocab_set_item(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at torchtext.Vocab and there are no similar func. There is also no such func in pytext/ScriptVocabulary.
I think a reasonable scenario is that users can always build a new vocab with the custom order or new tokens. It might be too complicate for us to maintain such capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see so you're suggesting to not keep the functionality to allow users to update an index once the vocab has been created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not necessary. Do you see this kind of func in other vocab?
|
||
def test_vocab_basic(self): | ||
token_to_freq = {'hello': 4, 'world': 3, 'ᑌᑎIᑕOᗪᕮ_Tᕮ᙭T': 5, 'freq_too_low': 2} | ||
sorted_by_freq_tuples = sorted(token_to_freq.items(), key=lambda x: x[1], reverse=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happen if there are tokens with same frequency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the sorted function keeps it in the same order that it appears in. I can do some more testing on this to verify. But the whole point is that it is up to the user to provide the ordering of tokens when building the OrderedDict
. Once the dict has been passed into the vocab class, we respect the ordering of this dict.
torchtext/csrc/vocab.cpp
Outdated
return unk_index_; | ||
} | ||
|
||
void addToken(const std::string &token) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As my comments above, we need more discussions for this func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to support a similar func, like lookup_indices_1d?
Do we want to have a func for lookup_word?
Let's discuss this further. We can probably add this in follow up PRs.
r"""Creates a vocab object which maps tokens to indices. | ||
|
||
Arguments: | ||
ordered_dict (collections.OrderedDict): object holding the frequencies of each token found in the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can be more explicit here. For example, what happen if there are tokens with same frequency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm so refer to my comments above for respecting the ordering of the dictionary that is passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this particular class documentation make that explicit to the user? Should it be made explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I think that's a good idea. I just added 2 lines explaining this
torchtext/experimental/vocab.py
Outdated
min_freq: The minimum frequency needed to include a token in the vocabulary. | ||
Values less than 1 will be set to 1. Default: 1. | ||
specials: The tuple of special tokens (e.g., padding or eos) that will be prepended to the vocabulary. | ||
The first value should always be the unknown token Default: ['<unk'>, '<pad>'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is not clear. The specials will be prepended to the vocabulary list so the order of the special tokens in specials
matters, right? And what happen if <unk>' is not in
specials`? What happen if users assign a custom unknown token?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I can specify that the ordering of the specials token matter. The comment about the unk token is incorrect. I have a seperate unk_token
parameter that the user needs to pass in. I will update the comment to reflect this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to support a similar func, like lookup_indices_1d?
Do we want to have a func for lookup_word?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to Vector class, once we get to optimize the performance of Vocab class, we need to decide the returned value of getitem
, either tensor or index (i.e. integer). The current Vocab return index and the vocab transform is followed by totensor
transform.
Codecov Report
@@ Coverage Diff @@
## master #854 +/- ##
==========================================
+ Coverage 76.86% 77.24% +0.37%
==========================================
Files 42 43 +1
Lines 2944 2993 +49
==========================================
+ Hits 2263 2312 +49
Misses 681 681
Continue to review full report at Codecov.
|
return self.vocab[token] | ||
|
||
@torch.jit.export | ||
def insert_token(self, token: str, index: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd remove the "_token" part unless you have an explicit reason. It's more in line with python's List.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I was trying to be more consistent with the other function names which also have _token
at the end. One thing to keep in mind that for some function we have lookup_tokens
vs lookup_token
. I thought it was important to be explicit if we were adding one token or multiple tokens.
What do you think?
return self.vocab.lookup_indices(tokens) | ||
|
||
@torch.jit.export | ||
def get_stoi(self) -> Dict[str, int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really still want these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it made testing the functions a lot easier. Now I'm not sure how often the list of tokens inside the vocab will be needed by the end user. Do you see any negatives in keeping this?
def __init__(self, ordered_dict, min_freq=1, unk_token='<unk>', specials=('<unk>', '<pad>'), specials_first=True): | ||
super(Vocab, self).__init__() | ||
|
||
if not unk_token: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we expect users to pass None
to the keyword argument unk_token
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally don't expect this. I was just being explicit here in case the user did pass in a None
we would fail elegantly.
self.assertEqual(len(v), 5) | ||
|
||
def test_vocab_basic(self): | ||
token_to_freq = {'hello': 4, 'world': 3, 'ᑌᑎIᑕOᗪᕮ_Tᕮ᙭T': 5, 'freq_too_low': 2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might miss it. Do we have a test case starting from a list of tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by starting off with a list of tokens? Because the Vocab
class expects an OrderedDict
as the input. So even if we did start off with a list of tokens, we would have to probably get a dictionary of tokens to frequency and possibly sort this dictionary in a similar manner that I am doing now to get an ordered list of tuples which are finally fed into an OrderedDict class.
Does this make sense?
return self.vocab[token] | ||
|
||
@torch.jit.export | ||
def insert_token(self, token: str, index: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to maintain insert_token
func in vocab class? Any usage case to support it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think Steven specifically requested this during our design meeting. I.e. they wanted to be able to insert special tokens into specific indices after constructing a Vocab
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. We can merge this PR if Christian has no future comments.
they are added into the vocabulary at last. Default: True. | ||
|
||
Raises: | ||
ValueError: if a default `unk_token` isn't provided. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to add two examples in the doc to build vocab from a list of ordered unique tokens and a list of raw text tokens (with repeat in this case). Something similar here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go ahead and add that example for a list of ordered unique tokens. In terms of reading tokens from raw text file, I will add that to the followup PR with the factory function vocab_from_file_object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeap. We can add that func in the followup PR.
Description
Future Scope
stoi()
anditos()
privateFastText
dictionary implementation)