Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating new JITable Vocab class #854

Merged
merged 16 commits into from
Jul 7, 2020
Merged

Conversation

Nayef211
Copy link
Contributor

@Nayef211 Nayef211 commented Jun 26, 2020

Description

  • Creating a new Vocab class in experimental that is JITable and more performant
  • Using a custom c++ class for the underlying vocab implementation

Future Scope

  • Making certain functions like stoi() and itos() private
  • Adding factory methods for reading from CSV file. Also include example of how to use this in the docstring
  • Creating benchmarking code to compare new implementation with exsiting Vocab implementation
  • Refactoring cpp Vocab class with optimized implementation (may involve using FastText dictionary implementation)

@Nayef211 Nayef211 marked this pull request as ready for review June 29, 2020 14:32
self.assertEqual(v['a'], 3)
self.assertEqual(v['b'], 4)

def test_vocab_set_item(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at torchtext.Vocab and there are no similar func. There is also no such func in pytext/ScriptVocabulary.

I think a reasonable scenario is that users can always build a new vocab with the custom order or new tokens. It might be too complicate for us to maintain such capability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see so you're suggesting to not keep the functionality to allow users to update an index once the vocab has been created?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not necessary. Do you see this kind of func in other vocab?


def test_vocab_basic(self):
token_to_freq = {'hello': 4, 'world': 3, 'ᑌᑎIᑕOᗪᕮ_Tᕮ᙭T': 5, 'freq_too_low': 2}
sorted_by_freq_tuples = sorted(token_to_freq.items(), key=lambda x: x[1], reverse=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happen if there are tokens with same frequency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the sorted function keeps it in the same order that it appears in. I can do some more testing on this to verify. But the whole point is that it is up to the user to provide the ordering of tokens when building the OrderedDict. Once the dict has been passed into the vocab class, we respect the ordering of this dict.

return unk_index_;
}

void addToken(const std::string &token) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As my comments above, we need more discussions for this func.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to support a similar func, like lookup_indices_1d?

Do we want to have a func for lookup_word?

Let's discuss this further. We can probably add this in follow up PRs.

r"""Creates a vocab object which maps tokens to indices.

Arguments:
ordered_dict (collections.OrderedDict): object holding the frequencies of each token found in the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can be more explicit here. For example, what happen if there are tokens with same frequency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm so refer to my comments above for respecting the ordering of the dictionary that is passed in.

Copy link
Contributor

@cpuhrsch cpuhrsch Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this particular class documentation make that explicit to the user? Should it be made explicit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure I think that's a good idea. I just added 2 lines explaining this

min_freq: The minimum frequency needed to include a token in the vocabulary.
Values less than 1 will be set to 1. Default: 1.
specials: The tuple of special tokens (e.g., padding or eos) that will be prepended to the vocabulary.
The first value should always be the unknown token Default: ['<unk'>, '<pad>']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not clear. The specials will be prepended to the vocabulary list so the order of the special tokens in specials matters, right? And what happen if <unk>' is not in specials`? What happen if users assign a custom unknown token?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I can specify that the ordering of the specials token matter. The comment about the unk token is incorrect. I have a seperate unk_token parameter that the user needs to pass in. I will update the comment to reflect this!

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to support a similar func, like lookup_indices_1d?

Do we want to have a func for lookup_word?

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to Vector class, once we get to optimize the performance of Vocab class, we need to decide the returned value of getitem, either tensor or index (i.e. integer). The current Vocab return index and the vocab transform is followed by totensor transform.

@codecov
Copy link

codecov bot commented Jul 2, 2020

Codecov Report

Merging #854 into master will increase coverage by 0.37%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #854      +/-   ##
==========================================
+ Coverage   76.86%   77.24%   +0.37%     
==========================================
  Files          42       43       +1     
  Lines        2944     2993      +49     
==========================================
+ Hits         2263     2312      +49     
  Misses        681      681              
Impacted Files Coverage Δ
torchtext/experimental/vocab.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b887daa...97c0807. Read the comment docs.

return self.vocab[token]

@torch.jit.export
def insert_token(self, token: str, index: int) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd remove the "_token" part unless you have an explicit reason. It's more in line with python's List.

Copy link
Contributor Author

@Nayef211 Nayef211 Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I was trying to be more consistent with the other function names which also have _token at the end. One thing to keep in mind that for some function we have lookup_tokens vs lookup_token. I thought it was important to be explicit if we were adding one token or multiple tokens.

What do you think?

return self.vocab.lookup_indices(tokens)

@torch.jit.export
def get_stoi(self) -> Dict[str, int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really still want these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it made testing the functions a lot easier. Now I'm not sure how often the list of tokens inside the vocab will be needed by the end user. Do you see any negatives in keeping this?

def __init__(self, ordered_dict, min_freq=1, unk_token='<unk>', specials=('<unk>', '<pad>'), specials_first=True):
super(Vocab, self).__init__()

if not unk_token:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we expect users to pass None to the keyword argument unk_token here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally don't expect this. I was just being explicit here in case the user did pass in a None we would fail elegantly.

self.assertEqual(len(v), 5)

def test_vocab_basic(self):
token_to_freq = {'hello': 4, 'world': 3, 'ᑌᑎIᑕOᗪᕮ_Tᕮ᙭T': 5, 'freq_too_low': 2}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might miss it. Do we have a test case starting from a list of tokens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by starting off with a list of tokens? Because the Vocab class expects an OrderedDict as the input. So even if we did start off with a list of tokens, we would have to probably get a dictionary of tokens to frequency and possibly sort this dictionary in a similar manner that I am doing now to get an ordered list of tuples which are finally fed into an OrderedDict class.

Does this make sense?

return self.vocab[token]

@torch.jit.export
def insert_token(self, token: str, index: int) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to maintain insert_token func in vocab class? Any usage case to support it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think Steven specifically requested this during our design meeting. I.e. they wanted to be able to insert special tokens into specific indices after constructing a Vocab.

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved. We can merge this PR if Christian has no future comments.

they are added into the vocabulary at last. Default: True.

Raises:
ValueError: if a default `unk_token` isn't provided.
Copy link
Contributor

@zhangguanheng66 zhangguanheng66 Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to add two examples in the doc to build vocab from a list of ordered unique tokens and a list of raw text tokens (with repeat in this case). Something similar here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go ahead and add that example for a list of ordered unique tokens. In terms of reading tokens from raw text file, I will add that to the followup PR with the factory function vocab_from_file_object

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeap. We can add that func in the followup PR.

@Nayef211 Nayef211 merged commit 02679c3 into pytorch:master Jul 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants