Creating new JITable Vocab class #854

Nayef211 · 2020-06-26T22:26:18Z

Description

Creating a new Vocab class in experimental that is JITable and more performant
Using a custom c++ class for the underlying vocab implementation

Future Scope

Making certain functions like stoi() and itos() private
Adding factory methods for reading from CSV file. Also include example of how to use this in the docstring
Creating benchmarking code to compare new implementation with exsiting Vocab implementation
Refactoring cpp Vocab class with optimized implementation (may involve using FastText dictionary implementation)

zhangguanheng66 · 2020-06-29T19:13:40Z

test/experimental/test_vocab.py

+        self.assertEqual(v['a'], 3)
+        self.assertEqual(v['b'], 4)
+
+    def test_vocab_set_item(self):


I took a look at torchtext.Vocab and there are no similar func. There is also no such func in pytext/ScriptVocabulary.

I think a reasonable scenario is that users can always build a new vocab with the custom order or new tokens. It might be too complicate for us to maintain such capability.

I see so you're suggesting to not keep the functionality to allow users to update an index once the vocab has been created?

I think it's not necessary. Do you see this kind of func in other vocab?

zhangguanheng66 · 2020-06-29T19:15:35Z

test/experimental/test_vocab.py

+
+    def test_vocab_basic(self):
+        token_to_freq = {'hello': 4, 'world': 3, 'ᑌᑎIᑕOᗪᕮ_Tᕮ᙭T': 5, 'freq_too_low': 2}
+        sorted_by_freq_tuples = sorted(token_to_freq.items(), key=lambda x: x[1], reverse=True)


what happen if there are tokens with same frequency?

I believe the sorted function keeps it in the same order that it appears in. I can do some more testing on this to verify. But the whole point is that it is up to the user to provide the ordering of tokens when building the OrderedDict. Once the dict has been passed into the vocab class, we respect the ordering of this dict.

zhangguanheng66 · 2020-06-29T19:18:14Z

torchtext/csrc/vocab.cpp

+    return unk_index_;
+  }
+
+  void addToken(const std::string &token) {


As my comments above, we need more discussions for this func.

Are we going to support a similar func, like lookup_indices_1d?

Do we want to have a func for lookup_word?

Let's discuss this further. We can probably add this in follow up PRs.

zhangguanheng66 · 2020-06-29T19:19:50Z

torchtext/experimental/vocab.py

+    r"""Creates a vocab object which maps tokens to indices.
+
+    Arguments:
+        ordered_dict (collections.OrderedDict): object holding the frequencies of each token found in the data.


I think we can be more explicit here. For example, what happen if there are tokens with same frequency.

Hmm so refer to my comments above for respecting the ordering of the dictionary that is passed in.

Does this particular class documentation make that explicit to the user? Should it be made explicit?

Sure I think that's a good idea. I just added 2 lines explaining this

zhangguanheng66 · 2020-06-29T19:22:54Z

torchtext/experimental/vocab.py

+        min_freq: The minimum frequency needed to include a token in the vocabulary.
+            Values less than 1 will be set to 1. Default: 1.
+        specials: The tuple of special tokens (e.g., padding or eos) that will be prepended to the vocabulary.
+            The first value should always be the unknown token Default: ['<unk'>, '<pad>']


This line is not clear. The specials will be prepended to the vocabulary list so the order of the special tokens in specials matters, right? And what happen if <unk>' is not in specials`? What happen if users assign a custom unknown token?

Yes I can specify that the ordering of the specials token matter. The comment about the unk token is incorrect. I have a seperate unk_token parameter that the user needs to pass in. I will update the comment to reflect this!

zhangguanheng66

Are we going to support a similar func, like lookup_indices_1d?

Do we want to have a func for lookup_word?

zhangguanheng66

Similar to Vector class, once we get to optimize the performance of Vocab class, we need to decide the returned value of getitem, either tensor or index (i.e. integer). The current Vocab return index and the vocab transform is followed by totensor transform.

codecov · 2020-07-02T22:33:19Z

Codecov Report

Merging #854 into master will increase coverage by 0.37%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #854      +/-   ##
==========================================
+ Coverage   76.86%   77.24%   +0.37%     
==========================================
  Files          42       43       +1     
  Lines        2944     2993      +49     
==========================================
+ Hits         2263     2312      +49     
  Misses        681      681

Impacted Files	Coverage Δ
torchtext/experimental/vocab.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b887daa...97c0807. Read the comment docs.

cpuhrsch · 2020-07-06T20:08:21Z

torchtext/experimental/vocab.py

+        return self.vocab[token]
+
+    @torch.jit.export
+    def insert_token(self, token: str, index: int) -> None:


I'd remove the "_token" part unless you have an explicit reason. It's more in line with python's List.

Okay I was trying to be more consistent with the other function names which also have _token at the end. One thing to keep in mind that for some function we have lookup_tokens vs lookup_token. I thought it was important to be explicit if we were adding one token or multiple tokens.

What do you think?

cpuhrsch · 2020-07-06T20:08:38Z

torchtext/experimental/vocab.py

+        return self.vocab.lookup_indices(tokens)
+
+    @torch.jit.export
+    def get_stoi(self) -> Dict[str, int]:


Do we really still want these?

I think it made testing the functions a lot easier. Now I'm not sure how often the list of tokens inside the vocab will be needed by the end user. Do you see any negatives in keeping this?

zhangguanheng66 · 2020-07-06T20:17:35Z

torchtext/experimental/vocab.py

+    def __init__(self, ordered_dict, min_freq=1, unk_token='<unk>', specials=('<unk>', '<pad>'), specials_first=True):
+        super(Vocab, self).__init__()
+
+        if not unk_token:


Are we expect users to pass None to the keyword argument unk_token here?

We generally don't expect this. I was just being explicit here in case the user did pass in a None we would fail elegantly.

zhangguanheng66 · 2020-07-06T20:21:20Z

test/experimental/test_vocab.py

+        self.assertEqual(len(v), 5)
+
+    def test_vocab_basic(self):
+        token_to_freq = {'hello': 4, 'world': 3, 'ᑌᑎIᑕOᗪᕮ_Tᕮ᙭T': 5, 'freq_too_low': 2}


I might miss it. Do we have a test case starting from a list of tokens?

What do you mean by starting off with a list of tokens? Because the Vocab class expects an OrderedDict as the input. So even if we did start off with a list of tokens, we would have to probably get a dictionary of tokens to frequency and possibly sort this dictionary in a similar manner that I am doing now to get an ordered list of tuples which are finally fed into an OrderedDict class.

Does this make sense?

zhangguanheng66 · 2020-07-06T20:29:02Z

torchtext/experimental/vocab.py

+        return self.vocab[token]
+
+    @torch.jit.export
+    def insert_token(self, token: str, index: int) -> None:


Do we want to maintain insert_token func in vocab class? Any usage case to support it?

Yes I think Steven specifically requested this during our design meeting. I.e. they wanted to be able to insert special tokens into specific indices after constructing a Vocab.

zhangguanheng66

Approved. We can merge this PR if Christian has no future comments.

zhangguanheng66 · 2020-07-06T23:18:43Z

torchtext/experimental/vocab.py

+            they are added into the vocabulary at last. Default: True.
+
+    Raises:
+        ValueError: if a default `unk_token` isn't provided.


Suggest to add two examples in the doc to build vocab from a list of ordered unique tokens and a list of raw text tokens (with repeat in this case). Something similar here

I will go ahead and add that example for a list of ordered unique tokens. In terms of reading tokens from raw text file, I will add that to the followup PR with the factory function vocab_from_file_object

yeap. We can add that func in the followup PR.

Nayef211 added 4 commits June 25, 2020 09:35

Adding new vocab class

63d9ea0

Merge branch 'master' into nayef211/vocab

62f40f3

Wrote the cpp Vocab class

b795a45

Completed vocab class

598be0b

Nayef211 requested review from zhangguanheng66 and cpuhrsch June 26, 2020 22:26

Fixing style check

37f4a1e

Nayef211 marked this pull request as ready for review June 29, 2020 14:32

zhangguanheng66 reviewed Jun 29, 2020

View reviewed changes

resolving PR comments

f309de0

zhangguanheng66 reviewed Jul 2, 2020

View reviewed changes

Nayef211 added 2 commits July 2, 2020 13:08

Using c::10 dict

2f48cc7

Added tests for new functions

76500f4

Nayef211 requested a review from zhangguanheng66 July 2, 2020 21:23

Fixing bug with specials tuple

fb122ef

Nayef211 added 6 commits July 3, 2020 11:11

Annotated function return types

a53dbce

Using camel case for function names

210c2df

Resolving PR comments

f5693fd

Fixing unit tests

6f7ab5d

Updated method names

4ec296b

Updated vocab with unk_token in specials and updated tests

87fff44

cpuhrsch reviewed Jul 6, 2020

View reviewed changes

zhangguanheng66 reviewed Jul 6, 2020

View reviewed changes

zhangguanheng66 approved these changes Jul 6, 2020

View reviewed changes

zhangguanheng66 reviewed Jul 6, 2020

View reviewed changes

Updated docstring to show example for Vocab class

97c0807

Nayef211 merged commit 02679c3 into pytorch:master Jul 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating new JITable Vocab class #854

Creating new JITable Vocab class #854

Nayef211 commented Jun 26, 2020 •

edited

Loading

zhangguanheng66 Jun 29, 2020

Nayef211 Jun 29, 2020

zhangguanheng66 Jun 29, 2020

zhangguanheng66 Jun 29, 2020

Nayef211 Jun 29, 2020

zhangguanheng66 Jun 29, 2020

Nayef211 Jun 29, 2020

zhangguanheng66 Jun 29, 2020

Nayef211 Jun 29, 2020

cpuhrsch Jul 6, 2020 •

edited

Loading

Nayef211 Jul 6, 2020

zhangguanheng66 Jun 29, 2020

Nayef211 Jun 29, 2020

zhangguanheng66 left a comment •

edited

Loading

zhangguanheng66 left a comment

codecov bot commented Jul 2, 2020 •

edited

Loading

cpuhrsch Jul 6, 2020

Nayef211 Jul 6, 2020 •

edited

Loading

cpuhrsch Jul 6, 2020

Nayef211 Jul 6, 2020

zhangguanheng66 Jul 6, 2020

Nayef211 Jul 6, 2020

zhangguanheng66 Jul 6, 2020

Nayef211 Jul 6, 2020

zhangguanheng66 Jul 6, 2020

Nayef211 Jul 6, 2020

zhangguanheng66 left a comment

zhangguanheng66 Jul 6, 2020 •

edited

Loading

Nayef211 Jul 6, 2020

zhangguanheng66 Jul 7, 2020

Creating new JITable Vocab class #854

Creating new JITable Vocab class #854

Conversation

Nayef211 commented Jun 26, 2020 • edited Loading

Description

Future Scope

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuhrsch Jul 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 left a comment • edited Loading

Choose a reason for hiding this comment

zhangguanheng66 left a comment

Choose a reason for hiding this comment

codecov bot commented Jul 2, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Nayef211 Jul 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhangguanheng66 left a comment

Choose a reason for hiding this comment

zhangguanheng66 Jul 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Nayef211 commented Jun 26, 2020 •

edited

Loading

cpuhrsch Jul 6, 2020 •

edited

Loading

zhangguanheng66 left a comment •

edited

Loading

codecov bot commented Jul 2, 2020 •

edited

Loading

Nayef211 Jul 6, 2020 •

edited

Loading

zhangguanheng66 Jul 6, 2020 •

edited

Loading