Remove <unk> token and index from experimental Vocab #1027

zhangguanheng66 · 2020-10-09T16:49:42Z

This PR is to remove the default '<unk>' token along with the index from experimental.vocab. Fix #1016

In the experimental vocabulary, there will be no special symbols or user reserved symbols. Instead, we add a builtin index for the default scenario, and users are required to call set_default_index func explicitly to reset the default index. If not reset, the vocabulary will throw out error message for the default scenario. With the set_default_index function, users will have the flexibility to have or not have default index. For the special symbols (e.g. '<unk>', '<pad>'), users should insert the tokens with the existing method self.insert_token(token: str, index: int). Later on, when users need the index of the special symbols, they can obtain them by calling the vocab instance. For example:

vocab.insert_token('<unk>', 0)
vocab.set_fallback_index(0)
vocab.insert_token('<pad>', 1)
print(vocab('not_in_vocab_token'), vocab('<pad>'))
>>> 0, 1

zhangguanheng66 · 2020-10-12T20:42:31Z

cc @bentrevett Please review the PR and let us know if you have any other suggestions.

bentrevett · 2020-10-13T12:47:21Z

@zhangguanheng66 All looks good to me.

cpuhrsch · 2020-10-13T17:13:05Z

Maybe "default" is a better name than "fallback" since it's akin to the default kwarg passed to dict.get.

cpuhrsch · 2020-10-20T17:37:23Z

test/experimental/test_vocab.py

-        self.assertEqual(v['not_in_it'], 0)
+        v.insert_token('not_in_it', 0)
+        v.set_default_index(0)
+        self.assertEqual(v.get_default_index(), 0)


You probably also want to check what numbers these tokens correspond to

self.assertEqual(v['not_in_it'], 0) self.assertEqual(v['<unk>'], 0)

cpuhrsch · 2020-10-20T17:45:32Z

Related to this would be a test that verifies the behavior of insert_token(<unk>) if <unk> is already part of the vocabulary.

cpuhrsch · 2020-10-21T17:33:54Z

test/experimental/test_vocab.py

        c = OrderedDict()
        v = vocab(c)
+        self.assertEqual(v.get_default_index(), -1)


I think it's better if this were to return "None". You should be able to do this easily by using c10::optional<int64_t> instead of int64_t in the C++ code.

cpuhrsch · 2020-10-21T17:43:30Z

As an aside, while we're introducing this for the Vocab we should probably as a follow-up also introduce the same concepts to the Vectors class

cpuhrsch

Before we merge this we also need to support resassignment. A special token might show up in the dataset and ends up inadvertently being mapped to the wrong index. For example, might show up in the dataset used to build this Vocab, but really the user wants it to be mapped to index "0" (which is what Vocab currrently does).

Guanheng Zhang added 26 commits October 9, 2020 06:47

checkpoint

e46e2f8

checkpoint

3129036

checkpoint

441f392

checkpoint

bc87fb0

checkpoint

8ba6b21

update tests

6b9f015

clang

3d21a18

flake8

0efe7b2

checkpoint

dce7080

checkpoint

660e051

CI

6b9368b

checkpoint

ffeb7ab

checkpoint

ca1dbbb

checkpoint

40d6c06

checkpoint

4246ba2

update save/load in vocab

6f6cfad

checkpooint

a0d5fc2

checkpoint

a49c10d

checkpoint

f76d9b1

skip test for windows

7bec937

update unk_index with insert_token

ba7e561

checkpoint

ed7be7d

Merge branch 'master' into remove_unk

e25660d

Merge branch 'master' into remove_unk

6627847

change unk_index to fallback_index

e4e1e05

checkpoint

279ba95

checkpoint

1cdb82e

Guanheng Zhang added 3 commits October 19, 2020 13:17

Merge remote-tracking branch 'upstream/master' into remove_unk

98c508b

checkpoint

c61c127

Merge remote-tracking branch 'upstream/master' into remove_unk

3549c23

cpuhrsch reviewed Oct 20, 2020

View reviewed changes

Guanheng Zhang added 2 commits October 20, 2020 15:10

update test

6b7e5ad

add one more test for inserting existing token

6858227

cpuhrsch reviewed Oct 21, 2020

View reviewed changes

use c10::optional for default index

b40d8dd

zhangguanheng66 force-pushed the remove_unk branch from 82a14a9 to b40d8dd Compare October 22, 2020 14:21

Guanheng Zhang added 2 commits October 22, 2020 07:23

checkpoint

4ec66e6

sync with master branch

1556074

facebook-github-bot added the cla signed label Oct 30, 2020

Guanheng Zhang added 6 commits November 2, 2020 07:44

Update docs

c5e3773

checkpoint

f3ed767

set_default_index if the saved vocab has a default index

a9b27de

checkpoint

53a353f

Merge branch 'master' into remove_unk

7631d72

sync with master

2cab04b

cpuhrsch requested changes Nov 11, 2020

View reviewed changes

Guanheng Zhang added 4 commits December 23, 2020 08:20

sync with master

a1cfea9

checkpoint

aeb9995

checkpoint

588cce4

checkpoint

67ec466

zhangguanheng66 mentioned this pull request Jan 4, 2021

Add __setitem__ func to torchtext.experimental.vocab.Vocab #1113

Open

Guanheng Zhang added 3 commits January 4, 2021 17:41

Merge branch 'master' into remove_unk

dd78169

sync with master branch

909c188

add have_default_index

7576775

parmeet mentioned this pull request May 10, 2021

Added APIs for default index and removed unk token #1302

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove <unk> token and index from experimental Vocab #1027

Remove <unk> token and index from experimental Vocab #1027

zhangguanheng66 commented Oct 9, 2020 •

edited

Loading

zhangguanheng66 commented Oct 12, 2020

bentrevett commented Oct 13, 2020

cpuhrsch commented Oct 13, 2020

cpuhrsch Oct 20, 2020

cpuhrsch commented Oct 20, 2020

cpuhrsch Oct 21, 2020

cpuhrsch commented Oct 21, 2020

cpuhrsch left a comment

Remove <unk> token and index from experimental Vocab #1027

Are you sure you want to change the base?

Remove <unk> token and index from experimental Vocab #1027

Conversation

zhangguanheng66 commented Oct 9, 2020 • edited Loading

zhangguanheng66 commented Oct 12, 2020

bentrevett commented Oct 13, 2020

cpuhrsch commented Oct 13, 2020

cpuhrsch Oct 20, 2020

Choose a reason for hiding this comment

cpuhrsch commented Oct 20, 2020

cpuhrsch Oct 21, 2020

Choose a reason for hiding this comment

cpuhrsch commented Oct 21, 2020

cpuhrsch left a comment

Choose a reason for hiding this comment

zhangguanheng66 commented Oct 9, 2020 •

edited

Loading