Merge branch 'master' into update-xcode

pytorch · Jun 23, 2021 · fb93e93 · fb93e93
2 parents d84e16f + e35562a
commit fb93e93
Show file tree

Hide file tree

Showing 26 changed files with 147 additions and 2,305 deletions.
diff --git a/.circleci/cached_datasets_list.txt b/.circleci/cached_datasets_list.txt
@@ -0,0 +1,21 @@
+IMDB
+AG_NEWS
+SogouNews
+DBpedia
+YelpReviewPolarity
+YelpReviewFull
+YahooAnswers
+AmazonReviewPolarity
+AmazonReviewFull
+UDPOS
+CoNLL2000Chunking
+Multi30k
+IWSLT2016
+IWSLT2017
+WMT14
+WikiText2
+WikiText103
+PennTreebank
+SQuAD1
+SQuAD2
+EnWik9
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -44,7 +44,9 @@ commands:
     steps:
       - run:
           name: Generate CCI cache key
-          command: echo "$(date "+%D")" > .cachekey
+          command:
+            echo "$(date "+%D")" > .cachekey
+            cat cached_datasets_list.txt >> .cachekey
       - persist_to_workspace:
           root: .
           paths:

diff --git a/.circleci/config.yml.in b/.circleci/config.yml.in
@@ -44,7 +44,9 @@ commands:
     steps:
       - run:
           name: Generate CCI cache key
-          command: echo "$(date "+%D")" > .cachekey
+          command:
+            echo "$(date "+%D")" > .cachekey
+            cat cached_datasets_list.txt >> .cachekey
       - persist_to_workspace:
           root: .
           paths:

diff --git a/README.rst b/README.rst
@@ -15,20 +15,22 @@ This repository consists of:
 * `torchtext.datasets <https://github.com/pytorch/text/tree/master/torchtext/datasets>`_: The raw text iterators for common NLP datasets
 * `torchtext.data <https://github.com/pytorch/text/tree/master/torchtext/data>`_: Some basic NLP building blocks (tokenizers, metrics, functionals etc.)
 * `torchtext.nn <https://github.com/pytorch/text/tree/master/torchtext/nn>`_: NLP related modules
+* `torchtext.vocab <https://github.com/pytorch/text/tree/master/torchtext/vocab.py>`_: Vocab and Vectors related classes and factory functions
 * `examples <https://github.com/pytorch/text/tree/master/examples>`_: Example NLP workflows with PyTorch and torchtext library.
 
-Note: the legacy code discussed in `torchtext v0.7.0 release note <https://github.com/pytorch/text/releases/tag/v0.7.0-rc3>`_ has been retired to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder. Those legacy code will not be maintained by the development team, and we plan to fully remove them in the future release. See `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder for more details.
+Note: The legacy code discussed in `torchtext v0.7.0 release note <https://github.com/pytorch/text/releases/tag/v0.7.0-rc3>`_ has been retired to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder. Those legacy code will not be maintained by the development team, and we plan to fully remove them in the future release. See `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder for more details.
 
 Installation
 ============
 
-We recommend Anaconda as Python package management system. Please refer to `pytorch.org <https://pytorch.org/>`_ for the detail of PyTorch installation. The following is the corresponding ``torchtext`` versions and supported Python versions.
+We recommend Anaconda as a Python package management system. Please refer to `pytorch.org <https://pytorch.org/>`_ for the details of PyTorch installation. The following are the corresponding ``torchtext`` versions and supported Python versions.
 
 .. csv-table:: Version Compatibility
    :header: "PyTorch version", "torchtext version", "Supported Python version"
    :widths: 10, 10, 10
 
    nightly build, master, 3.6+
+   1.9, 0.10, 3.6+
    1.8, 0.9, 3.6+
    1.7, 0.8, 3.6+
    1.6, 0.7, 3.6+
@@ -93,7 +95,7 @@ Datasets
 The datasets module currently contains:
 
 * Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
-* Machine translation: IWSLT2016, IWSLT2017
+* Machine translation: IWSLT2016, IWSLT2017, Multi30k
 * Sequence tagging (e.g. POS/NER): UDPOS, CoNLL2000Chunking
 * Question answering: SQuAD1, SQuAD2 
 * Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
@@ -113,15 +115,22 @@ For example, to access the raw text from the AG_NEWS dataset:
       >>> train_iter = AG_NEWS(split='train')
       >>> dataloader = DataLoader(train_iter, batch_size=8, shuffle=False)
 
-A tutorial for the end-to-end text classification workflow can be found in `PyTorch tutorial <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>`_
+Tutorials
+=========
+
+To get started with torchtext, users may refer to the following tutorials available on PyTorch website.
+
+* `Text classification with AG_NEWS dataset <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>`_
+* `Translation trained with Multi30k dataset using transformers and torchtext <https://pytorch.org/tutorials/beginner/translation_transformer.html>`_
+* `Language modeling using transforms and torchtext <https://pytorch.org/tutorials/beginner/transformer_tutorial.html>`_
+
 
 [Prototype] Experimental Code
 =============================
 
 We have re-written several building blocks under ``torchtext.experimental``:
 
 * `Transforms <https://github.com/pytorch/text/blob/master/torchtext/experimental/transforms.py>`_: some basic data processing building blocks
-* `Vocabulary <https://github.com/pytorch/text/blob/master/torchtext/experimental/vocab.py>`_: a vocabulary to numericalize tokens
 * `Vectors <https://github.com/pytorch/text/blob/master/torchtext/experimental/vectors.py>`_: the vectors to convert tokens into tensors.
 
 These prototype building blocks in the experimental folder are available in the nightly release only. The nightly packages are accessible via Pip and Conda for Windows, Mac, and Linux. For example, Linux users can install the nightly wheels with the following command::
@@ -133,7 +142,7 @@ For more detailed instructions, please refer to `Install PyTorch <https://pytorc
 [BC Breaking] Legacy
 ====================
 
-In v0.9.0 release, we move the following legacy code to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. This is part of the work to revamp the torchtext library and the motivation has been discussed in `Issue #664 <https://github.com/pytorch/text/issues/664>`_:
+In the v0.9.0 release, we moved the following legacy code to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. This is part of the work to revamp the torchtext library and the motivation has been discussed in `Issue #664 <https://github.com/pytorch/text/issues/664>`_:
 
 * ``torchtext.legacy.data.field``
 * ``torchtext.legacy.data.batch``
@@ -144,6 +153,8 @@ In v0.9.0 release, we move the following legacy code to `torchtext.legacy <https
 
 We have a `migration tutorial <https://colab.research.google.com/github/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb>`_ to help users switch to the torchtext datasets in ``v0.9.0`` release. For the users who still want the legacy components, they can add ``legacy`` to the import path.  
 
+In the v0.10.0 release, we retire the Vocab class to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. Users can still access the legacy Vocab via ``torchtext.legacy.vocab``. This class has been replaced by a Vocab module that is backed by efficient C++ implementation and provides common functional APIs for NLP workflows. 
+
 Disclaimer on Datasets
 ======================
 

diff --git a/benchmark/benchmark_experimental_vocab.py → benchmark/benchmark_vocab.py b/benchmark/benchmark_experimental_vocab.py → benchmark/benchmark_vocab.py
@@ -6,28 +6,40 @@
 from timeit import default_timer as timer
 from matplotlib import pyplot as plt
 import torch
-from torchtext.experimental.datasets import DATASETS
+from torchtext.datasets import DATASETS
 from torchtext.experimental.vocab_factory import (
     load_vocab_from_file,
     build_vocab_from_text_file
 )
-from torchtext.vocab import vocab as VocabExperimental
+from torchtext.vocab import build_vocab_from_iterator
+from torchtext.vocab import vocab as VocabNew
 from torchtext.legacy.vocab import (
     Vocab,
-    build_vocab_from_iterator
+    build_vocab_from_iterator as build_vocab_from_iterator_legacy,
 )
-from torchtext.experimental.transforms import basic_english_normalize
+from torchtext.experimental.transforms import(
+    basic_english_normalize,
+)
+from torchtext.data.utils import get_tokenizer
+
+def build_vocab(data, transforms):
+    def apply_transforms(data):
+        for _, line in data:
+            yield transforms(line)
+    vocab = build_vocab_from_iterator(apply_transforms(data), specials=['<unk>', '<pad>'])
+    vocab.set_default_index(vocab['<unk>'])
+    return vocab
 
 
-def compare_legacy_and_experimental_batch_lookup():
+def compare_legacy_and_new_batch_lookup():
     num_tokens = 1000
     num_letters = 6
     num_lines = 100000
     vocab = [''.join(random.sample(string.ascii_letters * num_letters, num_letters)) for _ in range(num_tokens)]
     counter = Counter()
     counter.update(vocab)
     legacy_vocab = Vocab(counter)
-    experimental_vocab = VocabExperimental(counter)
+    new_vocab = VocabNew(counter)
     speed_ups = []
     token_lengths = [i for i in range(2, 100)]
     for i in token_lengths:
@@ -39,12 +51,12 @@ def compare_legacy_and_experimental_batch_lookup():
 
         start_time = timer()
         for text in lines:
-            experimental_vocab.lookup_indices(text)
+            new_vocab.lookup_indices(text)
 
-        experimental_time = timer() - start_time
+        new_time = timer() - start_time
 
-        speed_ups.append(legacy_time / experimental_time)
-        print("speed-up={} for average length={}".format(legacy_time / experimental_time, i))
+        speed_ups.append(legacy_time / new_time)
+        print("speed-up={} for average length={}".format(legacy_time / new_time, i))
         del lines
 
     plt.close()
@@ -89,10 +101,10 @@ def token_iterator(lines):
             for token in tokenize(line):
                 yield token
 
-    return build_vocab_from_iterator(token_iterator(file_like_object))
+    return build_vocab_from_iterator_legacy(token_iterator(file_like_object))
 
 
-def benchmark_experimental_vocab_construction(vocab_file_path, is_raw_text=True, is_legacy=True, num_iters=1):
+def benchmark_new_vocab_construction(vocab_file_path, is_raw_text=True, is_legacy=True, num_iters=1):
     f = open(vocab_file_path, 'r')
     t0 = time.monotonic()
     if is_raw_text:
@@ -107,15 +119,15 @@ def benchmark_experimental_vocab_construction(vocab_file_path, is_raw_text=True,
             for _ in range(num_iters):
                 tokenizer = basic_english_normalize()
                 jited_tokenizer = torch.jit.script(tokenizer)
-                build_vocab_from_text_file(f, jited_tokenizer, num_cpus=1)
+                build_vocab_from_text_file(vocab_file_path, jited_tokenizer, num_cpus=1)
             print("Construction time:", time.monotonic() - t0)
     else:
         for _ in range(num_iters):
             load_vocab_from_file(f)
         print("Construction time:", time.monotonic() - t0)
 
 
-def benchmark_experimental_vocab_lookup(vocab_file_path=None, dataset='AG_NEWS'):
+def benchmark_new_vocab_lookup(vocab_file_path=None, dataset='AG_NEWS'):
     def _run_benchmark_lookup(tokens, vocab):
         t0 = time.monotonic()
         # list lookup
@@ -132,15 +144,11 @@ def _run_benchmark_lookup(tokens, vocab):
 
     tokens = []
     tokens_lists = []
-
-    train = DATASETS[dataset](split='train')
-    vocab = train.get_vocab()
-    for (_, text) in train:
-        cur_tokens = []
-        for id in text.tolist():
-            cur_tokens.append(vocab.itos[id])
-        tokens_lists.append(cur_tokens)
-        tokens += cur_tokens
+    tokenizer = get_tokenizer("basic_english")
+    for (_, text) in DATASETS[dataset](split='train'):
+       cur_tokens = tokenizer(text)
+       tokens_lists.append(cur_tokens)
+       tokens += cur_tokens
 
     if vocab_file_path:
         print("Loading Vocab from file {}".format(vocab_file_path))
@@ -153,14 +161,14 @@ def token_iterator(file_path):
         # existing Vocab construction
         print("Vocab")
         t0 = time.monotonic()
-        v_existing = build_vocab_from_iterator(token_iterator(vocab_file_path))
+        v_existing = build_vocab_from_iterator_legacy(token_iterator(vocab_file_path))
         print("Construction time:", time.monotonic() - t0)
 
-        # experimental Vocab construction
-        print("Vocab Experimental")
+        # new Vocab construction
+        print("Vocab New")
         t0 = time.monotonic()
         f = open(vocab_file_path, 'r')
-        v_experimental = load_vocab_from_file(f)
+        v_new = load_vocab_from_file(f)
         print("Construction time:", time.monotonic() - t0)
     else:
         print("Loading Vocab from {}".format(dataset))
@@ -174,31 +182,31 @@ def token_iterator(file_path):
         v_existing = Vocab(counter)
         print("Construction time:", time.monotonic() - t0)
 
-        # experimental Vocab construction
-        print("Vocab Experimental")
+        # new Vocab construction
+        print("Vocab New")
         t0 = time.monotonic()
-        v_experimental = VocabExperimental(ordered_dict)
+        v_new = VocabNew(ordered_dict)
         print("Construction time:", time.monotonic() - t0)
-    jit_v_experimental = torch.jit.script(v_experimental)
+    jit_v_new = torch.jit.script(v_new)
 
     # existing Vocab eager lookup
     print("Vocab - Eager Mode")
     _run_benchmark_lookup(tokens, v_existing)
     _run_benchmark_lookup([tokens], v_existing)
     _run_benchmark_lookup(tokens_lists, v_existing)
 
-    # experimental Vocab eager lookup
-    print("Vocab Experimental - Eager Mode")
-    _run_benchmark_lookup(tokens, v_experimental)
-    _run_benchmark_lookup([tokens], v_experimental)
-    _run_benchmark_lookup(tokens_lists, v_experimental)
+    # new Vocab eager lookup
+    print("Vocab New - Eager Mode")
+    _run_benchmark_lookup(tokens, v_new)
+    _run_benchmark_lookup([tokens], v_new)
+    _run_benchmark_lookup(tokens_lists, v_new)
 
-    jit_v_experimental = torch.jit.script(v_experimental)
-    # experimental Vocab jit lookup
-    print("Vocab Experimental - Jit Mode")
-    _run_benchmark_lookup(tokens, jit_v_experimental)
-    _run_benchmark_lookup([tokens], jit_v_experimental)
-    _run_benchmark_lookup(tokens_lists, jit_v_experimental)
+    jit_v_new = torch.jit.script(v_new)
+    # new Vocab jit lookup
+    print("Vocab New - Jit Mode")
+    _run_benchmark_lookup(tokens, jit_v_new)
+    _run_benchmark_lookup([tokens], jit_v_new)
+    _run_benchmark_lookup(tokens_lists, jit_v_new)
 
 
 if __name__ == "__main__":
@@ -219,7 +227,7 @@ def token_iterator(file_path):
 
     if args.run_construction_benchmark:
         print("is_legacy", args.is_legacy)
-        benchmark_experimental_vocab_construction(args.vocab_filename_construction,
+        benchmark_new_vocab_construction(args.vocab_filename_construction,
                                                   is_raw_text=args.is_raw_text, is_legacy=args.is_legacy)
     else:
-        benchmark_experimental_vocab_lookup(args.vocab_filename_lookup, args.dataset)
+        benchmark_new_vocab_lookup(args.vocab_filename_lookup, args.dataset)