Skip to content

Commit

Permalink
Merge branch 'master' into update-xcode
Browse files Browse the repository at this point in the history
  • Loading branch information
parmeet authored Jun 23, 2021
2 parents d84e16f + e35562a commit fb93e93
Show file tree
Hide file tree
Showing 26 changed files with 147 additions and 2,305 deletions.
21 changes: 21 additions & 0 deletions .circleci/cached_datasets_list.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
IMDB
AG_NEWS
SogouNews
DBpedia
YelpReviewPolarity
YelpReviewFull
YahooAnswers
AmazonReviewPolarity
AmazonReviewFull
UDPOS
CoNLL2000Chunking
Multi30k
IWSLT2016
IWSLT2017
WMT14
WikiText2
WikiText103
PennTreebank
SQuAD1
SQuAD2
EnWik9
4 changes: 3 additions & 1 deletion .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ commands:
steps:
- run:
name: Generate CCI cache key
command: echo "$(date "+%D")" > .cachekey
command:
echo "$(date "+%D")" > .cachekey
cat cached_datasets_list.txt >> .cachekey
- persist_to_workspace:
root: .
paths:
Expand Down
4 changes: 3 additions & 1 deletion .circleci/config.yml.in
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ commands:
steps:
- run:
name: Generate CCI cache key
command: echo "$(date "+%D")" > .cachekey
command:
echo "$(date "+%D")" > .cachekey
cat cached_datasets_list.txt >> .cachekey
- persist_to_workspace:
root: .
paths:
Expand Down
23 changes: 17 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,22 @@ This repository consists of:
* `torchtext.datasets <https://github.com/pytorch/text/tree/master/torchtext/datasets>`_: The raw text iterators for common NLP datasets
* `torchtext.data <https://github.com/pytorch/text/tree/master/torchtext/data>`_: Some basic NLP building blocks (tokenizers, metrics, functionals etc.)
* `torchtext.nn <https://github.com/pytorch/text/tree/master/torchtext/nn>`_: NLP related modules
* `torchtext.vocab <https://github.com/pytorch/text/tree/master/torchtext/vocab.py>`_: Vocab and Vectors related classes and factory functions
* `examples <https://github.com/pytorch/text/tree/master/examples>`_: Example NLP workflows with PyTorch and torchtext library.

Note: the legacy code discussed in `torchtext v0.7.0 release note <https://github.com/pytorch/text/releases/tag/v0.7.0-rc3>`_ has been retired to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder. Those legacy code will not be maintained by the development team, and we plan to fully remove them in the future release. See `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder for more details.
Note: The legacy code discussed in `torchtext v0.7.0 release note <https://github.com/pytorch/text/releases/tag/v0.7.0-rc3>`_ has been retired to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder. Those legacy code will not be maintained by the development team, and we plan to fully remove them in the future release. See `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder for more details.

Installation
============

We recommend Anaconda as Python package management system. Please refer to `pytorch.org <https://pytorch.org/>`_ for the detail of PyTorch installation. The following is the corresponding ``torchtext`` versions and supported Python versions.
We recommend Anaconda as a Python package management system. Please refer to `pytorch.org <https://pytorch.org/>`_ for the details of PyTorch installation. The following are the corresponding ``torchtext`` versions and supported Python versions.

.. csv-table:: Version Compatibility
:header: "PyTorch version", "torchtext version", "Supported Python version"
:widths: 10, 10, 10

nightly build, master, 3.6+
1.9, 0.10, 3.6+
1.8, 0.9, 3.6+
1.7, 0.8, 3.6+
1.6, 0.7, 3.6+
Expand Down Expand Up @@ -93,7 +95,7 @@ Datasets
The datasets module currently contains:

* Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
* Machine translation: IWSLT2016, IWSLT2017
* Machine translation: IWSLT2016, IWSLT2017, Multi30k
* Sequence tagging (e.g. POS/NER): UDPOS, CoNLL2000Chunking
* Question answering: SQuAD1, SQuAD2
* Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
Expand All @@ -113,15 +115,22 @@ For example, to access the raw text from the AG_NEWS dataset:
>>> train_iter = AG_NEWS(split='train')
>>> dataloader = DataLoader(train_iter, batch_size=8, shuffle=False)
A tutorial for the end-to-end text classification workflow can be found in `PyTorch tutorial <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>`_
Tutorials
=========

To get started with torchtext, users may refer to the following tutorials available on PyTorch website.

* `Text classification with AG_NEWS dataset <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>`_
* `Translation trained with Multi30k dataset using transformers and torchtext <https://pytorch.org/tutorials/beginner/translation_transformer.html>`_
* `Language modeling using transforms and torchtext <https://pytorch.org/tutorials/beginner/transformer_tutorial.html>`_


[Prototype] Experimental Code
=============================

We have re-written several building blocks under ``torchtext.experimental``:

* `Transforms <https://github.com/pytorch/text/blob/master/torchtext/experimental/transforms.py>`_: some basic data processing building blocks
* `Vocabulary <https://github.com/pytorch/text/blob/master/torchtext/experimental/vocab.py>`_: a vocabulary to numericalize tokens
* `Vectors <https://github.com/pytorch/text/blob/master/torchtext/experimental/vectors.py>`_: the vectors to convert tokens into tensors.

These prototype building blocks in the experimental folder are available in the nightly release only. The nightly packages are accessible via Pip and Conda for Windows, Mac, and Linux. For example, Linux users can install the nightly wheels with the following command::
Expand All @@ -133,7 +142,7 @@ For more detailed instructions, please refer to `Install PyTorch <https://pytorc
[BC Breaking] Legacy
====================

In v0.9.0 release, we move the following legacy code to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. This is part of the work to revamp the torchtext library and the motivation has been discussed in `Issue #664 <https://github.com/pytorch/text/issues/664>`_:
In the v0.9.0 release, we moved the following legacy code to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. This is part of the work to revamp the torchtext library and the motivation has been discussed in `Issue #664 <https://github.com/pytorch/text/issues/664>`_:

* ``torchtext.legacy.data.field``
* ``torchtext.legacy.data.batch``
Expand All @@ -144,6 +153,8 @@ In v0.9.0 release, we move the following legacy code to `torchtext.legacy <https

We have a `migration tutorial <https://colab.research.google.com/github/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb>`_ to help users switch to the torchtext datasets in ``v0.9.0`` release. For the users who still want the legacy components, they can add ``legacy`` to the import path.

In the v0.10.0 release, we retire the Vocab class to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. Users can still access the legacy Vocab via ``torchtext.legacy.vocab``. This class has been replaced by a Vocab module that is backed by efficient C++ implementation and provides common functional APIs for NLP workflows.

Disclaimer on Datasets
======================

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,28 +6,40 @@
from timeit import default_timer as timer
from matplotlib import pyplot as plt
import torch
from torchtext.experimental.datasets import DATASETS
from torchtext.datasets import DATASETS
from torchtext.experimental.vocab_factory import (
load_vocab_from_file,
build_vocab_from_text_file
)
from torchtext.vocab import vocab as VocabExperimental
from torchtext.vocab import build_vocab_from_iterator
from torchtext.vocab import vocab as VocabNew
from torchtext.legacy.vocab import (
Vocab,
build_vocab_from_iterator
build_vocab_from_iterator as build_vocab_from_iterator_legacy,
)
from torchtext.experimental.transforms import basic_english_normalize
from torchtext.experimental.transforms import(
basic_english_normalize,
)
from torchtext.data.utils import get_tokenizer

def build_vocab(data, transforms):
def apply_transforms(data):
for _, line in data:
yield transforms(line)
vocab = build_vocab_from_iterator(apply_transforms(data), specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])
return vocab


def compare_legacy_and_experimental_batch_lookup():
def compare_legacy_and_new_batch_lookup():
num_tokens = 1000
num_letters = 6
num_lines = 100000
vocab = [''.join(random.sample(string.ascii_letters * num_letters, num_letters)) for _ in range(num_tokens)]
counter = Counter()
counter.update(vocab)
legacy_vocab = Vocab(counter)
experimental_vocab = VocabExperimental(counter)
new_vocab = VocabNew(counter)
speed_ups = []
token_lengths = [i for i in range(2, 100)]
for i in token_lengths:
Expand All @@ -39,12 +51,12 @@ def compare_legacy_and_experimental_batch_lookup():

start_time = timer()
for text in lines:
experimental_vocab.lookup_indices(text)
new_vocab.lookup_indices(text)

experimental_time = timer() - start_time
new_time = timer() - start_time

speed_ups.append(legacy_time / experimental_time)
print("speed-up={} for average length={}".format(legacy_time / experimental_time, i))
speed_ups.append(legacy_time / new_time)
print("speed-up={} for average length={}".format(legacy_time / new_time, i))
del lines

plt.close()
Expand Down Expand Up @@ -89,10 +101,10 @@ def token_iterator(lines):
for token in tokenize(line):
yield token

return build_vocab_from_iterator(token_iterator(file_like_object))
return build_vocab_from_iterator_legacy(token_iterator(file_like_object))


def benchmark_experimental_vocab_construction(vocab_file_path, is_raw_text=True, is_legacy=True, num_iters=1):
def benchmark_new_vocab_construction(vocab_file_path, is_raw_text=True, is_legacy=True, num_iters=1):
f = open(vocab_file_path, 'r')
t0 = time.monotonic()
if is_raw_text:
Expand All @@ -107,15 +119,15 @@ def benchmark_experimental_vocab_construction(vocab_file_path, is_raw_text=True,
for _ in range(num_iters):
tokenizer = basic_english_normalize()
jited_tokenizer = torch.jit.script(tokenizer)
build_vocab_from_text_file(f, jited_tokenizer, num_cpus=1)
build_vocab_from_text_file(vocab_file_path, jited_tokenizer, num_cpus=1)
print("Construction time:", time.monotonic() - t0)
else:
for _ in range(num_iters):
load_vocab_from_file(f)
print("Construction time:", time.monotonic() - t0)


def benchmark_experimental_vocab_lookup(vocab_file_path=None, dataset='AG_NEWS'):
def benchmark_new_vocab_lookup(vocab_file_path=None, dataset='AG_NEWS'):
def _run_benchmark_lookup(tokens, vocab):
t0 = time.monotonic()
# list lookup
Expand All @@ -132,15 +144,11 @@ def _run_benchmark_lookup(tokens, vocab):

tokens = []
tokens_lists = []

train = DATASETS[dataset](split='train')
vocab = train.get_vocab()
for (_, text) in train:
cur_tokens = []
for id in text.tolist():
cur_tokens.append(vocab.itos[id])
tokens_lists.append(cur_tokens)
tokens += cur_tokens
tokenizer = get_tokenizer("basic_english")
for (_, text) in DATASETS[dataset](split='train'):
cur_tokens = tokenizer(text)
tokens_lists.append(cur_tokens)
tokens += cur_tokens

if vocab_file_path:
print("Loading Vocab from file {}".format(vocab_file_path))
Expand All @@ -153,14 +161,14 @@ def token_iterator(file_path):
# existing Vocab construction
print("Vocab")
t0 = time.monotonic()
v_existing = build_vocab_from_iterator(token_iterator(vocab_file_path))
v_existing = build_vocab_from_iterator_legacy(token_iterator(vocab_file_path))
print("Construction time:", time.monotonic() - t0)

# experimental Vocab construction
print("Vocab Experimental")
# new Vocab construction
print("Vocab New")
t0 = time.monotonic()
f = open(vocab_file_path, 'r')
v_experimental = load_vocab_from_file(f)
v_new = load_vocab_from_file(f)
print("Construction time:", time.monotonic() - t0)
else:
print("Loading Vocab from {}".format(dataset))
Expand All @@ -174,31 +182,31 @@ def token_iterator(file_path):
v_existing = Vocab(counter)
print("Construction time:", time.monotonic() - t0)

# experimental Vocab construction
print("Vocab Experimental")
# new Vocab construction
print("Vocab New")
t0 = time.monotonic()
v_experimental = VocabExperimental(ordered_dict)
v_new = VocabNew(ordered_dict)
print("Construction time:", time.monotonic() - t0)
jit_v_experimental = torch.jit.script(v_experimental)
jit_v_new = torch.jit.script(v_new)

# existing Vocab eager lookup
print("Vocab - Eager Mode")
_run_benchmark_lookup(tokens, v_existing)
_run_benchmark_lookup([tokens], v_existing)
_run_benchmark_lookup(tokens_lists, v_existing)

# experimental Vocab eager lookup
print("Vocab Experimental - Eager Mode")
_run_benchmark_lookup(tokens, v_experimental)
_run_benchmark_lookup([tokens], v_experimental)
_run_benchmark_lookup(tokens_lists, v_experimental)
# new Vocab eager lookup
print("Vocab New - Eager Mode")
_run_benchmark_lookup(tokens, v_new)
_run_benchmark_lookup([tokens], v_new)
_run_benchmark_lookup(tokens_lists, v_new)

jit_v_experimental = torch.jit.script(v_experimental)
# experimental Vocab jit lookup
print("Vocab Experimental - Jit Mode")
_run_benchmark_lookup(tokens, jit_v_experimental)
_run_benchmark_lookup([tokens], jit_v_experimental)
_run_benchmark_lookup(tokens_lists, jit_v_experimental)
jit_v_new = torch.jit.script(v_new)
# new Vocab jit lookup
print("Vocab New - Jit Mode")
_run_benchmark_lookup(tokens, jit_v_new)
_run_benchmark_lookup([tokens], jit_v_new)
_run_benchmark_lookup(tokens_lists, jit_v_new)


if __name__ == "__main__":
Expand All @@ -219,7 +227,7 @@ def token_iterator(file_path):

if args.run_construction_benchmark:
print("is_legacy", args.is_legacy)
benchmark_experimental_vocab_construction(args.vocab_filename_construction,
benchmark_new_vocab_construction(args.vocab_filename_construction,
is_raw_text=args.is_raw_text, is_legacy=args.is_legacy)
else:
benchmark_experimental_vocab_lookup(args.vocab_filename_lookup, args.dataset)
benchmark_new_vocab_lookup(args.vocab_filename_lookup, args.dataset)
Loading

0 comments on commit fb93e93

Please sign in to comment.