Skip to content

Commit

Permalink
Import torchtext #1325 57a1df3
Browse files Browse the repository at this point in the history
Reviewed By: NicolasHug

Differential Revision: D28994054

fbshipit-source-id: 4c679f56ef37b18f6d2acaaaed8518facbeaa41c
  • Loading branch information
mthrok authored and facebook-github-bot committed Jun 9, 2021
1 parent e9d7593 commit c56bfbd
Show file tree
Hide file tree
Showing 8 changed files with 47 additions and 52 deletions.
23 changes: 17 additions & 6 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,20 +15,22 @@ This repository consists of:
* `torchtext.datasets <https://github.com/pytorch/text/tree/master/torchtext/datasets>`_: The raw text iterators for common NLP datasets
* `torchtext.data <https://github.com/pytorch/text/tree/master/torchtext/data>`_: Some basic NLP building blocks (tokenizers, metrics, functionals etc.)
* `torchtext.nn <https://github.com/pytorch/text/tree/master/torchtext/nn>`_: NLP related modules
* `torchtext.vocab <https://github.com/pytorch/text/tree/master/torchtext/vocab.py>`_: Vocab and Vectors related classes and factory functions
* `examples <https://github.com/pytorch/text/tree/master/examples>`_: Example NLP workflows with PyTorch and torchtext library.

Note: the legacy code discussed in `torchtext v0.7.0 release note <https://github.com/pytorch/text/releases/tag/v0.7.0-rc3>`_ has been retired to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder. Those legacy code will not be maintained by the development team, and we plan to fully remove them in the future release. See `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder for more details.
Note: The legacy code discussed in `torchtext v0.7.0 release note <https://github.com/pytorch/text/releases/tag/v0.7.0-rc3>`_ has been retired to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder. Those legacy code will not be maintained by the development team, and we plan to fully remove them in the future release. See `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_ folder for more details.

Installation
============

We recommend Anaconda as Python package management system. Please refer to `pytorch.org <https://pytorch.org/>`_ for the detail of PyTorch installation. The following is the corresponding ``torchtext`` versions and supported Python versions.
We recommend Anaconda as a Python package management system. Please refer to `pytorch.org <https://pytorch.org/>`_ for the details of PyTorch installation. The following are the corresponding ``torchtext`` versions and supported Python versions.

.. csv-table:: Version Compatibility
:header: "PyTorch version", "torchtext version", "Supported Python version"
:widths: 10, 10, 10

nightly build, master, 3.6+
1.9, 0.10, 3.6+
1.8, 0.9, 3.6+
1.7, 0.8, 3.6+
1.6, 0.7, 3.6+
Expand Down Expand Up @@ -93,7 +95,7 @@ Datasets
The datasets module currently contains:

* Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
* Machine translation: IWSLT2016, IWSLT2017
* Machine translation: IWSLT2016, IWSLT2017, Multi30k
* Sequence tagging (e.g. POS/NER): UDPOS, CoNLL2000Chunking
* Question answering: SQuAD1, SQuAD2
* Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
Expand All @@ -113,15 +115,22 @@ For example, to access the raw text from the AG_NEWS dataset:
>>> train_iter = AG_NEWS(split='train')
>>> dataloader = DataLoader(train_iter, batch_size=8, shuffle=False)
A tutorial for the end-to-end text classification workflow can be found in `PyTorch tutorial <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>`_
Tutorials
=========

To get started with torchtext, users may refer to the following tutorials available on PyTorch website.

* `Text classification with AG_NEWS dataset <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>`_
* `Translation trained with Multi30k dataset using transformers and torchtext <https://pytorch.org/tutorials/beginner/translation_transformer.html>`_
* `Language modeling using transforms and torchtext <https://pytorch.org/tutorials/beginner/transformer_tutorial.html>`_


[Prototype] Experimental Code
=============================

We have re-written several building blocks under ``torchtext.experimental``:

* `Transforms <https://github.com/pytorch/text/blob/master/torchtext/experimental/transforms.py>`_: some basic data processing building blocks
* `Vocabulary <https://github.com/pytorch/text/blob/master/torchtext/experimental/vocab.py>`_: a vocabulary to numericalize tokens
* `Vectors <https://github.com/pytorch/text/blob/master/torchtext/experimental/vectors.py>`_: the vectors to convert tokens into tensors.

These prototype building blocks in the experimental folder are available in the nightly release only. The nightly packages are accessible via Pip and Conda for Windows, Mac, and Linux. For example, Linux users can install the nightly wheels with the following command::
Expand All @@ -133,7 +142,7 @@ For more detailed instructions, please refer to `Install PyTorch <https://pytorc
[BC Breaking] Legacy
====================

In v0.9.0 release, we move the following legacy code to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. This is part of the work to revamp the torchtext library and the motivation has been discussed in `Issue #664 <https://github.com/pytorch/text/issues/664>`_:
In the v0.9.0 release, we moved the following legacy code to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. This is part of the work to revamp the torchtext library and the motivation has been discussed in `Issue #664 <https://github.com/pytorch/text/issues/664>`_:

* ``torchtext.legacy.data.field``
* ``torchtext.legacy.data.batch``
Expand All @@ -144,6 +153,8 @@ In v0.9.0 release, we move the following legacy code to `torchtext.legacy <https

We have a `migration tutorial <https://colab.research.google.com/github/pytorch/text/blob/master/examples/legacy_tutorial/migration_tutorial.ipynb>`_ to help users switch to the torchtext datasets in ``v0.9.0`` release. For the users who still want the legacy components, they can add ``legacy`` to the import path.

In the v0.10.0 release, we retire the Vocab class to `torchtext.legacy <https://github.com/pytorch/text/tree/master/torchtext/legacy>`_. Users can still access the legacy Vocab via ``torchtext.legacy.vocab``. This class has been replaced by a Vocab module that is backed by efficient C++ implementation and provides common functional APIs for NLP workflows.

Disclaimer on Datasets
======================

Expand Down
35 changes: 4 additions & 31 deletions test/data/test_builtin_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,23 +207,14 @@ def test_next_method_dataset(self):

def test_imdb(self):
from torchtext.experimental.datasets import IMDB
from torchtext.legacy.vocab import Vocab
# smoke test to ensure imdb works properly
train_dataset, test_dataset = IMDB()
self._helper_test_func(len(train_dataset), 25000, train_dataset[0][1][:10],
[13, 1568, 13, 246, 35468, 43, 64, 398, 1135, 92])
self._helper_test_func(len(test_dataset), 25000, test_dataset[0][1][:10],
[13, 125, 1051, 5, 246, 1652, 8, 277, 66, 20])

# Test API with a vocab input object
old_vocab = train_dataset.get_vocab()
new_vocab = Vocab(counter=old_vocab.freqs, max_size=2500)
new_train_data, new_test_data = IMDB(vocab=new_vocab)

# Add test for the subset of the standard datasets
train_dataset = IMDB(split='train')
self._helper_test_func(len(train_dataset), 25000, train_dataset[0][1][:10],
[13, 1568, 13, 246, 35468, 43, 64, 398, 1135, 92])
train_iter, test_iter = torchtext.datasets.IMDB()
self._helper_test_func(len(train_iter), 25000, next(train_iter)[1][:25], 'I rented I AM CURIOUS-YEL')
self._helper_test_func(len(test_iter), 25000, next(test_iter)[1][:25], 'I love sci-fi and am will')
Expand All @@ -241,8 +232,8 @@ def test_iwslt2017(self):
de_vocab, en_vocab = train_dataset.get_vocab()

def assert_nth_pair_is_equal(n, expected_sentence_pair):
de_sentence = [de_vocab.itos[index] for index in train_dataset[n][0]]
en_sentence = [en_vocab.itos[index] for index in train_dataset[n][1]]
de_sentence = [de_vocab.lookup_token(index) for index in train_dataset[n][0]]
en_sentence = [en_vocab.lookup_token(index) for index in train_dataset[n][1]]

expected_de_sentence, expected_en_sentence = expected_sentence_pair

Expand All @@ -267,8 +258,8 @@ def test_iwslt2016(self):
de_vocab, en_vocab = train_dataset.get_vocab()

def assert_nth_pair_is_equal(n, expected_sentence_pair):
de_sentence = [de_vocab.itos[index] for index in train_dataset[n][0]]
en_sentence = [en_vocab.itos[index] for index in train_dataset[n][1]]
de_sentence = [de_vocab.lookup_token(index) for index in train_dataset[n][0]]
en_sentence = [en_vocab.lookup_token(index) for index in train_dataset[n][1]]
expected_de_sentence, expected_en_sentence = expected_sentence_pair

self.assertEqual(de_sentence, expected_de_sentence)
Expand Down Expand Up @@ -462,7 +453,6 @@ def test_conll_sequence_tagging(self):

def test_squad1(self):
from torchtext.experimental.datasets import SQuAD1
from torchtext.legacy.vocab import Vocab
# smoke test to ensure imdb works properly
train_dataset, dev_dataset = SQuAD1()
context, question, answers, ans_pos = train_dataset[100]
Expand All @@ -472,16 +462,8 @@ def test_squad1(self):
self._helper_test_func(len(dev_dataset), 10570, (question, ans_pos[0]),
([42, 27, 669, 7438, 17, 2, 1950, 3273, 17252, 389, 16], [45, 48]))

# Test API with a vocab input object
old_vocab = train_dataset.get_vocab()
new_vocab = Vocab(counter=old_vocab.freqs, max_size=2500)
new_train_data, new_test_data = SQuAD1(vocab=new_vocab)

# Add test for the subset of the standard datasets
train_dataset = SQuAD1(split='train')
context, question, answers, ans_pos = train_dataset[100]
self._helper_test_func(len(train_dataset), 87599, (question[:5], ans_pos[0]),
([7, 24, 86, 52, 2], [72, 72]))
train_iter, dev_iter = torchtext.datasets.SQuAD1()
self._helper_test_func(len(train_iter), 87599, next(train_iter)[0][:50],
'Architecturally, the school has a Catholic charact')
Expand All @@ -491,7 +473,6 @@ def test_squad1(self):

def test_squad2(self):
from torchtext.experimental.datasets import SQuAD2
from torchtext.legacy.vocab import Vocab
# smoke test to ensure imdb works properly
train_dataset, dev_dataset = SQuAD2()
context, question, answers, ans_pos = train_dataset[200]
Expand All @@ -501,16 +482,8 @@ def test_squad2(self):
self._helper_test_func(len(dev_dataset), 11873, (question, ans_pos[0]),
([41, 29, 2, 66, 17016, 30, 0, 1955, 16], [40, 46]))

# Test API with a vocab input object
old_vocab = train_dataset.get_vocab()
new_vocab = Vocab(counter=old_vocab.freqs, max_size=2500)
new_train_data, new_test_data = SQuAD2(vocab=new_vocab)

# Add test for the subset of the standard datasets
train_dataset = SQuAD2(split='train')
context, question, answers, ans_pos = train_dataset[200]
self._helper_test_func(len(train_dataset), 130319, (question[:5], ans_pos[0]),
([84, 50, 1421, 12, 5439], [9, 9]))
train_iter, dev_iter = torchtext.datasets.SQuAD2()
self._helper_test_func(len(train_iter), 130319, next(train_iter)[0][:50],
'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-Y')
Expand Down
6 changes: 4 additions & 2 deletions torchtext/experimental/datasets/language_modeling.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import torch
import logging
from torchtext.data.utils import get_tokenizer
from torchtext.legacy.vocab import build_vocab_from_iterator
from torchtext.vocab import build_vocab_from_iterator
from torchtext import datasets as raw
from torchtext.experimental.datasets import raw as experimental_raw
from torchtext.data.datasets_utils import _check_default_set
Expand All @@ -15,7 +15,9 @@ def apply_transforms(data):
for line in data:
tokens = transforms(line)
yield tokens
return build_vocab_from_iterator(apply_transforms(data), len(data))
vocab = build_vocab_from_iterator(apply_transforms(data), specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])
return vocab


class LanguageModelingDataset(torch.utils.data.Dataset):
Expand Down
5 changes: 3 additions & 2 deletions torchtext/experimental/datasets/question_answer.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import torch
import logging
from torchtext.data.utils import get_tokenizer
from torchtext.legacy.vocab import build_vocab_from_iterator
from torchtext.vocab import build_vocab_from_iterator
from torchtext import datasets as raw
from torchtext.data.datasets_utils import _check_default_set
from torchtext.data.datasets_utils import _wrap_datasets
Expand Down Expand Up @@ -81,7 +81,8 @@ def apply_transform(data):
tok_ans += text_transform(item)
yield text_transform(_context) + text_transform(_question) + tok_ans
logger_.info('Building Vocab based on train data')
vocab = build_vocab_from_iterator(apply_transform(raw_data['train']), len(raw_data['train']))
vocab = build_vocab_from_iterator(apply_transform(raw_data['train']), specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])
logger_.info('Vocab has %d entries', len(vocab))
text_transform = sequential_transforms(text_transform, vocab_func(vocab), totensor(dtype=torch.long))
transforms = {'context': text_transform, 'question': text_transform,
Expand Down
6 changes: 4 additions & 2 deletions torchtext/experimental/datasets/sequence_tagging.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from torchtext.data.datasets_utils import _check_default_set
from torchtext.data.datasets_utils import _wrap_datasets
from torchtext import datasets as raw
from torchtext.legacy.vocab import build_vocab_from_iterator
from torchtext.vocab import build_vocab_from_iterator
from torchtext.experimental.functional import (
vocab_func,
totensor,
Expand All @@ -22,7 +22,9 @@ def build_vocab(data):
for idx, col in enumerate(line):
data_list[idx].append(col)
for it in data_list:
vocabs.append(build_vocab_from_iterator(it, len(it)))
vocab = build_vocab_from_iterator(it, specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])
vocabs.append(vocab)

return vocabs

Expand Down
6 changes: 4 additions & 2 deletions torchtext/experimental/datasets/text_classification.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import torch
import logging
from torchtext.data.utils import get_tokenizer
from torchtext.legacy.vocab import build_vocab_from_iterator
from torchtext.vocab import build_vocab_from_iterator
from torchtext import datasets as raw
from torchtext.data.datasets_utils import _check_default_set
from torchtext.data.datasets_utils import _wrap_datasets
Expand All @@ -19,7 +19,9 @@ def build_vocab(data, transforms):
def apply_transforms(data):
for _, line in data:
yield transforms(line)
return build_vocab_from_iterator(apply_transforms(data), len(data))
vocab = build_vocab_from_iterator(apply_transforms(data), specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])
return vocab


class TextClassificationDataset(torch.utils.data.Dataset):
Expand Down
6 changes: 4 additions & 2 deletions torchtext/experimental/datasets/translation.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from torchtext.data.datasets_utils import _wrap_datasets
from torchtext import datasets as raw
from torchtext.experimental.datasets import raw as experimental_raw
from torchtext.legacy.vocab import Vocab, build_vocab_from_iterator
from torchtext.vocab import Vocab, build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
from ..functional import vocab_func, totensor, sequential_transforms

Expand All @@ -15,7 +15,9 @@ def build_vocab(data, transforms, index):
def apply_transforms(data):
for line in data:
yield transforms(line[index])
return build_vocab_from_iterator(apply_transforms(data), len(data))
vocab = build_vocab_from_iterator(apply_transforms(data), specials=['<unk>', '<pad>'])
vocab.set_default_index(vocab['<unk>'])
return vocab


def _setup_datasets(dataset_name,
Expand Down
12 changes: 7 additions & 5 deletions torchtext/vocab.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,14 +258,16 @@ def build_vocab_from_iterator(iterator: Iterable, min_freq: int = 1, specials: O
counter = Counter()
for tokens in iterator:
counter.update(tokens)
sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)

if specials is not None:
for symbol in specials:
if symbol in ordered_dict:
del ordered_dict[symbol]
for tok in specials:
del counter[tok]

sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[0])
sorted_by_freq_tuples.sort(key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)

if specials is not None:
if special_first:
specials = specials[::-1]
for symbol in specials:
Expand Down

0 comments on commit c56bfbd

Please sign in to comment.