Skip to content

Releases: pytorch/text

v0.14.0

28 Oct 19:15
e2b27f9
Compare
Choose a tag to compare

Highlights

In this release, we enriched our library with additional datasets and tokenizers while making improvements to our existing build system, documentation, and components.

  • Added CNN-DM dataset
  • Added support for RegexTokenizer
  • Added TorchArrow based examples for training RoBERTa model on SST2 classification dataset

Datasets

We increased the number of datasets in TorchText from 30 to 31 by adding the CNN-DM (paper) dataset. The datasets supported by TorchText use datapipes from the TorchData project, which is still in Beta status. This means that the datapipes API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of DataLoaderV2 from torchdata. For more details, refer to https://pytorch.org/text/stable/datasets.html

Tokenizers

TorchText has extended support for TorchScriptable tokenizers by adding a RegexTokenizer that enables splitting based on regular expressions. TorchScriptabilty support would allow users to embed the Regex Tokenizer natively in C++ without needing a Python runtime. As TorchText now supports the CMake build system to natively link TorchText binaries with application code, users can easily integrate Regex tokenizers for deployment needs.

New Features

Transforms, Tokenizers, Ops

  • Migrate RegexTokenizer from experimental/transforms.py to transforms.py (#1763)
  • Migrate MaskTransform from internal to experimental/transforms.py (#1775)
  • Graduate MaskTransform from prototype (#1882)

Datasets

  • Add CNN-DM dataset to torchtext (#1789)
  • Resolve inconsistency in IMDB label output (#1914)
  • Cache CNNDM extraction and optimize reading in filenames (#1809)
  • Allow CNNDM to be imported from torchtext.datasets (#1884)

Improvements

Features

  • Convert TA transform module to prepoc function (#1854)
  • Use TA functional for adding tokens to the beginning and end of input (#1820)
  • Add TA Tensor creation operation to the benchmark (#1836)
  • Add never_split feature to BERTTokenizer (#1898)
  • Adding benchmarks for add tokens operator (#1807)
  • Add benchmark for roberta prepoc pipelines (#1684)
  • Adding Benchmark for TA ops (#1801)
  • Make BERT benchmark code more robust (#1871)
  • Define TORCHTEXT_API macro for visibility control (#1806)
  • Modify get_local_asset_path to take overwrite option and use it in BERTTokenizer (#1839)

Testing

  • Add test to compare encoder inference on input with and without padding (#1770)
  • Add m1 tagged build for TorchText (#1776)
  • Refactor TorchText version handing and adding first version of M1 builds (#1773)
  • Fix test execution in torchtext (#1889)
  • Add torchdata to testing requirements in requirements.txt (#1874)
  • Add missing None type hint to tests (#1868)
  • Create pytest fixture to auto delete model checkpoints within integration tests (#1886)
  • Disable test_vocab_from_raw_text_file on Linux (#1901)

Examples

  • Add libtorchtext cpp example (#1817)
  • Torcharrow based training using RoBERTa model and SST2 classification dataset (#1808)

Documentation

  • Add Datasets contribution guidelines (#1798)
  • Correct typo in SST-2 tutorial (#1865)
  • Update doc theme to the latest (#1899)
  • Tutorial on using T5 model for text summarization (#1864)
  • Fix docstring type (#1867)

Bug fixes

  • Fixing incorrect inputs to add eos and bos operators (#1810)
  • Add missing type hints (#1782)
  • Fix typo in nightly branch ref (#1783)
  • Sharing -> sharding (#1787)
  • Remove padding mask for input embeddings (#1799)
  • Fixed on_disk_cache issues (#1957)
  • Fix Multi30k dataset urls (#1816)
  • Add missing Cmake file for in tokenizer dir (#1908)
  • Fix OBO error for vocab files with empty lines (#1841)
  • Fixing build when CUDA enabled torch is installed (#1814)
  • Make comment paths dynamic (#1894)
  • Turn off mask checking for torchtext which is known to have a legal mask ( #1906)
  • Fix push on release reference name (#1792)

Dependencies

  • Remove future dep from windows (#1838)
  • Remove dependency on the torch::jit::script::Module for mobile builds (#1885)
  • Add Torchdata as a requirement and remove conditional imports of Torchdata (#1962)
  • Remove sphinx_rtd_theme from requirements.txt (#1837)
  • Fix Sphinx-gallery display and pin sphinx-related packages (#1907)

Others

  • Resolve and remove TODO comments (#1912)
  • Refactor TorchText version handling and adding first version of M1 builds (#1773)
  • Update xcode version to 14.0 in CI (#1881)
  • CI: Use self hosted runners for build (#1851)
  • Move Spacy from Pip dependencies to Conda dependencies (#1890)
  • Update compatibility matrix for 0.13 release (#1802)
  • Update CircleCI Xcode image (#1818)
  • Avoid looping through the whole counter in bleu_score method (#1913)
  • Rename build_tools dir to tools dir (#1804)
  • Usage setup-minicoda action for m1 build (#1897)
  • Making sure we build correctly against release branch (#1790)
  • Adding the conda builds for m1 (#1794)
  • Automatically initialize submodule (#1805)
  • Set MACOSX_DEPLOYMENT_TARGET=10.9 for binary job (#1835)

v0.13.1

05 Aug 22:18
330201f
Compare
Choose a tag to compare

This is a minor release, which is compatible with PyTorch 1.12.1 and include small bug fixes, improvements and documentation update. There is no new feature added.

Bug Fix

  • #1814 Fixing build when CUDA enabled torch is installed

For the full feature of v0.13, please refer to the v0.13.0 release note.

v0.13.0

28 Jun 16:47
35298c4
Compare
Choose a tag to compare

Highlights

In this release, we enriched our library with additional datasets and tokenizers while making improvements to our existing build system, documentation, and components.

  • Added all 9 GLUE benchmark’s datasets (#1710): CoLA, MRPC, QQP, STS-B, SST-2, MNLI, QNLI, RTE, WNLI
  • Added support for BERTTokenizer
  • Created native C++ binaries using a CMake based build system (#1644)

Datasets

We increased the number of datasets in TorchText from 22 to 30 by adding the remaining 8 datasets from the GLUE benchmark (SST-2 was already supported). The complete list of GLUE datasets is as follows:

  • CoLA (paper): Single sentence binary classification acceptability task
  • SST-2 (paper): Single sentence binary classification sentiment task
  • MRPC (paper): Dual sentence binary classification paraphrase task
  • QQP: Dual sentence binary classification paraphrase task
  • STS-B (paper): Single sentence to float regression sentence similarity task
  • MNLI (paper): Sentence ternary classification NLI task
  • QNLI (paper): Sentence binary classification QA and NLI tasks
  • RTE (paper): Dual sentence binary classification NLI task
  • WNLI (paper): Dual sentence binary classification coreference and NLI tasks

The datasets supported by TorchText use datapipes from the TorchData project, which is still in Beta status. This means that the datapipes API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of DataLoaderV2 from torchdata. For more details, refer to https://pytorch.org/text/stable/datasets.html

Tokenizers

TorchText has extended support for TorchScriptable tokenizers by adding the WordPiece tokenizer used in BERT. It is one of the most commonly used algorithms for splitting input text into sub-words units and was introduced in Japanese and Korean Voice Search (Schuster et al., 2012).

TorchScriptabilty support would allow users to embed the BERT text-preprocessing natively in C++ without needing a Python runtime. As TorchText now supports the CMake build system to natively link TorchText binaries with application code, users can easily integrate BERT tokenizers for deployment needs.

For usage details, please refer to the corresponding documentation.

CMake Build System

TorchText has migrated its build system for C++ extension and third party libraries to use CMake rather than PyTorch’s CppExtension module. This allows end-users to integrate TorchText C++ binaries in their applications without having a dependency on libpython thus allowing them to use TorchText operators in a non-Python environment.

Refer to the GitHub issue for more details.

Backward Incompatible Changes

The RobertaModelBundle introduced in 0.12 release, which gets pre-trained RoBERTa/XLM-R models and builds custom models with similar architecture, has been renamed to RobertaBundle (#1653).

The default caching location (cache_dir) has been changed from os.path.expanduser("~/.TorchText/cache") to os.path.expanduser("~/.cache/torch/text"). Furthermore the default root directory of datasets is cache_dir/datasets (#1740). Users can now control default cache location via the TORCH_HOME environment variable (#1741)

New Features

Models

  • [fbsync] BetterTransformer support for TorchText (#1690) (#1694)
  • [fbsync] Killed to_better by having native load_from_state_dict and init (#1695)
  • [fbsync] Removed unneeded modules after using nn.Module for BetterTransformer (#1696)
  • [fbsync] Replaced TransformerEncoder in TorchText with better transformer (#1703)

Transforms, Tokenizers, Ops

  • Added pad transform, string to int transform (#1683)
  • Added support for Scriptable BERT tokenizer (#1707)
  • Added support for batch input in BERT Tokenizer with perf benchmark (#1745)

Datasets

Support for GLUE benchmark’s datasets added:

Others

  • Prepared datasets for new encoding kwarg. (#1616)
  • Added Shuffle and sharding datapipes to datasets (#1729)
  • For Datasets, refactored local functions to be global so that they can be pickled (#1726)
  • Updated TorchData DataPipe API usages (#1663)
  • Replaced lambda functions with regular functions in all datasets (#1718)

CMake Build System

  • [CMake 1/3] Updated C++ includes to use imports relative to root directory (#1666)
  • [CMake 2/3] Added CMake Build to TorchText to create single `_TorchText library (#1673)
  • [CMake 3/3] Splited source files with Python dependency to separate library (#1660)

Improvements

Features

  • [BC-breaking] Renamed Roberta Bundle (#1635)
  • Modified CLIPTokenizer to either infer number of merges from encoder json or take it in constructor (#1622)
  • Provided option to return splitted tokens (#1698)
  • Updated dataset code to avoid creating multiple iterators from a DataPipe (#1708)

Testing

  • Added unicode generation to IWSLT tests (followup to #1608) (#1642)
  • Added MacOS unit tests on CircleCI (#1672)
  • Added parameterized dataset pickling tests (#1732)
  • Added test to compare encoder inference on input with and without padding (#1770)
  • Added test for shuffle before shard (#1738)
  • Added more test coverage (#1653)
  • Enabled model testing in FBCode (#1720)
  • Fixed for windows builds with python 3.10 , getting rid of ssize_t (#1627)
  • Built and test py3.10 (#1625)
  • Making sure we build correctly against release branch (#1790)
  • Removed caching artifacts for datasets and fix it for vectors (#1674)
  • Installed torchdata from nightly release in CI (#1664)
  • Added m1 tagged build for TorchText (#1776)
  • Refactored TorchText version handing and adding first version of M1 builds (#1773)
  • Removed MACOSX_DEPLOYMENT_TARGET (#1728)

Examples

  • Added data pipelines for Roberta pre-processing (#1637)
  • Updated sst2 tutorial to replace lambda usage (#1722)

Documentation

  • Removed _add_docstring_header decorator from amazon review polarity (#1611)
  • Added missing quotation marks to to CLIPTokenizer docs (#1610)
  • Updated README around installing LTS version (#1665)
  • Added contributing guidelines for third party and custom C++ operators (#1742)
  • Added recommendations regarding use of datapipes for multi-processing, shuffling, DDP, etc. (#1755)
  • Fixed roberta bundle example doc (#1648)
  • Updated doc conf (#1634)
  • Removed install instructions (#1641)
  • Updated README (#1652)
  • Updated requirements (#1675)
  • Fixed typo sharing -> sharding (#1787)
  • Fixed docs build (#1730)
  • Replaced git+git with git+https in requirements.txt (#1658)
  • Added header info for BERT tokenizer (#1754)
  • Fixed docstring for Tokenizers (#1739)
  • Fixed doc js initialization (#1736)
  • Added missing type hints (#1782)
  • Fixed SentencePiece Tokenizer doc-string (#1706)

Bug fixes

  • Fixed missed mask arg in TorchText transformer (#1758)
  • Fixed bug in RTE and WNLI testing (#1759)
  • Fixed bug in QNLI dataset and corresponding test (#1760)
  • Fixed STSB and WikiTexts tests (#1737)
  • Fixed smoke tests for linux (#1687)
  • Removed redundant dataname in test_shuffle_shard_wrapper (#1733)
  • Fixed non-deterministic test failures for IWSLT (#1699)
  • Fixed typo in nightly branch ref (#1783)
  • Fixed windows utils test (#1761)
  • Fixed test utils (#1757)
  • Fixed pad transform test (#1688)
  • Resolved issues in #1653 + sanitize test names generated by nested_params (#1667)
  • Fixed mock tests due to change in datasets directory (#1749)
  • Deleted prints in test_qqp.py (#1734)
  • Fixed logger issue (#1656)

Others

  • Pinned Jinja2 version to fix broken doc build (#1669)
  • Fixed formatting for all files using pre-commit (#1670)
  • Pinned setuptools to 58.0.4 on Windows (#1746)
  • Added post install script for pywin32 (#1748)
  • Pinned Utf8proc version (#1771)
  • Removed models from experimental (#1643)
  • Cleaned examples folder (#1647)
  • Cleaned stale code (#1654)
  • Took TORCH_HOME env variable into account while setting the cache dir (#1741)
  • Updateed download hooks and datasets to import HttpReader and GDriveReader from download hooks (#1657)
  • Added Model benchmark (#1697)
  • Changed root directory for datasets (#1740)
  • Used _get_torch_home standard utility from torch hub (#1752)
  • Removed ticks (``) from the url under is_module_available (#1753)
  • Prepared repo for auto-formatters (#1546)
  • Fixed flake8 issues introduced from adding auto formatter (#1617)

v0.12.0

10 Mar 18:31
d7a34d6
Compare
Choose a tag to compare

Highlights

In this release, we have revamped the library to provide a more comprehensive experience for users to do NLP modeling using TorchText and PyTorch.

  • Migrated datasets to build on top of TorchData DataPipes
  • Added support RoBERTa and XLM-RoBERTa pre-trained models
  • Added support for Scriptable tokenizers
  • Added support for composable transforms and functionals

Datasets

TorchText has modernized its datasets by migrating from older-style Iterable Datasets to TorchData’s DataPipes. TorchData is a library that provides modular/composable primitives, allowing users to load and transform data in performant data pipelines. These DataPipes work out-of-the-box with PyTorch DataLoader and would enable new functionalities like auto-sharding. Users can now easily do data manipulation and pre-processing using user-defined functions and transformations in a functional style programming. Datasets backed by DataPipes also enable standard flow-control like batching, collation, shuffling and bucketizing. Collectively, DataPipes provides a comprehensive experience for data preprocessing and tensorization needs in a pythonic and flexible way for model training.

from functools import partial
import torchtext.functional as F
import torchtext.transforms as T
from torch.hub import[ load_state_dict_from_url](https://pytorch.org/docs/stable/hub.html#torch.hub.load_state_dict_from_url)
from torch.utils.data import DataLoader
from torchtext.datasets import SST2

# Tokenizer to split input text into tokens
encoder_json_path = "https://download.pytorch.org/models/text/gpt2_bpe_encoder.json"
vocab_bpe_path = "https://download.pytorch.org/models/text/gpt2_bpe_vocab.bpe"
tokenizer = T.GPT2BPETokenizer(encoder_json_path, vocab_bpe_path)
# vocabulary converting tokens to IDs
vocab_path = "https://download.pytorch.org/models/text/roberta.vocab.pt"
vocab = T.VocabTransform([load_state_dict_from_url](https://pytorch.org/docs/stable/hub.html#torch.hub.load_state_dict_from_url)(vocab_path))
# Add BOS token to the beginning of sentence
add_bos = T.AddToken(token=0, begin=True)
# Add EOS token to the end of sentence
add_eos = T.AddToken(token=2, begin=False)

# Create SST2 dataset datapipe and apply pre-processing
batch_size = 32
train_dp = SST2(split="train")
train_dp = train_dp.batch(batch_size).rows2columnar(["text", "label"])
train_dp = train_dp.map(tokenizer, input_col="text", output_col="tokens")
train_dp = train_dp.map(partial(F.truncate, max_seq_len=254), input_col="tokens")
train_dp = train_dp.map(vocab, input_col="tokens")
train_dp = train_dp.map(add_bos, input_col="tokens")
train_dp = train_dp.map(add_eos, input_col="tokens")
train_dp = train_dp.map(partial(F.to_tensor, padding_value=1), input_col="tokens")
train_dp = train_dp.map(F.to_tensor, input_col="label")
# create DataLoader
dl = DataLoader(train_dp, batch_size=None)
batch = next(iter(dl))
model_input = batch["tokens"]
target = batch["label"]

TorchData is required in order to use these datasets. Please install following instructions at https://github.com/pytorch/data

Models

We have added support for pre-trained RoBERTa and XLM-R models. The models are torchscriptable and hence can be employed for production use-cases. The modeling APIs let users attach custom task-specific heads with pre-trained encoders. The API also comes equipped with data pre-processing transforms to match the pre-trained weights and model configuration.

import torch, torchtext
from torchtext.functional import to_tensor
xlmr_base = torchtext.models.XLMR_BASE_ENCODER
model = xlmr_base.get_model()
transform = xlmr_base.transform()
input_batch = ["Hello world", "How are you!"]
model_input = to_tensor(transform(input_batch), padding_value=1)
output = model(model_input)
output.shape
torch.Size([2, 6, 768])

# add classification head
import torch.nn as nn
class ClassificationHead(nn.Module):
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.output_layer = nn.Linear(input_dim, num_classes)

    def forward(self, features):
        #get features from cls token
        x = features[:, 0, :]
        return self.output_layer(x)

binary_classifier = xlmr_base.get_model(head=ClassificationHead(input_dim=768, num_classes=2)) 
output = binary_classifier(model_input)
output.shape
torch.Size([2, 2])

Transforms and tokenizers

We have revamped our transforms to provide composable building blocks to do text pre-processing. They support both batched and non-batched inputs. Furthermore, we have added support for a number of commonly used tokenizers including SentencePiece, GPT-2 BPE and CLIP.

import torchtext.transforms as T
from torch.hub import load_state_dict_from_url

padding_idx = 1
bos_idx = 0
eos_idx = 2
max_seq_len = 256
xlmr_vocab_path = r"https://download.pytorch.org/models/text/xlmr.vocab.pt"
xlmr_spm_model_path = r"https://download.pytorch.org/models/text/xlmr.sentencepiece.bpe.model"

text_transform = T.Sequential(
    T.SentencePieceTokenizer(xlmr_spm_model_path),
    T.VocabTransform(load_state_dict_from_url(xlmr_vocab_path)),
    T.Truncate(max_seq_len - 2),
    T.AddToken(token=bos_idx, begin=True),
    T.AddToken(token=eos_idx, begin=False),
)

text_transform([“Hello World”, “How are you”])

Tutorial

We have added an end-2-end tutorial to perform SST-2 binary text classification with pre-trained XLM-R base architecture and demonstrates the usage of new datasets, transforms and models.

Backward Incompatible changes

We have removed the legacy folder in this release which provided access to legacy datasets and abstractions. For additional information, please refer to the corresponding github issue (#1422) and PR (#1437)

New Features

Models

  • Add XLMR Base and Large pre-trained models and corresponding transformations (#1407)
  • Added option to specify whether to load pre-trained weights (#1424)
  • Added Option for freezing encoder weights (#1428)
  • Enable optional return of all states in transformer encoder (#1430)
  • Added support for RobertaModel to accept model configuration (#1431)
  • Allow inferred scaling in MultiheadSelfAttention for head_dim != 64 (#1432)
  • Added attention mask to transformer encoder modules (#1435)
  • Added builder method in Model Bundler to facilitate model creation with user-defined configuration and checkpoint (#1442)
  • Cleaned up Model API (#1452)
  • Fixed bool attention mask in transformer encoder (#1454)
  • Removed xlmr transform class and instead used sequential for model transforms composition (#1482)
  • Added support for pre-trained Roberta encoder for base and large architecture #1491

Transforms, Tokenizers, Ops

  • Added ToTensor and LabelToIndex transformations (#1415)
  • Added Truncate Transform (#1458)
  • Updated input annotation type to Any to support torch-scriptability during transform composability (#1453)
  • Added AddToken transform (#1463)
  • Added GPT-2 BPE pre-tokenizer operator leveraging re2 regex library (#1459)
  • Added Torchscriptable GPT-2 BPE Tokenizer for RoBERTa models (#1462)
  • Migrated GPT-2 BPE tokenizer logic to C++ (#1469)
  • fix optionality of default arg in to_tensor (#1475)
  • added scriptable sequential transform (#1481)
  • Removed optionality of dtype in ToTensor (#1492)
  • Fixed max sequence length for xlmr transform (#1495)
  • add max_tokens kwarg to vocab factory (#1525)
  • Refactor vocab factory method to accept special tokens as a keyword argument (#1436)
  • Implemented ClipTokenizer that builds on top of GPT2BPETokenizer (#1541)

Datasets

Migration of datasets on top of datapipes

Newly added datasets

Misc

  • Fix split filter logic in AmazonReviewPolarity (#1505)
  • use os.path.join for consistency. #1506
  • Fixing dataset test failures due to incorrect caching mode in AG_NEWS (#1517)
  • Added caching for extraction datapipe for AmazonReviewPolarity (#1527)
  • Added caching for extraction datapipe for Yahoo (#1528)
  • Added caching for extraction datapipe for yelp full (#1529)
  • Added caching for extraction datapipe for yelp polarity (#1530)
  • Added caching for extraction datapipe for DBPedia (#1571)
  • Added caching for extraction datapipe for SogouNews and AmazonReviewFull (#1594)
  • Fixed issues with extraction caching (#1550, #1551, #1552)
  • Updating Conll2000Chunking dataset to be consistent with other datasets (#1590)
  • [BC-breaking] removed unnecessary split argument from datasets (#1591)

Improvements

Testing

Revamp TorchText dataset testing to use mocked data

Others

  • Fixed attention mask testing (#1439)
  • Fixed CircleCI download failures on windows for XLM-R unit tests (#1441)
  • Asses unit tests for testing model training (#1449)
  • Parameterized XLMR and Roberta mo...
Read more

Minor release

27 Jan 22:34
92f4d15
Compare
Choose a tag to compare

This is a minor release compatible with PyTorch 1.10.2.

There is no feature change in torchtext from 0.11.1. For the full feature of v0.11.1, please refer to the v0.11.1 release notes.

v0.11.0

21 Oct 17:20
Compare
Choose a tag to compare

torchtext 0.11.0 Release Notes

This is a relatively lightweight release while we are working on revamping the library. Users are encouraged to check various developments on the main branch.

Improvements

  • Refactored C++ codebase to fix clang-tidy warnings and using emplace_back for improving performance (#1327)
  • Updated sentecepience to v0.1.95 to make it compilable on M1 (#1336)
  • Up the priority of numpy array comparison in self.assertEqual (#1341)
  • Removed mentions of conda-forge as it is no longer necessary to build on python 3.9 (#1345)
  • Separated experimental tests to help remove them easily during release cycles (#1348)
  • Splitted the pybind and torchtbind registration in separate file and refactor Vocab modules to allow vocab to be used in pure C++ environment (#1352)
  • Changed the default root directory for downloaded datasets to avoid dirtying the working directory (# 1361)
  • Added method for logging module usage in fbcode (#1367)
  • Updated bug report file (#1377)
  • Renamed default branch to main (#1378)
  • Enabled torchtext extension work seamlessly between fbcode and open-source (#1382)
  • Migrated CircleCI docker image (#1393)

Docs

  • Fix tag build so so that adding a tag will trigger a documentation build-and-upload (#1332)
  • Minor doc-string fix in Multi30K dataset (#1351)
  • Fixed example in doc-string of get_vec_by_tokens (#1383)
  • Updated docs to point to main instead of deprecated master branch (#1387)
  • Changed various README.md links to point to main instead of master branch (#1392)

Bug fix

  • Fixed benchmark code that compares performance of vocab (#1339)
  • Fixed text classification example broken due removal of experimental datasets (#1347)
  • Fixed issue in IMDB dataset that result in all samples being positive depending on directory path (#1354)
  • Fixed doc building (#1365)

Minor bugfix release

27 Sep 04:44
0d670e0
Compare
Choose a tag to compare

This release depends on pytorch 1.9.1
No functional changes other than minor updates to CI rules.

torchtext 0.10.0 Release Notes

15 Jun 16:03
4da1de3
Compare
Choose a tag to compare

Highlights

In this release, we introduce a new Vocab module that replaces the current Vocab class. The new Vocab provides common functional APIs for NLP workflows. This module is backed by an efficient C++ implementation that reduces look-up time by up-to ~85% for batch look-up (refer to summary of #1248 and #1290 for further information on benchmarks), and provides support for TorchScript. We provide accompanying factory functions that can be used to build the Vocab object either through a python ordered dictionary or an Iterator that yields lists of tokens.

creating Vocab from text file

import io
from torchtext.vocab import build_vocab_from_iterator
# generator that yield list of tokens
def yield_tokens(file_path):
    with io.open(file_path, encoding = 'utf-8') as f:
       for line in f:
           yield line.strip().split()
# get Vocab object
vocab_obj = build_vocab_from_iterator(yield_tokens(file_path), specials=["<unk>"])

creating Vocab through ordered dict

from torchtext.vocab import vocab
from collections import Counter, OrderedDict
counter = Counter(["a", "a", "b", "b", "b"])
sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)
vocab_obj = vocab(ordered_dict)

common API usage

# look-up index
vocab_obj["a"]

# batch look-up indices
vocab_obj.looup_indices(["a","b"])
# support forward API of PyTorch nn Modules
vocab_obj(["a","b"])

# batch look-up tokens
vocab_obj.lookup_tokens([0,1])

# set default index to return when token not found 
vocab_obj.set_default_index(0)
vocab_obj["out_of_vocabulary"] #prints 0

Backward Incompatible changes

  • We have retired the old Vocab class into the legacy folder (#1289) . Users relying on this class should be able to access it from torchtext.legacy. The Vocab module that replaces this class is not backward compatible. The most notable difference is that the Vectors object is not an attribute of new Vocab object. We recommend users to use the build_vocab_from_iterator factory function to construct the new Vocab module that provides similar initialization capabilities as the retired Vocab class.
# retired Vocab class 
from torchtext.legacy.vocab import Vocab as retired_vocab
from collections import Counter
tokens_list = ["a", "a", "b", "b", "b"]
counter = Counter(tokens_list)
vocab_obj = retired_vocab(counter, specials=["<unk>","<pad>"], specials_first=True)

# new Vocab Module
from torchtext.vocab import build_vocab_from_iterator
vocab_obj = build_vocab_from_iterator([tokens_list], specials=["<unk>","<pad>"], specials_first=True)
  • Removed legacy batch from torchtext.data package (#1307) that was kept around for backward compatibility reasons. Users can still access batch from torchtext.data.legacy package.

New Features

  • Introduced functional to convert Iterable-style to map-style datasets (#1299)
from torchtext.datasets import IMDB
from torchtext.data import to_map_style_dataset
train_iter = IMDB(split='train')
#convert iterator to map-style dataset
train_dataset = to_map_style_dataset(train_iter)
  • Introduced functional to filter raw wikipedia XML dumps (#1292)
from torchtext.data.functional import filter_wikipedia_xml
from torchtext.datasets import EnWik9
data_iter = EnWik9(split='train')
# filter data according to https://github.com/facebookresearch/fastText/blob/master/wikifil.pl
filter_data_iter = filter_wikipedia_xml(data_iter)
# Added datasets for http://www.statmt.org/wmt16/multimodal-task.html#task1
from torchtext.datasets import Multi30k
train_data, valid_data, test_data = Multi30k()
next(train_data)
# prints following 
#('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.\n',
# 'Two young, White males are outside near many bushes.\n')

Improvements

  • Separated experimental and legacy tests into separate subfolders (#1285)
  • Stored md5 hash instead of raw text data for in-built datasets testing (#1261)
  • Cleaned up CircleCI cache handling and optimization of daily cache (#1236, #1238)
  • Fixed CircleCI caching issue when new dataset is added (#1314)
  • Organized datasets by names in root folder and moved common file reading functions into dataset_utils (#1233)
  • Added unit-test to verify raw datasets name property (#1234)
  • Fixed jinja2 environment autoescape to enable select extensions (#1277)
  • Added yaml.safe_load instead of yaml.load (#1278)
  • Added defusedxml to parse untrusted XML data (#1279)
  • Added CodeQL and Bandit security checks as GitHub Actions (#1266)
  • Added benchmark code to compare Vocab module with python dict for batch look-up time (#1290)

Documentation

  • Fixing doc for nn modules (#1267)
  • Store artifacts of rendered docs so that rendered docs can be checked on each PR (#1288)
  • Add Google Analytics support (#1287)

Bug Fix

  • Fixed import issue in text classification example (#1256)
  • Fixed and re-organized data pipeline example (#1250)

Performance

  • used c10::string_view and fast-text dictionary inside C++ kernel of Vocab module (#1248)

Torchtext 0.9.1 release note

25 Mar 17:19
4de31fc
Compare
Choose a tag to compare

Highlights

This is a minor release following pytorch 1.8.1. Please refer to torchtext 0.9.0 release note for more details.

Torchtext 0.9.0 release note

04 Mar 20:44
Compare
Choose a tag to compare

Highlights

In this release, we’re updating torchtext’s datasets to be compatible with the PyTorch DataLoader, and deprecating torchtext’s own DataLoading abstractions. We have published a full review of the legacy code and the new datasets in pytorch/text #664. These new datasets are simple string-by-string iterators over the data, rather than the previously custom set of abstractions such as Field. The legacy Datasets and abstractions have been moved into a new legacy folder to ease the migration, and will remain there for two more releases. For guidance about migrating from the legacy abstractions to use modern PyTorch data utilities, please refer to our migration guide (link).

The following raw text datasets are available as the replacement of the legacy datasets. Those datasets are iterators which yield the raw text data line-by-line. To apply those datasets in the NLP workflows, please refer to the end-to-end tutorial for the text classification task (link).

  • Language modeling: WikiText2, WikiText103, PennTreebank, EnWik9
  • Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
  • Sequence tagging: UDPOS, CoNLL2000Chunking
  • Translation: IWSLT2016, IWSLT2017
  • Question answer: SQuAD1, SQuAD2

We add Python 3.9 support in this release

Backwards Incompatible

The current users of the legacy code will experience BC breakage as we have retired the legacy code (#1172, #1181, #1183). The legacy components are placed in torchtext.legacy.data folder as follows:

  • torchtext.data.Pipeline -> torchtext.legacy.data.Pipeline
  • torchtext.data.Batch -> torchtext.legacy.data.Batch
  • torchtext.data.Example -> torchtext.legacy.data.Example
  • torchtext.data.Field -> torchtext.legacy.data.Field
  • torchtext.data.Iterator -> torchtext.legacy.data.Iterator
  • torchtext.data.Dataset -> torchtext.legacy.data.Dataset

This means, all features are still available, but within torchtext.legacy instead of torchtext.

Table 1: Summary of the legacy datasets and the replacements in 0.9.0 release

Category Legacy 0.9.0 release
Language Modeling torchtext.legacy.datasets.WikiText2 torchtext.datasets.WikiText2
  torchtext.legacy.datasets.WikiText103 torchtext.datasets.WikiText103
  torchtext.legacy.datasets.PennTreebank torchtext.datasets.PennTreebank
  torchtext.legacy.datasets.EnWik9 torchtext.datasets.EnWik9
Text Classification torchtext.legacy.datasets.AG_NEWS torchtext.datasets.AG_NEWS
  torchtext.legacy.datasets.SogouNews torchtext.datasets.SogouNews
  torchtext.legacy.datasets.DBpedia torchtext.datasets.DBpedia
  torchtext.legacy.datasets.YelpReviewPolarity torchtext.datasets.YelpReviewPolarity
  torchtext.legacy.datasets.YelpReviewFull torchtext.datasets.YelpReviewFull
  torchtext.legacy.datasets.YahooAnswers torchtext.datasets.YahooAnswers
  torchtext.legacy.datasets.AmazonReviewPolarity torchtext.datasets.AmazonReviewPolarity
  torchtext.legacy.datasets.AmazonReviewFull torchtext.datasets.AmazonReviewFull
  torchtext.legacy.datasets.IMDB torchtext.datasets.IMDB
  torchtext.legacy.datasets.SST deferred
  torchtext.legacy.datasets.TREC deferred
Sequence Tagging torchtext.legacy.datasets.UDPOS torchtext.datasets.UDPOS
  torchtext.legacy.datasets.CoNLL2000Chunking torchtext.datasets.CoNLL2000Chunking
Translation torchtext.legacy.datasets.WMT14 deferred
  torchtext.legacy.datasets.Multi30k deferred
  torchtext.legacy.datasets.IWSLT torchtext.datasets.IWSLT2016, torchtext.datasets.IWSLT2017
Natural Language Inference torchtext.legacy.datasets.XNLI deferred
  torchtext.legacy.datasets.SNLI deferred
  torchtext.legacy.datasets.MultiNLI deferred
Question Answer torchtext.legacy.datasets.BABI20 deferred

Improvements

  • Enable importing metrics/utils/functional from torchtext.legacy.data (#1229)
  • Set up daily caching mechanism with Master job (#1219)
  • Reset the functions in datasets_utils.py as private (#1224)
  • Resolve the download folder for some raw datasets (#1213)
  • Store the hash of the extracted CoNLL2000Chunking files so the extraction step will be skipped if the extracted files are detected (#1204)
  • Fix the total number of lines in doc strings of the datasets (#1200)
  • Extend CI tests to cover all the datasets (#1197, #1201, #1171)
  • Document the number of lines in the dataset splits (#1196)
  • Add hashes to skip the slow extraction if the extracted files are available (#1195)
  • Use decorator to loop over the split argument in the datasets (#1194)
  • Remove offset option from torchtext.datasets, and move torchtext.datasets.common to torchtext.data.dataset_utils (#1188, #1145)
  • Remove the step to clean up the cache in test_iwslt() (#1192)
  • Split IWSLT dataset into IWSLT2016 and IWSLT2017 dataset and re-organize the parameters in the constructors (#1191, #1209)
  • Move the prototype datasets in torchtext.experimental.datasets.raw folder to torchtext.datasets folder (#1182, #1202, #1207, #1211, #1212)
  • Add a decorator add_docstring_header() to generate docstring (#1185)
  • Add EnWiki9 dataset (#1184)
  • Avoid unnecessary downloads and extraction for some raw datasets, and add more logging (#1178)
  • Split raw datasets into individual files (#1156, #1173, #1174, #1175, #1176)
  • Extend the unittest coverage for all the raw datasets (#1157, #1149)
  • Define the relative path of the datasets in the download_from_url() func and skip unnecessary download if the downloaded files are detected (#1158, #1155)
  • Add MD5 and NUM_LINES as the meta information in the __init__ file of torchtext.datasets folder (#1155)
  • Standardize the text dataset doc strings and argument order. (#1151)
  • Report the “exceeds quota” error for the datasets using Google drive links (#1150)
  • Add support for the string-typed split values to the text datasets (#1147)
  • Re-name the argument from data_select to split in the dataset constructor (#1143)
  • Add Python 3.9 support across Linux, MacOS, and Windows platforms (#1139)
  • Switch to the new URL for the IWSLT dataset (#1115)
  • Extend the language shortcut in torchtext.data.utils.get_tokenizer func with the full name when Spacy tokenizers are loaded (#1140)
  • Fix broken CI tests due to spacy 3.0 release (#1138)
  • Pass an embedding layer to the constructor of the BertModel class in the BERT example (#1135)
  • Fix test warnings by switching to assertEqual() in PyTorch TestCase class (#1086)
  • Improve CircleCI tests and conda package (#1128, #1121, #1120, #1106)
  • Simplify TorchScript registration by adopting TORCH_LIBRARY_FRAGMENT macro (#1102)

Bug Fixes

  • Fix the total number of returned lines in setup_iter() func in RawTextIterableDataset (#1142)

Docs

  • Add number of classes to doc strings for text classification data (#1230)
  • Remove Lato font for pytorch/text website (#1227)
  • Add the migration tutorial (#1203, #1216, #1222)
  • Remove the legacy examples on pytorch/text website (#1206)
  • Update README file for 0.9.0 release (#1198)
  • Add CI check to detect undocumented parameters (#1167)
  • Add a static text link for the package version in the doc website (#1161)
  • Fix sphinx warnings and turn warnings into errors (#1163)
  • Add the text datasets to torchtext website (#1153)
  • Add the constructor document for IMDB and SST datasets (#1118)
  • Fix typos in the README file (#1089)
  • Rename "Arguments" to "Args" in the doc strings (#1110)
  • Build docs and push to gh-pages on nightly basis (#1105, #1111, #1112)