Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update README #1652

Merged
merged 1 commit into from
Mar 11, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 22 additions & 32 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ torchtext
This repository consists of:

* `torchtext.datasets <https://github.com/pytorch/text/tree/main/torchtext/datasets>`_: The raw text iterators for common NLP datasets
* `torchtext.data <https://github.com/pytorch/text/tree/main/torchtext/data>`_: Some basic NLP building blocks (tokenizers, metrics, functionals etc.)
* `torchtext.nn <https://github.com/pytorch/text/tree/main/torchtext/nn>`_: NLP related modules
* `torchtext.data <https://github.com/pytorch/text/tree/main/torchtext/data>`_: Some basic NLP building blocks
* `torchtext.transforms <https://github.com/pytorch/text/tree/main/torchtext/transforms>`_: Basic text-processing transformations
* `torchtext.models <https://github.com/pytorch/text/tree/main/torchtext/models>`_: Pre-trained models
* `torchtext.vocab <https://github.com/pytorch/text/tree/main/torchtext/vocab>`_: Vocab and Vectors related classes and factory functions
* `examples <https://github.com/pytorch/text/tree/main/examples>`_: Example NLP workflows with PyTorch and torchtext library.

Note: The legacy code discussed in `torchtext v0.7.0 release note <https://github.com/pytorch/text/releases/tag/v0.7.0-rc3>`_ has been retired to `torchtext.legacy <https://github.com/pytorch/text/tree/release/0.9/torchtext/legacy>`_ folder. Those legacy code will not be maintained by the development team, and we plan to fully remove them in the future release. See `torchtext.legacy <https://github.com/pytorch/text/tree/release/0.9/torchtext/legacy>`_ folder for more details.

Installation
============
Expand All @@ -30,6 +30,7 @@ We recommend Anaconda as a Python package management system. Please refer to `py
:widths: 10, 10, 10

nightly build, main, ">=3.7, <=3.9"
1.11.0, 0.12.0, ">=3.6, <=3.9"
1.10.0, 0.11.0, ">=3.6, <=3.9"
1.9.1, 0.10.1, ">=3.6, <=3.9"
1.9, 0.10, ">=3.6, <=3.9"
Expand Down Expand Up @@ -103,48 +104,37 @@ The datasets module currently contains:
* Machine translation: IWSLT2016, IWSLT2017, Multi30k
* Sequence tagging (e.g. POS/NER): UDPOS, CoNLL2000Chunking
* Question answering: SQuAD1, SQuAD2
* Text classification: AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
* Text classification: SST2, AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
* Model pre-training: CC-100

For example, to access the raw text from the AG_NEWS dataset:
Models
======

.. code-block:: python
The library currently consist of following pre-trained models:

>>> from torchtext.datasets import AG_NEWS
>>> train_iter = AG_NEWS(split='train')
>>> # Iterate with for loop
>>> for (label, line) in train_iter:
>>> print(label, line)
>>> # Or send to DataLoader
>>> from torch.utils.data import DataLoader
>>> train_iter = AG_NEWS(split='train')
>>> dataloader = DataLoader(train_iter, batch_size=8, shuffle=False)
* RoBERTa: `Base and Large Architecture <https://github.com/pytorch/fairseq/tree/main/examples/roberta#pre-trained-models>`_
* XLM-RoBERTa: `Base and Large Architure <https://github.com/pytorch/fairseq/tree/main/examples/xlmr#pre-trained-models>`_

Tokenizers
==========

The transforms module currently support following scriptable tokenizers:

* `SentencePiece <https://github.com/google/sentencepiece>`_
* `GPT-2 BPE <https://github.com/openai/gpt-2/blob/master/src/encoder.py>`_
* `CLIP <https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py>`_

Tutorials
=========

To get started with torchtext, users may refer to the following tutorials available on PyTorch website.
To get started with torchtext, users may refer to the following tutorial available on PyTorch website.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why did we change "tutorials" to tutorial?


* `SST-2 binary text classification using XLM-R pre-trained model <https://pytorch.org/text/stable/tutorials/sst2_classification_non_distributed.html>`_
* `Text classification with AG_NEWS dataset <https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html>`_
* `Translation trained with Multi30k dataset using transformers and torchtext <https://pytorch.org/tutorials/beginner/translation_transformer.html>`_
* `Language modeling using transforms and torchtext <https://pytorch.org/tutorials/beginner/transformer_tutorial.html>`_


[BC Breaking] Legacy
====================

In the v0.9.0 release, we moved the following legacy code to `torchtext.legacy <https://github.com/pytorch/text/tree/release/0.9/torchtext/legacy>`_. This is part of the work to revamp the torchtext library and the motivation has been discussed in `Issue #664 <https://github.com/pytorch/text/issues/664>`_:

* ``torchtext.legacy.data.field``
* ``torchtext.legacy.data.batch``
* ``torchtext.legacy.data.example``
* ``torchtext.legacy.data.iterator``
* ``torchtext.legacy.data.pipeline``
* ``torchtext.legacy.datasets``

We have a `migration tutorial <https://colab.research.google.com/github/pytorch/text/blob/release/0.9/examples/legacy_tutorial/migration_tutorial.ipynb>`_ to help users switch to the torchtext datasets in ``v0.9.0`` release. For the users who still want the legacy components, they can add ``legacy`` to the import path.

In the v0.10.0 release, we retire the Vocab class to `torchtext.legacy <https://github.com/pytorch/text/tree/release/0.9/torchtext/legacy>`_. Users can still access the legacy Vocab via ``torchtext.legacy.vocab``. This class has been replaced by a Vocab module that is backed by efficient C++ implementation and provides common functional APIs for NLP workflows.

Disclaimer on Datasets
======================

Expand Down