Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steps to retire legacy code and release new building blocks in torchtext #985

Open
zhangguanheng66 opened this issue Sep 16, 2020 · 0 comments
Assignees

Comments

@zhangguanheng66
Copy link
Contributor

zhangguanheng66 commented Sep 16, 2020

A new abstraction has been described in 0.5.0 release note. Currently, we are working on retiring a few legacy codes in torchtext in the next releases. This issue will track the progress of the relevant work. Here are a few steps that users could expect:

Step 1: Retire legacy codes in torchtext.data and torchtext.datasets

The following components will be retired from source code soon. We have added a few deprecation warning messages in 0.7.0 release (link). Users can still find them in torchtext.legacy and the original constructors will raise error when calling them.

  • torchtext.data.field - RawField, Field, ReversibleField, SubwordField, NestedField, LabelField
  • torchtext.data.iterator - BucketIterator, Iterator, BPTTIterator
  • torcthtext.data.dataset - Dataset, TabularDataset
  • torchtext.data.example - Example
  • torchtext.data.pipeline - Pipeline
  • torchtext.data.batch - Batch

At the same time, the datasets in torchtext.datasets are based on the legacy code above so they will be moved to the legacy folder:

  • language_modeling - LanguageModelingDataset, WikiText2, WikiText103, PennTreebank
  • nli - SNLI, MultiNLI, XNLI
  • sst - SST
  • translation - TranslationDataset, Multi30k, IWSLT, WMT14
  • sequence_tagging - SequenceTaggingDataset, UDPOS, CoNLL2000Chunking
  • trec - TREC
  • imdb - IMDB
  • babi - BABI20

Step 2: Release the new datasets

A few legacy datasets above have been re-written and are currently available in torchtext.experimental.datasets. They will be released to the core library:

  • language_modeling - LanguageModelingDataset, WikiText2, WikiText103, PennTreebank, WMTNewsCrawl
  • text_classification - AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull, IMDB
  • sequence_tagging - UDPOS, CoNLL2000Chunking
  • translation - Multi30k, IWSLT, WMT14
  • question_answer - SQuAD1, SQuAD2

Step 3: Retire legacy vocab/vector and release the new data processing building blocks

We also re-written the vocabulary and word vectors as high performance building blocks with the JIT support. We will retire the following components

  • torchtext.vocab.Vocab
  • torchtext.vocab.Vectors along with GloVe, FastText, CharNGram.

After this, the new vocabulary and vector building blocks in the experimental folder will be moved to the core library.

  • torchtext.experimental.vectors
  • torchtext.experimental.vocab

We also have some transforms that will be released to the core library.

  • torchtext.experimental.transforms

In general, we understand this is the special time for the torchtext library because we have to handle the legacy code and new building blocks at the same time. We really appreciate the efforts from the OSS community. Users should use the code in the three categories with the following expectations:

  • legacy folder - we will accept bug fix but not new features
  • torchtext main folder - we officially support via the stable release and carefully handle BC breaking.
  • experimental folder - experimental components available via nightly release channel. Users might experience BC breaking without warning messages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant