[Proposal][Experimental] Further decouple Vocab from TextClassification datasets.

## 🚀 Feature

I suggest we change the current behavior to return a Dataset which iterates over the raw text (standardized to UTF-8) instead of building a Vocab by default.

**Motivation**

We currently build a Vocab [by default](https://github.com/pytorch/text/blob/a5880a3da7928dd7dd529507eec943a307204de7/torchtext/datasets/text_classification.py#L126) if it wasn't provided to a TextClassification dataset. However, this causes at least two issues:
- We are starting to pass [flags to this Vocab creation causing coupling](https://github.com/pytorch/text/blob/a5880a3da7928dd7dd529507eec943a307204de7/torchtext/datasets/text_classification.py#L116) between datasets and Vocab.
- It's not possible to split train into train and validation https://github.com/pytorch/text/issues/690 without first reconstructing the original text, then doing the split and then rebuilding the Vocab on top of the training data for datasets that don't come with a default validation dataset such as IMDB.

**Pitch**

I suggest we change the current behavior to return a Dataset which iterates over the raw text instead of building a Vocab. We can further remove the constraint of accepting a "Vocab" object for the vocab flag and instead simply assume that it will map UTF-8, or possibly lists of UTF-8 (tokens), to lists / tensors of integers etc.

**Downsides**

This will force a user to create a Vocab by default. If a vocab is not passed the created datasets will yield the raw text. So, for most datasets, this will then involved a multi-stage process. a) Create raw Dataset b) Build Vocab c) Numericalize raw dataset / create datasets using Vocab.

It'll make it a bit harder to use (one versus three lines of code), but it'll cause more orthogonality.

**Alternatives**

We introduce a "raw_text" flag that will return IterableDatasets which yield raw text and still, by default, provide Datasets with default Vocabs for ease of use.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal][Experimental] Further decouple Vocab from TextClassification datasets. #691

🚀 Feature

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Proposal][Experimental] Further decouple Vocab from TextClassification datasets. #691

Description

🚀 Feature

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions