Description
🚀 Feature
I suggest we change the current behavior to return a Dataset which iterates over the raw text (standardized to UTF-8) instead of building a Vocab by default.
Motivation
We currently build a Vocab by default if it wasn't provided to a TextClassification dataset. However, this causes at least two issues:
- We are starting to pass flags to this Vocab creation causing coupling between datasets and Vocab.
- It's not possible to split train into train and validation Adding validation splits to (experimental) text_classification datasets that do not have vocabulary built over them #690 without first reconstructing the original text, then doing the split and then rebuilding the Vocab on top of the training data for datasets that don't come with a default validation dataset such as IMDB.
Pitch
I suggest we change the current behavior to return a Dataset which iterates over the raw text instead of building a Vocab. We can further remove the constraint of accepting a "Vocab" object for the vocab flag and instead simply assume that it will map UTF-8, or possibly lists of UTF-8 (tokens), to lists / tensors of integers etc.
Downsides
This will force a user to create a Vocab by default. If a vocab is not passed the created datasets will yield the raw text. So, for most datasets, this will then involved a multi-stage process. a) Create raw Dataset b) Build Vocab c) Numericalize raw dataset / create datasets using Vocab.
It'll make it a bit harder to use (one versus three lines of code), but it'll cause more orthogonality.
Alternatives
We introduce a "raw_text" flag that will return IterableDatasets which yield raw text and still, by default, provide Datasets with default Vocabs for ease of use.