You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I suggest we change the current behavior to return a Dataset which iterates over the raw text (standardized to UTF-8) instead of building a Vocab by default.
Motivation
We currently build a Vocab by default if it wasn't provided to a TextClassification dataset. However, this causes at least two issues:
I suggest we change the current behavior to return a Dataset which iterates over the raw text instead of building a Vocab. We can further remove the constraint of accepting a "Vocab" object for the vocab flag and instead simply assume that it will map UTF-8, or possibly lists of UTF-8 (tokens), to lists / tensors of integers etc.
Downsides
This will force a user to create a Vocab by default. If a vocab is not passed the created datasets will yield the raw text. So, for most datasets, this will then involved a multi-stage process. a) Create raw Dataset b) Build Vocab c) Numericalize raw dataset / create datasets using Vocab.
It'll make it a bit harder to use (one versus three lines of code), but it'll cause more orthogonality.
Alternatives
We introduce a "raw_text" flag that will return IterableDatasets which yield raw text and still, by default, provide Datasets with default Vocabs for ease of use.
The text was updated successfully, but these errors were encountered:
🚀 Feature
I suggest we change the current behavior to return a Dataset which iterates over the raw text (standardized to UTF-8) instead of building a Vocab by default.
Motivation
We currently build a Vocab by default if it wasn't provided to a TextClassification dataset. However, this causes at least two issues:
Pitch
I suggest we change the current behavior to return a Dataset which iterates over the raw text instead of building a Vocab. We can further remove the constraint of accepting a "Vocab" object for the vocab flag and instead simply assume that it will map UTF-8, or possibly lists of UTF-8 (tokens), to lists / tensors of integers etc.
Downsides
This will force a user to create a Vocab by default. If a vocab is not passed the created datasets will yield the raw text. So, for most datasets, this will then involved a multi-stage process. a) Create raw Dataset b) Build Vocab c) Numericalize raw dataset / create datasets using Vocab.
It'll make it a bit harder to use (one versus three lines of code), but it'll cause more orthogonality.
Alternatives
We introduce a "raw_text" flag that will return IterableDatasets which yield raw text and still, by default, provide Datasets with default Vocabs for ease of use.
The text was updated successfully, but these errors were encountered: