Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal][Experimental] Further decouple Vocab from TextClassification datasets. #691

Open
cpuhrsch opened this issue Feb 5, 2020 · 0 comments

Comments

@cpuhrsch
Copy link
Contributor

cpuhrsch commented Feb 5, 2020

🚀 Feature

I suggest we change the current behavior to return a Dataset which iterates over the raw text (standardized to UTF-8) instead of building a Vocab by default.

Motivation

We currently build a Vocab by default if it wasn't provided to a TextClassification dataset. However, this causes at least two issues:

Pitch

I suggest we change the current behavior to return a Dataset which iterates over the raw text instead of building a Vocab. We can further remove the constraint of accepting a "Vocab" object for the vocab flag and instead simply assume that it will map UTF-8, or possibly lists of UTF-8 (tokens), to lists / tensors of integers etc.

Downsides

This will force a user to create a Vocab by default. If a vocab is not passed the created datasets will yield the raw text. So, for most datasets, this will then involved a multi-stage process. a) Create raw Dataset b) Build Vocab c) Numericalize raw dataset / create datasets using Vocab.

It'll make it a bit harder to use (one versus three lines of code), but it'll cause more orthogonality.

Alternatives

We introduce a "raw_text" flag that will return IterableDatasets which yield raw text and still, by default, provide Datasets with default Vocabs for ease of use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant