Skip to content

[Proposal][Experimental] Further decouple Vocab from TextClassification datasets. #691

Open
@cpuhrsch

Description

@cpuhrsch

🚀 Feature

I suggest we change the current behavior to return a Dataset which iterates over the raw text (standardized to UTF-8) instead of building a Vocab by default.

Motivation

We currently build a Vocab by default if it wasn't provided to a TextClassification dataset. However, this causes at least two issues:

Pitch

I suggest we change the current behavior to return a Dataset which iterates over the raw text instead of building a Vocab. We can further remove the constraint of accepting a "Vocab" object for the vocab flag and instead simply assume that it will map UTF-8, or possibly lists of UTF-8 (tokens), to lists / tensors of integers etc.

Downsides

This will force a user to create a Vocab by default. If a vocab is not passed the created datasets will yield the raw text. So, for most datasets, this will then involved a multi-stage process. a) Create raw Dataset b) Build Vocab c) Numericalize raw dataset / create datasets using Vocab.

It'll make it a bit harder to use (one versus three lines of code), but it'll cause more orthogonality.

Alternatives

We introduce a "raw_text" flag that will return IterableDatasets which yield raw text and still, by default, provide Datasets with default Vocabs for ease of use.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions