Text classification datasets with new torchtext dataset abstraction #701

zhangguanheng66 · 2020-02-28T20:03:03Z

A new dataset abstraction that decouples data and vocab/tokenizer.

To load raw text dataset:

from torchtext.experimental.datasets import RawAG_NEWS
train, test = RawAG_NEWS()

# Process text data
from torchtext.experimental.datasets.text_classification import build_vocab
from torchtext.experimental.transforms import TokenizerTransform, VocabTransform, ToTensor
from torchtext.data.utils import get_tokenizer
from torch.nn import Sequential

vocab = build_vocab(train, TokenizerTransform(get_tokenizer('basic_english')))
text_transform = Sequential(TokenizerTransform(get_tokenizer('basic_english')), VocabTransform(vocab), ToTensor())
label_transform = ToTensor()
for (label, txt) in train[:10]:
    print(label_transform(label), text_transform(txt))

Or wrap up everything above and load processed dataset with one-command:

from torchtext.experimental.datasets import AG_NEWS
train, test = AG_NEWS()

zhangguanheng66 · 2020-03-02T15:23:25Z

@fmassa @cpuhrsch @vincentqb

…lassification

zhangguanheng66 · 2020-03-03T17:16:17Z

VocabTransform is scriptable for a Dict vocab.

import torch
import torchtext
vocab = {'here': 1, 'we': 2, 'are': 3}
vocab_transform = torchtext.experimental.datasets.text_classification.VocabTransform(vocab)
jit_method = torch.jit.script(vocab_transform)
print(vocab_transform(['here', 'we', 'are']) == jit_method(['here', 'we', 'are']))

TokenizerTransform is scriptable for a split tokenizer

import torch
import torchtext
from torchtext.data.utils import get_tokenizer
token_transform = torchtext.experimental.datasets.text_classification.TokenizerTransform(get_tokenizer(None))
token_transform('here we are')
jit_method = torch.jit.script(token_transform)
print(token_transform('here we are') == jit_method('here we are'))

Tokenizer + vocab

text_transform = torchtext.experimental.datasets.text_classification.TextSequential(token_transform, vocab_transform)
text_transform('here we are')
jit_method = torch.jit.script(text_transform)
print(text_transform('here we are') == jit_method('here we are'))

bentrevett · 2020-03-04T15:44:33Z

torchtext/experimental/datasets/new_text_classification.py

+    return data
+
+
+def build_vocab(dataset, transform):


Should this take a kwargs argument that will pass arguments to the vocab constructor inside build_vocab_from_iterator? This will allow us to do something like:

train, test = AG_NEWS() transform1 = TokenizerTransform('basic_english') train, valid = torch.utils.data.random_split(train, [90_000, 10_000]) #not exact numbers vocab = build_vocab(train, transform1, max_size = 25_000, min_freq = 2)

It will also mean build_vocab_from_iterator will also need to be modified to accept kwargs too.

This could be added. However, the wrap-up to build vocab is pretty simple and users now have the flexibility to do that themselves.

vincentqb · 2020-03-11T16:04:10Z

import torchtext
from torchtext.experimental.datasets import AG_NEWS
from torchtext.experimental.transforms import TokenizerTransform, VocabTransform, Compose
from torchtext.experimental.datasets.new_text_classification import build_vocab

# Import raw text strings
train, test = AG_NEWS()

# Build tokenizer transform
transform1 = TokenizerTransform('basic_english')

class PROCESSED(AG_NEWS):  # Wrap as map-like function

    def __getitem__(self, n):
        item = super().__getitem__(n)
        return transform1(item)

    def __next__(self):
        item = super().__next__()
        return transform1(item)

transformed_train = PROCESSED()

# Build vocab transform
vocab = VocabTransform(transformed_train)

# A new dataset with raw text strings + label/string transforms
from torchtext.experimental.datasets.new_text_classification import TextClassificationDataset
new_train = TextClassificationDataset(train.data, [int, transform1, vocab])  # transforms can be wrapped similarly

cpuhrsch · 2020-03-11T16:11:32Z

How about

import torchtext
from torchtext.experimental.datasets import AG_NEWS
from torchtext.experimental.transforms import TokenizerTransform, VocabTransform, Compose
from torchtext.experimental.datasets.new_text_classification import build_vocab

# Import raw text strings
train, _ = AG_NEWS()

def tokenizer(raw_text):
    splits = raw_text.split
    label, row = int(splits[0]), splits[1:]
    return label, TokenizerTransform('basic_english')(row) #Assuming this has 0 init cost

tokenized_train = map(tokenizer, train) # For lack of a better function name

# Build vocab transform
# EDIT: Needs function to have tokenized_train to only return text part
vocab = build_vocab(map(lambda x, y: y, tokenized_train))

# Assuming datasets don't need to be reset after consumed
# EDIT: Need to take care of label
new_train = map(lambda x, y: x, vocab(y), tokenized_train)

vincentqb · 2020-03-11T16:17:25Z

@fmassa points to tf and lua

zhangguanheng66 · 2020-03-11T21:29:51Z

Some offline discussions:

Transforms are in general some callable objects and used to map the inputs according to a "contract". For example, a tokenizer transform defines the "contract" to convert a string to a list of tokens.
Vocab transform contains a "dictionary" object which maps tokens to ids.
A separate folder is created to save the raw text datasets. For the existing datasets in torchtext library, Some standard transforms and the raw text datasets are wrapped together to support "one-command" data loading.

hudeven · 2020-04-06T18:29:17Z

@zhangguanheng66 the new APIs look great! I have a few questions:

How to handle dense/categorical feature with the new API?
As the data for language model could be too large to fit in memory. We might have to use IterableDataset. Could you also have a demo for IterableDataset as well?
How to support custom batching logic and custom sampling logic for IterableDataset?

cpuhrsch · 2020-04-07T15:16:37Z