Adding validation splits to (experimental) text_classification datasets that do not have vocabulary built over them #690

bentrevett · 2020-02-05T16:15:10Z

🚀 Feature

The experimental text_classification datasets should have a way to build a validation set from them, without the vocabulary being built over the validation set.

Motivation

In ML, you should always have a test, validation and training set. In NLP, your vocabulary should be built from the training set only, and not from the test/validation.

The current experimental text classification (IMDB) dataset does not have a validation set and automatically builds the vocabulary whilst loading the train/test sets. After loading the train and test sets, we would need to construct a validation set with torch.utils.data.random_split. The issue here is that our vocabulary has already been built over the validation set we are about to create. There is currently no way to solve this issue.

Pitch

'valid' should be accepted as a data_select argument and should create a validation set before the vocabulary has been created over the training set. As the IMDB dataset does not have a standardized validation split, we can do something like taking the last 20% of the training set.

I am proposing something like the following after the iters_group is created here:

from itertools import islice, tee

if 'valid' in iters_group.keys():
    train_iter_a, train_iter_b, train_iter_c = tee(iters_group['train'], 3)
    len_train = int(sum(1 for _ in train_iter_a) * 0.8)
    iters_group['valid'] = islice(train_iter_b, len_train, None)
    iters_group['train'] = islice(train_iter_c, 0, len_train)
    iters_group['vocab'] = islice(iters_group['vocab'] 0, len_train)

tee duplicates generators and islice slices into generators. We need to duplicate the training data iterator as we will be using it three times. We use the first iterator to get the length of the training set so we know what size the validation set will be (the last 20% of the examples from the training set). We then use islice to get the last 20% of the training examples to form the validation set, the first 80% of the training examples to use as the new training set, and the first 80% examples of the "vocab" set as this needs to match the training set as it is what we want to build our vocab from.

We can now correctly load a train, valid and test set with vocabulary built only over the training set:

from torchtext.experimental import datasets

train_data, valid_data, test_data = datasets.IMDB(data_select=('train', 'valid', 'test'))

Can also load a custom vocabulary built from the original vocabulary like so (note that 'valid' needs to be in the data_select when building the original vocabulary):

from torchtext import vocab
from torchtext.experimental import datasets

def get_IMDB(root, tokenizer, vocab_max_size, vocab_min_freq):
    
    os.makedirs(root, exist_ok = True)
    
    train_data, _ = datasets.IMDB(tokenizer = tokenizer, 
                                 data_select = ('train', 'valid'))
    
    old_vocab = train_data.get_vocab()
    
    new_vocab = vocab.Vocab(old_vocab.freqs, 
                            max_size = vocab_max_size, 
                            min_freq = vocab_min_freq)
    
    train_data, valid_data, test_data = datasets.IMDB(tokenizer = tokenizer, 
                                                      vocab = new_vocab,
                                                      data_select=('train', 'valid', 'test'))
    
    return train_data, valid_data, test_data

Happy to make the PR if this is given the go-ahead.

The text was updated successfully, but these errors were encountered:

zhangguanheng66 · 2020-02-05T16:36:42Z

I'm pretty sure all the vocab objects are built on train dataset link. If you use torch.utils.data.random_split to split test dataset into valid/test sub-dataset, the vocab object should have nothing with the valid sub-dataset.

bentrevett · 2020-02-05T16:50:58Z

Yes, the vocabulary is always built over the training set. The issue comes when you split the training set into training and validation sets. Your validation set has been numericalized from a vocabulary that has already “seen” all of these validation examples when they were part of the training set. This means information leaks from the training set into the validation set, giving inflated validation scores.

The validation set should not be taken from the test set. When comparing results on a dataset everyone should be using the exact same test set. Creating a validation set from the test set violates this, is extremely bad practice and causes information to leak from the test set into the validation set.

zhangguanheng66 · 2020-02-05T17:07:49Z

I agree with your point that splitting train dataset will leak the information in vocab and result in an inflated validation scores. I agree people ideally split train data into train/valid subset though.

However, I don't think splitting test set here is "extremely bad practice" :). The purpose to have test set, IMO, is to have a separate dataset that is never touched during training process. To this point, if we split test dataset into test/valid sub-datasets, we use valid sub-dataset through epochs but never touch test sub-datasets. To that sense, we just have a smaller data set for testing in the end. The problem for this method is that we never change the valid data set through epochs. This is true for the word language modeling dataset (like Wikitext2), where we have fixed valid data set.

bentrevett · 2020-02-05T17:28:09Z

I disagree and believe it is bad practice. If you release a paper with results showing X% accuracy over the test set, the only way I can compare a new method is if I use the exact same test set as you.

If you have used 100% of the test set to calculate your test accuracy and I have used 80% of the test set (as I've used 20% of it for my validation set), then these results are incomparable as we haven't used the exact same test sets.

Test sets should never be touched, in any way, including splitting them to form validation sets.

cpuhrsch · 2020-02-05T17:37:13Z

I agree that the test set should never be touched. However, I do not agree that we should introduce an option that generates a validation dataset, if such validation dataset has not been defined by the dataset creators.

The point here is provide a reference that will yield the train and test datasets as described by the dataset creators. Some people then maybe split train into a training and validation dataset with a 80/20 split. Or maybe they'll use a 70/30 split. Or maybe they'll have a fixed seed to pull a "random" subset from the training dataset, etc.

So, if someone wants to do this train / validation split, and they surely will, we should have an abstraction that makes it easy to do this, but we should not make a choice here by default. Then we diverge from the idea that this dataset implementation should do one thing well, which is provide a reference for the dataset as described by the dataset creators.

cpuhrsch · 2020-02-05T17:42:55Z

One way of dealing with this would be to modify text classification to return the raw text instead of building a vocab if it doesn't exist. That way you'd get a training and testing dataset that yields the lines of text (in UTF-8) format, which could then be fed into a vocab factory.

We could, for now, add a flag that will cause the raw text to be returned and then later on decide whether we want to make that the default.

bentrevett · 2020-02-05T18:24:46Z

One way of dealing with this would be to modify text classification to return the raw text instead of building a vocab if it doesn't exist. That way you'd get a training and testing dataset that yields the lines of text (in UTF-8) format, which could then be fed into a vocab factory.

I would prefer this over my proposed solution.

zhangguanheng66 · 2020-04-21T20:03:35Z

fixed in #701

cpuhrsch mentioned this issue Feb 5, 2020

[Proposal][Experimental] Further decouple Vocab from TextClassification datasets. #691

Open

zhangguanheng66 mentioned this issue Feb 19, 2020

Return raw tokens for experimental IMDB dataset #696

Closed

anmolsjoshi mentioned this issue Feb 27, 2020

Added WMT News Crawl Dataset for language modeling #688

Open

zhangguanheng66 mentioned this issue Apr 21, 2020

Text classification datasets with new torchtext dataset abstraction #701

Merged

zhangguanheng66 closed this as completed Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding validation splits to (experimental) text_classification datasets that do not have vocabulary built over them #690

Adding validation splits to (experimental) text_classification datasets that do not have vocabulary built over them #690

bentrevett commented Feb 5, 2020

zhangguanheng66 commented Feb 5, 2020 •

edited

Loading

bentrevett commented Feb 5, 2020 •

edited

Loading

zhangguanheng66 commented Feb 5, 2020

bentrevett commented Feb 5, 2020 •

edited

Loading

cpuhrsch commented Feb 5, 2020

cpuhrsch commented Feb 5, 2020 •

edited

Loading

bentrevett commented Feb 5, 2020

zhangguanheng66 commented Apr 21, 2020

Adding validation splits to (experimental) text_classification datasets that do not have vocabulary built over them #690

Adding validation splits to (experimental) text_classification datasets that do not have vocabulary built over them #690

Comments

bentrevett commented Feb 5, 2020

🚀 Feature

zhangguanheng66 commented Feb 5, 2020 • edited Loading

bentrevett commented Feb 5, 2020 • edited Loading

zhangguanheng66 commented Feb 5, 2020

bentrevett commented Feb 5, 2020 • edited Loading

cpuhrsch commented Feb 5, 2020

cpuhrsch commented Feb 5, 2020 • edited Loading

bentrevett commented Feb 5, 2020

zhangguanheng66 commented Apr 21, 2020

zhangguanheng66 commented Feb 5, 2020 •

edited

Loading

bentrevett commented Feb 5, 2020 •

edited

Loading

bentrevett commented Feb 5, 2020 •

edited

Loading

cpuhrsch commented Feb 5, 2020 •

edited

Loading