TEXT.build_vocab for two datasets #648

antgr · 2019-11-20T15:11:36Z

❓ Questions and Help

Description

Hi, I want to do multitask learning, and to use two datasets.
I want to create common vocabulary from the two datasets. For example from imdb and snli. How should I use TEXT.build_vocab to achieve this?

zhangguanheng66 · 2019-11-20T15:15:34Z

@mttk Can we pass a pretrained vocab object to build a torchtext dataset?

mttk · 2019-11-22T15:37:17Z

You can build vocab from multiple datasets (the argument for TEXT.build_vocab can be a sequence of Dataset instances).
Illustratively:

TEXT = Field(...)
imdb_train, imdb_test = IMDB.splits(text_field=TEXT, ...)
snli_train, snli_valid, snli_test = SNLI.splits(text_field=TEXT, ...)
TEXT.build_vocab(imdb_train, snli_train)

AFAIK there is no way to pass a pretrained vocab to build a dataset without some hacking, but this should work in this case.

antgr · 2019-11-22T16:09:10Z

You mean that I could not use for example "Glove" etc. Right?
For example something like:

TEXT.build_vocab(imdb_train, snli_train
                 min_freq = MIN_FREQ,
                 vectors = "glove.6B.300d",
                 unk_init = torch.Tensor.normal_)

mttk · 2019-11-22T16:12:50Z

That code should work with GloVe. The Field has to be shared between the Datasets (as it is in the example above), and you will get a common vocab. The vector assignment is done after the vocabulary is determined.

Concretely, the word frequencies are computed here from all the Dataset instances: https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L304
And after that point, the Vectors class doesn't know how many source datasets were used to obtain the frequencies.

antgr · 2019-11-22T17:25:56Z

OK thank you! And if i would like to use a custom dataset instead of, for example, imdb? How could I use there the TEXT? Can I use split (with text_field as argument) in a custom dataset as well?

mttk · 2019-11-22T18:27:57Z

You have to load that dataset by e.g. using TabularDataset and assigning TEXT as a field in the constructor, then you can use the builtin dataset.split method. Now you have your custom dataset splits loaded.

Then, you can pass any instance of Dataset as an argument to TEXT.build_vocab.

zhangguanheng66 mentioned this issue Dec 6, 2019

Overview of issues in torchtext and the plan for revamping #664

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TEXT.build_vocab for two datasets #648

TEXT.build_vocab for two datasets #648

antgr commented Nov 20, 2019

zhangguanheng66 commented Nov 20, 2019

mttk commented Nov 22, 2019 •

edited

Loading

antgr commented Nov 22, 2019 •

edited

Loading

mttk commented Nov 22, 2019 •

edited

Loading

antgr commented Nov 22, 2019

mttk commented Nov 22, 2019

TEXT.build_vocab for two datasets #648

TEXT.build_vocab for two datasets #648

Comments

antgr commented Nov 20, 2019

❓ Questions and Help

zhangguanheng66 commented Nov 20, 2019

mttk commented Nov 22, 2019 • edited Loading

antgr commented Nov 22, 2019 • edited Loading

mttk commented Nov 22, 2019 • edited Loading

antgr commented Nov 22, 2019

mttk commented Nov 22, 2019

mttk commented Nov 22, 2019 •

edited

Loading

antgr commented Nov 22, 2019 •

edited

Loading

mttk commented Nov 22, 2019 •

edited

Loading