-
Notifications
You must be signed in to change notification settings - Fork 811
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TEXT.build_vocab for two datasets #648
Comments
@mttk Can we pass a pretrained vocab object to build a torchtext dataset? |
You can build vocab from multiple datasets (the argument for TEXT = Field(...)
imdb_train, imdb_test = IMDB.splits(text_field=TEXT, ...)
snli_train, snli_valid, snli_test = SNLI.splits(text_field=TEXT, ...)
TEXT.build_vocab(imdb_train, snli_train) AFAIK there is no way to pass a pretrained vocab to build a dataset without some hacking, but this should work in this case. |
You mean that I could not use for example "Glove" etc. Right?
|
That code should work with GloVe. The Field has to be shared between the Datasets (as it is in the example above), and you will get a common vocab. The vector assignment is done after the vocabulary is determined. Concretely, the word frequencies are computed here from all the |
OK thank you! And if i would like to use a custom dataset instead of, for example, imdb? How could I use there the TEXT? Can I use |
You have to load that dataset by e.g. using TabularDataset and assigning TEXT as a field in the constructor, then you can use the builtin Then, you can pass any instance of Dataset as an argument to |
❓ Questions and Help
Description
Hi, I want to do multitask learning, and to use two datasets.
I want to create common vocabulary from the two datasets. For example from imdb and snli. How should I use TEXT.build_vocab to achieve this?
The text was updated successfully, but these errors were encountered: