Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to load a word embedding dictionary using torchtext #722

Open
nawshad opened this issue Apr 4, 2020 · 2 comments
Open

How to load a word embedding dictionary using torchtext #722

nawshad opened this issue Apr 4, 2020 · 2 comments

Comments

@nawshad
Copy link

nawshad commented Apr 4, 2020

Hi,

I have tried to write that to a gensim word2vec format then load, but it throws error about string to float conversion. Is there a standard way to use custom pre-trained embedding (not created through gensim) which is a python dictionary to load using torchtext?

Thanks,

@zhangguanheng66
Copy link
Contributor

@bentrevett @mttk Any ideas for this issue. I think we support the pretrained word vector in torchtext.

@bentrevett
Copy link
Contributor

bentrevett commented Apr 6, 2020

There is a way to load custom embeddings from a file, so you can write your dictionary to a file and then read it with TorchText.

import torchtext.vocab as vocab

custom_embeddings = vocab.Vectors(name = 'custom_embeddings.txt')

The format of your custom_embeddings.txt file needs to be the token followed by the values of each of the dimensions for the embedding, all separated by a single space, e.g. here's three tokens with 20 dimensional embeddings (all just ones as an example):

good 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
great 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
awesome 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0

You then align these with your vocabulary when you build_vocab for the desired Field:

TEXT.build_vocab(train_data, vectors = custom_embeddings)

Then you actually load these pre-trained embeddings into your model with:

model.embedding.weight.data.copy_(TEXT.vocab.vectors)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants