Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the word embedding after pre-training? #60

Closed
mfxss opened this issue Nov 6, 2018 · 9 comments
Closed

How to get the word embedding after pre-training? #60

mfxss opened this issue Nov 6, 2018 · 9 comments

Comments

@mfxss
Copy link

mfxss commented Nov 6, 2018

Hi,
I am excited on this great model. And I want to get the word embedding . Where shold I find the file from output or should I change to code to do this?
Thanks,
Yuguang

@jacobdevlin-google
Copy link
Contributor

If you want to get the contextual embeddings (like ELMo) see the section here.

If you want the actual word embeddings, the word->id mapping is just the index of the row in vocab.txt, and the embedding matrix is in bert_model.ckpt with the variable name bert/embeddings/word_embeddings.

@mfxss
Copy link
Author

mfxss commented Nov 6, 2018

And I download your released model of chinese_L-12_H-768_A-12. In vocab.txt, I found some token such as
[unused1] [CLS][SEP][MASK] <S> <T> .
What do these tokens mean?

@jacobdevlin-google
Copy link
Contributor

The [CLS], [SEP] and [MASK] tokens are used as described in the paper and README. The [unused] tokens were not used in our model and are randomly initialized.

@mfxss
Copy link
Author

mfxss commented Nov 6, 2018

What is your training data of chinese_L-12_H-768_A-12? And what is it's size?

@jacobdevlin-google
Copy link
Contributor

It's Chinese wikipedia with both Traditional and Simplified characters.

@imgarylai
Copy link

Hello @mfxss ,
Not sure if you still have problem to get the word embedding from BERT. I implement a BERT embedding library which makes you can get word embedding in a programatic way.

https://github.com/imgarylai/bert-embedding

Because I'm working closely with mxnet & gluonnlp team, my implementation is done by using mxnet and gluonnlp. However, I am trying to implement it in all other different frameworks.

Hope my works can help you.

@rainorangelemon
Copy link

rainorangelemon commented Jan 4, 2020

Hey guys, if you don't want to install an extra module, here is an example:

BERT_PATH = 'HOME_DIR/bert_en_uncased_L-12_H-768_A-12'

import tensorflow as tf
imported = tf.saved_model.load(BERT_PATH)

for i in imported.trainable_variables:
    if i.name == 'bert_model/word_embeddings/embeddings:0':
        embeddings = i

And embeddings is the tensor of word embedding that you want!

@arjunrajanna
Copy link

arjunrajanna commented Aug 19, 2020

Hi @jacobdevlin-google Thanks for the pointers. I see the output with the extract_features.py gives subword representations. I'm sure to be missing something but my question is how can we get a word (not subword) representation instead? Thanks in advance for your help!

@mathshangw
Copy link

Hi @jacobdevlin-google Thanks for the pointers. I see the output with the extract_features.py gives subword representations. I'm sure to be missing something but my question is how can we get a word (not subword) representation instead? Thanks in advance for your help!

Excuse me did you find a solution for word not subword , please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants