Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domain Specific Pre-training Model #4

Closed
abhinandansrivastava opened this issue Mar 25, 2019 · 13 comments
Closed

Domain Specific Pre-training Model #4

abhinandansrivastava opened this issue Mar 25, 2019 · 13 comments

Comments

@abhinandansrivastava
Copy link

Hi,

I have run the code run_pretraining.py script on my domain specific data.

It seems like only checkpoints are saved. I have got two files 0000020.params and 0000020.states.

How can I save the model or get a model from .params and .states files in checkpoint folder so that I can use that model to get contextual embeddings.

Can someone please help me with this?

@jhyuklee
Copy link
Collaborator

Hi,

the run_pretraining.py script is exactly the same as https://github.com/google-research/bert, and you can get help from there. We used our modified version of the script (which is not shared) to handle multi-gpu and server specific issues for saving the models, so the result might be quite different from what you'll get using the original script.

Thank you.

@abhinandansrivastava
Copy link
Author

abhinandansrivastava commented Mar 27, 2019

Hi,
After running the BERT Model, I am getting embedding for each word in a sentence, But need to get the sentence embedding. How to find that?

Thanks

@Sriyella
Copy link

Sriyella commented Mar 27, 2019

Hi,

Is there anyway to load this model in tensorflow hub.module()? If not, how can we use the model to get the embeddings?

Please suggest the way forward

@jhyuklee
Copy link
Collaborator

Hi, @abhinandansrivastava,
you can use [CLS] token for sentence embedding or classification. Thanks.

@jhyuklee
Copy link
Collaborator

Hi @Sriyella,
We haven't tried using hub.module(). You can just get the last layer of BERT (or BioBERT), and save them.

@jhyuklee
Copy link
Collaborator

If it's not related to pre-trained weights of BioBERT, please report BioBERT related issues in https://github.com/dmis-lab/biobert, or BERT related issues in https://github.com/google-research/bert.

@pyturn
Copy link

pyturn commented Mar 29, 2019

Hi,

Is there anyway to load this model in tensorflow hub.module()? If not, how can we use the model to get the embeddings?

Please suggest the way forward

I am also looking for the same. How to use the pre-trained weigths to get the embeddings.

@jhyuklee
Copy link
Collaborator

This might help!
google-research/bert#60

@abhinandansrivastava
Copy link
Author

Hi @jhyuklee ,
Thanks for the reply.

Do we need to create our own vocab.txt after doing pretraining of domain specific model, as the model saved after the pretraining process does not have vocab.txt and bert_config.json file.

If yes, then how?

Thanks

@jhyuklee
Copy link
Collaborator

Hi @abhinandansrivastava,

you don't have to create your own vocab.txt if you used the same vocab.txt and bert_config.json while pre-training. See #1.

Thanks.

@jhyuklee
Copy link
Collaborator

jhyuklee commented Apr 1, 2019

Embedding related issues are at dmis-lab/biobert#23. Closing this issue.

@jhyuklee jhyuklee closed this as completed Apr 1, 2019
@abhinandansrivastava
Copy link
Author

Hi @jhyuklee ,
BioBert Vocab.txt file and Bert Uncased Vocab.txt file are different. How you have added new tokenised words into Biobert Vocab.txt file as Some Biobert Vocab.txt file has different tokenised words compared with Uncased Bert Base Vocab.txt

@jhyuklee
Copy link
Collaborator

jhyuklee commented Apr 4, 2019

Hi @abhinandansrivastava ,
we used Bert-base Cased vocabulary as uppercase often matters in biomedical texts. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants