Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input embedding matrix must match size: 250000 x 100, found torch.Size([100000, 300]) #15

Open
KhushbooMundada-tomtom opened this issue Nov 23, 2022 · 3 comments

Comments

@KhushbooMundada-tomtom
Copy link

KhushbooMundada-tomtom commented Nov 23, 2022

Wants to train NER model with custom labels and following is the command I'm using to train it

!python -m stanza.utils.training.run_ner en_sample --pretrain_max_vocab 250000 --word_emb_dim 300 --max_steps 100

whereas while loading the model using,
stanza.Pipeline('en', processor='ner', ner_model_path ='/content/saved_models/ner/en_sample_nertagger.pt')

I'm facing embedding matrix error
Full error -

INFO:stanza:Loading these models for language: en (English):

| Processor | Package |

| tokenize | combined |
| pos | combined |
| lemma | combined |
| depparse | combined |
| sentiment | sstplus |
| constituency | wsj |
| ner | /content/s...rtagger.pt |

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: constituency
INFO:stanza:Loading: ner
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']

  • This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

AssertionError Traceback (most recent call last)
in
1 import stanza
2
----> 3 stanza.Pipeline('en', processor='ner', ner_model_path ='/content/saved_models/ner/en_sample_nertagger.pt')

6 frames
/usr/local/lib/python3.7/dist-packages/stanza/models/ner/model.py in init_emb(self, emb_matrix)
119 dim = self.args['word_emb_dim']
120 assert emb_matrix.size() == (vocab_size, dim),
--> 121 "Input embedding matrix must match size: {} x {}, found {}".format(vocab_size, dim, emb_matrix.size())
122 self.word_emb.weight.data.copy_(emb_matrix)
123

AssertionError: Input embedding matrix must match size: 250000 x 100, found torch.Size([100000, 300])

@AngledLuffa
Copy link
Contributor

It is using a word embedding of a different size from the ones the model was trained with. You can fix this by creating the Pipeline with a flag ner_pretrain_path=<path>, where <path> is the word embedding used to create the NER model. The log for the NER training should have a line in it which looks like this:

2022-11-21 18:07:14 DEBUG: Loaded pretrain from /home/john/stanza_resources/sd/pretrain/adamw_50E_200D.pt

That will tell you the exact path it used

I suspect what happened is the default pretrain for the default NER model is different from the default pretrain used in general for English. I can update that so the example uses the same pretrain as the English NER model.

@AngledLuffa
Copy link
Contributor

Some of the annotators try to quietly add their dependencies, but apparently openie is not one of them. Try this:

annotators='tokenize,pos,lemma,depparse,natlog,ner,coref,openie'

@AngledLuffa
Copy link
Contributor

whoopsie, wrong thread. well, hopefully my early comment helps, and i updated some of the documentation to account for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants