Input embedding matrix must match size: 250000 x 100, found torch.Size([100000, 300]) #15

KhushbooMundada-tomtom · 2022-11-23T09:39:55Z

Wants to train NER model with custom labels and following is the command I'm using to train it

!python -m stanza.utils.training.run_ner en_sample --pretrain_max_vocab 250000 --word_emb_dim 300 --max_steps 100

whereas while loading the model using,
stanza.Pipeline('en', processor='ner', ner_model_path ='/content/saved_models/ner/en_sample_nertagger.pt')

I'm facing embedding matrix error
Full error -

INFO:stanza:Loading these models for language: en (English):

| Processor | Package |

| tokenize | combined |
| pos | combined |
| lemma | combined |
| depparse | combined |
| sentiment | sstplus |
| constituency | wsj |
| ner | /content/s...rtagger.pt |

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: pos
INFO:stanza:Loading: lemma
INFO:stanza:Loading: depparse
INFO:stanza:Loading: sentiment
INFO:stanza:Loading: constituency
INFO:stanza:Loading: ner
Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']

This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

AssertionError Traceback (most recent call last)
in
1 import stanza
2
----> 3 stanza.Pipeline('en', processor='ner', ner_model_path ='/content/saved_models/ner/en_sample_nertagger.pt')

6 frames
/usr/local/lib/python3.7/dist-packages/stanza/models/ner/model.py in init_emb(self, emb_matrix)
119 dim = self.args['word_emb_dim']
120 assert emb_matrix.size() == (vocab_size, dim),
--> 121 "Input embedding matrix must match size: {} x {}, found {}".format(vocab_size, dim, emb_matrix.size())
122 self.word_emb.weight.data.copy_(emb_matrix)
123

AssertionError: Input embedding matrix must match size: 250000 x 100, found torch.Size([100000, 300])

AngledLuffa · 2022-11-23T19:01:02Z

It is using a word embedding of a different size from the ones the model was trained with. You can fix this by creating the Pipeline with a flag ner_pretrain_path=<path>, where <path> is the word embedding used to create the NER model. The log for the NER training should have a line in it which looks like this:

2022-11-21 18:07:14 DEBUG: Loaded pretrain from /home/john/stanza_resources/sd/pretrain/adamw_50E_200D.pt

That will tell you the exact path it used

I suspect what happened is the default pretrain for the default NER model is different from the default pretrain used in general for English. I can update that so the example uses the same pretrain as the English NER model.

AngledLuffa · 2022-11-23T22:27:41Z

Some of the annotators try to quietly add their dependencies, but apparently openie is not one of them. Try this:

annotators='tokenize,pos,lemma,depparse,natlog,ner,coref,openie'

AngledLuffa · 2022-11-23T22:28:25Z

whoopsie, wrong thread. well, hopefully my early comment helps, and i updated some of the documentation to account for it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input embedding matrix must match size: 250000 x 100, found torch.Size([100000, 300]) #15

Input embedding matrix must match size: 250000 x 100, found torch.Size([100000, 300]) #15

KhushbooMundada-tomtom commented Nov 23, 2022 •

edited

Loading

AngledLuffa commented Nov 23, 2022

AngledLuffa commented Nov 23, 2022

AngledLuffa commented Nov 23, 2022

Input embedding matrix must match size: 250000 x 100, found torch.Size([100000, 300]) #15

Input embedding matrix must match size: 250000 x 100, found torch.Size([100000, 300]) #15

Comments

KhushbooMundada-tomtom commented Nov 23, 2022 • edited Loading

INFO:stanza:Loading these models for language: en (English):

| Processor | Package |

| tokenize | combined | | pos | combined | | lemma | combined | | depparse | combined | | sentiment | sstplus | | constituency | wsj | | ner | /content/s...rtagger.pt |

AngledLuffa commented Nov 23, 2022

AngledLuffa commented Nov 23, 2022

AngledLuffa commented Nov 23, 2022

KhushbooMundada-tomtom commented Nov 23, 2022 •

edited

Loading

| tokenize | combined |
| pos | combined |
| lemma | combined |
| depparse | combined |
| sentiment | sstplus |
| constituency | wsj |
| ner | /content/s...rtagger.pt |