What to do about this warning message: "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification" #5421

ohmeow · 2020-07-01T01:31:55Z

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

returns this warning message:

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

This just started popping up with v.3 so I'm not sure what is the recommended action to take here. Please advise if you can. Basically, any of my code using the AutoModelFor<X> is throwing up this warning now.

Thanks.

The text was updated successfully, but these errors were encountered:

julien-c · 2020-07-01T07:56:44Z

Not sure what's happening with the multiple duplicate opened issues, @ohmeow?

Is GitHub flaky again? :)

fliptrail · 2020-07-01T11:54:45Z

I am also encountering the same warning.

When loading the model

Some weights of the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFBertModel for predictions without further training.

When attempting to fine tune it:

WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.

Is the model correctly fine-tuning? Are the pre-trained model weights also getting updated (fine-tuned) or only the layers outside(above) the pre-trained model are changing their weights while training?

ohmeow · 2020-07-01T17:47:41Z

Not sure what's happening with the multiple duplicate opened issues, @ohmeow?

Is GitHub flaky again? :)

I noticed the same thing. Not sure what is going on ... but I swear I only opened this one :)

LysandreJik · 2020-07-01T18:37:50Z

@ohmeow you're loading the bert-base-cased checkpoint (which is a checkpoint that was trained using a similar architecture to BertForPreTraining) in a BertForSequenceClassification model.

This means that:

The layers that BertForPreTraining has, but BertForSequenceClassification does not have will be discarded
The layers that BertForSequenceClassification has but BertForPreTraining does not have will be randomly initialized.

This is expected, and tells you that you won't have good performance with your BertForSequenceClassification model before you fine-tune it 🙂.

@fliptrail this warning means that during your training, you're not using the pooler in order to compute the loss. I don't know how you're finetuning your model, but if you're not using the pooler layer then there's no need to worry about that warning.

fliptrail · 2020-07-01T18:58:31Z

@LysandreJik Thank you for your response.
I am using the code:

def main_model():
  encoder = ppd.TFBertModel.from_pretrained("bert-base-uncased")
  input_ids = tf.keras.layers.Input(shape=(max_seq_len,), dtype=tf.int32)
  token_type_ids = tf.keras.layers.Input(shape=(max_seq_len,), dtype=tf.int32)
  attention_mask = tf.keras.layers.Input(shape=(max_seq_len,), dtype=tf.int32)

  embedding = encoder(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)[0]

  pooling = tf.keras.layers.GlobalAveragePooling1D()(embedding)
  normalization = tf.keras.layers.BatchNormalization()(pooling)
  dropout = tf.keras.layers.Dropout(0.1)(normalization)

  out = tf.keras.layers.Dense(1, activation="sigmoid", name="final_output_bert")(dropout)

  model = tf.keras.Model(inputs=[input_ids, token_type_ids, attention_mask], outputs=out)

  loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
  optimizer = tf.keras.optimizers.Adam(lr=2e-5)
  metrics=['accuracy', tf.keras.metrics.FalseNegatives(), tf.keras.metrics.FalsePositives()]

  model.compile(optimizer=optimizer, loss=loss, metrics=metrics)
  return model

model = main_model()
model.summary()

I am only using the TFBertModel.from_pretrained("bert-base-uncased") pre-built class. I am not initializing it from any other class. Still, I am encountering the warning. From what I can understand this should only appear when initializing given pre-trained model inside another class.
Am I fine-tuning correctly? Are the BERT layer weights also getting updated?

Warning while loading model:

Some weights of the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['nsp___cls', 'mlm___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the ckeckpoint was trained on, you can already use TFBertModel for predictions without further training.

While attempting to train:

WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model_1/bert/pooler/dense/kernel:0', 'tf_bert_model_1/bert/pooler/dense/bias:0'] when minimizing the loss.

This warning only started to appear from yesterday in all my codes and other sample codes given.

VaibhavBhatnagar17 · 2020-07-01T19:20:43Z

Hello everyone,
I also start getting this error today. before today it was working fine. Are there any changes that take place in colab?
This is the code I am using:

!pip install transformers
import TensorFlow as to
import transformers
from transformers import TFBertForSequenceClassification, BertConfig
tokenizer = transformers.BertTokenizer('gdrive/My Drive/Colab Notebooks/vocab.txt', do_lower_case=True)

max_seq_length = 128

bert = 'bert-large-uncased'
config = BertConfig.from_pretrained('bert-large-uncased', output_hidden_states=True, hidden_dropout_prob=0.2, 
attention_probs_dropout_prob=0.2)

transformer_model = TFBertForSequenceClassification.from_pretrained(bert, config=config)

input_ids_in = tf.keras.layers.Input(shape=(max_seq_length,), name='input_token', dtype='int32')
input_masks_in = tf.keras.layers.Input(shape=(max_seq_length,), name='masked_token', dtype='int32')
input_segments_in = tf.keras.layers.Input(shape=(max_seq_length,), name='segment_ids', dtype='int32') 

embedding_layer = transformer_model(input_ids_in, attention_mask=input_masks_in, token_type_ids=input_segments_in)

I have been using this same code for more than 2 weeks and no problem till yesterday.
Please if anyone finds the solution, share it.
Thank you

ohmeow · 2020-07-01T19:26:59Z

Thanks @LysandreJik

This is expected, and tells you that you won't have good performance with your BertForSequenceClassification model before you fine-tune it

Makes sense.

Now, how do we know what checkpoints are available that were trained on BertForSequenceClassification?

LysandreJik · 2020-07-01T20:20:31Z

@fliptrail in your code you have the following:

embedding = encoder(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)[0]

which means you're only getting the first output of the model, and using that to compute the loss. The first output of the model is the hidden states:

https://github.com/huggingface/transformers/blob/master/src/transformers/modeling_tf_bert.py#L716-L738

    Returns:
        :obj:`tuple(tf.Tensor)` comprising various elements depending on the configuration (:class:`~transformers.BertConfig`) and inputs:
        last_hidden_state (:obj:`tf.Tensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`):
            Sequence of hidden-states at the output of the last layer of the model.
        pooler_output (:obj:`tf.Tensor` of shape :obj:`(batch_size, hidden_size)`):
            Last layer hidden-state of the first token of the sequence (classification token)
            further processed by a Linear layer and a Tanh activation function. The Linear
            layer weights are trained from the next sentence prediction (classification)
            objective during Bert pretraining. This output is usually *not* a good summary
            of the semantic content of the input, you're often better with averaging or pooling
            the sequence of hidden-states for the whole input sequence.
        hidden_states (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_hidden_states=True`` is passed or when ``config.output_hidden_states=True``):
            tuple of :obj:`tf.Tensor` (one for the output of the embeddings + one for the output of each layer)
            of shape :obj:`(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the initial embedding outputs.
        attentions (:obj:`tuple(tf.Tensor)`, `optional`, returned when ``output_attentions=True`` is passed or when ``config.output_attentions=True``):
            tuple of :obj:`tf.Tensor` (one for each layer) of shape
            :obj:`(batch_size, num_heads, sequence_length, sequence_length)`:
            Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
            heads.
        """

You're ignoring the second value which is the pooler output. The warnings are normal in your case.

LysandreJik · 2020-07-01T20:20:45Z

@VaibhavBhatnagar17, these are warnings, not errors. What exact warning are you not understanding?

LysandreJik · 2020-07-01T20:20:49Z

@ohmeow that really depends on what you want to do! Sequence classification is a large subject, with many different tasks. Here's a list of all available checkpoints fine-tuned on sequence classification (not all are for BERT, though!)

Please be aware that if you have a specific task in mind, you should fine-tune your model to that task.

VaibhavBhatnagar17 · 2020-07-02T05:33:03Z

@LysandreJik Hey, What I am not able to understand is that I was using this code for more than 2 weeks and no warning came up till yesterday. I haven't changed anything but suddenly this warning came up is confusing.
I am not getting the same output dimension as before and not able to complete my project.

LysandreJik · 2020-07-02T23:42:07Z

The warning came up yesterday because version 3.0.0 was released yesterday. It's weird that you saw an output dimension changed since yesterday. What's the error you get?

tarskiandhutch · 2020-07-07T21:54:36Z

I see this same warning when initializing BertForMaskedLM, pasted in below for good measure. As other posters have mentioned, this warning began appearing only after upgrading to v3.0.0.

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMaskedLM were not initialized from the model checkpoint at bert-large-uncased-whole-word-masking and are newly initialized: ['cls.predictions.decoder.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Note that my module imports/initializations essentially duplicate the snippet demonstrating cloze task usage at https://huggingface.co/bert-large-uncased-whole-word-masking?text=Paris+is+the+%5BMASK%5D+of+France.

from transformers import BertTokenizer, BertForMaskedLM

_tokenizer = BertTokenizer.from_pretrained(
    'bert-large-uncased-whole-word-masking')
_model = BertForMaskedLM.from_pretrained(
    'bert-large-uncased-whole-word-masking')

Am I correct in assuming that nothing has changed in the behavior of the relevant model, but that perhaps this warning should have been being printed all along?

LysandreJik · 2020-07-09T13:25:57Z

You're right, this has always been the behavior of the models. It wasn't clear enough before, so we've clarified it with this warning.

tarskiandhutch · 2020-07-09T14:00:21Z

Thanks, @LysandreJik .

ehalit · 2020-09-25T07:12:51Z

Anyone knows how to suppress this warning? I am aware that the model needs fine-tuning and I am fine-tuning it so, it becomes annoying to see this over and over again.

LysandreJik · 2020-09-25T07:45:39Z

You can manage the warnings with the logging utility introduced in version 3.1.0:

from transformers import logging

logging.set_verbosity_warning()

ehalit · 2020-09-25T07:52:16Z

@LysandreJik Thanks for the rapid response, I set it with set_verbosity_error()

s4sarath · 2020-10-12T10:26:10Z

@LysandreJik - So , by default bert-base-uncased loading from TFBertModel has 199 variables [ 3embedding + 2 layer norms + (16 x 12 layers) + 2 (pooler kernel and bias )] .

But when loading from TFBertForMaskedLM, it has 204 variables. Below are the 5 extra variables

tf_bert_for_masked_lm_1/mlm___cls/predictions/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/kernel:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/gamma:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/beta:0

So that means , these 5 variables are randomly initialising right.
Are these 5 variables required for MLM ( is this how it is in official tensorflow models )

OR

can we take output token embeddings ( before passing to mlm___cls ) ( batch x sequence x embedding_dimension ) , multiply it with word_embedding matrix to produce ( batch x sequence x vocab_size ) and then use that for MLM loss .

veronica320 · 2020-10-22T22:47:29Z

@LysandreJik I'm having a slightly different issue here - I'm loading a sequence classification checkpoint in a AutoModelForSequenceClassification model. But I still get the warning. Here's my code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('roberta-large-mnli')

Output:

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

I believe it's NOT expected because I'm indeed initializing from a model that I expect to be exactly identical.

I'm only starting to get this warning after upgrading to transformers v3 as well. I'm using 3.3.1 currently. Could you please help? Thanks!

LysandreJik · 2020-10-23T14:40:11Z

@s4sarath I'm not sure I understand your question.

@veronica320, the pooler layer is not used when doing sequence classification, so there's nothing to be worried about.

The pooler is the second output of the RobertaModel:
https://github.com/huggingface/transformers/blob/v3.4.0/src/transformers/modeling_roberta.py#L691

But only the first output is used in the sequence classification model:
https://github.com/huggingface/transformers/blob/v3.4.0/src/transformers/modeling_roberta.py#L1002

veronica320 · 2020-10-23T19:25:16Z

Thanks a lot!

s4sarath · 2020-10-27T07:41:55Z

@LysandreJik - Sorry to make you confused .

tf_bert_for_masked_lm_1/mlm___cls/predictions/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/kernel:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/gamma:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/beta:0

The above 4 variables are randomly initialising right, means they were not a part of official BERT .
Am i right?

LysandreJik · 2020-10-27T13:32:37Z

Thank you for your explanation.

Actually these four variables shouldn't be initialized randomly, as they're part of BERT. The official BERT checkpoints contain two heads: the MLM head and the NSP head.

You can see it here:

>>> from transformers import TFBertForMaskedLM
>>> model = TFBertForMaskedLM.from_pretrained("bert-base-cased")

Among the logging, you should find this:

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertForMaskedLM: ['nsp___cls']
- This IS expected if you are initializing TFBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing TFBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-cased.

This tells you two things:

Some layers of the checkpoints are not used. These are ['nsp___cls'], corresponding to the CLS head. Since we're using a ***ForMaskedLM, it makes sense not to use the CLS head
All the layers of the model were initialized from the model checkpoint, as both the transformer layers and the MLM head were present in the checkpoint.

If you're getting those variables randomly initialized:

tf_bert_for_masked_lm_1/mlm___cls/predictions/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/kernel:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/bias:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/gamma:0
tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/beta:0

then it means you're using a checkpoint that does not contain these variables. These are the MLM layers, so you're probably loading a checkpoint that was saved using an architecture that does not contain these layers. This can happen if you do the following:

>>> from transformers import TFBertModel, TFBertForMaskedLM
>>> model = TFBertModel.from_pretrained("bert-base-cased")
>>> model.save_pretrained(directory)
>>> mlm_model = TFBertForMaskedLM.from_pretrained(directory)

I hope this answers your question!

s4sarath · 2020-10-27T16:14:20Z

Oh okay. Thank you so much for the clarification. When I looked at bert models from tf-hub , these 4 variables were not present. That was the reason for the confusion .

…

On Tue, Oct 27, 2020, 7:02 PM Lysandre Debut ***@***.***> wrote: Thank you for your explanation. Actually these four variables shouldn't be initialized randomly, as they're part of BERT. The official BERT checkpoints contain two heads: the MLM head and the NSP head. You can see it here: >>> from transformers import TFBertForMaskedLM>>> model = TFBertForMaskedLM.from_pretrained("bert-base-cased") Among the logging, you should find this: Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertForMaskedLM: ['nsp___cls'] - This IS expected if you are initializing TFBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model). - This IS NOT expected if you are initializing TFBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-cased. This tells you two things: - Some layers of the checkpoints are not used. These are ['nsp___cls'], corresponding to the CLS head. Since we're using a ***ForMaskedLM, it makes sense not to use the CLS head - All the layers of the model were initialized from the model checkpoint, as both the transformer layers and the MLM head were present in the checkpoint. If you're getting those variables randomly initialized: tf_bert_for_masked_lm_1/mlm___cls/predictions/bias:0 tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/kernel:0 tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/dense/bias:0 tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/gamma:0 tf_bert_for_masked_lm_1/mlm___cls/predictions/transform/LayerNorm/beta:0 then it means you're using a checkpoint that does not contain these variables. These are the MLM layers, so you're probably loading a checkpoint that was saved using an architecture that does not contain these layers. This can happen if you do the following: >>> from transformers import TFBertModel, TFBertForMaskedLM>>> model = TFBertModel.from_pretrained("bert-base-cased")>>> model.save_pretrained(directory)>>> mlm_model = TFBertForMaskedLM.from_pretrained(directory) I hope this answers your question! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5421 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACRE6KEEQACWSAEO3GK3CL3SM3DYNANCNFSM4OM5S2SQ> .

TingNLP · 2021-04-13T09:18:35Z

Hi, is there any solution?
I have a same problem.

the warning as below:

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultiLabelSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForMultiLabelSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

the learner still can fit and predict, but the prediction is not consistent every time

s4sarath · 2021-04-13T09:57:09Z

I don't know brother. I really can't understand those warnings Because it doesn't make sense. Check github.com/legacyai/tf-tranaformers . A new and improved version is on the way.

…

On Tue, Apr 13, 2021, 2:48 PM TingNLP ***@***.***> wrote: Hi, is there any solution? I have a same problem. #339 <#339> #18 <#18> #132 <#132> the warning as below: Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMultiLabelSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias'] - This IS expected if you are initializing BertForMultiLabelSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model). - This IS NOT expected if you are initializing BertForMultiLabelSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForMultiLabelSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5421 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACRE6KEATEJTCCLXBKPFVZDTIQD7ZANCNFSM4OM5S2SQ> .

rkunani · 2021-04-13T12:18:52Z

All of the BertForXXX models consist of a BERT model followed by some head which is task-specific. For sequence classification tasks, the head is just a linear layer which maps the BERT transformer hidden state vector to a vector of length num_labels, where num_labels is the number of classes for your classification task (for example, positive/negative sentiment analysis has 2 labels). If you're familiar with logits, this final vector contains the logits.

In the transformers source code, you can see this linear layer (assigned to self.classifier) initialized in the constructor for BertForSequenceClassification:

class BertForSequenceClassification(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        self.init_weights()

Since self.classifier is not part of the pre-trained BERT model, its parameters must be initialized randomly (done automatically by the nn.Linear constructor).

@s4sarath Anytime you use code like model = BertForSequenceClassification.from_pretrained("bert-base-cased"), the self.classifier linear layer will have to be initialized randomly.

@TingNLP You are getting different predictions each time because each time you instantiate the model using .from_pretrained(), the self.classifier parameters will be different.

s4sarath · 2021-04-13T12:22:05Z

Absolutely agree . Task specific heads has to be randomly initialised. Because, it is not a part of official Bert Model. I agree with that.

…

On Tue, Apr 13, 2021, 5:49 PM Raguvir Kunani ***@***.***> wrote: All of the BertForXXX models consist of a BERT model <https://huggingface.co/transformers/model_doc/bert.html#bertmodel> followed by some head which is task-specific. For sequence classification tasks, the head is just a linear layer which maps the BERT transformer hidden state vector to a vector of length num_labels, where num_labels is the number of classes for your classification task (for example, positive/negative sentiment analysis has 2 labels). If you're familiar with logits, this final vector contains the logits. In the transformers source code, you can see this linear layer (assigned to self.classifier) initialized in the constructor <https://huggingface.co/transformers/_modules/transformers/models/bert/modeling_bert.html#BertForSequenceClassification> for BertForSequenceClassification: class BertForSequenceClassification(BertPreTrainedModel): def __init__(self, config): super().__init__(config) self.num_labels = config.num_labels self.bert = BertModel(config) self.dropout = nn.Dropout(config.hidden_dropout_prob) self.classifier = nn.Linear(config.hidden_size, config.num_labels) self.init_weights() Since self.classifier is not part of the pre-trained BERT model, its parameters must be initialized randomly (done automatically by the nn.Linear constructor). @s4sarath <https://github.com/s4sarath> Anytime you use code like model = BertForSequenceClassification.from_pretrained("bert-base-cased"), the self.classifier linear layer will have to be initialized randomly. @TingNLP <https://github.com/TingNLP> You are getting different predictions each time because each time you instantiate the model using .from_pretrained(), the self.classifier parameters will be different. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5421 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACRE6KH7VKR7VKJFIDZC33LTIQZDXANCNFSM4OM5S2SQ> .

TingNLP · 2021-04-14T05:15:21Z

OK... So... the problem is the parameters.
Is it possible for us to fix the value?

I think if it can be fixed, the prediction will not be inconsistent every time.

s4sarath · 2021-04-14T05:27:10Z

there is no point doing that right. because once the model is trained we will be having fixed set of parameters . :)

…

On Wed, Apr 14, 2021, 10:45 AM TingNLP ***@***.***> wrote: OK... So... the problem is the parameters. Is it possible for us to fix the value? I think if it can be fixed, the prediction will not be inconsistent every time. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5421 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACRE6KGVO6PVJZVFHV2TOBTTIUQHNANCNFSM4OM5S2SQ> .

TingNLP · 2021-04-14T06:41:13Z

@s4sarath Thanks for your immediate reply
I am still a little confused.
If the prediction is different each time, is that still a reasonable result??

s4sarath · 2021-04-14T09:22:04Z

I will explain bro. Assume classification. Last classification layer is initialised randomly right now. Now, it's okay, because you haven't trained it yet. But once you train the model and save the checkpoint, at the time of inference you are loading that checkpoint. So the prediction remains consistent.

…

On Wed, Apr 14, 2021, 12:11 PM TingNLP ***@***.***> wrote: @s4sarath <https://github.com/s4sarath> Thanks for your immediate reply I am still a little confused. If the prediction is different each time, is that still a reasonable result?? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5421 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACRE6KESSFYPPKW2A5AYEG3TIU2JPANCNFSM4OM5S2SQ> .

Mark-Brass · 2021-05-02T13:21:13Z

It is said that BERT is a pre-trained model. Why then, it is needed to be trained again?

tarskiandhutch · 2021-05-02T14:24:09Z

It does not need to be trained again to be used for a task that it was trained on: e.g., masked language modeling over a very large, general corpus of books and web text in the case of BERT. However, to perform more specific tasks like classification and question answering, such a model must be re-trained, which is called fine-tuning. Since many popular tasks fall in this latter category, it is assumed that most developers will be fine-tuning the models, and hence the developers of Huggingface included this warning message to ensure developers are aware when the model does not appear to have been fine-tuned.

See Advantages of Fine-Tuning at this tutorial: https://mccormickml.com/2019/07/22/BERT-fine-tuning/#12-installing-the-hugging-face-library

Or check out this page from the documentation: https://huggingface.co/transformers/training.html

Mark-Brass · 2021-05-02T15:29:54Z

Thank you. Now it is a bit more clear.
I am using finBERT for sentiment analysis, and downloaded the model from the official finBERT GIT. Do I need, then, to train the model anew?

ishandutta0098 · 2021-05-15T12:39:52Z

I am facing a similar error while creating an entity extraction model using bert-base-uncased. Here is the code for my model

import config
import torch
import transformers
import torch.nn as nn

def loss_fn(output, target, mask, num_labels):

    lfn = nn.CrossEntropyLoss()
    active_loss = mask.view(-1) == 1
    active_logits = output.view(-1, num_labels)
    active_labels = torch.where(
        active_loss,
        target.view(-1),
        torch.tensor(lfn.ignore_index).type_as(target)
    )
    loss = lfn(active_logits, active_labels)
    return loss

class EntityModel(nn.Module):
    def __init__(self, num_tag, num_pos):
        super(EntityModel, self).__init__()

        self.num_tag = num_tag
        self.num_pos = num_pos
        self.bert = transformers.BertModel.from_pretrained(config.BASE_MODEL_PATH)
        self.bert_drop_1 = nn.Dropout(p = 0.3)
        self.bert_drop_2 = nn.Dropout(p = 0.3)
        self.out_tag = nn.Linear(768, self.num_tag)
        self.out_pos = nn.Linear(768, self.num_pos)

    def forward(self, ids, mask, token_type_ids, target_pos, target_tag):
        o1, _ = self.bert(ids, 
                          attention_mask = mask,
                          token_type_ids = token_type_ids)

        bo_tag = self.bert_drop_1(o1)
        bo_pos = self.bert_drop_2(o1)

        tag = self.out_tag(bo_tag)
        pos = self.out_pos(bo_pos)

        loss_tag = loss_fn(tag, target_tag, mask, self.num_tag)
        loss_pos = loss_fn(pos, target_pos, mask, self.num_pos)

        loss = (loss_tag + loss_pos) / 2

        return tag, pos, loss

Error
Some weights of the model checkpoint at D:\Transformers\bert-entity-extraction\input\bert-base-uncased_L-12_H-768_A-12 were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

How to reslove this?

bver · 2022-01-19T14:33:17Z

@veronica320, the pooler layer is not used when doing sequence classification, so there's nothing to be worried about.

Note that this warning is sensitive to a Transformers version used for model training vs. a version used for inference.
For instance, the Roberta model finetuned with 4.9.1 expresses this warning when loading the model for RobertaForSequenceClassification inference based on ver. 4.15.0, but the model finetuned with 4.15.0 does not.

lijiazheng99 · 2022-04-19T22:57:42Z

An interesting edge case -- when I created and fine-tuned my custom classification model BertXXXSequenceClassification inherited from BertPreTrainedModel, I found out that I can't name layers called self.beta_layer. Otherwise, I get the warning that says beta_layer is newly initialised and won't be able to load its wights and bias from saved checkpoints.
Didn't know what caused this conflict, and refactoring it to self.bate_layer saved me in the end. I used ver 4.15.0.

Birch-san · 2022-11-16T22:55:25Z

I've been using suppressing the warning with this helper:

from transformers import CLIPTextModel, logging

class log_level:
  orig_log_level: int
  log_level: int
  def __init__(self, log_level: int):
    self.log_level = log_level
    self.orig_log_level = logging.get_verbosity()
  def __enter__(self):
    logging.set_verbosity(self.log_level)
  def __exit__(self):
    logging.set_verbosity(self.orig_log_level)

with log_level(logging.ERROR):
  text_encoder: CLIPTextModel = CLIPTextModel.from_pretrained('openai/clip-vit-large-patch14')

marcospgp · 2022-12-19T16:18:01Z

Coming here from Google, this was happening when I called AutoModel.from_pretrained("EleutherAI/gpt-neo-125M").

I figured out that you can get the correct model type using the pipeline API instead:

In this case, this means I could also use AutoModelForCausalLM, but not AutoModel as that generated a model of a different type.

saranpan · 2023-03-28T05:13:18Z

For those who want to suppress the warning for the latest transformers version, try this, hope this helps :D

import logging
logging.getLogger("transformers.modeling_utils").setLevel(logging.ERROR)

layumi · 2023-07-16T10:51:10Z

I guess the simple solution is to use AutoModelForMaskedLM instead of AutoModel.

from transformers import AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained('deps/distilbert-base-uncased')

Muhammad-Asad-Arshed · 2023-08-06T08:47:01Z

This means you are not using Pooler, you need to set add_pooling_layer=True.

Maximo-Rulli · 2023-09-02T01:03:44Z

I am confused after reading the whole issue. Is it possible to load a custom transformer architecture using HF or not? If not, is it possible to load a custom HF architecture (and weights) from a trainer checkpoint using torch?

ArthurZucker · 2023-09-04T12:27:45Z

Hey @Maximo-Rulli, you can of course load a custom architecture using the trust_remote_code = True (following this tutorial).
You just need to make sure you are instantiating the correct class. Meaning if the model that you push to the hub is a XXXForMaskedLM then if you load it with XXXModel you will simply get a warning saying that for example the lm_head was not loaded (which is expected!).
Also feel free to ask questions like this on our forum , as you'll probably get a lot of input from the community! 🤗

ddasfdsg34343 · 2023-12-26T06:55:02Z

    if args.bert == 'base':
        med_config = '/Users/dengdeng/Desktop/M2KT-vit/config/med_config_blip.json'
    elif args.bert == 'sci':
        med_config = '/Users/dengdeng/Desktop/M2KT-vit/config/med_config_sci.json'
    elif args.bert == 'cli':
        med_config = '/Users/dengdeng/Desktop/M2KT-vit/config/med_config_cli.json'
    decoder_config = BertConfig.from_json_file(med_config)
    decoder_config.encoder_width = vision_width
    if args.bert == 'base':
        self.text_decoder = BertLMHeadModel.from_pretrained('bert-base-uncased', config=decoder_config)

    elif args.bert == 'sci':
        self.text_decoder = BertLMHeadModel.from_pretrained('allenai/scibert_scivocab_uncased',
                                                            config=decoder_config)
    elif args.bert == 'cli':
        self.text_decoder = BertLMHeadModel.from_pretrained('emilyalsentzer/Bio_ClinicalBERT',
                                                            config=decoder_config)
    self.text_decoder.resize_token_embeddings(len(self.tokenizer))

#################################
class BertLMHeadModel(BertPreTrainedModel):

_keys_to_ignore_on_load_unexpected = [r"pooler"]
_keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]

def __init__(self, config):
    super().__init__(config)

    self.bert = BertModel(config, add_pooling_layer=False)
    self.cls = BertOnlyMLMHead(config)

    self.init_weights()

def get_output_embeddings(self):
    return self.cls.predictions.decoder

def set_output_embeddings(self, new_embeddings):
    self.cls.predictions.decoder = new_embeddings

def forward(
    self,
    input_ids=None,
    attention_mask=None,
    position_ids=None,
    head_mask=None,
    inputs_embeds=None,
    encoder_hidden_states=None,
    encoder_attention_mask=None,
    labels=None,
    past_key_values=None,
    use_cache=None,
    output_attentions=None,
    output_hidden_states=None,
    return_dict=None,
    return_logits=False,            
    is_decoder=True,
    reduction='mean',
    mode='multimodal', 
):
    r"""
    encoder_hidden_states  (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
        Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
        the model is configured as a decoder.
    encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
        Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
        the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
        - 1 for tokens that are **not masked**,
        - 0 for tokens that are **masked**.
    labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
        Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
        ``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
        ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
    past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
        Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
        If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
        (those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
        instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
    use_cache (:obj:`bool`, `optional`):
        If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
        decoding (see :obj:`past_key_values`).
    Returns:
    Example::
        >>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig
        >>> import torch
        >>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
        >>> config = BertConfig.from_pretrained("bert-base-cased")
        >>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
        >>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
        >>> outputs = model(**inputs)
        >>> prediction_logits = outputs.logits
    """
    return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    if labels is not None:
        use_cache = False

    outputs = self.bert(
        input_ids,
        attention_mask=attention_mask,
        position_ids=position_ids,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        encoder_hidden_states=encoder_hidden_states,
        encoder_attention_mask=encoder_attention_mask,
        past_key_values=past_key_values,
        use_cache=use_cache,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
        is_decoder=is_decoder,
        mode=mode,
    )
    
    sequence_output = outputs[0]
    prediction_scores = self.cls(sequence_output)
    
    if return_logits:
        return prediction_scores[:, :-1, :].contiguous()  

    lm_loss = None
    if labels is not None:
        # we are doing next-token prediction; shift prediction scores and input ids by one
        shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
        labels = labels[:, 1:].contiguous()
        loss_fct = CrossEntropyLoss(reduction=reduction, label_smoothing=0.1)
        # loss_fct = CrossEntropyLoss(reduction=reduction)
        lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
        if reduction=='none':
            lm_loss = lm_loss.view(prediction_scores.size(0),-1).sum(1)               

    if not return_dict:
        output = (prediction_scores,) + outputs[2:]
        return ((lm_loss,) + output) if lm_loss is not None else output

    return CausalLMOutputWithCrossAttentions(
        loss=lm_loss,
        logits=prediction_scores,
        past_key_values=outputs.past_key_values,
        hidden_states=outputs.hidden_states,
        attentions=outputs.attentions,
        cross_attentions=outputs.cross_attentions,
    )

def prepare_inputs_for_generation(self, input_ids, past=None, attention_mask=None, **model_kwargs):
    input_shape = input_ids.shape
    # if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
    if attention_mask is None:
        attention_mask = input_ids.new_ones(input_shape)

    # cut decoder_input_ids if past is used
    if past is not None:
        input_ids = input_ids[:, -1:]

    return {
        "input_ids": input_ids, 
        "attention_mask": attention_mask, 
        "past_key_values": past,
        "encoder_hidden_states": model_kwargs.get("encoder_hidden_states", None),
        "encoder_attention_mask": model_kwargs.get("encoder_attention_mask", None),
        "is_decoder": True,
    }

def _reorder_cache(self, past, beam_idx):
    reordered_past = ()
    for layer_past in past:
        reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
    return reordered_past

when i was loading pretrained weight ,it's appear:Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertLMHeadModel: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'bert.embeddings.token_type_embeddings.weight']

This IS expected if you are initializing BertLMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertLMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertLMHeadModel were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['bert.encoder.layer.5.crossattention.self.query.bias', 'bert.encoder.layer.10.crossattention.self.value.bias', 'bert.encoder.layer.10.crossattention.self.query.weight', 'bert.encoder.layer.2.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.0.crossattention.output.dense.bias', 'bert.encoder.layer.5.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.5.crossattention.self.key.bias', 'bert.encoder.layer.0.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.key.bias', 'bert.encoder.layer.10.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.3.crossattention.self.query.weight', 'bert.encoder.layer.11.crossattention.output.dense.weight', 'bert.encoder.layer.4.crossattention.self.query.weight', 'bert.encoder.layer.11.crossattention.output.dense.bias', 'bert.encoder.layer.8.crossattention.output.dense.weight', 'bert.encoder.layer.9.crossattention.output.dense.weight', 'bert.encoder.layer.4.crossattention.output.dense.weight', 'bert.encoder.layer.6.crossattention.self.query.weight', 'bert.encoder.layer.3.crossattention.self.query.bias', 'bert.encoder.layer.10.crossattention.self.query.bias', 'bert.encoder.layer.4.crossattention.self.key.bias', 'bert.encoder.layer.9.crossattention.self.key.bias', 'bert.encoder.layer.10.crossattention.self.key.weight', 'bert.encoder.layer.10.crossattention.output.dense.bias', 'bert.encoder.layer.7.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.8.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.7.crossattention.self.query.bias', 'bert.encoder.layer.7.crossattention.self.value.bias', 'bert.encoder.layer.4.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.2.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.3.crossattention.self.key.weight', 'bert.encoder.layer.6.crossattention.self.value.weight', 'bert.encoder.layer.6.crossattention.self.key.bias', 'bert.encoder.layer.1.crossattention.self.value.weight', 'bert.encoder.layer.7.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.9.crossattention.self.query.bias', 'bert.encoder.layer.8.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.query.bias', 'bert.encoder.layer.6.crossattention.output.dense.weight', 'bert.encoder.layer.5.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.8.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.0.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.9.crossattention.self.value.weight', 'bert.encoder.layer.0.crossattention.self.query.weight', 'bert.encoder.layer.3.crossattention.self.value.weight', 'bert.encoder.layer.5.crossattention.self.value.weight', 'bert.encoder.layer.3.crossattention.self.key.bias', 'bert.encoder.layer.1.crossattention.self.value.bias', 'bert.encoder.layer.10.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.7.crossattention.self.key.bias', 'bert.encoder.layer.1.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.self.query.bias', 'bert.encoder.layer.7.crossattention.output.dense.bias', 'bert.encoder.layer.2.crossattention.self.key.weight', 'bert.encoder.layer.11.crossattention.self.value.weight', 'bert.encoder.layer.0.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.6.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.self.key.weight', 'bert.encoder.layer.6.crossattention.output.dense.bias', 'bert.encoder.layer.11.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.10.crossattention.output.dense.weight', 'bert.encoder.layer.0.crossattention.self.value.bias', 'bert.encoder.layer.11.crossattention.self.query.bias', 'bert.encoder.layer.4.crossattention.self.value.bias', 'bert.encoder.layer.6.crossattention.self.key.weight', 'bert.encoder.layer.11.crossattention.self.value.bias', 'bert.encoder.layer.0.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.8.crossattention.self.query.weight', 'bert.encoder.layer.5.crossattention.self.value.bias', 'bert.encoder.layer.7.crossattention.self.key.weight', 'bert.encoder.layer.3.crossattention.output.dense.bias', 'bert.encoder.layer.1.crossattention.self.key.weight', 'bert.encoder.layer.8.crossattention.output.dense.bias', 'bert.encoder.layer.4.crossattention.self.value.weight', 'bert.encoder.layer.5.crossattention.self.query.weight', 'bert.encoder.layer.11.crossattention.self.query.weight', 'bert.encoder.layer.2.crossattention.self.query.bias', 'bert.encoder.layer.2.crossattention.output.dense.bias', 'bert.encoder.layer.7.crossattention.output.dense.weight', 'cls.predictions.decoder.weight', 'bert.encoder.layer.7.crossattention.self.value.weight', 'bert.encoder.layer.9.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.6.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.2.crossattention.self.key.bias', 'bert.encoder.layer.5.crossattention.output.dense.bias', 'bert.encoder.layer.3.crossattention.output.dense.weight', 'bert.encoder.layer.1.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.11.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.5.crossattention.output.dense.weight', 'bert.encoder.layer.11.crossattention.self.key.bias', 'bert.encoder.layer.11.crossattention.self.key.weight', 'bert.encoder.layer.6.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.5.crossattention.self.key.weight', 'bert.encoder.layer.3.crossattention.output.LayerNorm.bias', 'bert.encoder.layer.3.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.8.crossattention.self.key.bias', 'bert.encoder.layer.2.crossattention.output.dense.weight', 'bert.encoder.layer.8.crossattention.self.key.weight', 'bert.encoder.layer.9.crossattention.self.query.weight', 'bert.encoder.layer.4.crossattention.self.key.weight', 'bert.encoder.layer.4.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.2.crossattention.self.value.bias', 'bert.encoder.layer.1.crossattention.self.key.bias', 'bert.encoder.layer.9.crossattention.output.LayerNorm.weight', 'bert.encoder.layer.7.crossattention.self.query.weight', 'bert.encoder.layer.8.crossattention.self.value.weight', 'bert.encoder.layer.9.crossattention.self.value.bias', 'bert.encoder.layer.2.crossattention.self.query.weight', 'bert.encoder.layer.1.crossattention.output.dense.bias', 'bert.encoder.layer.10.crossattention.self.value.weight', 'bert.encoder.layer.1.crossattention.self.query.weight', 'bert.encoder.layer.9.crossattention.output.dense.bias', 'bert.encoder.layer.8.crossattention.self.query.bias', 'bert.encoder.layer.4.crossattention.output.dense.bias', 'bert.encoder.layer.2.crossattention.self.value.weight', 'bert.encoder.layer.6.crossattention.self.query.bias', 'bert.encoder.layer.3.crossattention.self.value.bias', 'bert.encoder.layer.9.crossattention.self.key.weight', 'bert.encoder.layer.4.crossattention.self.query.bias', 'bert.encoder.layer.10.crossattention.self.key.bias']，how to deal with it，please🥹

ArthurZucker · 2024-01-02T17:32:35Z

Hey 🤗 thanks for the follow up, could you ask your question on the forum instead? I'm sure the community will be of help and we can't debug your custom code for you.

AssmaGht · 2024-04-07T11:54:58Z

Hello,
I have the same error
this my code with the problem

import pandas as pd
import numpy as np
import tensorflow as tf
from transformers import TFAutoModel, AutoTokenizer
train_df = pd.read_csv('train.csv', encoding='latin-1')
test_df = pd.read_csv('test.csv', encoding='latin-1')

model = TFAutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_text(text):
    return tokenizer(text, padding=True, truncation=True, return_tensors='tf')

train_inputs = preprocess_text(train_df['STATUS'].tolist())
test_inputs = preprocess_text(test_df['STATUS'].tolist())

label_mapping = {'n': 0, 'y': 1}
train_labels = train_df[['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']].replace(label_mapping).values.astype(np.float32)
test_labels = test_df[['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']].replace(label_mapping).values.astype(np.float32)

train_dataset = tf.data.Dataset.from_tensor_slices(({
    'input_ids': train_inputs['input_ids'],
    'token_type_ids': train_inputs['token_type_ids'],
    'attention_mask': train_inputs['attention_mask']
}, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices(({
    'input_ids': test_inputs['input_ids'],
    'token_type_ids': test_inputs['token_type_ids'],
    'attention_mask': test_inputs['attention_mask']
}, test_labels))

BATCH_SIZE = 64
train_dataset = train_dataset.shuffle(len(train_df)).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)

class BERTForClassification(tf.keras.Model):

    def __init__(self, bert_model, num_classes):
        super().__init__()
        self.bert = bert_model
        self.dropout = tf.keras.layers.Dropout(0.1)
        self.fc = tf.keras.layers.Dense(num_classes, activation='softmax')

    def call(self, inputs, training=False):
        outputs = self.bert(inputs, training=training)[0]  # Use all hidden states
        pooled_output = outputs[:, 0, :]  # Use the [CLS] token representation
        pooled_output = self.dropout(pooled_output, training=training)
        return self.fc(pooled_output)

num_classes = len(train_df[['cEXT', 'cNEU', 'cAGR', 'cCON', 'cOPN']].columns)
classifier = BERTForClassification(model, num_classes=num_classes)

loss = tf.keras.losses.CategoricalCrossentropy()
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-5)
classifier.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])

history = classifier.fit(
    train_dataset,
    epochs=3
)

classifier.evaluate(test_dataset)

error

/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.
Epoch 1/3
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model/bert/pooler/dense/kernel:0', 'tf_bert_model/bert/pooler/dense/bias:0'] when minimizing the loss. If you're using `model.compile()`, did you forget to provide a `loss` argument?
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model/bert/pooler/dense/kernel:0', 'tf_bert_model/bert/pooler/dense/bias:0'] when minimizing the loss. If you're using `model.compile()`, did you forget to provide a `loss` argument?
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model/bert/pooler/dense/kernel:0', 'tf_bert_model/bert/pooler/dense/bias:0'] when minimizing the loss. If you're using `model.compile()`, did you forget to provide a `loss` argument?
WARNING:tensorflow:Gradients do not exist for variables ['tf_bert_model/bert/pooler/dense/kernel:0', 'tf_bert_model/bert/pooler/dense/bias:0'] when minimizing the loss. If you're using `model.compile()`, did you forget to provide a `loss` argument?

Sumi19 · 2024-04-23T02:25:06Z

tokenizer_config.json: 100%
 1.27k/1.27k [00:00<00:00, 32.0kB/s]
vocab.txt: 100%
 232k/232k [00:00<00:00, 3.49MB/s]
tokenizer.json: 100%
 711k/711k [00:00<00:00, 3.66MB/s]
special_tokens_map.json: 100%
 125/125 [00:00<00:00, 1.57kB/s]
config.json: 100%
 913/913 [00:00<00:00, 18.6kB/s]
model.safetensors: 100%
 436M/436M [00:08<00:00, 52.1MB/s]
Some weights of BertForMaskedLM were not initialized from the model checkpoint at judithrosell/BC5CDR_BlueBERT_NER and are newly initialized: ['cls.predictions.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-10-6a4cfd3451b6>](https://localhost:8080/#) in <cell line: 78>()
     76 
     77 # Create datasets
---> 78 train_dataset = create_dataset_from_file("/content/drive/My Drive/Model_Training/fold2/train.tsv", tokenizer)
     79 eval_dataset = create_dataset_from_file("/content/drive/My Drive/Model_Training/fold2/dev.tsv", tokenizer)
     IndexError: list index out of range

what does this mean? how to correct it?

ArthurZucker · 2024-05-20T09:21:26Z

Sorry but I think you should ask your question on the forum, you are trying to use custom code, and it does not seem related to this thread 😉

davies-w · 2024-07-23T19:32:29Z

I've been using suppressing the warning with this helper:

from transformers import CLIPTextModel, logging

class log_level:
  orig_log_level: int
  log_level: int
  def __init__(self, log_level: int):
    self.log_level = log_level
    self.orig_log_level = logging.get_verbosity()
  def __enter__(self):
    logging.set_verbosity(self.log_level)
  def __exit__(self):
    logging.set_verbosity(self.orig_log_level)

with log_level(logging.ERROR):
  text_encoder: CLIPTextModel = CLIPTextModel.from_pretrained('openai/clip-vit-large-patch14')

This works, except you need to make the exit have 4 parameters,
(self, exception_type, exception_value, traceback):

and from transformers import CLIPTextModel, logging should reflect that logging has moved to transformers.util.logging

LysandreJik closed this as completed Jul 1, 2020

amaiya mentioned this issue Sep 3, 2020

distilBERT error amaiya/ktrain#242

Closed

ravisurdhar mentioned this issue Sep 8, 2020

Cannot replicate results in live demo app joeddav/zero-shot-demo#1

Closed

manueltonneau mentioned this issue Sep 8, 2020

MLM performance difference between bert-base-cased and Conversational BERT #7007

Closed

takojunior mentioned this issue May 10, 2022

Seq2pat encoding in notebook fidelity/mab2rec#7

Merged

Ch3nYe mentioned this issue Sep 15, 2023

Model Parallelism for Bert Models #10151

Closed

radswn mentioned this issue Nov 20, 2023

Huggingface warnings cleanup radswn/t2r2#73

Closed

What to do about this warning message: "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification" #5421

What to do about this warning message: "Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification" #5421

Comments

ohmeow commented Jul 1, 2020

julien-c commented Jul 1, 2020

fliptrail commented Jul 1, 2020

ohmeow commented Jul 1, 2020

LysandreJik commented Jul 1, 2020

fliptrail commented Jul 1, 2020

VaibhavBhatnagar17 commented Jul 1, 2020

ohmeow commented Jul 1, 2020

LysandreJik commented Jul 1, 2020

LysandreJik commented Jul 1, 2020

LysandreJik commented Jul 1, 2020

VaibhavBhatnagar17 commented Jul 2, 2020

LysandreJik commented Jul 2, 2020

tarskiandhutch commented Jul 7, 2020

LysandreJik commented Jul 9, 2020

tarskiandhutch commented Jul 9, 2020

ehalit commented Sep 25, 2020 • edited Loading

LysandreJik commented Sep 25, 2020

ehalit commented Sep 25, 2020

s4sarath commented Oct 12, 2020 • edited Loading

veronica320 commented Oct 22, 2020

LysandreJik commented Oct 23, 2020

veronica320 commented Oct 23, 2020

s4sarath commented Oct 27, 2020

LysandreJik commented Oct 27, 2020

s4sarath commented Oct 27, 2020 via email

TingNLP commented Apr 13, 2021 • edited Loading

s4sarath commented Apr 13, 2021 via email

rkunani commented Apr 13, 2021

s4sarath commented Apr 13, 2021 via email

TingNLP commented Apr 14, 2021

s4sarath commented Apr 14, 2021 via email

TingNLP commented Apr 14, 2021

s4sarath commented Apr 14, 2021 via email

Mark-Brass commented May 2, 2021

tarskiandhutch commented May 2, 2021

Mark-Brass commented May 2, 2021

ishandutta0098 commented May 15, 2021

bver commented Jan 19, 2022

lijiazheng99 commented Apr 19, 2022 • edited Loading

Birch-san commented Nov 16, 2022 • edited Loading

marcospgp commented Dec 19, 2022 • edited Loading

saranpan commented Mar 28, 2023

layumi commented Jul 16, 2023

Muhammad-Asad-Arshed commented Aug 6, 2023

Maximo-Rulli commented Sep 2, 2023

ArthurZucker commented Sep 4, 2023

ddasfdsg34343 commented Dec 26, 2023

ArthurZucker commented Jan 2, 2024

AssmaGht commented Apr 7, 2024 • edited by ArthurZucker Loading

Sumi19 commented Apr 23, 2024 • edited by ArthurZucker Loading

ArthurZucker commented May 20, 2024

davies-w commented Jul 23, 2024

ehalit commented Sep 25, 2020 •

edited

Loading

s4sarath commented Oct 12, 2020 •

edited

Loading

TingNLP commented Apr 13, 2021 •

edited

Loading

lijiazheng99 commented Apr 19, 2022 •

edited

Loading

Birch-san commented Nov 16, 2022 •

edited

Loading

marcospgp commented Dec 19, 2022 •

edited

Loading

AssmaGht commented Apr 7, 2024 •

edited by ArthurZucker

Loading

Sumi19 commented Apr 23, 2024 •

edited by ArthurZucker

Loading