Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, #32382

Closed
4 tasks
xiaocao opened this issue Aug 1, 2024 · 5 comments
Closed
4 tasks

'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, #32382

xiaocao opened this issue Aug 1, 2024 · 5 comments
Labels

Comments

@xiaocao
Copy link

xiaocao commented Aug 1, 2024

System Info

datasets==2.20.0
torch==1.10.2+cu111
torchvision==0.11.3+cu111
transformers==4.42.4
detectron2==0.6
opencv-contrib-python==4.10.0.84
seqeval==1.2.2
accelerate==0.32.1
wandb
sentencepiece
easyocr
setuptools==59.5.0
python-bidi==0.4.2

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am transferring the relation extraction on LayoutLMv2 from transformers==4.6 to the latest transformers.
I use the original re.py code with the LayoutLMv2 built-in transfoemers.
However, the model always doesn't work.
The output is as following:
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.33}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.34}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.35}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.36}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.37}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.38}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.39}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.4}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.41}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.43}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.44}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.45}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.46}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.47}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.48}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.49}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.5}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.51}
The configuration is:
"--standalone",
"--nnodes=1",
"--nproc_per_node=1",
"--master_port=12098",
"examples/run_xfun_re_inf.py",
"--model_name_or_path=/home/mypath/layoutxlm_base",
"--output_dir=/home/myproj/output",
"--do_train",
"--do_eval",
"--lang=zh",
"--max_steps=5000",
"--per_device_train_batch_size=2",
"--warmup_ratio=0.1",
"--fp16",
"--learning_rate=5e-5",
"--logging_steps=1",

Could someone please help solve this problem? Thank you very much!!

Expected behavior

Get the normal result

@xiaocao xiaocao added the bug label Aug 1, 2024
@amyeroberts
Copy link
Collaborator

Hi @xiaocao, thanks for raising an error!

Could you link to or share the re.py code you're referring to? LayoutLMV2 is available in the most recent version of transformers, so you shouldn't need to do anything specific in terms of conversion to the general model. Looking at old issues, it seems this task-specific head wasn't added for this model: #15451, #19120

@xiaocao
Copy link
Author

xiaocao commented Aug 2, 2024

Thank you for your reply. @amyeroberts
The source code of re.py is in the url: https://github.com/microsoft/unilm/blob/master/layoutlmft/layoutlmft/modules/decoders/re.py

In my attempt, I tried to test the "LayoutLMv2ForRelationExtraction" task.
The code is avaliable in the url:
https://github.com/microsoft/unilm/blob/master/layoutlmft/layoutlmft/models/layoutlmv2/modeling_layoutlmv2.py

I just transfer the LayoutLMv2ForRelationExtraction class and the re.py code from transformers==4.6 to 4.41 or latest version, and get the above result.

I have debugged the code, and found that the weight is abnormal.
For example, the weights of biliner in BiaffineAttention (the code in re.py) were initialized to extreme values, a few of them less than 1e-44, most of them were 0.
Manually initializing for weights does not work.

I replaced the LayoutLMv2 model with LayoutLMv3, and the result is similar.

@amyeroberts
Copy link
Collaborator

@xiaocao Thanks for sharing links to the code. As this involves custom code and classes, this is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.

For example, the weights of biliner in BiaffineAttention (the code in re.py) were initialized to extreme values, a few of them less than 1e-44, most of them were 0.

Is this loading the model with from_pretrained? In this case, and given these values, it would indicate to me that the layers in the custom classes are not captured in the PretrainedModel subclass' _init_weights method.

@Muhammad-Hamza-Jadoon
Copy link

Thank you for your reply. @amyeroberts The source code of re.py is in the url: https://github.com/microsoft/unilm/blob/master/layoutlmft/layoutlmft/modules/decoders/re.py

In my attempt, I tried to test the "LayoutLMv2ForRelationExtraction" task. The code is avaliable in the url: https://github.com/microsoft/unilm/blob/master/layoutlmft/layoutlmft/models/layoutlmv2/modeling_layoutlmv2.py

I just transfer the LayoutLMv2ForRelationExtraction class and the re.py code from transformers==4.6 to 4.41 or latest version, and get the above result.

I have debugged the code, and found that the weight is abnormal. For example, the weights of biliner in BiaffineAttention (the code in re.py) were initialized to extreme values, a few of them less than 1e-44, most of them were 0. Manually initializing for weights does not work.

I replaced the LayoutLMv2 model with LayoutLMv3, and the result is similar.

hi xiaocao

Were you able to debug and solve the problem?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants