'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, #32382

xiaocao · 2024-08-01T16:03:24Z

System Info

datasets==2.20.0
torch==1.10.2+cu111
torchvision==0.11.3+cu111
transformers==4.42.4
detectron2==0.6
opencv-contrib-python==4.10.0.84
seqeval==1.2.2
accelerate==0.32.1
wandb
sentencepiece
easyocr
setuptools==59.5.0
python-bidi==0.4.2

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am transferring the relation extraction on LayoutLMv2 from transformers==4.6 to the latest transformers.
I use the original re.py code with the LayoutLMv2 built-in transfoemers.
However, the model always doesn't work.
The output is as following:
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.33}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.34}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.35}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.36}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.37}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.38}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.39}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.4}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.41}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.43}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.44}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.45}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.46}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.47}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.48}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.49}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.5}
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, 'epoch': 5.51}
The configuration is:
"--standalone",
"--nnodes=1",
"--nproc_per_node=1",
"--master_port=12098",
"examples/run_xfun_re_inf.py",
"--model_name_or_path=/home/mypath/layoutxlm_base",
"--output_dir=/home/myproj/output",
"--do_train",
"--do_eval",
"--lang=zh",
"--max_steps=5000",
"--per_device_train_batch_size=2",
"--warmup_ratio=0.1",
"--fp16",
"--learning_rate=5e-5",
"--logging_steps=1",

Could someone please help solve this problem? Thank you very much!!

Expected behavior

Get the normal result

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-08-01T16:26:34Z

Hi @xiaocao, thanks for raising an error!

Could you link to or share the re.py code you're referring to? LayoutLMV2 is available in the most recent version of transformers, so you shouldn't need to do anything specific in terms of conversion to the general model. Looking at old issues, it seems this task-specific head wasn't added for this model: #15451, #19120

xiaocao · 2024-08-02T02:02:58Z

Thank you for your reply. @amyeroberts
The source code of re.py is in the url: https://github.com/microsoft/unilm/blob/master/layoutlmft/layoutlmft/modules/decoders/re.py

In my attempt, I tried to test the "LayoutLMv2ForRelationExtraction" task.
The code is avaliable in the url:
https://github.com/microsoft/unilm/blob/master/layoutlmft/layoutlmft/models/layoutlmv2/modeling_layoutlmv2.py

I just transfer the LayoutLMv2ForRelationExtraction class and the re.py code from transformers==4.6 to 4.41 or latest version, and get the above result.

I have debugged the code, and found that the weight is abnormal.
For example, the weights of biliner in BiaffineAttention (the code in re.py) were initialized to extreme values, a few of them less than 1e-44, most of them were 0.
Manually initializing for weights does not work.

I replaced the LayoutLMv2 model with LayoutLMv3, and the result is similar.

amyeroberts · 2024-08-07T14:26:38Z

@xiaocao Thanks for sharing links to the code. As this involves custom code and classes, this is a question best placed in our forums. We try to reserve the github issues for feature requests and bug reports.

For example, the weights of biliner in BiaffineAttention (the code in re.py) were initialized to extreme values, a few of them less than 1e-44, most of them were 0.

Is this loading the model with from_pretrained? In this case, and given these values, it would indicate to me that the layers in the custom classes are not captured in the PretrainedModel subclass' _init_weights method.

Muhammad-Hamza-Jadoon · 2024-08-15T05:51:05Z

Thank you for your reply. @amyeroberts The source code of re.py is in the url: https://github.com/microsoft/unilm/blob/master/layoutlmft/layoutlmft/modules/decoders/re.py

In my attempt, I tried to test the "LayoutLMv2ForRelationExtraction" task. The code is avaliable in the url: https://github.com/microsoft/unilm/blob/master/layoutlmft/layoutlmft/models/layoutlmv2/modeling_layoutlmv2.py

I just transfer the LayoutLMv2ForRelationExtraction class and the re.py code from transformers==4.6 to 4.41 or latest version, and get the above result.

I have debugged the code, and found that the weight is abnormal. For example, the weights of biliner in BiaffineAttention (the code in re.py) were initialized to extreme values, a few of them less than 1e-44, most of them were 0. Manually initializing for weights does not work.

I replaced the LayoutLMv2 model with LayoutLMv3, and the result is similar.

hi xiaocao

Were you able to debug and solve the problem?

github-actions · 2024-09-13T08:05:18Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

xiaocao added the bug label Aug 1, 2024

github-actions bot closed this as completed Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, #32382

'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, #32382

xiaocao commented Aug 1, 2024

amyeroberts commented Aug 1, 2024

xiaocao commented Aug 2, 2024 •

edited

Loading

amyeroberts commented Aug 7, 2024

Muhammad-Hamza-Jadoon commented Aug 15, 2024

github-actions bot commented Sep 13, 2024

'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, #32382

'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0, #32382

Comments

xiaocao commented Aug 1, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Aug 1, 2024

xiaocao commented Aug 2, 2024 • edited Loading

amyeroberts commented Aug 7, 2024

Muhammad-Hamza-Jadoon commented Aug 15, 2024

github-actions bot commented Sep 13, 2024

xiaocao commented Aug 2, 2024 •

edited

Loading