Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

表格识别 训练时候,进行eval 时候遇到bug #13268

Closed
liuzhipengchd opened this issue Jul 5, 2024 · 11 comments · Fixed by #13276
Closed

表格识别 训练时候,进行eval 时候遇到bug #13268

liuzhipengchd opened this issue Jul 5, 2024 · 11 comments · Fixed by #13276
Labels
bug Something isn't working Code PR is needed This issue could inspire a code PR

Comments

@liuzhipengchd
Copy link

structure_ids = paddle.zeros(

[Hint: Expected data_type == phi::DataType::FLOAT16 || data_type == phi::DataType::BFLOAT16 == true, but received data_type == phi::DataType::FLOAT16 || data_type == phi::DataType::BFLOAT16:0 != true:1.] (at ../paddle/fluid/imperative/amp_auto_cast.cc:190)

在评估时候,structure_ids 的数据类型和pre_chars 不一致,导致bug

@GreatV
Copy link
Collaborator

GreatV commented Jul 5, 2024

运行环境是什么 paddle版本是多少 paddleocr 版本是多少

@liuzhipengchd
Copy link
Author

运行环境是什么 paddle版本是多少 paddleocr 版本是多少

paddleocr 是2.8 ,paddlepaddle-gpu == 0.0.0.post112

@GreatV
Copy link
Collaborator

GreatV commented Jul 5, 2024

用的是啥gpu呀,amp模式训练的吗,不用amp可以运行吗

@liuzhipengchd
Copy link
Author

用的是啥gpu呀,amp模式训练的吗,不用amp可以运行吗

4090,就是正常模式。把那个数据格式改成统一的int32就可以执行。。不是这个bug吗?

@GreatV
Copy link
Collaborator

GreatV commented Jul 5, 2024

好的 @liuzhipengchd ,谢谢反馈,你能提一个PR来修复它吗?

@GreatV GreatV added bug Something isn't working Code PR is needed This issue could inspire a code PR labels Jul 5, 2024
@GreatV
Copy link
Collaborator

GreatV commented Jul 5, 2024

4090,就是正常模式。把那个数据格式改成统一的int32就可以执行。。不是这个bug吗?

因为这里看上去是启用amp,导致的bug。

@liuzhipengchd
Copy link
Author

4090,就是正常模式。把那个数据格式改成统一的int32就可以执行。。不是这个bug吗?

因为这里看上去是启用amp,导致的bug。

好的。。大佬,我想问下。1、训练表格识别模型,那个效果好 SLANet_lcnetv2.yml 和 SLANet_ch.yml (如果我选择scale 2.5的呢),2、SLANet_lcnetv2 中 我想选择 PPLCNetV2_large的话,应该怎么配置

@GreatV
Copy link
Collaborator

GreatV commented Jul 5, 2024

我想问下。1、训练表格识别模型,那个效果好 SLANet_lcnetv2.yml 和 SLANet_ch.yml (如果我选择scale 2.5的呢),2、SLANet_lcnetv2 中 我想选择 PPLCNetV2_large的话,应该怎么配置

应该是 SLANet_lcnetv2.yml 更好一点,具体的得请教一下 @invictuszhao

@GreatV
Copy link
Collaborator

GreatV commented Jul 6, 2024

@liuzhipengchd 无法复现你的问题

 python3 tools/train.py -c configs/table/SLANet.yml -o Train.loader.batch_size_per_card=16 Eval.loader.batch_size_per_card=16
[2024/07/06 07:43:04] ppocr INFO: epoch: [1/100], global_step: 1000, lr: 0.001000, acc: 0.000000, loss: 0.083020, structure_loss: 0.055820, loc_loss: 0.026588, avg_reader_cost: 0.00075 s, avg_batch_cost: 0.28300 s, avg_samples: 16.0, ips: 56.53801 samples/s, eta: 10 days, 9:10:27, max_mem_reserved: 8722 MB, max_mem_allocated: 8043 MB
eval model:: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 570/570 [01:45<00:00,  5.41it/s]
[2024/07/06 07:44:49] ppocr INFO: cur metric, acc: 0.026001097089851773, fps: 94.59463124211499
[2024/07/06 07:44:49] ppocr INFO: save best model is to ./output/SLANet/best_accuracy
[2024/07/06 07:44:49] ppocr INFO: best metric, acc: 0.026001097089851773, is_float16: False, fps: 94.59463124211499, best_epoch: 1
[2024/07/06 07:44:55] ppocr INFO: epoch: [1/100], global_step: 1020, lr: 0.001000, acc: 0.000000, loss: 0.078646, structure_loss: 0.050355, loc_loss: 0.032018, avg_reader_cost: 0.00104 s, avg_batch_cost: 0.27847 s, avg_samples: 16.0, ips: 57.45583 samples/s, eta: 10 days, 9:01:56, max_mem_reserved: 8722 MB, max_mem_allocated: 8043 MB
python3 tools/train.py -c configs/table/SLANet.yml -o Train.loader.batch_size_per_card=16 Eval.loader.batch_size_per_card=16 Global.use_amp=True
[2024/07/06 07:59:55] ppocr INFO: epoch: [1/100], global_step: 1000, lr: 0.001000, acc: 0.000000, loss: 0.084489, structure_loss: 0.057010, loc_loss: 0.028551, avg_reader_cost: 0.00082 s, avg_batch_cost: 0.26592 s, avg_samples: 16.0, ips: 60.16862 samples/s, eta: 9 days, 18:05:22, max_mem_reserved: 4516 MB, max_mem_allocated: 4079 MB
eval model:: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 570/570 [01:59<00:00,  4.78it/s]
[2024/07/06 08:01:55] ppocr INFO: cur metric, acc: 0.005485463521065774, fps: 82.6024714859491
[2024/07/06 08:01:55] ppocr INFO: save best model is to ./output/SLANet/best_accuracy
[2024/07/06 08:01:55] ppocr INFO: best metric, acc: 0.005485463521065774, is_float16: False, fps: 82.6024714859491, best_epoch: 1
[2024/07/06 08:02:00] ppocr INFO: epoch: [1/100], global_step: 1020, lr: 0.001000, acc: 0.000000, loss: 0.078040, structure_loss: 0.051105, loc_loss: 0.030460, avg_reader_cost: 0.00108 s, avg_batch_cost: 0.26121 s, avg_samples: 16.0, ips: 61.25244 samples/s, eta: 9 days, 17:56:58, max_mem_reserved: 4691 MB, max_mem_allocated: 4079 MB

@GreatV GreatV linked a pull request Jul 6, 2024 that will close this issue
@liuzhipengchd
Copy link
Author

@liuzhipengchd 无法复现你的问题

 python3 tools/train.py -c configs/table/SLANet.yml -o Train.loader.batch_size_per_card=16 Eval.loader.batch_size_per_card=16
[2024/07/06 07:43:04] ppocr INFO: epoch: [1/100], global_step: 1000, lr: 0.001000, acc: 0.000000, loss: 0.083020, structure_loss: 0.055820, loc_loss: 0.026588, avg_reader_cost: 0.00075 s, avg_batch_cost: 0.28300 s, avg_samples: 16.0, ips: 56.53801 samples/s, eta: 10 days, 9:10:27, max_mem_reserved: 8722 MB, max_mem_allocated: 8043 MB
eval model:: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 570/570 [01:45<00:00,  5.41it/s]
[2024/07/06 07:44:49] ppocr INFO: cur metric, acc: 0.026001097089851773, fps: 94.59463124211499
[2024/07/06 07:44:49] ppocr INFO: save best model is to ./output/SLANet/best_accuracy
[2024/07/06 07:44:49] ppocr INFO: best metric, acc: 0.026001097089851773, is_float16: False, fps: 94.59463124211499, best_epoch: 1
[2024/07/06 07:44:55] ppocr INFO: epoch: [1/100], global_step: 1020, lr: 0.001000, acc: 0.000000, loss: 0.078646, structure_loss: 0.050355, loc_loss: 0.032018, avg_reader_cost: 0.00104 s, avg_batch_cost: 0.27847 s, avg_samples: 16.0, ips: 57.45583 samples/s, eta: 10 days, 9:01:56, max_mem_reserved: 8722 MB, max_mem_allocated: 8043 MB
python3 tools/train.py -c configs/table/SLANet.yml -o Train.loader.batch_size_per_card=16 Eval.loader.batch_size_per_card=16 Global.use_amp=True
[2024/07/06 07:59:55] ppocr INFO: epoch: [1/100], global_step: 1000, lr: 0.001000, acc: 0.000000, loss: 0.084489, structure_loss: 0.057010, loc_loss: 0.028551, avg_reader_cost: 0.00082 s, avg_batch_cost: 0.26592 s, avg_samples: 16.0, ips: 60.16862 samples/s, eta: 9 days, 18:05:22, max_mem_reserved: 4516 MB, max_mem_allocated: 4079 MB
eval model:: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 570/570 [01:59<00:00,  4.78it/s]
[2024/07/06 08:01:55] ppocr INFO: cur metric, acc: 0.005485463521065774, fps: 82.6024714859491
[2024/07/06 08:01:55] ppocr INFO: save best model is to ./output/SLANet/best_accuracy
[2024/07/06 08:01:55] ppocr INFO: best metric, acc: 0.005485463521065774, is_float16: False, fps: 82.6024714859491, best_epoch: 1
[2024/07/06 08:02:00] ppocr INFO: epoch: [1/100], global_step: 1020, lr: 0.001000, acc: 0.000000, loss: 0.078040, structure_loss: 0.051105, loc_loss: 0.030460, avg_reader_cost: 0.00108 s, avg_batch_cost: 0.26121 s, avg_samples: 16.0, ips: 61.25244 samples/s, eta: 9 days, 17:56:58, max_mem_reserved: 4691 MB, max_mem_allocated: 4079 MB

WeChat6f359efa0925112819dc9dadc4e44b62

我这里可以复现问题
保持类型一样就可以了。(我把类型保持一样,会影响训练效果吗)

@GreatV
Copy link
Collaborator

GreatV commented Jul 8, 2024

@liuzhipengchd 不会影响效果,这个已经修复了。

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working Code PR is needed This issue could inspire a code PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants