Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latexocr paddle #13401

Merged
merged 15 commits into from
Jul 22, 2024
Merged

Latexocr paddle #13401

merged 15 commits into from
Jul 22, 2024

Conversation

liuhongen1234567
Copy link
Contributor

add the LaTeX OCR model into PaddleOCR

@CLAassistant
Copy link

CLAassistant commented Jul 16, 2024

CLA assistant check
All committers have signed the CLA.

@@ -0,0 +1,127 @@
Global:
use_gpu: True
epoch_num: 500
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确认下总epoch数是500吗

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的 .需要跑够500 epoch, ExpRate 才能达到pytorch 版本公布模型的精度. 不同训练 epoch 的评估结果如下表;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1721125010720

Head:
name: LaTeXOCRHead
pad_value: 0
ignore_index: -100
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index会出现负数吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不会, LaTeX 字典从0开始编码

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1721134188547

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LaTeX 代码并没有使用到 ignore_index 参数, 已删除

keep_keys: ['image']
loader:
shuffle: True
batch_size_per_card: 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

训练bs 只能是1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch size 需要在batch_size_per_pair处设置.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1721125346165

Copy link
Contributor Author

@liuhongen1234567 liuhongen1234567 Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 batch_size_per_card 是指图像公式文本对列表的数量.之所以这样写.是因为原始的LateXOCR 数据集需要先对图片按大小分组,再从不同分组中取 batch_size_per_pair 数量的图片进行模型训练.最小的采样单位是batch_size_per_pair ,而不是单张图片. 如果最小单位是单张图片,则无法保证每个batch的图片大小相同.

# 目前的静态图模型支持的最大输出长度为512
```
**注意:**
- 如果您是在自己的数据集上训练的模型,并且调整了字典文件,请注意修改配置文件中的`character_dict_path`是否是所需要的字典文件。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用的也是character_dict_path 这个参数吗? 在配置文件里没有看到

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为了和后续的export 参数统一, 已经更改为rec_char_dict_path.

"""

def __call__(self, batch):
# print(len(batch))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删掉注释

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

def __call__(self, batch):
# print(len(batch))
images, labels, attention_mask = batch[0]
return images, labels, attention_mask
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不希望自动组batch?

Copy link
Contributor Author

@liuhongen1234567 liuhongen1234567 Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的. batch 在数据集初始化后已经组好.这里只需要返回组后的batch即可. 如果按照传统的方式组batch, 那么不同大小的图片势必会分到一起,进而需要按最大的图像尺寸padding 其他图像, 产生大量冗余的图像token,影响模型的训练和推理速度.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1721128712706(1)

if np.random.random() < self.bitmap_prob:
img[img != 255] = 0
img = self.train_transform(image=img)["image"]
# print(img.shape)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删掉多余注释

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

infer_img: doc/datasets/pme_demo/0000013.png
infer_mode: False
use_space_char: False
fast_tokenizer_file: ppocr/utils/dict/latex_ocr_tokenizer.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数名和export时不一致, 统一下使用 fast_tokenizer_file or rec_char_dict_path

Copy link
Contributor Author

@liuhongen1234567 liuhongen1234567 Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改, 统一改为 rec_char_dict_path

import math
import cv2
import numpy as np
import albumentations as alb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里 albumentations 官方示例都是 import albumentations as A,是不是跟官方对齐更好。

https://github.com/albumentations-team/albumentations/blob/main/tests/test_augmentations.py

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已修改

Copy link
Collaborator

@tink2123 tink2123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@GreatV
Copy link
Collaborator

GreatV commented Jul 17, 2024

cla需要签署一下

@liuhongen1234567
Copy link
Contributor Author

cla需要签署一下

已经签署过了,但是不知道为什么一直说没有签署
1721198026369

@GreatV
Copy link
Collaborator

GreatV commented Jul 17, 2024

是不是没有配置邮箱呀

@jzhang533
Copy link
Collaborator

是不是没有配置邮箱呀

因为他本地 commit 的时候,改动过 gitconfig, 现在能看到两个 author: liuhongen liuhongen1234567

@liuhongen1234567
Copy link
Contributor Author

liuhongen1234567 commented Jul 19, 2024 via email

@GreatV GreatV merged commit cf26f23 into PaddlePaddle:main Jul 22, 2024
3 checks passed
@GreatV
Copy link
Collaborator

GreatV commented Jul 23, 2024

训练的时候会报 list index out of range @liuhongen1234567

python3 tools/train.py -c configs/rec/rec_latex_ocr.yml
W0723 05:12:14.900127 23183 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
list index out of range
[2024/07/23 05:12:15] ppocr INFO: train dataloader has 15556 iters
[2024/07/23 05:12:15] ppocr INFO: valid dataloader has 716 iters
[2024/07/23 05:12:15] ppocr INFO: train from scratch
[2024/07/23 05:12:15] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 60000 iterations
list index out of range
[2024/07/23 05:12:21] ppocr INFO: epoch: [1/500], global_step: 100, lr: 0.000100, edit distance: 0.878932, exp_rate: 0.000000, exp_rate<=1 : 0.443004, exp_rate<=2 : 0.641043, exp_rate<=3 : 0.791905, loss: 3.814032, avg_reader_cost: 0.00380 s, avg_batch_cost: 0.06194 s, avg_samples: 10.0, ips: 161.45872 samples/s, eta: 5 days, 13:48:46, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB
[2024/07/23 05:12:26] ppocr INFO: epoch: [1/500], global_step: 200, lr: 0.000100, edit distance: 0.837422, exp_rate: 0.000000, exp_rate<=1 : 0.233597, exp_rate<=2 : 0.355908, exp_rate<=3 : 0.451972, loss: 2.453111, avg_reader_cost: 0.00003 s, avg_batch_cost: 0.04902 s, avg_samples: 10.0, ips: 204.00151 samples/s, eta: 4 days, 23:51:31, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB

@liuhongen1234567
Copy link
Contributor Author

训练的时候会报 list index out of range @liuhongen1234567

python3 tools/train.py -c configs/rec/rec_latex_ocr.yml
W0723 05:12:14.900127 23183 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
list index out of range
[2024/07/23 05:12:15] ppocr INFO: train dataloader has 15556 iters
[2024/07/23 05:12:15] ppocr INFO: valid dataloader has 716 iters
[2024/07/23 05:12:15] ppocr INFO: train from scratch
[2024/07/23 05:12:15] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 60000 iterations
list index out of range
[2024/07/23 05:12:21] ppocr INFO: epoch: [1/500], global_step: 100, lr: 0.000100, edit distance: 0.878932, exp_rate: 0.000000, exp_rate<=1 : 0.443004, exp_rate<=2 : 0.641043, exp_rate<=3 : 0.791905, loss: 3.814032, avg_reader_cost: 0.00380 s, avg_batch_cost: 0.06194 s, avg_samples: 10.0, ips: 161.45872 samples/s, eta: 5 days, 13:48:46, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB
[2024/07/23 05:12:26] ppocr INFO: epoch: [1/500], global_step: 200, lr: 0.000100, edit distance: 0.837422, exp_rate: 0.000000, exp_rate<=1 : 0.233597, exp_rate<=2 : 0.355908, exp_rate<=3 : 0.451972, loss: 2.453111, avg_reader_cost: 0.00003 s, avg_batch_cost: 0.04902 s, avg_samples: 10.0, ips: 204.00151 samples/s, eta: 4 days, 23:51:31, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB

这个应该不会影响训练,这是我的训练日志,
paddle_latex_train.txt
paddle_latex_train_300_500.txt

@liuhongen1234567
Copy link
Contributor Author

训练的时候会报 list index out of range @liuhongen1234567

python3 tools/train.py -c configs/rec/rec_latex_ocr.yml
W0723 05:12:14.900127 23183 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
list index out of range
[2024/07/23 05:12:15] ppocr INFO: train dataloader has 15556 iters
[2024/07/23 05:12:15] ppocr INFO: valid dataloader has 716 iters
[2024/07/23 05:12:15] ppocr INFO: train from scratch
[2024/07/23 05:12:15] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 60000 iterations
list index out of range
[2024/07/23 05:12:21] ppocr INFO: epoch: [1/500], global_step: 100, lr: 0.000100, edit distance: 0.878932, exp_rate: 0.000000, exp_rate<=1 : 0.443004, exp_rate<=2 : 0.641043, exp_rate<=3 : 0.791905, loss: 3.814032, avg_reader_cost: 0.00380 s, avg_batch_cost: 0.06194 s, avg_samples: 10.0, ips: 161.45872 samples/s, eta: 5 days, 13:48:46, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB
[2024/07/23 05:12:26] ppocr INFO: epoch: [1/500], global_step: 200, lr: 0.000100, edit distance: 0.837422, exp_rate: 0.000000, exp_rate<=1 : 0.233597, exp_rate<=2 : 0.355908, exp_rate<=3 : 0.451972, loss: 2.453111, avg_reader_cost: 0.00003 s, avg_batch_cost: 0.04902 s, avg_samples: 10.0, ips: 204.00151 samples/s, eta: 4 days, 23:51:31, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB

1721714875146
找到问题源头了,是这个代码块,由于latexocr 并没有MakeBorderMap或者MakeShrinkMap这种数据增强,所以循环得到的index为None, 之后又获取0索引,肯定找不到。所以这个报错理论上不会对训练造成影响。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants