Latexocr paddle #13401

liuhongen1234567 · 2024-07-16T08:42:01Z

add the LaTeX OCR model into PaddleOCR

CLAassistant · 2024-07-16T08:42:13Z

All committers have signed the CLA.

tink2123 · 2024-07-16T09:21:05Z

configs/rec/rec_latex_ocr.yml

@@ -0,0 +1,127 @@
+Global:
+  use_gpu: True
+  epoch_num: 500


确认下总epoch数是500吗

是的 .需要跑够500 epoch, ExpRate 才能达到pytorch 版本公布模型的精度. 不同训练 epoch 的评估结果如下表;

tink2123 · 2024-07-16T09:21:50Z

configs/rec/rec_latex_ocr.yml

+  Head:
+    name: LaTeXOCRHead
+    pad_value: 0
+    ignore_index: -100


index会出现负数吗？

不会, LaTeX 字典从0开始编码

LaTeX 代码并没有使用到 ignore_index 参数, 已删除

tink2123 · 2024-07-16T09:22:17Z

configs/rec/rec_latex_ocr.yml

+          keep_keys: ['image']
+  loader:
+    shuffle: True
+    batch_size_per_card: 1


训练bs 只能是1？

batch size 需要在batch_size_per_pair处设置.

这里的 batch_size_per_card 是指图像公式文本对列表的数量.之所以这样写.是因为原始的LateXOCR 数据集需要先对图片按大小分组,再从不同分组中取 batch_size_per_pair 数量的图片进行模型训练.最小的采样单位是batch_size_per_pair ,而不是单张图片. 如果最小单位是单张图片,则无法保证每个batch的图片大小相同.

tink2123 · 2024-07-16T09:25:12Z

doc/doc_ch/algorithm_rec_latex_ocr.md

+# 目前的静态图模型支持的最大输出长度为512
+```
+**注意：**
+- 如果您是在自己的数据集上训练的模型，并且调整了字典文件，请注意修改配置文件中的`character_dict_path`是否是所需要的字典文件。


使用的也是character_dict_path 这个参数吗？在配置文件里没有看到

为了和后续的export 参数统一, 已经更改为rec_char_dict_path.

tink2123 · 2024-07-16T09:26:13Z

ppocr/data/collate_fn.py

+    """
+
+    def __call__(self, batch):
+        # print(len(batch))


删掉注释

tink2123 · 2024-07-16T09:32:22Z

ppocr/data/collate_fn.py

+    def __call__(self, batch):
+        # print(len(batch))
+        images, labels, attention_mask = batch[0]
+        return images, labels, attention_mask


不希望自动组batch？

是的. batch 在数据集初始化后已经组好.这里只需要返回组后的batch即可. 如果按照传统的方式组batch, 那么不同大小的图片势必会分到一起,进而需要按最大的图像尺寸padding 其他图像, 产生大量冗余的图像token,影响模型的训练和推理速度.

tink2123 · 2024-07-16T09:33:23Z

ppocr/data/imaug/latex_ocr_aug.py

+        if np.random.random() < self.bitmap_prob:
+            img[img != 255] = 0
+        img = self.train_transform(image=img)["image"]
+        # print(img.shape)


删掉多余注释

tink2123 · 2024-07-16T09:39:40Z

configs/rec/rec_latex_ocr.yml

+  infer_img: doc/datasets/pme_demo/0000013.png
+  infer_mode: False
+  use_space_char: False
+  fast_tokenizer_file:  ppocr/utils/dict/latex_ocr_tokenizer.json


参数名和export时不一致，统一下使用 fast_tokenizer_file or rec_char_dict_path

已修改, 统一改为 rec_char_dict_path

GreatV · 2024-07-16T11:28:01Z

ppocr/data/imaug/latex_ocr_aug.py

+import math
+import cv2
+import numpy as np
+import albumentations as alb


这里 albumentations 官方示例都是 import albumentations as A，是不是跟官方对齐更好。

https://github.com/albumentations-team/albumentations/blob/main/tests/test_augmentations.py

tink2123

LGTM

GreatV · 2024-07-17T06:20:42Z

cla需要签署一下

liuhongen1234567 · 2024-07-17T06:34:01Z

cla需要签署一下

已经签署过了，但是不知道为什么一直说没有签署

GreatV · 2024-07-17T06:42:49Z

是不是没有配置邮箱呀

jzhang533 · 2024-07-17T06:44:32Z

是不是没有配置邮箱呀

因为他本地 commit 的时候，改动过 gitconfig，现在能看到两个 author： liuhongen liuhongen1234567

GreatV · 2024-07-19T08:00:37Z

@liuhongen1234567 参考这里解决一下 https://docs.github.com/en/pull-requests/committing-changes-to-your-project/troubleshooting-commits/why-are-my-commits-linked-to-the-wrong-user

liuhongen1234567 · 2024-07-19T08:03:31Z

好的，去团建了，这周日或下周一解决一下。

…

---Original--- From: "Wang ***@***.***> Date: Fri, Jul 19, 2024 16:01 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [PaddlePaddle/PaddleOCR] Latexocr paddle (PR #13401) @liuhongen1234567 参考这里解决一下 https://docs.github.com/en/pull-requests/committing-changes-to-your-project/troubleshooting-commits/why-are-my-commits-linked-to-the-wrong-user — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

deleted: ppocr/modeling/backbones/rec_resnetv2.py

… latexocr_paddle

GreatV · 2024-07-23T05:15:17Z

训练的时候会报 list index out of range @liuhongen1234567

python3 tools/train.py -c configs/rec/rec_latex_ocr.yml

W0723 05:12:14.900127 23183 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
list index out of range
[2024/07/23 05:12:15] ppocr INFO: train dataloader has 15556 iters
[2024/07/23 05:12:15] ppocr INFO: valid dataloader has 716 iters
[2024/07/23 05:12:15] ppocr INFO: train from scratch
[2024/07/23 05:12:15] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 60000 iterations
list index out of range
[2024/07/23 05:12:21] ppocr INFO: epoch: [1/500], global_step: 100, lr: 0.000100, edit distance: 0.878932, exp_rate: 0.000000, exp_rate<=1 : 0.443004, exp_rate<=2 : 0.641043, exp_rate<=3 : 0.791905, loss: 3.814032, avg_reader_cost: 0.00380 s, avg_batch_cost: 0.06194 s, avg_samples: 10.0, ips: 161.45872 samples/s, eta: 5 days, 13:48:46, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB
[2024/07/23 05:12:26] ppocr INFO: epoch: [1/500], global_step: 200, lr: 0.000100, edit distance: 0.837422, exp_rate: 0.000000, exp_rate<=1 : 0.233597, exp_rate<=2 : 0.355908, exp_rate<=3 : 0.451972, loss: 2.453111, avg_reader_cost: 0.00003 s, avg_batch_cost: 0.04902 s, avg_samples: 10.0, ips: 204.00151 samples/s, eta: 4 days, 23:51:31, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB

liuhongen1234567 · 2024-07-23T05:41:16Z

训练的时候会报 list index out of range @liuhongen1234567

python3 tools/train.py -c configs/rec/rec_latex_ocr.yml

W0723 05:12:14.900127 23183 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
list index out of range
[2024/07/23 05:12:15] ppocr INFO: train dataloader has 15556 iters
[2024/07/23 05:12:15] ppocr INFO: valid dataloader has 716 iters
[2024/07/23 05:12:15] ppocr INFO: train from scratch
[2024/07/23 05:12:15] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 60000 iterations
list index out of range
[2024/07/23 05:12:21] ppocr INFO: epoch: [1/500], global_step: 100, lr: 0.000100, edit distance: 0.878932, exp_rate: 0.000000, exp_rate<=1 : 0.443004, exp_rate<=2 : 0.641043, exp_rate<=3 : 0.791905, loss: 3.814032, avg_reader_cost: 0.00380 s, avg_batch_cost: 0.06194 s, avg_samples: 10.0, ips: 161.45872 samples/s, eta: 5 days, 13:48:46, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB
[2024/07/23 05:12:26] ppocr INFO: epoch: [1/500], global_step: 200, lr: 0.000100, edit distance: 0.837422, exp_rate: 0.000000, exp_rate<=1 : 0.233597, exp_rate<=2 : 0.355908, exp_rate<=3 : 0.451972, loss: 2.453111, avg_reader_cost: 0.00003 s, avg_batch_cost: 0.04902 s, avg_samples: 10.0, ips: 204.00151 samples/s, eta: 4 days, 23:51:31, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB

这个应该不会影响训练，这是我的训练日志，
paddle_latex_train.txt
paddle_latex_train_300_500.txt

liuhongen1234567 · 2024-07-23T06:10:46Z

训练的时候会报 list index out of range @liuhongen1234567

python3 tools/train.py -c configs/rec/rec_latex_ocr.yml

W0723 05:12:14.900127 23183 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
list index out of range
[2024/07/23 05:12:15] ppocr INFO: train dataloader has 15556 iters
[2024/07/23 05:12:15] ppocr INFO: valid dataloader has 716 iters
[2024/07/23 05:12:15] ppocr INFO: train from scratch
[2024/07/23 05:12:15] ppocr INFO: During the training process, after the 0th iteration, an evaluation is run every 60000 iterations
list index out of range
[2024/07/23 05:12:21] ppocr INFO: epoch: [1/500], global_step: 100, lr: 0.000100, edit distance: 0.878932, exp_rate: 0.000000, exp_rate<=1 : 0.443004, exp_rate<=2 : 0.641043, exp_rate<=3 : 0.791905, loss: 3.814032, avg_reader_cost: 0.00380 s, avg_batch_cost: 0.06194 s, avg_samples: 10.0, ips: 161.45872 samples/s, eta: 5 days, 13:48:46, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB
[2024/07/23 05:12:26] ppocr INFO: epoch: [1/500], global_step: 200, lr: 0.000100, edit distance: 0.837422, exp_rate: 0.000000, exp_rate<=1 : 0.233597, exp_rate<=2 : 0.355908, exp_rate<=3 : 0.451972, loss: 2.453111, avg_reader_cost: 0.00003 s, avg_batch_cost: 0.04902 s, avg_samples: 10.0, ips: 204.00151 samples/s, eta: 4 days, 23:51:31, max_mem_reserved: 3137 MB, max_mem_allocated: 2822 MB

找到问题源头了，是这个代码块，由于latexocr 并没有MakeBorderMap或者MakeShrinkMap这种数据增强，所以循环得到的index为None，之后又获取0索引，肯定找不到。所以这个报错理论上不会对训练造成影响。

tink2123 reviewed Jul 16, 2024

View reviewed changes

GreatV reviewed Jul 16, 2024

View reviewed changes

tink2123 approved these changes Jul 17, 2024

View reviewed changes

liuhongen1234567 closed this Jul 17, 2024

liuhongen1234567 reopened this Jul 17, 2024

liuhongen1234567 added 15 commits July 22, 2024 03:10

commit_test

249c3c3

modified: configs/rec/rec_latex_ocr.yml

f5e9ec9

deleted: ppocr/modeling/backbones/rec_resnetv2.py

ntuple_solve

2614722

style

3125368

style

2d07f91

style

ca6d4c8

style

67bfe23

style

195f6d8

style

4e5fb6a

style

ea1162a

style

f66d37f

style

715027d

delete comment

beb1dde

cla_email

ac5d6c1

Merge branch 'main' of https://github.com/PaddlePaddle/PaddleOCR into…

357bdf3

… latexocr_paddle

liuhongen1234567 force-pushed the latexocr_paddle branch from 002fac8 to 357bdf3 Compare July 22, 2024 03:31

GreatV merged commit cf26f23 into PaddlePaddle:main Jul 22, 2024
3 checks passed

Latexocr paddle #13401

Latexocr paddle #13401

Conversation

liuhongen1234567 commented Jul 16, 2024

CLAassistant commented Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liuhongen1234567 Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liuhongen1234567 Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liuhongen1234567 Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tink2123 left a comment

Choose a reason for hiding this comment

GreatV commented Jul 17, 2024

liuhongen1234567 commented Jul 17, 2024

GreatV commented Jul 17, 2024

jzhang533 commented Jul 17, 2024

GreatV commented Jul 19, 2024

liuhongen1234567 commented Jul 19, 2024 via email

GreatV commented Jul 23, 2024

liuhongen1234567 commented Jul 23, 2024

liuhongen1234567 commented Jul 23, 2024

CLAassistant commented Jul 16, 2024 •

edited

Loading

liuhongen1234567 Jul 16, 2024 •

edited

Loading

liuhongen1234567 Jul 16, 2024 •

edited

Loading

liuhongen1234567 Jul 16, 2024 •

edited

Loading