-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latexocr paddle #13401
Latexocr paddle #13401
Conversation
@@ -0,0 +1,127 @@ | |||
Global: | |||
use_gpu: True | |||
epoch_num: 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确认下总epoch数是500吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的 .需要跑够500 epoch, ExpRate 才能达到pytorch 版本公布模型的精度. 不同训练 epoch 的评估结果如下表;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configs/rec/rec_latex_ocr.yml
Outdated
Head: | ||
name: LaTeXOCRHead | ||
pad_value: 0 | ||
ignore_index: -100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index会出现负数吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不会, LaTeX 字典从0开始编码
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LaTeX 代码并没有使用到 ignore_index 参数, 已删除
keep_keys: ['image'] | ||
loader: | ||
shuffle: True | ||
batch_size_per_card: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
训练bs 只能是1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batch size 需要在batch_size_per_pair处设置.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的 batch_size_per_card 是指图像公式文本对列表的数量.之所以这样写.是因为原始的LateXOCR 数据集需要先对图片按大小分组,再从不同分组中取 batch_size_per_pair 数量的图片进行模型训练.最小的采样单位是batch_size_per_pair ,而不是单张图片. 如果最小单位是单张图片,则无法保证每个batch的图片大小相同.
# 目前的静态图模型支持的最大输出长度为512 | ||
``` | ||
**注意:** | ||
- 如果您是在自己的数据集上训练的模型,并且调整了字典文件,请注意修改配置文件中的`character_dict_path`是否是所需要的字典文件。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用的也是character_dict_path 这个参数吗? 在配置文件里没有看到
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为了和后续的export 参数统一, 已经更改为rec_char_dict_path.
ppocr/data/collate_fn.py
Outdated
""" | ||
|
||
def __call__(self, batch): | ||
# print(len(batch)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删掉注释
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
def __call__(self, batch): | ||
# print(len(batch)) | ||
images, labels, attention_mask = batch[0] | ||
return images, labels, attention_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不希望自动组batch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的. batch 在数据集初始化后已经组好.这里只需要返回组后的batch即可. 如果按照传统的方式组batch, 那么不同大小的图片势必会分到一起,进而需要按最大的图像尺寸padding 其他图像, 产生大量冗余的图像token,影响模型的训练和推理速度.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ppocr/data/imaug/latex_ocr_aug.py
Outdated
if np.random.random() < self.bitmap_prob: | ||
img[img != 255] = 0 | ||
img = self.train_transform(image=img)["image"] | ||
# print(img.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
删掉多余注释
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
configs/rec/rec_latex_ocr.yml
Outdated
infer_img: doc/datasets/pme_demo/0000013.png | ||
infer_mode: False | ||
use_space_char: False | ||
fast_tokenizer_file: ppocr/utils/dict/latex_ocr_tokenizer.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参数名和export时不一致, 统一下使用 fast_tokenizer_file
or rec_char_dict_path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改, 统一改为 rec_char_dict_path
ppocr/data/imaug/latex_ocr_aug.py
Outdated
import math | ||
import cv2 | ||
import numpy as np | ||
import albumentations as alb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里 albumentations
官方示例都是 import albumentations as A
,是不是跟官方对齐更好。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cla需要签署一下 |
是不是没有配置邮箱呀 |
因为他本地 commit 的时候,改动过 gitconfig, 现在能看到两个 author: liuhongen liuhongen1234567 |
好的,去团建了,这周日或下周一解决一下。
…---Original---
From: "Wang ***@***.***>
Date: Fri, Jul 19, 2024 16:01 PM
To: ***@***.***>;
Cc: ***@***.******@***.***>;
Subject: Re: [PaddlePaddle/PaddleOCR] Latexocr paddle (PR #13401)
@liuhongen1234567 参考这里解决一下 https://docs.github.com/en/pull-requests/committing-changes-to-your-project/troubleshooting-commits/why-are-my-commits-linked-to-the-wrong-user
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
deleted: ppocr/modeling/backbones/rec_resnetv2.py
002fac8
to
357bdf3
Compare
训练的时候会报 python3 tools/train.py -c configs/rec/rec_latex_ocr.yml
|
这个应该不会影响训练,这是我的训练日志, |
|
add the LaTeX OCR model into PaddleOCR