Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

识别pdf 文件报错 #10466

Closed
archerbj opened this issue Jul 24, 2023 · 9 comments
Closed

识别pdf 文件报错 #10466

archerbj opened this issue Jul 24, 2023 · 9 comments

Comments

@archerbj
Copy link

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

  • 系统环境/System Environment:Mac M1
  • 版本号/Version: Paddle: PaddleOCR: 2.6.0.1 问题相关组件/Related components: 识别PDF
  • 运行指令/Command Code:
from paddleocr import PaddleOCR, draw_ocr

# Paddleocr supports Chinese, English, French, German, Korean and Japanese.
# You can set the parameter `lang` as `ch`, `en`, `fr`, `german`, `korean`, `japan`
# to switch the language model in order.
ocr = PaddleOCR(use_angle_cls=True, type="ocr",lang="ch", show_log=True,page_num=2)  # need to run only once to download and load model into memory
img_path = './data/super.pdf'
result = ocr.ocr(img_path, cls=True)
for idx in range(len(result)):
    res = result[idx]
    for line in res:
        print(line)

# draw result
import fitz
from PIL import Image
import cv2
import numpy as np
imgs = []
with fitz.open(img_path) as pdf:
    for pg in range(0, pdf.pageCount):
        page = pdf[pg]
        mat = fitz.Matrix(2, 2)
        pm = page.getPixmap(matrix=mat, alpha=False)
        # if width or height > 2000 pixels, don't enlarge the image
        if pm.width > 2000 or pm.height > 2000:
            pm = page.getPixmap(matrix=fitz.Matrix(1, 1), alpha=False)

        img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples)
        img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
        imgs.append(img)
for idx in range(len(result)):
    res = result[idx]
    image = imgs[idx]
    boxes = [line[0] for line in res]
    txts = [line[1][0] for line in res]
    scores = [line[1][1] for line in res]
    im_show = draw_ocr(image, boxes, txts, scores, font_path='fonts/simfang.ttf')
    im_show = Image.fromarray(im_show)
    im_show.save('result_page_{}.jpg'.format(idx))

  • 完整报错/Complete Error Message:

$ python feijiang_pdf.py
[2023/07/25 03:04:48] ppocr DEBUG: Namespace(alpha=1.0, benchmark=False, beta=1.0, cls_batch_num=6, cls_image_shape='3, 48, 192', cls_model_dir='/Users/baixinjiang/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer', cls_thresh=0.9, cpu_threads=10, crop_res_save_dir='./output', det=True, det_algorithm='DB', det_db_box_thresh=0.6, det_db_score_mode='fast', det_db_thresh=0.3, det_db_unclip_ratio=1.5, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_east_score_thresh=0.8, det_fce_box_type='poly', det_limit_side_len=960, det_limit_type='max', det_model_dir='/Users/baixinjiang/.paddleocr/whl/det/ch/ch_PP-OCRv3_det_infer', det_pse_box_thresh=0.85, det_pse_box_type='quad', det_pse_min_area=16, det_pse_scale=1, det_pse_thresh=0, det_sast_nms_thresh=0.2, det_sast_polygon=False, det_sast_score_thresh=0.5, draw_img_save_dir='./inference_results', drop_score=0.5, e2e_algorithm='PGNet', e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_limit_side_len=768, e2e_limit_type='max', e2e_model_dir=None, e2e_pgnet_mode='fast', e2e_pgnet_score_thresh=0.5, e2e_pgnet_valid_set='totaltext', enable_mkldnn=False, fourier_degree=5, gpu_mem=500, help='==SUPPRESS==', image_dir=None, image_orientation=False, ir_optim=True, kie_algorithm='LayoutXLM', label_list=['0', '180'], lang='ch', layout=True, layout_dict_path=None, layout_model_dir=None, layout_nms_threshold=0.5, layout_score_threshold=0.5, max_batch_size=10, max_text_length=25, merge_no_span_structure=True, min_subgraph_size=15, mode='structure', ocr=True, ocr_order_method=None, ocr_version='PP-OCRv3', output='./output', page_num=2, precision='fp32', process_id=0, rec=True, rec_algorithm='SVTR_LCNet', rec_batch_num=6, rec_char_dict_path='/Users/baixinjiang/coding/python3/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddleocr/ppocr/utils/ppocr_keys_v1.txt', rec_image_shape='3, 48, 320', rec_model_dir='/Users/baixinjiang/.paddleocr/whl/rec/ch/ch_PP-OCRv3_rec_infer', recovery=False, save_crop_res=False, save_log_path='./log_output/', save_pdf=False, scales=[8, 16, 32], ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ser_model_dir=None, shape_info_filename=None, show_log=True, sr_batch_num=1, sr_image_shape='3, 32, 128', sr_model_dir=None, structure_version='PP-Structurev2', table=True, table_algorithm='TableAttn', table_char_dict_path=None, table_max_len=488, table_model_dir=None, total_process_num=1, type='structure', use_angle_cls=True, use_dilation=False, use_gpu=False, use_mp=False, use_onnx=False, use_pdserving=False, use_space_char=True, use_tensorrt=False, use_xpu=False, vis_font_path='./doc/fonts/simfang.ttf', warmup=False)
[2023/07/25 03:04:48] ppocr ERROR: error in loading image:./data/super.pdf
Traceback (most recent call last):
  File "feijiang_pdf.py", line 8, in <module>
    result = ocr.ocr(img_path, cls=True)
  File "/Users/baixinjiang/coding/python3/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddleocr/paddleocr.py", line 524, in ocr
    dt_boxes, rec_res, _ = self.__call__(img, cls)
  File "/Users/baixinjiang/coding/python3/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddleocr/tools/infer/predict_system.py", line 70, in __call__
    ori_im = img.copy()
AttributeError: 'NoneType' object has no attribute 'copy'

@ToddBear
Copy link
Collaborator

看起来像是没有读到图片,请检查一下图片路径是否正确

@archerbj
Copy link
Author

路径应该是正确的, 我改成绝对路径 也是一样的报错。 请问一下,我是想识别pdf,代码是文档中给的例子,直接读取pdf的,不是图片。 直接读取pdf的这个能力没有了吗?

@ToddBear
Copy link
Collaborator

ToddBear commented Jul 25, 2023

读取pdf也是支持的,参考https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/docs/quickstart.md 的2.1.6节
以pdf作为输入的板面恢复的命令为:
paddleocr --image_dir=ppstructure/recovery/UnrealText.pdf --type=structure --recovery=true --use_pdf2docx_api=true

如果希望利用OCR技术从pdf文件对应图片中解析pdf,需要先将pdf转化为图片,然后参考上述链接中的教程进行板面恢复

@archerbj
Copy link
Author

如果希望利用OCR技术从pdf文件对应图片中解析pdf,需要先将pdf转化为图片,然后参考上述链接中的教程进行板面恢复

了解, 所以我要解析pdf 中的内容成 txt,需要先将pdf转换为一张一张图片,然后再去识别对吧?

另外上面的命令 报错

paddleocr: error: unrecognized arguments: --use_pdf2docs_api=true

@ToddBear
Copy link
Collaborator

对的,利用OCR的方法进行识别就需要先将pdf一张张转化为图片

@ToddBear
Copy link
Collaborator

有关以pdf作为输入的问题,可以参考以下文档,似乎要求paddleocr版本大于2.6才行:
https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/recovery/README_ch.md

@liugddx
Copy link

liugddx commented Aug 10, 2023

有关以pdf作为输入的问题,可以参考以下文档,似乎要求paddleocr版本大于2.6才行: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/recovery/README_ch.md

paddleocr==2.6.1还是存在一样的问题,用的https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/doc/doc_ch/quickstart.md#12例子

Copy link
Contributor

github-actions bot commented Jan 3, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@ATP-BME
Copy link

ATP-BME commented Mar 6, 2024

将PDF转换为图片后仍然报错
` 14 img = cv2.imread(img_path)
---> 15 result = table_engine(img)
16 save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
18 for line in result:

File d:\apps\miniconda3\envs\GNN\lib\site-packages\paddleocr\paddleocr.py:759, in PPStructure.call(self, img, return_ocr_result_in_table, img_idx)
757 def call(self, img, return_ocr_result_in_table=False, img_idx=0):
758 img = check_img(img)
--> 759 res, _ = super().call(
760 img, return_ocr_result_in_table, img_idx=img_idx)
761 return res

File d:\apps\miniconda3\envs\GNN\lib\site-packages\paddleocr\ppstructure\predict_system.py:110, in StructureSystem.call(self, img, return_ocr_result_in_table, img_idx)
108 time_dict['image_orientation'] = toc - tic
109 if self.mode == 'structure':
--> 110 ori_im = img.copy()
111 if self.layout_predictor is not None:
112 layout_res, elapse = self.layout_predictor(img)

AttributeError: 'NoneType' object has no attribute 'copy'`

运行代码:
`import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE" # #4613
import cv2
from paddleocr import PPStructure,save_structure_res
from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes, convert_info_docx

中文测试图

table_engine = PPStructure(recovery=True,table=False)

英文测试图

table_engine = PPStructure(recovery=True, lang='en')

save_folder = './output'
img_path = "E:/project/LM/知识注入/pdf/test/牛津精神病学教科书(第五版)/导出页面自 [牛津精神病学教科书(第五版)].(Shorter.Oxford.Textbook.of.Psychiatry).格尔德.扫描版_Page1.png" #"E:/project/LM/知识注入/pdf/test/牛津精神病学教科书(第五版)"
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
line.pop('img')
print(line)

h, w, _ = img.shape
res = sorted_layout_boxes(result, w)
convert_info_docx(img, res, save_folder, os.path.basename(img_path).split('.')[0])`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants