识别pdf 文件报错 #10466

archerbj · 2023-07-24T19:07:45Z

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：Mac M1
版本号/Version： Paddle： PaddleOCR： 2.6.0.1 问题相关组件/Related components：识别PDF
运行指令/Command Code：

from paddleocr import PaddleOCR, draw_ocr

# Paddleocr supports Chinese, English, French, German, Korean and Japanese.
# You can set the parameter `lang` as `ch`, `en`, `fr`, `german`, `korean`, `japan`
# to switch the language model in order.
ocr = PaddleOCR(use_angle_cls=True, type="ocr",lang="ch", show_log=True,page_num=2)  # need to run only once to download and load model into memory
img_path = './data/super.pdf'
result = ocr.ocr(img_path, cls=True)
for idx in range(len(result)):
    res = result[idx]
    for line in res:
        print(line)

# draw result
import fitz
from PIL import Image
import cv2
import numpy as np
imgs = []
with fitz.open(img_path) as pdf:
    for pg in range(0, pdf.pageCount):
        page = pdf[pg]
        mat = fitz.Matrix(2, 2)
        pm = page.getPixmap(matrix=mat, alpha=False)
        # if width or height > 2000 pixels, don't enlarge the image
        if pm.width > 2000 or pm.height > 2000:
            pm = page.getPixmap(matrix=fitz.Matrix(1, 1), alpha=False)

        img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples)
        img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
        imgs.append(img)
for idx in range(len(result)):
    res = result[idx]
    image = imgs[idx]
    boxes = [line[0] for line in res]
    txts = [line[1][0] for line in res]
    scores = [line[1][1] for line in res]
    im_show = draw_ocr(image, boxes, txts, scores, font_path='fonts/simfang.ttf')
    im_show = Image.fromarray(im_show)
    im_show.save('result_page_{}.jpg'.format(idx))

完整报错/Complete Error Message：


$ python feijiang_pdf.py
[2023/07/25 03:04:48] ppocr DEBUG: Namespace(alpha=1.0, benchmark=False, beta=1.0, cls_batch_num=6, cls_image_shape='3, 48, 192', cls_model_dir='/Users/baixinjiang/.paddleocr/whl/cls/ch_ppocr_mobile_v2.0_cls_infer', cls_thresh=0.9, cpu_threads=10, crop_res_save_dir='./output', det=True, det_algorithm='DB', det_db_box_thresh=0.6, det_db_score_mode='fast', det_db_thresh=0.3, det_db_unclip_ratio=1.5, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_east_score_thresh=0.8, det_fce_box_type='poly', det_limit_side_len=960, det_limit_type='max', det_model_dir='/Users/baixinjiang/.paddleocr/whl/det/ch/ch_PP-OCRv3_det_infer', det_pse_box_thresh=0.85, det_pse_box_type='quad', det_pse_min_area=16, det_pse_scale=1, det_pse_thresh=0, det_sast_nms_thresh=0.2, det_sast_polygon=False, det_sast_score_thresh=0.5, draw_img_save_dir='./inference_results', drop_score=0.5, e2e_algorithm='PGNet', e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_limit_side_len=768, e2e_limit_type='max', e2e_model_dir=None, e2e_pgnet_mode='fast', e2e_pgnet_score_thresh=0.5, e2e_pgnet_valid_set='totaltext', enable_mkldnn=False, fourier_degree=5, gpu_mem=500, help='==SUPPRESS==', image_dir=None, image_orientation=False, ir_optim=True, kie_algorithm='LayoutXLM', label_list=['0', '180'], lang='ch', layout=True, layout_dict_path=None, layout_model_dir=None, layout_nms_threshold=0.5, layout_score_threshold=0.5, max_batch_size=10, max_text_length=25, merge_no_span_structure=True, min_subgraph_size=15, mode='structure', ocr=True, ocr_order_method=None, ocr_version='PP-OCRv3', output='./output', page_num=2, precision='fp32', process_id=0, rec=True, rec_algorithm='SVTR_LCNet', rec_batch_num=6, rec_char_dict_path='/Users/baixinjiang/coding/python3/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddleocr/ppocr/utils/ppocr_keys_v1.txt', rec_image_shape='3, 48, 320', rec_model_dir='/Users/baixinjiang/.paddleocr/whl/rec/ch/ch_PP-OCRv3_rec_infer', recovery=False, save_crop_res=False, save_log_path='./log_output/', save_pdf=False, scales=[8, 16, 32], ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ser_model_dir=None, shape_info_filename=None, show_log=True, sr_batch_num=1, sr_image_shape='3, 32, 128', sr_model_dir=None, structure_version='PP-Structurev2', table=True, table_algorithm='TableAttn', table_char_dict_path=None, table_max_len=488, table_model_dir=None, total_process_num=1, type='structure', use_angle_cls=True, use_dilation=False, use_gpu=False, use_mp=False, use_onnx=False, use_pdserving=False, use_space_char=True, use_tensorrt=False, use_xpu=False, vis_font_path='./doc/fonts/simfang.ttf', warmup=False)
[2023/07/25 03:04:48] ppocr ERROR: error in loading image:./data/super.pdf
Traceback (most recent call last):
  File "feijiang_pdf.py", line 8, in <module>
    result = ocr.ocr(img_path, cls=True)
  File "/Users/baixinjiang/coding/python3/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddleocr/paddleocr.py", line 524, in ocr
    dt_boxes, rec_res, _ = self.__call__(img, cls)
  File "/Users/baixinjiang/coding/python3/miniconda3/envs/paddle_env/lib/python3.8/site-packages/paddleocr/tools/infer/predict_system.py", line 70, in __call__
    ori_im = img.copy()
AttributeError: 'NoneType' object has no attribute 'copy'

The text was updated successfully, but these errors were encountered:

ToddBear · 2023-07-25T02:03:04Z

看起来像是没有读到图片，请检查一下图片路径是否正确

archerbj · 2023-07-25T02:09:40Z

路径应该是正确的，我改成绝对路径也是一样的报错。请问一下，我是想识别pdf，代码是文档中给的例子，直接读取pdf的，不是图片。直接读取pdf的这个能力没有了吗？

ToddBear · 2023-07-25T02:31:01Z

读取pdf也是支持的，参考https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/docs/quickstart.md 的2.1.6节
以pdf作为输入的板面恢复的命令为：
paddleocr --image_dir=ppstructure/recovery/UnrealText.pdf --type=structure --recovery=true --use_pdf2docx_api=true

如果希望利用OCR技术从pdf文件对应图片中解析pdf，需要先将pdf转化为图片，然后参考上述链接中的教程进行板面恢复

archerbj · 2023-07-25T02:42:33Z

如果希望利用OCR技术从pdf文件对应图片中解析pdf，需要先将pdf转化为图片，然后参考上述链接中的教程进行板面恢复

了解，所以我要解析pdf 中的内容成 txt，需要先将pdf转换为一张一张图片，然后再去识别对吧？

另外上面的命令报错

paddleocr: error: unrecognized arguments: --use_pdf2docs_api=true

ToddBear · 2023-07-25T02:44:02Z

对的，利用OCR的方法进行识别就需要先将pdf一张张转化为图片

ToddBear · 2023-07-25T02:49:57Z

有关以pdf作为输入的问题，可以参考以下文档，似乎要求paddleocr版本大于2.6才行：
https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/recovery/README_ch.md

liugddx · 2023-08-10T03:13:34Z

有关以pdf作为输入的问题，可以参考以下文档，似乎要求paddleocr版本大于2.6才行： https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/recovery/README_ch.md

paddleocr==2.6.1还是存在一样的问题，用的https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/doc/doc_ch/quickstart.md#12例子

github-actions · 2024-01-03T02:42:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

ATP-BME · 2024-03-06T05:52:21Z

将PDF转换为图片后仍然报错
` 14 img = cv2.imread(img_path)
---> 15 result = table_engine(img)
16 save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
18 for line in result:

File d:\apps\miniconda3\envs\GNN\lib\site-packages\paddleocr\paddleocr.py:759, in PPStructure.call(self, img, return_ocr_result_in_table, img_idx)
757 def call(self, img, return_ocr_result_in_table=False, img_idx=0):
758 img = check_img(img)
--> 759 res, _ = super().call(
760 img, return_ocr_result_in_table, img_idx=img_idx)
761 return res

File d:\apps\miniconda3\envs\GNN\lib\site-packages\paddleocr\ppstructure\predict_system.py:110, in StructureSystem.call(self, img, return_ocr_result_in_table, img_idx)
108 time_dict['image_orientation'] = toc - tic
109 if self.mode == 'structure':
--> 110 ori_im = img.copy()
111 if self.layout_predictor is not None:
112 layout_res, elapse = self.layout_predictor(img)

AttributeError: 'NoneType' object has no attribute 'copy'`

运行代码：
`import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE" # #4613
import cv2
from paddleocr import PPStructure,save_structure_res
from paddleocr.ppstructure.recovery.recovery_to_doc import sorted_layout_boxes, convert_info_docx

中文测试图

table_engine = PPStructure(recovery=True,table=False)

英文测试图

table_engine = PPStructure(recovery=True, lang='en')

save_folder = './output'
img_path = "E:/project/LM/知识注入/pdf/test/牛津精神病学教科书(第五版)/导出页面自 [牛津精神病学教科书(第五版)].(Shorter.Oxford.Textbook.of.Psychiatry).格尔德.扫描版_Page1.png" #"E:/project/LM/知识注入/pdf/test/牛津精神病学教科书(第五版)"
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

for line in result:
line.pop('img')
print(line)

h, w, _ = img.shape
res = sorted_layout_boxes(result, w)
convert_info_docx(img, res, save_folder, os.path.basename(img_path).split('.')[0])`

ToddBear mentioned this issue Aug 23, 2023

🏅️飞桨套件快乐开源常规赛 #10223

Closed

github-actions bot added the stale label Jan 3, 2024

github-actions bot closed this as completed Jan 24, 2024

paddle-bot bot added the status/close label Jan 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

识别pdf 文件报错 #10466

识别pdf 文件报错 #10466

archerbj commented Jul 24, 2023

ToddBear commented Jul 25, 2023

archerbj commented Jul 25, 2023

ToddBear commented Jul 25, 2023 •

edited

Loading

archerbj commented Jul 25, 2023

ToddBear commented Jul 25, 2023

ToddBear commented Jul 25, 2023

liugddx commented Aug 10, 2023

github-actions bot commented Jan 3, 2024

ATP-BME commented Mar 6, 2024

识别pdf 文件报错 #10466

识别pdf 文件报错 #10466

Comments

archerbj commented Jul 24, 2023

ToddBear commented Jul 25, 2023

archerbj commented Jul 25, 2023

ToddBear commented Jul 25, 2023 • edited Loading

archerbj commented Jul 25, 2023

ToddBear commented Jul 25, 2023

ToddBear commented Jul 25, 2023

liugddx commented Aug 10, 2023

github-actions bot commented Jan 3, 2024

ATP-BME commented Mar 6, 2024

中文测试图

英文测试图

table_engine = PPStructure(recovery=True, lang='en')

ToddBear commented Jul 25, 2023 •

edited

Loading