偶尔会出现找不到PDF中的图片的错误，然后程序退出 #674

WXpiero · 2024-09-28T23:01:35Z

Description of the bug | 错误描述

我的机器很差，内存只有40G，怕解析中途内存爆了，在解析一些5000多页的PDF的时候，我会先把PDF切成80页一个的小文件，然后再用MAGIC-PDF去解析。然后一大堆文件中偶尔会看到回显有如下日志这样的找不到图片的错误，一旦出现这样的错误，这个PDF就不会有任何layout或者markdown文件被输出。
不知道是不是跟我切分了PDF文件导致的，一本书我切成3堆小PDF，会出现其中一堆小PDF全部都不会有输出的情况。

How to reproduce the bug | 如何复现

0: 1888x1472 (no detections), 5348.0ms
Speed: 31.3ms preprocess, 5348.0ms inference, 0.0ms postprocess per image at shape (1, 3, 1888, 1472)
2024-09-29 06:12:10.477 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0
2024-09-29 06:12:44.351 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 33.87
2024-09-29 06:13:05.565 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 21.21

0: 1888x1472 (no detections), 5281.3ms
Speed: 47.3ms preprocess, 5281.3ms inference, 0.0ms postprocess per image at shape (1, 3, 1888, 1472)
2024-09-29 06:13:10.893 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 0, mfr time: 0.0
2024-09-29 06:13:28.802 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 17.91
2024-09-29 06:13:49.682 | INFO | magic_pdf.model.pdf_extract_kit:call:259 - layout detection cost: 20.88

0: 1888x1472 1 embedding, 5265.1ms
Speed: 31.3ms preprocess, 5265.1ms inference, 0.0ms postprocess per image at shape (1, 3, 1888, 1472)
2024-09-29 06:13:57.233 | INFO | magic_pdf.model.pdf_extract_kit:call:289 - formula nums: 1, mfr time: 2.24
2024-09-29 06:14:37.120 | INFO | magic_pdf.model.pdf_extract_kit:call:372 - ocr cost: 39.87
2024-09-29 06:14:37.120 | INFO | magic_pdf.model.doc_analyze_by_custom_model:doc_analyze:136 - doc analyze cost: 4823.803959131241
2024-09-29 06:14:37.827 | INFO | magic_pdf.pdf_parse_union_core:pdf_parse_union:242 - page_id: 0, last_page_cost_time: 0.0
2024-09-29 06:14:37.859 | ERROR | magic_pdf.user_api:parse_pdf:91 - [Errno 2] No such file or directory: 'C:\download\pdf\11\222\output\Braunwald_s_Heart_Disease__A_Textbook_of_Cardiovascular_Medicine__split_2_1total\Braunwald_s_Heart_Disease__A_Textbook_of_Cardiovascular_Medicine__split_2_split_9\auto\images\e6a42f9b1b5b49c9f8f6810f6a1f8e562f0c407c3c2b88e94166e9c1839b83b8.jpg'
Traceback (most recent call last):

File "C:\Users\wxpie\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
return run_code(code, main_globals, None,
│ │ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "c:\ai\pdf_mark\venv\Scripts\m...
│ └ <code object at 0x0000016A15988240, file "c:\ai\pdf_mark\venv\Scripts\magic-pdf.exe_main.py", line 1>
└ <function _run_code at 0x0000016A1595E560>

File "C:\Users\wxpie\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in run_code
exec(code, run_globals)
│ └ {'name': 'main', 'doc': None, 'package': '', 'loader': <zipimporter object "c:\ai\pdf_mark\venv\Scripts\m...
└ <code object at 0x0000016A15988240, file "c:\ai\pdf_mark\venv\Scripts\magic-pdf.exe_main.py", line 1>

File "c:\ai\pdf_mark\venv\Scripts\magic-pdf.exe_main_.py", line 7, in
sys.exit(cli())
│ │ └
│ └
└ <module 'sys' (built-in)>

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 1157, in call
return self.main(*args, **kwargs)
│ │ │ └ {}
│ │ └ ()
│ └ <function BaseCommand.main at 0x0000016A15DE9E10>
└

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 1078, in main
rv = self.invoke(ctx)
│ │ └ <click.core.Context object at 0x0000016A159C4EB0>
│ └ <function Command.invoke at 0x0000016A15DEA8C0>
└

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
│ │ │ │ │ └ {'path': 'C:\download\pdf\11\222\test\Braunwald_s_Heart_Disease__A_Textbook_of_Cardiovascular_Medicine__split_2', 'outp...
│ │ │ │ └ <click.core.Context object at 0x0000016A159C4EB0>
│ │ │ └ <function cli at 0x0000016A5C782200>
│ │ └
│ └ <function Context.invoke at 0x0000016A15DE9630>
└ <click.core.Context object at 0x0000016A159C4EB0>

File "c:\ai\pdf_mark\venv\lib\site-packages\click\core.py", line 783, in invoke
return __callback(*args, **kwargs)
│ └ {'path': 'C:\download\pdf\11\222\test\Braunwald_s_Heart_Disease__A_Textbook_of_Cardiovascular_Medicine__split_2', 'outp...
└ ()

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\tools\cli.py", line 100, in cli
parse_doc(doc_path)
│ └ WindowsPath('C:/download/pdf/11/222/test/Braunwald_s_Heart_Disease__A_Textbook_of_Cardiovascular_Medicine__split_2/Braunwald_...
└ <function cli..parse_doc at 0x0000016A159CCB80>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\tools\cli.py", line 84, in parse_doc
do_parse(
└ <function do_parse at 0x0000016A5C781990>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\tools\common.py", line 85, in do_parse
pipe.pipe_parse()
│ └ <function UNIPipe.pipe_parse at 0x0000016A5C781BD0>
└ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\pipe\UNIPipe.py", line 38, in pipe_parse
self.pdf_mid_data = parse_union_pdf(self.pdf_bytes, self.model_list, self.image_writer,
│ │ │ │ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x0000016A791456F0>
│ │ │ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280>
│ │ │ │ │ │ └ [{'layout_dets': [{'category_id': 0, 'poly': [10.797889709472656, 1376.1629638671875, 579.42041015625, 1376.1629638671875, 57...
│ │ │ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280>
│ │ │ │ └ b'%PDF-1.7\n%\xc2\xb5\xc2\xb6\n\n1 0 obj\n<</Type/Catalog/Pages 2 0 R>>\nendobj\n\n2 0 obj\n<</Type/Pages/Count 80/Kids[37 0 ...
│ │ │ └ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280>
│ │ └ <function parse_union_pdf at 0x0000016A5C7811B0>
│ └ None
└ <magic_pdf.pipe.UNIPipe.UNIPipe object at 0x0000016A7CC3F280>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\user_api.py", line 101, in parse_union_pdf
pdf_info_dict = parse_pdf(parse_pdf_by_ocr)
│ └ <function parse_pdf_by_ocr at 0x0000016A3A3B9D80>
└ <function parse_union_pdf..parse_pdf at 0x0000016A9EC7BC70>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\user_api.py", line 82, in parse_pdf
return method(
└ <function parse_pdf_by_ocr at 0x0000016A3A3B9D80>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\pdf_parse_by_ocr.py", line 11, in parse_pdf_by_ocr
return pdf_parse_union(pdf_bytes,
│ └ b'%PDF-1.7\n%\xc2\xb5\xc2\xb6\n\n1 0 obj\n<</Type/Catalog/Pages 2 0 R>>\nendobj\n\n2 0 obj\n<</Type/Pages/Count 80/Kids[37 0 ...
└ <function pdf_parse_union at 0x0000016A5C780EE0>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\pdf_parse_union_core.py", line 249, in pdf_parse_union
page_info = parse_page_core(pdf_docs, magic_model, page_id, pdf_bytes_md5, imageWriter, parse_mode)
│ │ │ │ │ │ └ 'ocr'
│ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x0000016A791456F0>
│ │ │ │ └ 'B7BDFC834A92FCBCFEB8ACF7B733A395'
│ │ │ └ 0
│ │ └ <magic_pdf.model.magic_model.MagicModel object at 0x0000016A791465F0>
│ └ Document('', <memory, doc# 49>)
└ <function parse_page_core at 0x0000016A5C780E50>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\pdf_parse_union_core.py", line 128, in parse_page_core
spans = ocr_cut_image_and_table(spans, pdf_docs[page_id], page_id, pdf_bytes_md5, imageWriter)
│ │ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x0000016A791456F0>
│ │ │ │ │ └ 'B7BDFC834A92FCBCFEB8ACF7B733A395'
│ │ │ │ └ 0
│ │ │ └ 0
│ │ └ Document('', <memory, doc# 49>)
│ └ [{'bbox': [3, 64, 596, 133], 'score': 0.993672788143158, 'type': 'table'}, {'bbox': [298, 538, 326, 553], 'score': 0.87, 'con...
└ <function ocr_cut_image_and_table at 0x0000016A5C766290>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\pre_proc\cut_image.py", line 22, in ocr_cut_image_and_table
span['image_path'] = cut_image(span['bbox'], page_id, page, return_path=return_path('tables'),
│ │ │ │ │ └ <function ocr_cut_image_and_table..return_path at 0x0000016AA39440D0>
│ │ │ │ └ page 0 of <memory, doc# 49>
│ │ │ └ 0
│ │ └ {'bbox': [3, 64, 596, 133], 'score': 0.993672788143158, 'type': 'table'}
│ └ <function cut_image at 0x0000016A5C766440>
└ {'bbox': [3, 64, 596, 133], 'score': 0.993672788143158, 'type': 'table'}

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\libs\pdf_image_tools.py", line 31, in cut_image
imageWriter.write(byte_data, img_hash256_path, AbsReaderWriter.MODE_BIN)
│ │ │ │ │ └ 'binary'
│ │ │ │ └ <class 'magic_pdf.rw.AbsReaderWriter.AbsReaderWriter'>
│ │ │ └ 'e6a42f9b1b5b49c9f8f6810f6a1f8e562f0c407c3c2b88e94166e9c1839b83b8.jpg'
│ │ └ b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x01\x00\x00\x00\x00\xff\xdb\x00C\x00\x02\x01\x01\x01\x01\x01\x02\x01\x01\x01\x02...
│ └ <function DiskReaderWriter.write at 0x0000016A17D7D120>
└ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x0000016A791456F0>

File "c:\ai\pdf_mark\venv\lib\site-packages\magic_pdf\rw\DiskReaderWriter.py", line 41, in write
with open(abspath, "wb") as f:
└ 'C:\download\pdf\11\222\output\Braunwald_s_Heart_Disease__A_Textbook_of_Cardiovascular_Medicine__split_2_1total\Braunw...

FileNotFoundError: [Errno 2] No such file or directory: 'C:\download\pdf\11\222\output\Braunwald_s_Heart_Disease__A_Textbook_of_Cardiovascular_Medicine__split_2_1total\Braunwald_s_Heart_Disease__A_Textbook_of_Cardiovascular_Medicine__split_2_split_9\auto\images\e6a42f9b1b5b49c9f8f6810f6a1f8e562f0c407c3c2b88e94166e9c1839b83b8.jpg'
2024-09-29 06:14:37.923 | ERROR | magic_pdf.tools.cli:parse_doc:96 - Both parse_pdf_by_txt and parse_pdf_by_ocr failed.
Traceback (most recent call last):