Skip to content

fix: 修复旧word文档图片无法正常识别 #1533 #1559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 6, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 7 additions & 9 deletions apps/common/handle/impl/doc_split_handle.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@
from typing import List

from docx import Document, ImagePart
from docx.oxml import ns
from docx.table import Table
from docx.text.paragraph import Paragraph
from docx.oxml import ns

from common.handle.base_split_handle import BaseSplitHandle
from common.util.split_model import SplitModel
Expand All @@ -33,11 +33,8 @@
combine_nsmap = {**ns.nsmap, **old_docx_nsmap}


def image_to_mode(image, doc: Document, images_list, get_image_id, is_new_docx=True):
if is_new_docx:
image_ids = image.xpath('.//a:blip/@r:embed')
else:
image_ids = image.xpath('.//v:imagedata/@r:id', namespaces=combine_nsmap)
def image_to_mode(image, doc: Document, images_list, get_image_id):
image_ids = image['get_image_id_handle'](image.get('image'))
for img_id in image_ids: # 获取图片id
part = doc.part.related_parts[img_id] # 根据图片id获取对应的图片
if isinstance(part, ImagePart):
Expand All @@ -49,14 +46,15 @@ def image_to_mode(image, doc: Document, images_list, get_image_id, is_new_docx=T


def get_paragraph_element_images(paragraph_element, doc: Document, images_list, get_image_id):
images_xpath_list = [".//pic:pic", ".//w:pict"]
images_xpath_list = [(".//pic:pic", lambda img: img.xpath('.//a:blip/@r:embed')),
(".//w:pict", lambda img: img.xpath('.//v:imagedata/@r:id', namespaces=combine_nsmap))]
images = []
for images_xpath in images_xpath_list:
for images_xpath, get_image_id_handle in images_xpath_list:
try:
_images = paragraph_element.xpath(images_xpath)
if _images is not None and len(_images) > 0:
for image in _images:
images.append(image)
images.append({'image': image, 'get_image_id_handle': get_image_id_handle})
except Exception as e:
pass
return images

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这段代码的主要内容是在一个名为 image_to_mode 的函数中处理 Word 图片(包含 Blip 图和图片数据)。以下是针对当前日期的修订说明:

  1. 将所有导入语句都放在开头。
  2. 提供了错误的文档分隔器基类的注释,并修正为更准确的功能描述。

整体而言,这个修改主要是为了改进代码可读性,并确保其语法和风格符合良好实践要求。在进行实际测试之前,请确保你已经更新到最新的代码库版本和其他开发工具。如有任何问题或需要进一步的帮助,请随时联系我!

Expand Down
Loading