Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It is desirable to preserve the header structure of the parsed md #32

Open
202030481266 opened this issue Aug 20, 2024 · 3 comments
Open
Labels
enhancement New feature or request

Comments

@202030481266
Copy link

Describe the bug
A clear and concise description of what the bug is.

I used magic-doc to convert my docx document to markdown, but the headings did not preserve the # heading symbol. How do I preserve the markdown heading structure?
我用了magic-doc转换我的docx文档为markdown,但是其中的标题没有保留 # 标题符号,我该如何保留其中的markdown标题结构呢?

To Reproduce

from magic_doc.docconv import DocConverter, S3Config
converter = DocConverter(s3_config=None)
file_path = '/myfile/path'
markdown_content, time_cost = converter.convert(file_path, conv_timeout=300)

Expected behavior

2.2非临床研究/已完成的临床研究结果概述

2.2.1非临床研究结果

2.2.1.1临床前药效学研究

Screenshots

image

Env

  • ubuntu20.04
  • Python3.10.14
  • fairy-doc 0.1.44
@icecraft
Copy link
Collaborator

currently, no plan to preserve title now

@icecraft icecraft added the enhancement New feature or request label Aug 22, 2024
@202030481266
Copy link
Author

I actually found python-docx to work quite well. It not only can identify the title, but also can identify the form, and add tags, so that I can use langchain markdownheadertextspliter for cutting.

import os
from docx import Document


def read_doc(file_path, image_folder):
    doc = Document(file_path)
    content_list = []

    # Ensure image folder exists
    if not os.path.exists(image_folder):
        os.makedirs(image_folder)

    # To track heading hierarchy and associate tables with titles
    current_heading = None
    current_heading_level = 0

    for para in doc.paragraphs:
        # Handle headings (Title detection)
        if para.style.name.startswith('Heading'):
            heading_level = int(para.style.name.replace('Heading ', '').strip())
            current_heading = para.text.strip()
            current_heading_level = heading_level
            markdown_heading = '#' * heading_level + ' ' + current_heading
            content_list.append({'type': 'heading', 'content': markdown_heading})

        # Handle regular paragraphs
        elif para.text.strip():
            content_list.append({'type': 'paragraph', 'content': para.text.strip()})

    # Handle tables separately since they are not part of paragraphs
    for table in doc.tables:
        table_data = []
        for row in table.rows:
            row_data = []
            for cell in row.cells:
                # Clean up cell content by removing newlines and extra spaces
                cell_text = " ".join(cell.text.split())
                row_data.append(cell_text)
            table_data.append(row_data)

        # Convert table to Markdown format
        markdown_table = '| ' + ' | '.join(table_data[0]) + ' |\n'
        markdown_table += '| ' + ' | '.join(['---'] * len(table_data[0])) + ' |\n'
        for row in table_data[1:]:
            markdown_table += '| ' + ' | '.join(row) + ' |\n'

        # If there's a current heading, consider it as the table title
        if current_heading:
            content_list.append({'type': 'table', 'title': current_heading, 'content': markdown_table})
        else:
            content_list.append({'type': 'table', 'content': markdown_table})

    # Handle images
    for rel in doc.part.rels.values():
        if "image" in rel.target_ref:
            img_bin = rel.target_part.blob
            img_ext = rel.target_part.partname.split(".")[-1]
            filename = f'{image_folder}/image{len(content_list) + 1}.{img_ext}'
            with open(filename, 'wb') as f:
                f.write(img_bin)
            content_list.append({'type': 'image', 'path': filename})

    return content_list

@icecraft
Copy link
Collaborator

I actually found python-docx to work quite well. It not only can identify the title, but also can identify the form, and add tags, so that I can use langchain markdownheadertextspliter for cutting.

import os
from docx import Document


def read_doc(file_path, image_folder):
    doc = Document(file_path)
    content_list = []

    # Ensure image folder exists
    if not os.path.exists(image_folder):
        os.makedirs(image_folder)

    # To track heading hierarchy and associate tables with titles
    current_heading = None
    current_heading_level = 0

    for para in doc.paragraphs:
        # Handle headings (Title detection)
        if para.style.name.startswith('Heading'):
            heading_level = int(para.style.name.replace('Heading ', '').strip())
            current_heading = para.text.strip()
            current_heading_level = heading_level
            markdown_heading = '#' * heading_level + ' ' + current_heading
            content_list.append({'type': 'heading', 'content': markdown_heading})

        # Handle regular paragraphs
        elif para.text.strip():
            content_list.append({'type': 'paragraph', 'content': para.text.strip()})

    # Handle tables separately since they are not part of paragraphs
    for table in doc.tables:
        table_data = []
        for row in table.rows:
            row_data = []
            for cell in row.cells:
                # Clean up cell content by removing newlines and extra spaces
                cell_text = " ".join(cell.text.split())
                row_data.append(cell_text)
            table_data.append(row_data)

        # Convert table to Markdown format
        markdown_table = '| ' + ' | '.join(table_data[0]) + ' |\n'
        markdown_table += '| ' + ' | '.join(['---'] * len(table_data[0])) + ' |\n'
        for row in table_data[1:]:
            markdown_table += '| ' + ' | '.join(row) + ' |\n'

        # If there's a current heading, consider it as the table title
        if current_heading:
            content_list.append({'type': 'table', 'title': current_heading, 'content': markdown_table})
        else:
            content_list.append({'type': 'table', 'content': markdown_table})

    # Handle images
    for rel in doc.part.rels.values():
        if "image" in rel.target_ref:
            img_bin = rel.target_part.blob
            img_ext = rel.target_part.partname.split(".")[-1]
            filename = f'{image_folder}/image{len(content_list) + 1}.{img_ext}'
            with open(filename, 'wb') as f:
                f.write(img_bin)
            content_list.append({'type': 'image', 'path': filename})

    return content_list

well done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants