Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word Document table conversion issue #20

Open
He-Huang opened this issue Dec 14, 2024 · 5 comments
Open

Word Document table conversion issue #20

He-Huang opened this issue Dec 14, 2024 · 5 comments
Labels
enhancement New feature or request open for contribution Invites open-source developers to contribute to the project.

Comments

@He-Huang
Copy link

This library is great. It would be even more useful if the table conversion is accurate with merged cells.

With the table inside this docx file
Image

I got the parsing results as below:

| 1 | 2 | 3 | 4 | 5 | 6 | |
| --- | --- | --- | --- | --- | --- | --- |
| A | b | c | d | e | f | value |
| I | J | K | L | M |
| P | Q | R | S | T |

After rendering in markdown, it's like

1 2 3 4 5 6
A b c d e f value
I J K L M
P Q R S T
@mick-net
Copy link

Markdown tables seem to lack complex table functionality. Maybe complex table's can be converted to html tables (since llm & markdown viewers can usually also work with this).

@gagb gagb added enhancement New feature or request help wanted Extra attention is needed open for contribution Invites open-source developers to contribute to the project. and removed help wanted Extra attention is needed labels Dec 14, 2024
@brc-dd
Copy link
Contributor

brc-dd commented Dec 15, 2024

pandoc's output is bit better:

1 2 3 4 5 6
A b c d e f value
I J K L M
P Q R S T

mammoth's html output is also correct (except it doesn't detect headers row properly, but it's still usable, ref - mwilliamson/mammoth.js#126).

The issue seems to be this - matthewwithanm/python-markdownify#121

We can use the mentioned workaround. We already have pandas in deps. It will give output like this:

1 2 3 4 5 6 6.1
A b c d e f value
A I J K L M value
A P Q R S T value

@He-Huang
Copy link
Author

Thanks @brc-dd! Pandoc works perfectly in my trail.
Hopefully it will be integrated inside the markitdown so that we can get an embedded html tables within the markdown output.

@Utopiah
Copy link

Utopiah commented Dec 16, 2024

pandoc's output is bit better

Not to hijack this issue but can you please explain what's the difference in terms of quality and features (so excluding programming language, funding institution and community) between this tool and Pandoc? I discovered it recently and my first thought was indeed, what does it do better than what I already know and use, namely pandoc or soffice, and why are those not contributions to such existing FLOSS projects?

Edit: seems there is audio transcription, not sure what's the use case in this context though.

@LCorleone
Copy link

markdown is not suitable for complex table, maybe xml is a good choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request open for contribution Invites open-source developers to contribute to the project.
Projects
None yet
Development

No branches or pull requests

6 participants