Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert DOCX to Markdown: Ordered List in Markdown keeps increasing across headings #10258

Closed
ThachNgocTran opened this issue Oct 3, 2024 · 9 comments

Comments

@ThachNgocTran
Copy link

Explain the problem

In Microsoft Word (DOCX), when starting a new heading (in this specific case: Heading 4), make an ordered list (or numbered list). Definitely, the starting number is reset expectedly. But when converting from DOCX to Markdown, using Pandoc, the numbered list keeps continuing, i.e. keep increasing across headings! This is unexpected.

Expectation: When a new heading starts, the ordered list should be reset (in Markdown). This is to keep it consistent with the DOCX version.

issue

Reproducibility

  • Input file: input.docx
  • Output file: output.md
  • Command line to convert Input to Output: pandoc --wrap=none --extract-media=./ -f docx -t gfm input.docx -o output.md
  • Pandoc Version: pandoc-3.4-windows-x86_64
  • OS: Windows 10 x64 (22H2)
  • Office Editor: LibreOffice 24.8.1.2
@jgm
Copy link
Owner

jgm commented Oct 3, 2024

When one creates this sort of file with Word, it uses different numId for the two lists, and pandoc expects that. It seems that LibreOffice uses the same numId. So I'm actually puzzled about why Word doesn't display the lists as all part of one list, as pandoc interprets it.

There may be something I'm not understanding right about how these lists are to be encoded, or maybe there's an issue with LibreOffice.

See also #7895.

@jgm
Copy link
Owner

jgm commented Oct 3, 2024

OK, I think I see what is happening in this case.

The list with numId 2 is being used for BOTH the enumerated lists and the headings (the headings are "level 0", <w:ilvl w:val="0"/>, and the lists are level 1).

I think that's why the number resets; because Word interprets this as a new sublist of a higher-level list (the headings).

@jgm
Copy link
Owner

jgm commented Oct 3, 2024

Word doesn't seem to use numId for headings in this way, when headings aren't numbered. (And even when they are, the inner list isn't using the same numId, though maybe this would happen if I entered the document differently.)

@jgm
Copy link
Owner

jgm commented Oct 3, 2024

Anyway, the fix should involve modifying the code in the docx reader that tracks continuing lists and sets the start number accordingly, and making sure that the number is reset when higher-level list item in the same numId series is encountered.

@ThachNgocTran
Copy link
Author

@jgm Thank you for the quick response. Originally I got this issue from Google Document: I exported a Google document to Docx, converted to Markdown with Pandoc. Got the issue. Then I tried to remove irrelevant parts (of the Docx) to make the reproducible file for the bug, using LibreOffice Writer (latest version).

So I guess this issue is not LibreOffice's.

@jgm
Copy link
Owner

jgm commented Oct 3, 2024

We do have code for list items that does what I say above:
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Docx.hs#L758-L762

The problem is that we don't have anything similar for headings, which we don't parse with these fields.

The heading element is created here:
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Docx.hs#L695

Somewhere in the code path leading up to this, we need to modify docxListState. This may require some changes in the types used in Docx parsing.

jgm added a commit that referenced this issue Oct 4, 2024
+ Remove ListItem constructor from BodyPart.
+ Changed numbered field of ParagraphStyle to a Maybe Number.
+ Add Number type to store numbering information.

This makes sense because headings can have numbering information,
and we sometimes need to know what it is (#10258).
jgm added a commit that referenced this issue Oct 4, 2024
@jgm
Copy link
Owner

jgm commented Oct 4, 2024

some ideas in issue10258 branch.

@ThachNgocTran
Copy link
Author

ThachNgocTran commented Oct 4, 2024

@jgm Hope this issue is simple and can get fixed in the next iteration of Pandoc! 😀

My original intention was to export Google Document into Markdown (Github flavor) in order to import to Obsidian. Along the way, there were some minor issues such as: without the flag --wrap=none, text in a table cell can be broken into newline (by Pandoc), making Obsidian's Markdown renderer unable to draw the table correctly (one should use <br/> instead of \n for newline in a table cell). Apart from that, the bug in this Post is the only major issue left.

@jgm jgm closed this as completed in 93d7457 Oct 4, 2024
@jgm
Copy link
Owner

jgm commented Oct 4, 2024

I think it is now fixed (the fix will be in the next release).

For the table issue you mention, try setting --columns to a very high value when writing markdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants