Convert DOCX to Markdown: Ordered List in Markdown keeps increasing across headings #10258

ThachNgocTran · 2024-10-03T17:16:20Z

Explain the problem

In Microsoft Word (DOCX), when starting a new heading (in this specific case: Heading 4), make an ordered list (or numbered list). Definitely, the starting number is reset expectedly. But when converting from DOCX to Markdown, using Pandoc, the numbered list keeps continuing, i.e. keep increasing across headings! This is unexpected.

Expectation: When a new heading starts, the ordered list should be reset (in Markdown). This is to keep it consistent with the DOCX version.

Reproducibility

Input file: input.docx
Output file: output.md
Command line to convert Input to Output: pandoc --wrap=none --extract-media=./ -f docx -t gfm input.docx -o output.md
Pandoc Version: pandoc-3.4-windows-x86_64
OS: Windows 10 x64 (22H2)
Office Editor: LibreOffice 24.8.1.2

The text was updated successfully, but these errors were encountered:

jgm · 2024-10-03T17:41:00Z

When one creates this sort of file with Word, it uses different numId for the two lists, and pandoc expects that. It seems that LibreOffice uses the same numId. So I'm actually puzzled about why Word doesn't display the lists as all part of one list, as pandoc interprets it.

There may be something I'm not understanding right about how these lists are to be encoded, or maybe there's an issue with LibreOffice.

See also #7895.

jgm · 2024-10-03T17:44:47Z

OK, I think I see what is happening in this case.

The list with numId 2 is being used for BOTH the enumerated lists and the headings (the headings are "level 0", <w:ilvl w:val="0"/>, and the lists are level 1).

I think that's why the number resets; because Word interprets this as a new sublist of a higher-level list (the headings).

jgm · 2024-10-03T17:48:49Z

Word doesn't seem to use numId for headings in this way, when headings aren't numbered. (And even when they are, the inner list isn't using the same numId, though maybe this would happen if I entered the document differently.)

jgm · 2024-10-03T17:50:19Z

Anyway, the fix should involve modifying the code in the docx reader that tracks continuing lists and sets the start number accordingly, and making sure that the number is reset when higher-level list item in the same numId series is encountered.

ThachNgocTran · 2024-10-03T18:02:11Z

@jgm Thank you for the quick response. Originally I got this issue from Google Document: I exported a Google document to Docx, converted to Markdown with Pandoc. Got the issue. Then I tried to remove irrelevant parts (of the Docx) to make the reproducible file for the bug, using LibreOffice Writer (latest version).

So I guess this issue is not LibreOffice's.

jgm · 2024-10-03T22:03:18Z

We do have code for list items that does what I say above:
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Docx.hs#L758-L762

The problem is that we don't have anything similar for headings, which we don't parse with these fields.

The heading element is created here:
https://github.com/jgm/pandoc/blob/main/src/Text/Pandoc/Readers/Docx.hs#L695

Somewhere in the code path leading up to this, we need to modify docxListState. This may require some changes in the types used in Docx parsing.

+ Remove ListItem constructor from BodyPart. + Changed numbered field of ParagraphStyle to a Maybe Number. + Add Number type to store numbering information. This makes sense because headings can have numbering information, and we sometimes need to know what it is (#10258).

See #10258.

jgm · 2024-10-04T01:24:32Z

some ideas in issue10258 branch.

ThachNgocTran · 2024-10-04T09:56:45Z

@jgm Hope this issue is simple and can get fixed in the next iteration of Pandoc! 😀

My original intention was to export Google Document into Markdown (Github flavor) in order to import to Obsidian. Along the way, there were some minor issues such as: without the flag --wrap=none, text in a table cell can be broken into newline (by Pandoc), making Obsidian's Markdown renderer unable to draw the table correctly (one should use <br/> instead of \n for newline in a table cell). Apart from that, the bug in this Post is the only major issue left.

jgm · 2024-10-04T15:32:26Z

I think it is now fixed (the fix will be in the next release).

For the table issue you mention, try setting --columns to a very high value when writing markdown.

ThachNgocTran added the bug label Oct 3, 2024

jgm added format:Docx reader labels Oct 3, 2024

jgm added a commit that referenced this issue Oct 4, 2024

Reset docxListState if header has list information.

e6a1bb4

See #10258.

jgm closed this as completed in 93d7457 Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert DOCX to Markdown: Ordered List in Markdown keeps increasing across headings #10258

Convert DOCX to Markdown: Ordered List in Markdown keeps increasing across headings #10258

ThachNgocTran commented Oct 3, 2024

jgm commented Oct 3, 2024

jgm commented Oct 3, 2024

jgm commented Oct 3, 2024

jgm commented Oct 3, 2024

ThachNgocTran commented Oct 3, 2024

jgm commented Oct 3, 2024

jgm commented Oct 4, 2024

ThachNgocTran commented Oct 4, 2024 •

edited

Loading

jgm commented Oct 4, 2024

Convert DOCX to Markdown: Ordered List in Markdown keeps increasing across headings #10258

Convert DOCX to Markdown: Ordered List in Markdown keeps increasing across headings #10258

Comments

ThachNgocTran commented Oct 3, 2024

Explain the problem

Reproducibility

jgm commented Oct 3, 2024

jgm commented Oct 3, 2024

jgm commented Oct 3, 2024

jgm commented Oct 3, 2024

ThachNgocTran commented Oct 3, 2024

jgm commented Oct 3, 2024

jgm commented Oct 4, 2024

ThachNgocTran commented Oct 4, 2024 • edited Loading

jgm commented Oct 4, 2024

ThachNgocTran commented Oct 4, 2024 •

edited

Loading