Trivial docx file fails to be parsed with 'couldn't parse docx file' error #5277

danielrbrowne · 2019-02-06T11:28:34Z

Pandoc fails to parse the attached trivial docx file (generated from Microsoft Word Online). Microsoft Word for Mac v16.21, Word Online (i.e. part of Office 365 online) and Pages for Mac all open the file without reporting any errors.

Given how trivial the file's contents are, I would expect Pandoc to parse this file without a problem.

trivial.docx

Pandoc version: 2.6 (also reproducible with 1.15.0.6, which is an old version that is used by a codebase I work on)

pandoc 2.6
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2, skylighting 0.7.5
Default user data directory: /Users/danbrowne/.pandoc
Copyright (C) 2006-2019 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

Command used with v2.6:

pandoc --extract-media=/Users/admin/Documents --from=docx --to=html --email-obfuscation=none --standalone +RTS -K128m -RTS --wrap=none ~/Documents/trivial.docx

Command-line output (v2.6):

couldn't parse docx file

Command used with v1.15.0.6:

pandoc --extract-media=/Users/admin/Documents --from=docx --to=html --email-obfuscation=none --standalone +RTS -K128m -RTS --no-wrap ~/Documents/trivial.docx

Command-line output (v1.15.0.6):

pandoc: couldn't parse docx file

The text was updated successfully, but these errors were encountered:

mb21 · 2019-02-06T12:39:54Z

We should probably do better error reporting here: https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Readers/Docx.hs#L119

agusmba · 2019-02-06T15:01:27Z

I can confirm the problem with the file on windows 7 (pandoc 2.6):

$ pandoc --from=docx --to=native ./trivial.docx
couldn't parse docx file

If I open the docx in word and save it, pandoc works:

$ pandoc --from=docx --to=native ./trivial2.docx
[Header 1 ("test",[],[]) [Str "Test"]
,Para [Str "This",Space,Str "is",Space,Emph [Str "italic"],Str ",",Space,Strong [Str "bold"],Str ",",Space,Span ("",["underline"],[]) [Str "underlined"],Str ",",Space,Emph [Span ("",["underline"],[]) [Str "italic",Space,Str "underlined"]],Str ",",Space,Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "underlined"]],Str ",",Space,Emph [Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "italic",Space,Str "underlined"]]],Str "."]]

I'll try to see if there is any glaring difference at the xml level between the two...

I don't know if pandoc is hardcoded to read document.xml, the test file is using document2.xml

danielrbrowne · 2019-02-06T15:18:46Z

Thanks for the updates on this - I just double-checked and I got my wires crossed with another dev that reported this issue internally...

The attached docx file on this ticket is actually from Microsoft Word Online (part of Office 365 online). It seems Word Online generates docx files with an internal document2.xml, whereas Microsoft Word for Mac generates docx files with an internal document.xml file.

I've never seen a document2.xml inside a docx file - so it seems even Microsoft themselves are not consistent across their own separate versions of Word!

(I verified the internal difference by simply saving a docx with the word "Test" in it on both Word Online and Word for Mac)

danielrbrowne · 2019-02-06T15:38:17Z

@agusmba This comment might be helpful for a method of avoiding a hard-coded check for document.xml or document2.xml (e.g. if Microsoft introduce some other naming scheme in the future): ankushshah89/python-docx2txt#16 (comment)

agusmba · 2019-02-06T16:13:31Z

Confirmed, I renamed internally document2.xml and document2.xml.rels, and updated the references in [Content_Types].xml and _rels/.rels and the corrected docx is parsed by pandoc:

$ pandoc --from=docx --to=native ./trivial0/trivial0_b.docx
[Header 1 ("test",[],[]) [Str "Test"]
,Para [Str "This",Space,Str "is",Space,Emph [Str "italic"],Str ",",Space,Strong [Str "bold"],Str ",",Space,Span ("",["underline"],[]) [Str "underlined"],Str ",",Space,Emph [Span ("",["underline"],[]) [Str "italic",Space,Str "underlined"]],Str ",",Space,Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "underlined"]],Str ",",Space,Emph [Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "italic",Space,Str "underlined"]]],Str "."]]

So if this is the default naming for Office365 (it seems to be, I just created a small online document that exhibits this issue), we'd need to support this variability in the filenames.

agusmba · 2019-02-06T16:44:50Z

This comment might be helpful for a method of avoiding a hard-coded check for document.xml or document2.xml (e.g. if Microsoft introduce some other naming scheme in the future): ankushshah89/python-docx2txt#16 (comment)

it seems they have the same problem. However their solution is not complete yet (I commented on it). I guess we'd need to start from _rels/.rels, get the name of the document[x].xml and construct from that the name of word/_rels/document[x].xml.rels supposing that its name is correlated, and that pandoc uses it.

jgm · 2019-02-06T16:54:02Z

Interesting. I wonder why online Office does that?
There are two places in src/Text/Pandoc/Readers/Docx/Parse.hs where these paths are hardcoded:

Parse.hs
366:  entry <- maybeToD $ findEntryByPath "word/document.xml" zf
480:filePathToRelType "word/_rels/document.xml.rels"  = Just InDocument

For a quick fix we could just have it check in both places for document2.xml as well.

For a more robust fix, we should get these names from _rels/.rels.

jgm · 2019-02-06T16:58:22Z

I'll implement the quick fix now.
@agusmba do you want to work on the more robust fix?

For some reason, Word in Office 365 Online uses `document2.xml` for the content, instead of `document.xml`. This causes pandoc not to be able to parse docx. This quick fix has the parser check for both `document.xml` and `document2.xml`. Addresses #5277, but a more robust solution would be to get the name of the main document dynamically (who knows whether it might change again?).

agusmba · 2019-02-06T17:38:47Z

I could take a look, but with some big caveats, I need to study some Haskell first (I'm not too proud of my latest PR code), and I don't have as much free time anymore (I had some unexpected free time on January).
On the plus side, if I were able to understand the docx reader, I could later on try to improve the reading of properties...
I could put this on my radar, but I wouldn't expect anything soon, so if anyone else wants to take a go at it, [s]he'd be welcome.

jgm · 2019-02-06T22:55:14Z

@jkr wrote the docx reader and might be able to fix it easily.

jkr · 2019-02-07T00:18:14Z

Okay, done. Will upload as soon as stack is done stacking.

agusmba · 2019-02-07T08:27:55Z

Thanks jkr!
I had a question on the commit, but it may be nothing.

jkr · 2019-02-07T11:06:28Z

@agusmba : Good point -- thanks! Fixed with 9ff4042

jgm · 2019-02-07T18:51:18Z

@jkr we're getting a "couldn't parse docx file" failure in the test suite on appveyor.
https://ci.appveyor.com/project/jgm/pandoc/build/job/q2gmkgiic8lj8xm5
The same commit seems to build on linux. Can you think of any reason why this would be happening?
Something Windows specific?

jkr · 2019-02-07T19:00:17Z

If there's (a) an error only on windows, and (b) the change had to do with zip path extraction, I'd guess it has to do with os-specific file-path separators. (filepath on windows wants backslashes, but zipfiles are forward slashes regardless of OS). In particular, I'd guess that it has to do with takeFileName on Parse.hs 508.

I can't test on Windows though. Should I try doing a more simple-mind path separation and upload it to see what appveyor says?

jkr · 2019-02-07T19:05:20Z

If that is the problem, it would probably also work to do a qualified import of System.FilePath.Posix and use the takeFileName from there.

I'm focusing on that function, because I think it's the only separator-related function from System.FilePath that I introduced here.

jkr · 2019-02-07T19:51:53Z

I just pushed the Posix version, and will see how appveyor likes it.

jkr · 2019-02-07T20:07:29Z

Hmmm... that wasn't it. Reverted.

jgm · 2019-02-08T07:36:33Z

Reopening this so we can track it until the Windows issue is fixed...

mb21 · 2019-02-08T09:36:05Z

@jkr you can also make a pull request to get an appveyor build, so you don't have to push to master every time ;-)

jkr · 2019-02-08T11:32:08Z

Okay -- I'm mystified. This complain starts appearing with 1847bdb.

Which is just the initial fix in 4cce0ef (no path manipulation) and the test cases.

I'm curious about whether it's something weird with the test case file. I'll try just reverting the test case commit and submitting that as a PR...

agusmba · 2019-02-08T12:44:06Z

@jkr I can do tests on windows, let me know if I can help.
When I get some free time at home I could try to download the appveyor build and try to execute your test case, if needed I could also build pandoc locally just in case.

jkr · 2019-02-08T13:30:43Z

@mb21: actually, is that true? I submitted #5284 and I don't see an appveyor build on there.

@agsumba: that would be great. Could you try #5284?

agusmba · 2019-02-08T13:48:56Z

yeah, I think appveyor only kicks in on jgm:master.
And unfortunately it seems appveyor only saves artifacts if the tests are ok, so I'd need to build pandoc locally to test your changes.

I have tested with jgm's fix and it works:

$ pandoc -t native alternate_document_path.docx
[Header 1 ("test",[],[]) [Str "Test"]
,Para [Str "This",Space,Str "is",Space,Emph [Str "italic"],Str ",",Space,Strong [Str "bold"],Str ",",Space,Span ("",["underline"],[]) [Str "underlined"],Str ",",Space,Emph [Span ("",["underline"],[]) [Str "italic",Space,Str "underlined"]],Str ",",Space,Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "underlined"]],Str ",",Space,Emph [Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "italic",Space,Str "underlined"]]],Str "."]]

I'll post again once I build and test with your PR.

jkr · 2019-02-08T14:00:39Z

@agsumba: great -- thanks! If #5284 succeeds, we're good.

If it fails, could you try with #5283?

agusmba · 2019-02-08T14:49:23Z

Good news @jkr #5284 works in my box (I also confirmed that current master doesn't, to make sure)

jkr · 2019-02-08T15:16:29Z

@jgm: there was still a failure in the appveyor build, but it seems independent of this issue. x86_64 passed, but i386 failed on building basement.

jkr · 2019-02-09T17:47:17Z

Actually, @jgm, do you think this (second) fix should be moved to zip-archive? I've poked around a bit and I've seen the rare reference to it as a bug under Windows' handling of paths with a leading slash: e.g. https://superuser.com/a/415972. I can put a bug report or PR up there if you'd like.

jgm · 2019-02-09T22:17:23Z

Jesse Rosenthal <notifications@github.com> writes:

Actually, @jgm, do you think this (second) fix should be moved to `zip-archive`? I've poked around a bit and I've seen the rare reference to it as a bug under Windows' handling of paths with a leading slash: e.g. https://superuser.com/a/415972. I can put a bug report or PR up there if you'd like.

Sure, if it affects every zip file, then it's probably worth putting there.

kstar971 · 2019-07-16T11:05:21Z

Hi all, thanks for the ideas.

I had the same issue with pandoc 2.7.3 - Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.8.1 on win10 x64

Issue was due to two different footers in the docx file.

Changing the .docx to .zip extension
opening with windows file explorer
deleting the footer2.xml file + save
restoring docx extension
pandoc conversion gives no more error :)

Issue with managing more than one xml file of same type (document[x].xml, footer[x].xml) ??

Kstar

mb21 added format:Docx reader labels Feb 6, 2019

jkr closed this as completed in 4cce0ef Feb 7, 2019

jgm reopened this Feb 8, 2019

jkr closed this as completed in b3d015e Feb 8, 2019

rcragun mentioned this issue Nov 14, 2021

Couldn’t parse docx (strict) #7691

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trivial docx file fails to be parsed with 'couldn't parse docx file' error #5277

Trivial docx file fails to be parsed with 'couldn't parse docx file' error #5277

danielrbrowne commented Feb 6, 2019 •

edited

Loading

mb21 commented Feb 6, 2019

agusmba commented Feb 6, 2019 •

edited

Loading

danielrbrowne commented Feb 6, 2019 •

edited

Loading

danielrbrowne commented Feb 6, 2019

agusmba commented Feb 6, 2019 •

edited

Loading

agusmba commented Feb 6, 2019

jgm commented Feb 6, 2019

jgm commented Feb 6, 2019

agusmba commented Feb 6, 2019

jgm commented Feb 6, 2019

jkr commented Feb 7, 2019

agusmba commented Feb 7, 2019

jkr commented Feb 7, 2019

jgm commented Feb 7, 2019

jkr commented Feb 7, 2019

jkr commented Feb 7, 2019

jkr commented Feb 7, 2019

jkr commented Feb 7, 2019

jgm commented Feb 8, 2019

mb21 commented Feb 8, 2019

jkr commented Feb 8, 2019

agusmba commented Feb 8, 2019

jkr commented Feb 8, 2019

agusmba commented Feb 8, 2019

jkr commented Feb 8, 2019

agusmba commented Feb 8, 2019

jkr commented Feb 8, 2019

jkr commented Feb 9, 2019

jgm commented Feb 9, 2019 via email

kstar971 commented Jul 16, 2019

Trivial docx file fails to be parsed with 'couldn't parse docx file' error #5277

Trivial docx file fails to be parsed with 'couldn't parse docx file' error #5277

Comments

danielrbrowne commented Feb 6, 2019 • edited Loading

mb21 commented Feb 6, 2019

agusmba commented Feb 6, 2019 • edited Loading

danielrbrowne commented Feb 6, 2019 • edited Loading

danielrbrowne commented Feb 6, 2019

agusmba commented Feb 6, 2019 • edited Loading

agusmba commented Feb 6, 2019

jgm commented Feb 6, 2019

jgm commented Feb 6, 2019

agusmba commented Feb 6, 2019

jgm commented Feb 6, 2019

jkr commented Feb 7, 2019

agusmba commented Feb 7, 2019

jkr commented Feb 7, 2019

jgm commented Feb 7, 2019

jkr commented Feb 7, 2019

jkr commented Feb 7, 2019

jkr commented Feb 7, 2019

jkr commented Feb 7, 2019

jgm commented Feb 8, 2019

mb21 commented Feb 8, 2019

jkr commented Feb 8, 2019

agusmba commented Feb 8, 2019

jkr commented Feb 8, 2019

agusmba commented Feb 8, 2019

jkr commented Feb 8, 2019

agusmba commented Feb 8, 2019

jkr commented Feb 8, 2019

jkr commented Feb 9, 2019

jgm commented Feb 9, 2019 via email

kstar971 commented Jul 16, 2019

danielrbrowne commented Feb 6, 2019 •

edited

Loading

agusmba commented Feb 6, 2019 •

edited

Loading

danielrbrowne commented Feb 6, 2019 •

edited

Loading

agusmba commented Feb 6, 2019 •

edited

Loading