Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trivial docx file fails to be parsed with 'couldn't parse docx file' error #5277

Closed
danielrbrowne opened this issue Feb 6, 2019 · 30 comments

Comments

@danielrbrowne
Copy link

danielrbrowne commented Feb 6, 2019

Pandoc fails to parse the attached trivial docx file (generated from Microsoft Word Online). Microsoft Word for Mac v16.21, Word Online (i.e. part of Office 365 online) and Pages for Mac all open the file without reporting any errors.

Given how trivial the file's contents are, I would expect Pandoc to parse this file without a problem.

trivial.docx

Pandoc version: 2.6 (also reproducible with 1.15.0.6, which is an old version that is used by a codebase I work on)

pandoc 2.6
Compiled with pandoc-types 1.17.5.4, texmath 0.11.2, skylighting 0.7.5
Default user data directory: /Users/danbrowne/.pandoc
Copyright (C) 2006-2019 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

Command used with v2.6:

pandoc --extract-media=/Users/admin/Documents --from=docx --to=html --email-obfuscation=none --standalone +RTS -K128m -RTS --wrap=none ~/Documents/trivial.docx

Command-line output (v2.6):

couldn't parse docx file

Command used with v1.15.0.6:

pandoc --extract-media=/Users/admin/Documents --from=docx --to=html --email-obfuscation=none --standalone +RTS -K128m -RTS --no-wrap ~/Documents/trivial.docx

Command-line output (v1.15.0.6):

pandoc: couldn't parse docx file
@mb21
Copy link
Collaborator

mb21 commented Feb 6, 2019

We should probably do better error reporting here: https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Readers/Docx.hs#L119

@agusmba
Copy link
Contributor

agusmba commented Feb 6, 2019

I can confirm the problem with the file on windows 7 (pandoc 2.6):

$ pandoc --from=docx --to=native ./trivial.docx
couldn't parse docx file

If I open the docx in word and save it, pandoc works:

$ pandoc --from=docx --to=native ./trivial2.docx
[Header 1 ("test",[],[]) [Str "Test"]
,Para [Str "This",Space,Str "is",Space,Emph [Str "italic"],Str ",",Space,Strong [Str "bold"],Str ",",Space,Span ("",["underline"],[]) [Str "underlined"],Str ",",Space,Emph [Span ("",["underline"],[]) [Str "italic",Space,Str "underlined"]],Str ",",Space,Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "underlined"]],Str ",",Space,Emph [Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "italic",Space,Str "underlined"]]],Str "."]]

I'll try to see if there is any glaring difference at the xml level between the two...

image

I don't know if pandoc is hardcoded to read document.xml, the test file is using document2.xml

@danielrbrowne
Copy link
Author

danielrbrowne commented Feb 6, 2019

Thanks for the updates on this - I just double-checked and I got my wires crossed with another dev that reported this issue internally...

The attached docx file on this ticket is actually from Microsoft Word Online (part of Office 365 online). It seems Word Online generates docx files with an internal document2.xml, whereas Microsoft Word for Mac generates docx files with an internal document.xml file.

I've never seen a document2.xml inside a docx file - so it seems even Microsoft themselves are not consistent across their own separate versions of Word!

(I verified the internal difference by simply saving a docx with the word "Test" in it on both Word Online and Word for Mac)

@danielrbrowne
Copy link
Author

@agusmba This comment might be helpful for a method of avoiding a hard-coded check for document.xml or document2.xml (e.g. if Microsoft introduce some other naming scheme in the future): ankushshah89/python-docx2txt#16 (comment)

@agusmba
Copy link
Contributor

agusmba commented Feb 6, 2019

Confirmed, I renamed internally document2.xml and document2.xml.rels, and updated the references in [Content_Types].xml and _rels/.rels and the corrected docx is parsed by pandoc:

$ pandoc --from=docx --to=native ./trivial0/trivial0_b.docx
[Header 1 ("test",[],[]) [Str "Test"]
,Para [Str "This",Space,Str "is",Space,Emph [Str "italic"],Str ",",Space,Strong [Str "bold"],Str ",",Space,Span ("",["underline"],[]) [Str "underlined"],Str ",",Space,Emph [Span ("",["underline"],[]) [Str "italic",Space,Str "underlined"]],Str ",",Space,Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "underlined"]],Str ",",Space,Emph [Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "italic",Space,Str "underlined"]]],Str "."]]

So if this is the default naming for Office365 (it seems to be, I just created a small online document that exhibits this issue), we'd need to support this variability in the filenames.

@agusmba
Copy link
Contributor

agusmba commented Feb 6, 2019

This comment might be helpful for a method of avoiding a hard-coded check for document.xml or document2.xml (e.g. if Microsoft introduce some other naming scheme in the future): ankushshah89/python-docx2txt#16 (comment)

it seems they have the same problem. However their solution is not complete yet (I commented on it). I guess we'd need to start from _rels/.rels, get the name of the document[x].xml and construct from that the name of word/_rels/document[x].xml.rels supposing that its name is correlated, and that pandoc uses it.

@jgm
Copy link
Owner

jgm commented Feb 6, 2019

Interesting. I wonder why online Office does that?
There are two places in src/Text/Pandoc/Readers/Docx/Parse.hs where these paths are hardcoded:

Parse.hs
366:  entry <- maybeToD $ findEntryByPath "word/document.xml" zf
480:filePathToRelType "word/_rels/document.xml.rels"  = Just InDocument

For a quick fix we could just have it check in both places for document2.xml as well.

For a more robust fix, we should get these names from _rels/.rels.

@jgm
Copy link
Owner

jgm commented Feb 6, 2019

I'll implement the quick fix now.
@agusmba do you want to work on the more robust fix?

jgm added a commit that referenced this issue Feb 6, 2019
For some reason, Word in Office 365 Online uses `document2.xml`
for the content, instead of `document.xml`.  This causes pandoc
not to be able to parse docx.

This quick fix has the parser check for both `document.xml`
and `document2.xml`.

Addresses #5277, but a more robust solution would be to
get the name of the main document dynamically (who knows
whether it might change again?).
@agusmba
Copy link
Contributor

agusmba commented Feb 6, 2019

I could take a look, but with some big caveats, I need to study some Haskell first (I'm not too proud of my latest PR code), and I don't have as much free time anymore (I had some unexpected free time on January).
On the plus side, if I were able to understand the docx reader, I could later on try to improve the reading of properties...
I could put this on my radar, but I wouldn't expect anything soon, so if anyone else wants to take a go at it, [s]he'd be welcome.

@jgm
Copy link
Owner

jgm commented Feb 6, 2019

@jkr wrote the docx reader and might be able to fix it easily.

@jkr
Copy link
Collaborator

jkr commented Feb 7, 2019

Okay, done. Will upload as soon as stack is done stacking.

@jkr jkr closed this as completed in 4cce0ef Feb 7, 2019
@agusmba
Copy link
Contributor

agusmba commented Feb 7, 2019

Thanks jkr!
I had a question on the commit, but it may be nothing.

@jkr
Copy link
Collaborator

jkr commented Feb 7, 2019

@agusmba : Good point -- thanks! Fixed with 9ff4042

@jgm
Copy link
Owner

jgm commented Feb 7, 2019

@jkr we're getting a "couldn't parse docx file" failure in the test suite on appveyor.
https://ci.appveyor.com/project/jgm/pandoc/build/job/q2gmkgiic8lj8xm5
The same commit seems to build on linux. Can you think of any reason why this would be happening?
Something Windows specific?

@jkr
Copy link
Collaborator

jkr commented Feb 7, 2019

If there's (a) an error only on windows, and (b) the change had to do with zip path extraction, I'd guess it has to do with os-specific file-path separators. (filepath on windows wants backslashes, but zipfiles are forward slashes regardless of OS). In particular, I'd guess that it has to do with takeFileName on Parse.hs 508.

I can't test on Windows though. Should I try doing a more simple-mind path separation and upload it to see what appveyor says?

@jkr
Copy link
Collaborator

jkr commented Feb 7, 2019

If that is the problem, it would probably also work to do a qualified import of System.FilePath.Posix and use the takeFileName from there.

I'm focusing on that function, because I think it's the only separator-related function from System.FilePath that I introduced here.

@jkr
Copy link
Collaborator

jkr commented Feb 7, 2019

I just pushed the Posix version, and will see how appveyor likes it.

@jkr
Copy link
Collaborator

jkr commented Feb 7, 2019

Hmmm... that wasn't it. Reverted.

@jgm jgm reopened this Feb 8, 2019
@jgm
Copy link
Owner

jgm commented Feb 8, 2019

Reopening this so we can track it until the Windows issue is fixed...

@mb21
Copy link
Collaborator

mb21 commented Feb 8, 2019

@jkr you can also make a pull request to get an appveyor build, so you don't have to push to master every time ;-)

@jkr
Copy link
Collaborator

jkr commented Feb 8, 2019

Okay -- I'm mystified. This complain starts appearing with 1847bdb.

Which is just the initial fix in 4cce0ef (no path manipulation) and the test cases.

I'm curious about whether it's something weird with the test case file. I'll try just reverting the test case commit and submitting that as a PR...

@agusmba
Copy link
Contributor

agusmba commented Feb 8, 2019

@jkr I can do tests on windows, let me know if I can help.
When I get some free time at home I could try to download the appveyor build and try to execute your test case, if needed I could also build pandoc locally just in case.

@jkr
Copy link
Collaborator

jkr commented Feb 8, 2019

@mb21: actually, is that true? I submitted #5284 and I don't see an appveyor build on there.

@agsumba: that would be great. Could you try #5284?

@agusmba
Copy link
Contributor

agusmba commented Feb 8, 2019

yeah, I think appveyor only kicks in on jgm:master.
And unfortunately it seems appveyor only saves artifacts if the tests are ok, so I'd need to build pandoc locally to test your changes.

I have tested with jgm's fix and it works:

$ pandoc -t native alternate_document_path.docx
[Header 1 ("test",[],[]) [Str "Test"]
,Para [Str "This",Space,Str "is",Space,Emph [Str "italic"],Str ",",Space,Strong [Str "bold"],Str ",",Space,Span ("",["underline"],[]) [Str "underlined"],Str ",",Space,Emph [Span ("",["underline"],[]) [Str "italic",Space,Str "underlined"]],Str ",",Space,Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "underlined"]],Str ",",Space,Emph [Strong [Span ("",["underline"],[]) [Str "bold",Space,Str "italic",Space,Str "underlined"]]],Str "."]]

I'll post again once I build and test with your PR.

@jkr
Copy link
Collaborator

jkr commented Feb 8, 2019

@agsumba: great -- thanks! If #5284 succeeds, we're good.

If it fails, could you try with #5283?

@agusmba
Copy link
Contributor

agusmba commented Feb 8, 2019

Good news @jkr #5284 works in my box (I also confirmed that current master doesn't, to make sure)

@jkr jkr closed this as completed in b3d015e Feb 8, 2019
@jkr
Copy link
Collaborator

jkr commented Feb 8, 2019

@jgm: there was still a failure in the appveyor build, but it seems independent of this issue. x86_64 passed, but i386 failed on building basement.

@jkr
Copy link
Collaborator

jkr commented Feb 9, 2019

Actually, @jgm, do you think this (second) fix should be moved to zip-archive? I've poked around a bit and I've seen the rare reference to it as a bug under Windows' handling of paths with a leading slash: e.g. https://superuser.com/a/415972. I can put a bug report or PR up there if you'd like.

@jgm
Copy link
Owner

jgm commented Feb 9, 2019 via email

@kstar971
Copy link

Hi all, thanks for the ideas.

I had the same issue with pandoc 2.7.3 - Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.8.1 on win10 x64

Issue was due to two different footers in the docx file.

  1. Changing the .docx to .zip extension
  2. opening with windows file explorer
  3. deleting the footer2.xml file + save
  4. restoring docx extension
  5. pandoc conversion gives no more error :)

Issue with managing more than one xml file of same type (document[x].xml, footer[x].xml) ??

Kstar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants