-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trivial docx file fails to be parsed with 'couldn't parse docx file' error #5277
Comments
We should probably do better error reporting here: https://github.com/jgm/pandoc/blob/master/src/Text/Pandoc/Readers/Docx.hs#L119 |
I can confirm the problem with the file on windows 7 (pandoc 2.6):
If I open the docx in word and save it, pandoc works:
I'll try to see if there is any glaring difference at the xml level between the two... I don't know if pandoc is hardcoded to read |
Thanks for the updates on this - I just double-checked and I got my wires crossed with another dev that reported this issue internally... The attached docx file on this ticket is actually from Microsoft Word Online (part of Office 365 online). It seems Word Online generates docx files with an internal I've never seen a (I verified the internal difference by simply saving a docx with the word "Test" in it on both Word Online and Word for Mac) |
@agusmba This comment might be helpful for a method of avoiding a hard-coded check for |
Confirmed, I renamed internally
So if this is the default naming for Office365 (it seems to be, I just created a small online document that exhibits this issue), we'd need to support this variability in the filenames. |
it seems they have the same problem. However their solution is not complete yet (I commented on it). I guess we'd need to start from |
Interesting. I wonder why online Office does that?
For a quick fix we could just have it check in both places for For a more robust fix, we should get these names from |
I'll implement the quick fix now. |
For some reason, Word in Office 365 Online uses `document2.xml` for the content, instead of `document.xml`. This causes pandoc not to be able to parse docx. This quick fix has the parser check for both `document.xml` and `document2.xml`. Addresses #5277, but a more robust solution would be to get the name of the main document dynamically (who knows whether it might change again?).
I could take a look, but with some big caveats, I need to study some Haskell first (I'm not too proud of my latest PR code), and I don't have as much free time anymore (I had some unexpected free time on January). |
@jkr wrote the docx reader and might be able to fix it easily. |
Okay, done. Will upload as soon as stack is done stacking. |
Thanks jkr! |
@jkr we're getting a "couldn't parse docx file" failure in the test suite on appveyor. |
If there's (a) an error only on windows, and (b) the change had to do with zip path extraction, I'd guess it has to do with os-specific file-path separators. (filepath on windows wants backslashes, but zipfiles are forward slashes regardless of OS). In particular, I'd guess that it has to do with I can't test on Windows though. Should I try doing a more simple-mind path separation and upload it to see what appveyor says? |
If that is the problem, it would probably also work to do a qualified import of System.FilePath.Posix and use the I'm focusing on that function, because I think it's the only separator-related function from System.FilePath that I introduced here. |
I just pushed the Posix version, and will see how appveyor likes it. |
Hmmm... that wasn't it. Reverted. |
Reopening this so we can track it until the Windows issue is fixed... |
@jkr you can also make a pull request to get an appveyor build, so you don't have to push to master every time ;-) |
@jkr I can do tests on windows, let me know if I can help. |
yeah, I think appveyor only kicks in on jgm:master. I have tested with jgm's fix and it works:
I'll post again once I build and test with your PR. |
@jgm: there was still a failure in the appveyor build, but it seems independent of this issue. x86_64 passed, but i386 failed on building basement. |
Actually, @jgm, do you think this (second) fix should be moved to |
Jesse Rosenthal <notifications@github.com> writes:
Actually, @jgm, do you think this (second) fix should be moved to `zip-archive`? I've poked around a bit and I've seen the rare reference to it as a bug under Windows' handling of paths with a leading slash: e.g. https://superuser.com/a/415972. I can put a bug report or PR up there if you'd like.
Sure, if it affects every zip file, then it's probably
worth putting there.
|
Hi all, thanks for the ideas. I had the same issue with pandoc 2.7.3 - Compiled with pandoc-types 1.17.5.4, texmath 0.11.2.2, skylighting 0.8.1 on win10 x64 Issue was due to two different footers in the docx file.
Issue with managing more than one xml file of same type (document[x].xml, footer[x].xml) ?? Kstar |
Pandoc fails to parse the attached trivial docx file (generated from Microsoft Word Online). Microsoft Word for Mac v16.21, Word Online (i.e. part of Office 365 online) and Pages for Mac all open the file without reporting any errors.
Given how trivial the file's contents are, I would expect Pandoc to parse this file without a problem.
trivial.docx
Pandoc version: 2.6 (also reproducible with 1.15.0.6, which is an old version that is used by a codebase I work on)
Command used with v2.6:
Command-line output (v2.6):
Command used with v1.15.0.6:
Command-line output (v1.15.0.6):
The text was updated successfully, but these errors were encountered: