Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Editing reference.docx with Word for Mac 2011 means pandoc generates malformed docx #3322

Closed
rbubley opened this issue Dec 20, 2016 · 23 comments

Comments

@rbubley
Copy link

rbubley commented Dec 20, 2016

I created a reference.docx with
pandoc --print-default-data-file reference.docx > reference.docx.
Using this to generate a document with pandoc works fine.

I then opened the reference.docx file, dirtied it (by replacing the 'e' in Hello World with another 'e', and resaved. (reference.docx)
Now generating a document with pandoc produces a malformed docx (test.docx): when opening in Word it says, "The Open-XML file test.docx cannot be opened because there are problems with the contents or the file name might contain invalid characters (for example, /). Details: Microsoft Office cannot open this file because some parts are missing or invalid."

@jgm
Copy link
Owner

jgm commented Dec 20, 2016 via email

@rbubley
Copy link
Author

rbubley commented Dec 21, 2016

pandoc 1.19.1
Compiled with pandoc-types 1.17.0.4, texmath 0.9, highlighting-kate 0.6.3

@richarddb
Copy link

I have a variation on the same problem I think.

$ pandoc --version
pandoc 1.19.2.1
Compiled with pandoc-types 1.17.0.5, texmath 0.9, skylighting 0.1.1.4

Using Microsoft Word for Mac, Version 15.32 (the latest).

Opening the generated file ("attempt to recover"), nothing seems obviously amiss.

I extracted the reference.docx contents before/after to compare. Unfortunately, all files essentially have changed (beyond whitespace), so it's not obvious to me what's going on in the contents. I did see that Word removed the file "footnotes.xml.rels", but added "endnotes.xml" instead.

I've attached the modified reference.docx.

@jgm
Copy link
Owner

jgm commented Mar 31, 2017

@richarddb You shouldn't use footnotes or anything complex in your reference.docx.
Try using a file with simple contents ("hello world") and modify the styles only.
[EDIT: Never mind: I see that even the basic one has footnotes.xml.]

@jgm
Copy link
Owner

jgm commented Mar 31, 2017

@richarddb I tried creating a document using your attached reference.docx as a reference doc. The document pandoc produced opened fine with MS Word for Mac 15.31.
Can you be more explicit about how to reproduce the problem you're seeing?

@jgm
Copy link
Owner

jgm commented Mar 31, 2017

I couldn't reproduce what @rbubley reports, either.

@richarddb
Copy link

Hmmmm... Mysterious. I've created a short video capture of the process I used, and done so on some isolated example files that may help you reproduce. I've put it all on Dropbox here: https://www.dropbox.com/sh/7yg4r5dqkhna047/AADFvmb4YkUcvwHzKB25YTipa?dl=0

Please give me a shout if there's anything you'd like me to do to debug further.

@richarddb
Copy link

I also tried opening the file in Word for Windows. That's at least more helpful in giving an error message. See the screenshot Windows Word Error Message.png -- but it's basically complaining about the "Endnotes 1".

@andrewderrington
Copy link

andrewderrington commented Apr 5, 2017

I have the same problem: if I edit reference.docx MSWord claims that the .docx file produced by pandoc is corrupt. It will then say that there is readable content in the file and I can open the file, rename it and save it and work on it.

I want to edit reference.docx to change all the styles but a very simple edit - removing the period from "Hello World" produces the same effect.
I am using pandoc 1.19.2.1 and Word 15.32. I attach two versions of reference .docx, the original works fine, the other (no period) produces a corrupt output file.

reference.docx
reference.docx.original.docx

@andrewderrington
Copy link

andrewderrington commented Apr 9, 2017

Version of MS Word is not critical, nor is changing the file.

I get the same behaviour with MS Word 2011 (Version 14.7.2) and the latest MS Word (Version 15.32).
Forcing a save without having made any changes generates the corrupt reference.docx.

I attach the two corrupt reference.docx files.

reference.docx.Word14.7.2.docx
reference.docx.Word15.32.docx

@jgm
Copy link
Owner

jgm commented Apr 9, 2017

@andrewderrington I can use your "corrupt" reference.docx to produce a docx using pandoc that Word opens without problems. Perhaps it matters what is in the file you're converting using this reference docx? (I tried it on the pandoc MANUAL.txt.)
(Word 14.7.1 for Mac, pandoc 1.19.2.1 + 1.17.2 + dev version)

@andrewderrington
Copy link

andrewderrington commented Apr 9, 2017 via email

@jgm
Copy link
Owner

jgm commented Apr 9, 2017

@andrewderrington The file I converted was MANUAL.txt from the pandoc repository.
If you'd like to link to or send the input file you used, the full pandoc command line, and the result, perhaps that would help to diagnose this.

@andrewderrington
Copy link

I will try manual.txt

Here's my command line, my input file and my output file.

pandoc --filter pandoc-citeproc -s -S --normalize -f markdown -t docx -o zzz.docx zzz.txt
zzz.txt
zzz.docx

@andrewderrington
Copy link

andrewderrington commented Apr 9, 2017 via email

@andrewderrington
Copy link

andrewderrington commented Apr 9, 2017 via email

@andrewderrington
Copy link

andrewderrington commented Apr 9, 2017 via email

@jgm
Copy link
Owner

jgm commented Apr 9, 2017

@andrewderrington when I try that command

/usr/local/bin/pandoc --filter /usr/local/bin/pandoc-citeproc  -s -S --normalize  -f markdown -t docx -o zzz.docx zzz.txt

on zzz.txt with the contents at the end of your post above, I get zzz.docx which opens without problems in Word. Note that your command doesn't at any point call for using an alternative reference.docx. But when I also added

--reference-docx ~/Downloads/reference.docx.Word15.32.docx

using the "corrupt" reference.docx you uploaded earlier, I also got a docx which opened without problems in Word.

I could not open the zzz.docx you attached, however.

@andrewderrington
Copy link

andrewderrington commented Apr 10, 2017

OK, I can reproduce that behaviour and my behaviour.

With the reference.docx files I sent you, which have been saved but not edited, if I refer to them explicitly by including " --reference pathname" in the command, I get an openable .docx file but if I rename them as reference.docx and place them in the pandoc directory I get errors.

It turns out that if the reference.docx file is called reference.docx and is stored in the pandoc directory I get an unreadable output file whether or not I refer to it explicitly. It's OK to call it reference.docx if it's stored elsewhere.

Here are the commands:-

These ones produce readable .docx files:-
pandoc -s -S --normalize -f markdown -t docx -o zzz.14.7.2.docx zzz.txt --reference-docx ~/.pandoc/reference.docx.Word14.7.2.docx
pandoc -s -S --normalize -f markdown -t docx -o zzz.15.32.docx zzz.txt --reference-docx ~/.pandoc/reference.docx.Word15.32.docx
pandoc -s -S --normalize -f markdown -t docx -o zzz.reference-elsewhere.docx zzz.txt --reference-docx ~/Dogs/reference.docx

And these ones produce unreadable .docx files.
pandoc -s -S --normalize -f markdown -t docx -o zzz.15.32.noref.docx zzz.txt
pandoc -s -S --normalize -f markdown -t docx -o zzz.14.7.2.noref.docx zzz.txt
pandoc -s -S --normalize -f markdown -t docx -o zzz.reference-explicit.docx zzz.txt --reference-docx ~/.pandoc/reference.docx

zzz.14.7.2.noref.docx
zzz.15.32.noref.docx
zzz.15.32.docx
zzz.14.7.2.docx
zzz.reference-elsewhere.docx
zzz.reference-explicit.docx

@andrewderrington
Copy link

andrewderrington commented Apr 10, 2017

It took me a long time to work the above out because there seems to be a memory effect. Once I have produced an unreadable output file, commands and reference.docx files that before had produced readable .docx file produce unreadable .docx files. It seems that I can restore the ability to produce readable .docx files by running pandoc with without referring to a reference.docx file and without a file called reference.docx in the pandoc directory.

@jgm
Copy link
Owner

jgm commented Apr 10, 2017

That's very helpful, I think I finally see what is going on here.

@jgm jgm closed this as completed in d4e5fe0 Apr 10, 2017
@ghost
Copy link

ghost commented Aug 22, 2017

Further to this discussion and resolving this bug:

If I place a modified docx template 'reference.docx' into the .pandoc directory, the resulting Word file will be corrupted. This corruption of the generated output occurs even if I concurrently place another docx template in another directory, and explicitly direct Pandoc to use that template (reference-docx='/Users/MyName/.pandoc/templates/reference.docx'
).

My current workaround is not to place a 'reference.docx' file into the default directory at all. Rather, I define a 'reference.docx' in a different location. It is fine, for example, in a subdirectory named "Templates" within the .pandoc default directory, or in the same directory as the the document being converted by Pandoc. The important thing is not to have a 'reference.docx' in the ~/.pandoc directory.

I hope this is of help.

@mb21
Copy link
Collaborator

mb21 commented Aug 23, 2017

@talazem This is probably fixed in pandoc 2.0, which is currently only available in the nightly-builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants