Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandoc fails to extract image from docx #7881

Closed
aeb-dev opened this issue Feb 1, 2022 · 24 comments
Closed

Pandoc fails to extract image from docx #7881

aeb-dev opened this issue Feb 1, 2022 · 24 comments
Labels
format:Docx platform:windows status:resolved? Feedback requested: please either close the issue or describe why the solution is insufficient. writer

Comments

@aeb-dev
Copy link

aeb-dev commented Feb 1, 2022

Explain the problem.
Include the exact command line you used and all inputs necessary to reproduce the issue. Please create as minimal an example as possible, to help the maintainers isolate the problem. Explain the output you received and how it differs from what you expected.
This issue persists

Example file
test.docx

Pandoc version?
What version of pandoc are you using, on what OS?
pandoc.exe 2.17.1.1
Windows 10

@aeb-dev aeb-dev added the bug label Feb 1, 2022
@tarleb
Copy link
Collaborator

tarleb commented Feb 1, 2022

I cannot reproduce this with the given example. pandoc test.docx --extract-media=media adds the image to folder media, as expected.

If this is not what you meant then please provide the exact command used, as stated in the issue template.

@aeb-dev
Copy link
Author

aeb-dev commented Feb 1, 2022

This is the command I am using:

pandoc --toc --self-contained --extract-media ./Docs -s -o README.md -f docx -t gfm+gfm_auto_identifiers ./Docs/test.docx

Command generates the md file without failing or giving any warning etc.

@tarleb
Copy link
Collaborator

tarleb commented Feb 1, 2022

This works for me as well. What is it that you expect to happen, and what is happening instead?

@jgm
Copy link
Owner

jgm commented Feb 1, 2022

Note that --self-contained does not work with markdown output (only HTML).

@tarleb tarleb added status:resolved? Feedback requested: please either close the issue or describe why the solution is insufficient. and removed bug labels Feb 1, 2022
@aeb-dev
Copy link
Author

aeb-dev commented Feb 1, 2022

@tarleb I do not see the image when I render markdown. Also I do not see any extracted image. I should be able see extracted images under Docs/media

@jgm I removed it does not change the output

@aeb-dev
Copy link
Author

aeb-dev commented Feb 1, 2022

I made couple of more tests. The first example file is downloaded from web word. I created the same example on desktop word and when I run the same command it works as expected.

Here is file created by desktop version
test2.docx

@aeb-dev
Copy link
Author

aeb-dev commented Feb 1, 2022

Test

<img src="./Docs//media/image.png"
style="width:2.89583in;height:2.51042in" />

test

Pandoc generates the above markdown file for the first example file. Please check the src attribute of img element

@jgm
Copy link
Owner

jgm commented Feb 1, 2022

With your original test.docx, I just tried this with 2.17.1.1 and got:

Test

<img src="./Docs/4403abea626ca1620ed5a0e21c8a2253440788d0.png"
style="width:2.89583in;height:2.51042in" />

test

The image file is present in ./Docs. Everything looks good.
I wonder if you're actually running an old version?

@aeb-dev
Copy link
Author

aeb-dev commented Feb 2, 2022

This is the output for pandoc -v
image

Why is the file name of the image so different on your output?
Also are you guys on Windows?

@jgm
Copy link
Owner

jgm commented Feb 2, 2022

No, I'm not on Windows; this is likely a subtle issue about Windows paths.

@jgm
Copy link
Owner

jgm commented Feb 2, 2022

With your test2.docx, if I just do

pandoc test2.docx --extract-media=foo

then I get

<p>test</p>
<p><img src="fo/media/image1.png"
style="width:2.89583in;height:2.51042in" /></p>
<p>test</p>

and we find image1.png in fo/media.
Can you try with that simple experiment? I want to rule out issues arising from the fact that the source file is in a subdirectory. (That is likely why we got the SHA1 hash above; pandoc tries to preserve original path names when extracting media, but when it encounters .. it does not do this for security reasons.)

@aeb-dev
Copy link
Author

aeb-dev commented Feb 2, 2022

test2.docx works correct already on me whether subdirectory or not. I tried test.docx with the following command

pandoc test.docx --extract-media=foo

No image is extracted and I get the following output:

<p>Test</p>
<p><img src="foo//media/image.png"
style="width:2.89583in;height:2.51042in" /></p>
<p>test</p>

@jgm
Copy link
Owner

jgm commented Feb 2, 2022

Oh, sorry, I misunderstood. I understand now that your real issue is with test.docx.

@aeb-dev
Copy link
Author

aeb-dev commented Feb 4, 2022

No worries. I am not a native speaker so it could be my writing :)

Do you need any information from me to identify the problem?

My guess is foo//media that double slash is the problem

@jgm
Copy link
Owner

jgm commented Feb 4, 2022

I need to get myself a Windows VM to debug this on.

jgm added a commit that referenced this issue Feb 4, 2022
PReviously if the directory argument ended in slash,
we'd get a doubled slash in the path.  This may help
with #7881.
@jgm
Copy link
Owner

jgm commented Feb 4, 2022

I just pushed a change that may fix the double /. If you could test, that would be great. If you don't have a Haskell toolchain, you can wait for the Windows nightly to be built.

@aeb-dev
Copy link
Author

aeb-dev commented Feb 4, 2022

I can test on my computer with nightly but your last commit failed

@jgm
Copy link
Owner

jgm commented Feb 4, 2022

I noticed something about test.docx and test2.docx.
In test.docx, in word/_rels/document.xml.rels we have

<Relationship Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="/media/image.png" Id="R7ec72d0878e54e71" />

Note the leading / on the Target.
In test2.docx we have a relative path instead, "media/image1.png".
The leading / is the reason I'm getting a SHA1 hash filename on recent pandoc: pandoc is interpreting it as an absolute path. It's probably also the reason for the doubled //.
So I think my last commit was irrelevant, and I'll revert it.

@jgm
Copy link
Owner

jgm commented Feb 4, 2022

This answers the question why Windows pandoc isn't using the SHA1 hash.

Prelude System.FilePath.Windows> isRelative "/media/image.png"
True
Prelude System.FilePath.Posix> isRelative "/media/image.png"
False

@aeb-dev
Copy link
Author

aeb-dev commented Feb 4, 2022

That leading / was my initial thought as well. That is why I referred to this issue #7511

Prelude System.FilePath.Windows> isRelative "/media/image.png"
True

This is also very strange. As a person who develops on Windows. I am pretty sure that, that is not relative.

@jgm
Copy link
Owner

jgm commented Feb 4, 2022

Well I think I can see how to fix this now!

@jgm jgm closed this as completed in d402368 Feb 4, 2022
@jgm jgm reopened this Feb 4, 2022
@jgm
Copy link
Owner

jgm commented Feb 4, 2022

OK, this still needs testing on your end, but I suspect it will fix the problem.

@aeb-dev
Copy link
Author

aeb-dev commented Feb 5, 2022

It is working.

As a side not, when I run pandoc test.docx --extract-media=foo command, normally it was creating a media folder automatically under foo. Currently it does not do this. It does not matter for me, just fyi.

You can see the behavior difference by testing test.docx and test2.docx. test.docx puts images under foo, test2.docx puts images under foo/media

@jgm
Copy link
Owner

jgm commented Feb 5, 2022

normally it was creating a media folder automatically under foo. Currently it does not do this. It does not matter for me, just fyi.

When we use a path based on a sha1 hash, as in this case, it won't have a subdirectory. You only get the subdirectory when we're preserving the original path in the docx container. So this is expected. Glad it's working!

@jgm jgm closed this as completed Feb 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
format:Docx platform:windows status:resolved? Feedback requested: please either close the issue or describe why the solution is insufficient. writer
Projects
None yet
Development

No branches or pull requests

3 participants