Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong image size for pdf images in docx #4322

Closed
timtroendle opened this issue Jan 29, 2018 · 13 comments
Closed

Wrong image size for pdf images in docx #4322

timtroendle opened this issue Jan 29, 2018 · 13 comments

Comments

@timtroendle
Copy link

PDF images in a markdown -> docx workflow using pandoc 1.19.2 end up with the wrong image size; unfortunately even with wrong aspect ratio. This issue seems closely related to #2720 and #3798.

Assuming there is a PNG and a PDF version of the same image, this simple Markdown file is enough to reproduce the issue:

![PNG for reference.](test.png)

![PDF version of the image.](test.pdf)

Running this with pandoc test.md -t docx -o test.docx shows a Word document, in which the PNG version has a correct aspect ratio and scale, but the PDF version is distorted. (Actually the PDF version has the default image size of 300/200 defined in ImageSize.hs.)

I assume this is due to the fact that the image size detection is not yet implemented for PDF images.

As I couldn't find a workaround so far, I'd be happy for any hint in improving the current situation which requires manual correction of all image sizes.

@jgm
Copy link
Owner

jgm commented Jan 29, 2018 via email

@jgm
Copy link
Owner

jgm commented Jan 29, 2018

Width and Height in the above seem to be pixels.

Tried it with another PDF generated by latex/tikz -- not originally a bitmap -- and found that this file contains no Image object at all, and no Width or Height. The best bet for that file seems to be the first

\BBox [0 0 355.898 355.026]

Not sure what the units are here. If anyone knows the PDF format and could help, what we're looking for is a simple way to extract image dimensions.

@timtroendle
Copy link
Author

I had a quick look at my PDF, which is generated from matplotlib 2.1.0. It does not contain any subtype /Image, /Width, /Height, or \BBox, but a /MediaBox [ 0 0 576 216 ] which might define the size.

%PDF-1.4
%¨‹ ´∫
1 0 obj
<< /Pages 2 0 R /Type /Catalog >>
endobj
8 0 obj
<< /ExtGState 4 0 R /Font 3 0 R /Pattern 5 0 R
/ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /Shading 6 0 R
/XObject 7 0 R >>
endobj
10 0 obj
<< /Annots [ ] /Contents 9 0 R
/Group << /CS /DeviceRGB /S /Transparency /Type /Group >>
/MediaBox [ 0 0 576 216 ] /Parent 2 0 R /Resources 8 0 R /Type /Page >>
endobj
9 0 obj
<< /Filter /FlateDecode /Length 11 0 R >>

@timtroendle
Copy link
Author

By the way, in contrast to #2350 I do not see any error message. I think such an error message would be helpful. At least what it would do is to point to #2350 more directly.

@njbart
Copy link

njbart commented Jan 30, 2018

Not sure what the units are here.

“The desktop publishing point (DTP point) or PostScript point is defined as ​1⁄72 or 0.0138̄ of the international inch, making it equivalent to 352.7̄ µm.” (wikipedia)

pdfinfo and identify (from imagemagick) can output the size of a pdf on the command line (see https://unix.stackexchange.com/questions/39464/how-to-query-pdf-page-size-from-the-command-line).

Note that there are a number of PDF Boxes that might be relevant: mediabox, cropbox, bleedbox, trimbox, artbox (explained here). I guess mediabox might be the one that comes closest to “the” size of a pdf – but your mileage may vary.

@jgm jgm added the enhancement label Feb 2, 2018
@jgm
Copy link
Owner

jgm commented Feb 2, 2018

We could try a rough and ready approach: scan the PDF for the first line like

/MediaBox [x1 y1 x2 y2]

and take the image size to be (x2 - x1, y2 - y1) in points.

@jgm jgm closed this as completed in eeafb3f Feb 2, 2018
@timtroendle
Copy link
Author

Thanks for the work on this! Unfortunately this does not seem to solve my issue. I used the nightly build of c74d2064a which yields the exact same result that pandoc 2.2.1 yields as well.

I created a repository with my test case and uploaded an example pdf file as well.

@timtroendle
Copy link
Author

@jgm, could you please have a quick look at this ^ ?

jgm added a commit that referenced this issue Feb 16, 2018
@jgm
Copy link
Owner

jgm commented Feb 16, 2018

@timtroendle I see what is happening with your case, and I was able to make the image size extraction more robust, so it is now handled properly. Thanks!

@timtroendle
Copy link
Author

@jgm yes, that fixes the issue. Thanks a lot!

@richarddavis
Copy link
Contributor

@jgm I am having the same issue in pandoc 2.7.1. I've created a minimal markdown file and PDF image that should allow you to replicate the bug.

pandoc-docx-pdf-bug.zip

If you run this command inside the folder it should compile the docx with the PDF image at the wrong size:

pandoc -f markdown -t docx -o pdf-bug.docx ./docx-pdf-bug.txt

What's strange is the PDF images are the correct size when I output a PDF file instead of a DOCX.

@jgm
Copy link
Owner

jgm commented Mar 20, 2019

@richarddavis In this case the imageSize function returns Left "could not determine PDF size".
I think I've fixed it now so that it handles your image.

@richarddavis
Copy link
Contributor

@jgm This is working great with PDF images created by Illustrator. I also tried it with some other PDF images that I created using Figma, but those had the same issue. I was able to get these PDF sizes recognized by ignoring all of the whitespace in the /MediaBox command. You can see the changes I made in this pull request. Everything is working great on my end now, thanks a bunch!

jgm pushed a commit that referenced this issue Mar 21, 2019
…command (#5383)

This fix ignores all whitespace in the PDF /MediaBox line so that a wider range of PDF sizes can be read. This improves fix to #4322.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants