Wrong image size for pdf images in docx #4322

timtroendle · 2018-01-29T17:42:31Z

PDF images in a markdown -> docx workflow using pandoc 1.19.2 end up with the wrong image size; unfortunately even with wrong aspect ratio. This issue seems closely related to #2720 and #3798.

Assuming there is a PNG and a PDF version of the same image, this simple Markdown file is enough to reproduce the issue:

![PNG for reference.](test.png)

![PDF version of the image.](test.pdf)

Running this with pandoc test.md -t docx -o test.docx shows a Word document, in which the PNG version has a correct aspect ratio and scale, but the PDF version is distorted. (Actually the PDF version has the default image size of 300/200 defined in ImageSize.hs.)

I assume this is due to the fact that the image size detection is not yet implemented for PDF images.

As I couldn't find a workaround so far, I'd be happy for any hint in improving the current situation which requires manual correction of all image sizes.

The text was updated successfully, but these errors were encountered:

jgm · 2018-01-29T22:42:06Z

I assume this is due to the fact that the [4]image size detection is not yet implemented for PDF images.

No doubt that's true. See #2350. I tried converting lalune.jpg to pdf using ImageMagick, and looking at the pdf in a text editor. It begins as follows: ``` %PDF-1.3 1 0 obj << /Pages 2 0 R /Type /Catalog

>

endobj 2 0 obj << /Type /Pages /Kids [ 3 0 R ] /Count 1

>

endobj 3 0 obj << /Type /Page /Parent 2 0 R /Resources << /XObject << /Im0 8 0 R >> /ProcSet 6 0 R >> /MediaBox [0 0 150 150] /CropBox [0 0 150 150] /Contents 4 0 R /Thumb 11 0 R

>

endobj 4 0 obj << /Length 5 0 R

>

stream q 150 0 0 150 0 0 cm /Im0 Do Q endstream endobj 5 0 obj 31 endobj 6 0 obj [ /PDF /Text /ImageC ] endobj 7 0 obj <<

>

endobj 8 0 obj << /Type /XObject /Subtype /Image /Name /Im0 /Filter [ /DCTDecode ] /Width 250 /Height 250 /ColorSpace 10 0 R /BitsPerComponent 8 /Length 9 0 R

>

stream ``` The PDF contains multiple objects -- and of course, in general, a PDF might contain many images -- but for these purposes perhaps it would work if we just scanned forward for the first object with subtype /Image and grabbed the Width and Height? (What are the units here?)

jgm · 2018-01-29T23:14:57Z

Width and Height in the above seem to be pixels.

Tried it with another PDF generated by latex/tikz -- not originally a bitmap -- and found that this file contains no Image object at all, and no Width or Height. The best bet for that file seems to be the first

\BBox [0 0 355.898 355.026]

Not sure what the units are here. If anyone knows the PDF format and could help, what we're looking for is a simple way to extract image dimensions.

timtroendle · 2018-01-30T09:11:40Z

I had a quick look at my PDF, which is generated from matplotlib 2.1.0. It does not contain any subtype /Image, /Width, /Height, or \BBox, but a /MediaBox [ 0 0 576 216 ] which might define the size.

%PDF-1.4
%¨‹ ´∫
1 0 obj
<< /Pages 2 0 R /Type /Catalog >>
endobj
8 0 obj
<< /ExtGState 4 0 R /Font 3 0 R /Pattern 5 0 R
/ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /Shading 6 0 R
/XObject 7 0 R >>
endobj
10 0 obj
<< /Annots [ ] /Contents 9 0 R
/Group << /CS /DeviceRGB /S /Transparency /Type /Group >>
/MediaBox [ 0 0 576 216 ] /Parent 2 0 R /Resources 8 0 R /Type /Page >>
endobj
9 0 obj
<< /Filter /FlateDecode /Length 11 0 R >>

timtroendle · 2018-01-30T09:13:22Z

By the way, in contrast to #2350 I do not see any error message. I think such an error message would be helpful. At least what it would do is to point to #2350 more directly.

njbart · 2018-01-30T12:30:14Z

Not sure what the units are here.

“The desktop publishing point (DTP point) or PostScript point is defined as 1⁄72 or 0.0138̄ of the international inch, making it equivalent to 352.7̄ µm.” (wikipedia)

pdfinfo and identify (from imagemagick) can output the size of a pdf on the command line (see https://unix.stackexchange.com/questions/39464/how-to-query-pdf-page-size-from-the-command-line).

Note that there are a number of PDF Boxes that might be relevant: mediabox, cropbox, bleedbox, trimbox, artbox (explained here). I guess mediabox might be the one that comes closest to “the” size of a pdf – but your mileage may vary.

jgm · 2018-02-02T17:09:14Z

We could try a rough and ready approach: scan the PDF for the first line like

/MediaBox [x1 y1 x2 y2]

and take the image size to be (x2 - x1, y2 - y1) in points.

timtroendle · 2018-02-05T11:29:45Z

Thanks for the work on this! Unfortunately this does not seem to solve my issue. I used the nightly build of c74d2064a which yields the exact same result that pandoc 2.2.1 yields as well.

I created a repository with my test case and uploaded an example pdf file as well.

timtroendle · 2018-02-16T19:01:35Z

@jgm, could you please have a quick look at this ^ ?

See #4322.

jgm · 2018-02-16T21:46:12Z

@timtroendle I see what is happening with your case, and I was able to make the image size extraction more robust, so it is now handled properly. Thanks!

timtroendle · 2018-02-20T08:29:38Z

@jgm yes, that fixes the issue. Thanks a lot!

richarddavis · 2019-03-20T08:19:45Z

@jgm I am having the same issue in pandoc 2.7.1. I've created a minimal markdown file and PDF image that should allow you to replicate the bug.

pandoc-docx-pdf-bug.zip

If you run this command inside the folder it should compile the docx with the PDF image at the wrong size:

pandoc -f markdown -t docx -o pdf-bug.docx ./docx-pdf-bug.txt

What's strange is the PDF images are the correct size when I output a PDF file instead of a DOCX.

jgm · 2019-03-20T19:25:26Z

@richarddavis In this case the imageSize function returns Left "could not determine PDF size".
I think I've fixed it now so that it handles your image.

Improves fix to #4322.

richarddavis · 2019-03-21T00:32:09Z

@jgm This is working great with PDF images created by Illustrator. I also tried it with some other PDF images that I created using Figma, but those had the same issue. I was able to get these PDF sizes recognized by ignoring all of the whitespace in the /MediaBox command. You can see the changes I made in this pull request. Everything is working great on my end now, thanks a bunch!

…command (#5383) This fix ignores all whitespace in the PDF /MediaBox line so that a wider range of PDF sizes can be read. This improves fix to #4322.

jgm added the enhancement label Feb 2, 2018

jgm closed this as completed in eeafb3f Feb 2, 2018

jgm added a commit that referenced this issue Feb 16, 2018

Make image size detection for PDFs more robust.

c75740e

See #4322.

jgm added a commit that referenced this issue Mar 20, 2019

Improve pdfSize in ImageSize.

9573141

Improves fix to #4322.

richarddavis mentioned this issue Mar 21, 2019

Improve pdfSize in ImageSize by ignoring all whitespace in /MediaBox command #5383

Merged

jgm pushed a commit that referenced this issue Mar 21, 2019

Improve pdfSize in ImageSize by ignoring all whitespace in /MediaBox …

567a43a

…command (#5383) This fix ignores all whitespace in the PDF /MediaBox line so that a wider range of PDF sizes can be read. This improves fix to #4322.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong image size for pdf images in docx #4322

Wrong image size for pdf images in docx #4322

timtroendle commented Jan 29, 2018

jgm commented Jan 29, 2018 via email

jgm commented Jan 29, 2018

timtroendle commented Jan 30, 2018

timtroendle commented Jan 30, 2018

njbart commented Jan 30, 2018

jgm commented Feb 2, 2018

timtroendle commented Feb 5, 2018

timtroendle commented Feb 16, 2018

jgm commented Feb 16, 2018

timtroendle commented Feb 20, 2018

richarddavis commented Mar 20, 2019

jgm commented Mar 20, 2019

richarddavis commented Mar 21, 2019

Wrong image size for pdf images in docx #4322

Wrong image size for pdf images in docx #4322

Comments

timtroendle commented Jan 29, 2018

jgm commented Jan 29, 2018 via email

jgm commented Jan 29, 2018

timtroendle commented Jan 30, 2018

timtroendle commented Jan 30, 2018

njbart commented Jan 30, 2018

jgm commented Feb 2, 2018

timtroendle commented Feb 5, 2018

timtroendle commented Feb 16, 2018

jgm commented Feb 16, 2018

timtroendle commented Feb 20, 2018

richarddavis commented Mar 20, 2019

jgm commented Mar 20, 2019

richarddavis commented Mar 21, 2019