-
-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong image size for pdf images in docx #4322
Comments
I assume this is due to the fact that the [4]image size detection is
not yet implemented for PDF images.
No doubt that's true. See #2350.
I tried converting lalune.jpg to pdf using ImageMagick, and
looking at the pdf in a text editor. It begins as follows:
```
%PDF-1.3
1 0 obj
<<
/Pages 2 0 R
/Type /Catalog
>
endobj
2 0 obj
<<
/Type /Pages
/Kids [ 3 0 R ]
/Count 1
>
endobj
3 0 obj
<<
/Type /Page
/Parent 2 0 R
/Resources <<
/XObject << /Im0 8 0 R >>
/ProcSet 6 0 R >>
/MediaBox [0 0 150 150]
/CropBox [0 0 150 150]
/Contents 4 0 R
/Thumb 11 0 R
>
endobj
4 0 obj
<<
/Length 5 0 R
>
stream
q
150 0 0 150 0 0 cm
/Im0 Do
Q
endstream
endobj
5 0 obj
31
endobj
6 0 obj
[ /PDF /Text /ImageC ]
endobj
7 0 obj
<<
>
endobj
8 0 obj
<<
/Type /XObject
/Subtype /Image
/Name /Im0
/Filter [ /DCTDecode ]
/Width 250
/Height 250
/ColorSpace 10 0 R
/BitsPerComponent 8
/Length 9 0 R
>
stream
```
The PDF contains multiple objects -- and of course, in
general, a PDF might contain many images -- but for these
purposes perhaps it would work if we just scanned forward
for the first object with subtype /Image and grabbed the
Width and Height? (What are the units here?)
|
Width and Height in the above seem to be pixels. Tried it with another PDF generated by latex/tikz -- not originally a bitmap -- and found that this file contains no Image object at all, and no Width or Height. The best bet for that file seems to be the first
Not sure what the units are here. If anyone knows the PDF format and could help, what we're looking for is a simple way to extract image dimensions. |
I had a quick look at my PDF, which is generated from
|
“The desktop publishing point (DTP point) or PostScript point is defined as 1⁄72 or 0.0138̄ of the international inch, making it equivalent to 352.7̄ µm.” (wikipedia)
Note that there are a number of PDF Boxes that might be relevant: mediabox, cropbox, bleedbox, trimbox, artbox (explained here). I guess mediabox might be the one that comes closest to “the” size of a pdf – but your mileage may vary. |
We could try a rough and ready approach: scan the PDF for the first line like
and take the image size to be (x2 - x1, y2 - y1) in points. |
Thanks for the work on this! Unfortunately this does not seem to solve my issue. I used the nightly build of I created a repository with my test case and uploaded an example pdf file as well. |
@jgm, could you please have a quick look at this ^ ? |
@timtroendle I see what is happening with your case, and I was able to make the image size extraction more robust, so it is now handled properly. Thanks! |
@jgm yes, that fixes the issue. Thanks a lot! |
@jgm I am having the same issue in pandoc 2.7.1. I've created a minimal markdown file and PDF image that should allow you to replicate the bug. If you run this command inside the folder it should compile the docx with the PDF image at the wrong size:
What's strange is the PDF images are the correct size when I output a PDF file instead of a DOCX. |
@richarddavis In this case the |
@jgm This is working great with PDF images created by Illustrator. I also tried it with some other PDF images that I created using Figma, but those had the same issue. I was able to get these PDF sizes recognized by ignoring all of the whitespace in the /MediaBox command. You can see the changes I made in this pull request. Everything is working great on my end now, thanks a bunch! |
PDF images in a markdown -> docx workflow using
pandoc 1.19.2
end up with the wrong image size; unfortunately even with wrong aspect ratio. This issue seems closely related to #2720 and #3798.Assuming there is a PNG and a PDF version of the same image, this simple Markdown file is enough to reproduce the issue:
Running this with
pandoc test.md -t docx -o test.docx
shows a Word document, in which the PNG version has a correct aspect ratio and scale, but the PDF version is distorted. (Actually the PDF version has the default image size of 300/200 defined in ImageSize.hs.)I assume this is due to the fact that the image size detection is not yet implemented for PDF images.
As I couldn't find a workaround so far, I'd be happy for any hint in improving the current situation which requires manual correction of all image sizes.
The text was updated successfully, but these errors were encountered: