-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embed Base64-Encoded Images Inline #99
Conversation
Codecov Report
@@ Coverage Diff @@
## master #99 +/- ##
==========================================
+ Coverage 66.40% 66.83% +0.42%
==========================================
Files 23 24 +1
Lines 2566 2593 +27
==========================================
+ Hits 1704 1733 +29
+ Misses 862 860 -2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
What happens today for a PNG? Adobe goes with the alternative you mention in the issue (saving it as a file, and pointing to that instead), which seems to work fairly well. |
Ah, I see now that it looks like you can just specify the MIME type, so we should be able to support a wide variety of image formats this way, right? Basically, I want to check that this approach will be support a wide variety of images. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me.
Yes, this approach can support a wide variety of image types. I like the fact that hOCR with embedded images is more portable than hOCR with multiple images. |
I don't know about portable vs not portable, but I think embedding them seems just fine. I assume I could download the embedded image as a file anyways, right? |
@lukehsiao Yes, you can open an exported hocr and save an embedded image as a separate file. |
Sounds great. Feel free to merge whenever you'd like. |
For future reference, this PR relies on pdfminer.six for image type detection, which cannot detect an image type sometimes.
|
Description of the problems or issues
Is your pull request related to a problem? Please describe.
See #88
Does your pull request fix any issue.
Close #88
Description of the proposed changes
Embed base64-encoded images inline. Support starting with JPEG and BMP.
Test plan
Apply pdftotree to pdfs and see if JPEG and BMP images are extracted.
Checklist