Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed Base64-Encoded Images Inline #99

Merged
merged 10 commits into from
Oct 26, 2020
Merged

Conversation

HiromuHota
Copy link
Contributor

@HiromuHota HiromuHota commented Oct 22, 2020

Description of the problems or issues

Is your pull request related to a problem? Please describe.

See #88

Does your pull request fix any issue.

Close #88

Description of the proposed changes

Embed base64-encoded images inline. Support starting with JPEG and BMP.

Test plan

Apply pdftotree to pdfs and see if JPEG and BMP images are extracted.

Checklist

  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • I have updated the CHANGELOG.md accordingly.

@codecov-io
Copy link

codecov-io commented Oct 22, 2020

Codecov Report

Merging #99 into master will increase coverage by 0.42%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #99      +/-   ##
==========================================
+ Coverage   66.40%   66.83%   +0.42%     
==========================================
  Files          23       24       +1     
  Lines        2566     2593      +27     
==========================================
+ Hits         1704     1733      +29     
+ Misses        862      860       -2     
Flag Coverage Δ
#unittests 66.83% <100.00%> (+0.42%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pdftotree/TreeExtract.py 89.53% <100.00%> (+0.79%) ⬆️
pdftotree/ml/features.py 64.58% <100.00%> (ø)
pdftotree/utils/bbox_utils.py 21.91% <100.00%> (+4.52%) ⬆️
pdftotree/utils/pdf/node.py 45.39% <100.00%> (+0.39%) ⬆️
pdftotree/utils/pdf/pdf_parsers.py 92.61% <100.00%> (+0.11%) ⬆️
tests/test_figures.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4e5dc8...1cfa226. Read the comment docs.

@HiromuHota HiromuHota requested a review from lukehsiao October 22, 2020 22:02
@HiromuHota HiromuHota marked this pull request as ready for review October 22, 2020 22:30
@lukehsiao
Copy link
Contributor

lukehsiao commented Oct 23, 2020

What happens today for a PNG?

Adobe goes with the alternative you mention in the issue (saving it as a file, and pointing to that instead), which seems to work fairly well.

@lukehsiao
Copy link
Contributor

lukehsiao commented Oct 23, 2020

Ah, I see now that it looks like you can just specify the MIME type, so we should be able to support a wide variety of image formats this way, right? Basically, I want to check that this approach will be support a wide variety of images.

Copy link
Contributor

@lukehsiao lukehsiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me.

@HiromuHota
Copy link
Contributor Author

HiromuHota commented Oct 23, 2020

Ah, I see now that it looks like you can just specify the MIME type, so we should be able to support a wide variety of image formats this way, right? Basically, I want to check that this approach will be support a wide variety of images.

Yes, this approach can support a wide variety of image types.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs
In fact, JPEG and BMP are just an initial set of supported types. We will add PNG and other types too when we have such an example PDF containing different types of image.

I like the fact that hOCR with embedded images is more portable than hOCR with multiple images.
Please let me know your thoughts and experiences.

@lukehsiao
Copy link
Contributor

I like the fact that hOCR with embedded images is more portable than hOCR with multiple images.
Please let me know your thoughts and experiences.

I don't know about portable vs not portable, but I think embedding them seems just fine. I assume I could download the embedded image as a file anyways, right?

@HiromuHota
Copy link
Contributor Author

@lukehsiao Yes, you can open an exported hocr and save an embedded image as a separate file.
In the future, pdftotree might have options like --embed-image <0|1> (Default: 1) as in pdf2htmlEX.
https://github.com/coolwanglu/pdf2htmlEX/wiki/Command-Line-Options
With this, you can control whether images are embedded or saved as a separate file.

@lukehsiao
Copy link
Contributor

Sounds great. Feel free to merge whenever you'd like.

@HiromuHota
Copy link
Contributor Author

For future reference, this PR relies on pdfminer.six for image type detection, which cannot detect an image type sometimes.
This PR embeds images only when their image types are detected and does not embed the image of unknown type.
To detect an image type on broader cases, we have to

  • Enhance and contribute to pdfminer.six, or
  • Use a different library or implement it on our own.

@HiromuHota HiromuHota merged commit c7e38be into HazyResearch:master Oct 26, 2020
@HiromuHota HiromuHota deleted the fix/88 branch October 26, 2020 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Embed Base64-Encoded Images Inline In HTML?
3 participants