Embed Base64-Encoded Images Inline #99

HiromuHota · 2020-10-22T05:32:40Z

Description of the problems or issues

Is your pull request related to a problem? Please describe.

See #88

Does your pull request fix any issue.

Close #88

Description of the proposed changes

Embed base64-encoded images inline. Support starting with JPEG and BMP.

Test plan

Apply pdftotree to pdfs and see if JPEG and BMP images are extracted.

Checklist

I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.
I have updated the CHANGELOG.md accordingly.

codecov-io · 2020-10-22T20:16:59Z

Codecov Report

Merging #99 into master will increase coverage by 0.42%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #99      +/-   ##
==========================================
+ Coverage   66.40%   66.83%   +0.42%     
==========================================
  Files          23       24       +1     
  Lines        2566     2593      +27     
==========================================
+ Hits         1704     1733      +29     
+ Misses        862      860       -2

Flag	Coverage Δ
#unittests	`66.83% <100.00%> (+0.42%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pdftotree/TreeExtract.py	`89.53% <100.00%> (+0.79%)`	⬆️
pdftotree/ml/features.py	`64.58% <100.00%> (ø)`
pdftotree/utils/bbox_utils.py	`21.91% <100.00%> (+4.52%)`	⬆️
pdftotree/utils/pdf/node.py	`45.39% <100.00%> (+0.39%)`	⬆️
pdftotree/utils/pdf/pdf_parsers.py	`92.61% <100.00%> (+0.11%)`	⬆️
tests/test_figures.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d4e5dc8...1cfa226. Read the comment docs.

lukehsiao · 2020-10-23T03:08:23Z

What happens today for a PNG?

Adobe goes with the alternative you mention in the issue (saving it as a file, and pointing to that instead), which seems to work fairly well.

lukehsiao · 2020-10-23T03:12:24Z

Ah, I see now that it looks like you can just specify the MIME type, so we should be able to support a wide variety of image formats this way, right? Basically, I want to check that this approach will be support a wide variety of images.

lukehsiao

Seems reasonable to me.

HiromuHota · 2020-10-23T03:35:57Z

Ah, I see now that it looks like you can just specify the MIME type, so we should be able to support a wide variety of image formats this way, right? Basically, I want to check that this approach will be support a wide variety of images.

Yes, this approach can support a wide variety of image types.
https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Data_URIs
In fact, JPEG and BMP are just an initial set of supported types. We will add PNG and other types too when we have such an example PDF containing different types of image.

I like the fact that hOCR with embedded images is more portable than hOCR with multiple images.
Please let me know your thoughts and experiences.

lukehsiao · 2020-10-23T08:06:59Z

I like the fact that hOCR with embedded images is more portable than hOCR with multiple images.
Please let me know your thoughts and experiences.

I don't know about portable vs not portable, but I think embedding them seems just fine. I assume I could download the embedded image as a file anyways, right?

HiromuHota · 2020-10-23T21:26:14Z

@lukehsiao Yes, you can open an exported hocr and save an embedded image as a separate file.
In the future, pdftotree might have options like --embed-image <0|1> (Default: 1) as in pdf2htmlEX.
https://github.com/coolwanglu/pdf2htmlEX/wiki/Command-Line-Options
With this, you can control whether images are embedded or saved as a separate file.

lukehsiao · 2020-10-23T21:29:11Z

Sounds great. Feel free to merge whenever you'd like.

HiromuHota · 2020-10-23T23:11:54Z

For future reference, this PR relies on pdfminer.six for image type detection, which cannot detect an image type sometimes.
This PR embeds images only when their image types are detected and does not embed the image of unknown type.
To detect an image type on broader cases, we have to

Enhance and contribute to pdfminer.six, or
Use a different library or implement it on our own.

Hiromu Hota added 8 commits October 22, 2020 12:47

Add type hints

572c071

Remove unused arguments from merge_nodes()

75bc8d6

Change args of get_figures()

c6eaa18

Remove unused return value from merge_nodes()

c5be519

Add type hints and docstrings

1bed2d5

Refactor

ee1f41a

Embed img if it is jpeg

37eead8

Add a test

2c91583

HiromuHota force-pushed the fix/88 branch from 519f38c to 2c91583 Compare October 22, 2020 19:47

Hiromu Hota added 2 commits October 22, 2020 14:17

Support BMP images

7356e89

Update CHANGELOG

1cfa226

HiromuHota requested a review from lukehsiao October 22, 2020 22:02

HiromuHota added the enhancement label Oct 22, 2020

HiromuHota marked this pull request as ready for review October 22, 2020 22:30

lukehsiao approved these changes Oct 23, 2020

View reviewed changes

HiromuHota merged commit c7e38be into HazyResearch:master Oct 26, 2020

HiromuHota deleted the fix/88 branch October 26, 2020 16:20

HiromuHota mentioned this pull request Oct 26, 2020

unable to read images in the pdf file HazyResearch/fonduer#525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embed Base64-Encoded Images Inline #99

Embed Base64-Encoded Images Inline #99

HiromuHota commented Oct 22, 2020 •

edited

Loading

codecov-io commented Oct 22, 2020 •

edited

Loading

lukehsiao commented Oct 23, 2020 •

edited

Loading

lukehsiao commented Oct 23, 2020 •

edited

Loading

lukehsiao left a comment

HiromuHota commented Oct 23, 2020 •

edited

Loading

lukehsiao commented Oct 23, 2020

HiromuHota commented Oct 23, 2020

lukehsiao commented Oct 23, 2020

HiromuHota commented Oct 23, 2020

Embed Base64-Encoded Images Inline #99

Embed Base64-Encoded Images Inline #99

Conversation

HiromuHota commented Oct 22, 2020 • edited Loading

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

codecov-io commented Oct 22, 2020 • edited Loading

Codecov Report

lukehsiao commented Oct 23, 2020 • edited Loading

lukehsiao commented Oct 23, 2020 • edited Loading

lukehsiao left a comment

Choose a reason for hiding this comment

HiromuHota commented Oct 23, 2020 • edited Loading

lukehsiao commented Oct 23, 2020

HiromuHota commented Oct 23, 2020

lukehsiao commented Oct 23, 2020

HiromuHota commented Oct 23, 2020

HiromuHota commented Oct 22, 2020 •

edited

Loading

codecov-io commented Oct 22, 2020 •

edited

Loading

lukehsiao commented Oct 23, 2020 •

edited

Loading

lukehsiao commented Oct 23, 2020 •

edited

Loading

HiromuHota commented Oct 23, 2020 •

edited

Loading