Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PI: Avoid string concatenation with large embedded base64-encoded images #1350

Merged

Conversation

mergezalot
Copy link
Contributor

Certain PDF libraries do embed images as base64 strings. This causes performance issues in read_string_from_stream due to incremental string concatenation, byte by byte.

PDF Lib in our case is

<xmp:CreatorTool>Canon iR-ADV C256  PDF</xmp:CreatorTool>
<pdf:Producer>PDF Annotator 8.0.0.826 [Adobe PSL 1.3e for Canon</pdf:Producer>

@mergezalot
Copy link
Contributor Author

The correct behaviour is tested by existing unit tests. I am not sure if you want performance tests on large files in your code base or not. Runtime on a 4mb PDF with a single embedded base64 image was several minutes with the old code, whereas with the new code it is roughly a second.

Sadly i did not yet manage to create a sample PDF yet, and the one PDF i have contains sensitive data.

@mergezalot mergezalot force-pushed the fix-read_string_from_stream-performance branch from 01c1956 to a41c497 Compare September 16, 2022 11:42
Certain PDF libraries do embed images as base64 strings. This causes performance issues
in `read_string_from_stream` due to incremental string concatenation, byte by byte.

PDF Lib in our case is
```
<xmp:CreatorTool>Canon iR-ADV C256  PDF</xmp:CreatorTool>
<pdf:Producer>PDF Annotator 8.0.0.826 [Adobe PSL 1.3e for Canon</pdf:Producer>

```
@mergezalot mergezalot force-pushed the fix-read_string_from_stream-performance branch from a41c497 to 28306de Compare September 16, 2022 11:43
@codecov
Copy link

codecov bot commented Sep 16, 2022

Codecov Report

Base: 94.63% // Head: 94.63% // No change to project coverage 👍

Coverage data is based on head (28306de) compared to base (7c96d13).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1350   +/-   ##
=======================================
  Coverage   94.63%   94.63%           
=======================================
  Files          30       30           
  Lines        5140     5140           
  Branches     1058     1058           
=======================================
  Hits         4864     4864           
  Misses        164      164           
  Partials      112      112           
Impacted Files Coverage Δ
PyPDF2/generic/_utils.py 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mergezalot mergezalot changed the title Fix performance issues with large embedded base64 images BUG: fix performance issues with large embedded base64 images Sep 16, 2022
@mergezalot mergezalot changed the title BUG: fix performance issues with large embedded base64 images BUG: fix performance issues with large embedded base64 Sep 16, 2022
@MartinThoma
Copy link
Member

Thank you for the PR ❤️

@MartinThoma
Copy link
Member

I am not sure if you want performance tests on large files in your code base or not.

Yes, I want that! We have the https://github.com/py-pdf/sample-files repository included as a git submodule to ensure the PyPDF2 repository itself remains small (that is important if people install directly from source).

I also want tests that are expected to be slow to be marked with a pytest marker. Currently we only have external (code), but we could add a slow marker as well.

Additionally, we have a benchmark over several PDF libraries: https://github.com/py-pdf/benchmarks
It might make sense to add a few "special" cases there (PRs are welcome :-) )

@MartinThoma MartinThoma changed the title BUG: fix performance issues with large embedded base64 PERF: Avoid repeated string concatenation with large embedded base64 images Sep 17, 2022
@MartinThoma MartinThoma changed the title PERF: Avoid repeated string concatenation with large embedded base64 images PERF: Avoid string concatenation with large embedded base64-encoded images Sep 17, 2022
@MartinThoma
Copy link
Member

https://stackoverflow.com/a/65019934/562769 is a micro-benchmark that essentially shows this improvement.

@MartinThoma MartinThoma added the nf-performance Non-functional change: Performance label Sep 17, 2022
@MartinThoma MartinThoma changed the title PERF: Avoid string concatenation with large embedded base64-encoded images PI: Avoid string concatenation with large embedded base64-encoded images Sep 17, 2022
@MartinThoma MartinThoma merged commit 3be01fd into py-pdf:main Sep 17, 2022
@MartinThoma
Copy link
Member

Thank you! It is merged to main and will be part of the next release to PyPI (likely today or tomorrow)

@mergezalot
Copy link
Contributor Author

Thank you for the speedy merge @MartinThoma. We did not yet manage to create offending testdata, but will give it another try.

MartinThoma added a commit that referenced this pull request Sep 18, 2022
New Features (ENH):
-  Add rotation property and transfer_rotate_to_content (#1348)

Performance Improvements (PI):
-  Avoid string concatenation with large embedded base64-encoded images (#1350)

Bug Fixes (BUG):
-  Format floats using their intrinsic decimal precision (#1267)

Robustness (ROB):
-  Fix merge_page for pages without resources (#1349)

Full Changelog: 2.10.8...2.10.9
@mergezalot
Copy link
Contributor Author

Note to self: found out how the original PDF with embedded base64 image was created, will file a test later.
base64image.pdf

mergezalot added a commit to mergezalot/PyPDF2 that referenced this pull request Sep 20, 2022
There is a saftey margin of a factor of 10 in both directions,
so the test should be fairly stable. Tests py-pdf#1350.
MartinThoma added a commit to py-pdf/sample-files that referenced this pull request Sep 24, 2022
Source: py-pdf/pypdf#1350 (comment)

Co-authored-by: Michael Karlen <michael.karlen@gmail.com>
@MartinThoma
Copy link
Member

I've added the example file: https://github.com/py-pdf/sample-files

MartinThoma pushed a commit that referenced this pull request Sep 25, 2022
There is a saftey margin of a factor of 10 in both directions,
so the test should be fairly stable.

Tests #1350.

Co-authored-by: Michael Karlen <michael.karlen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nf-performance Non-functional change: Performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants