Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix incorrect tm_matrix in call to visitor_text #2060

Merged
merged 4 commits into from
Aug 13, 2023

Conversation

troethe
Copy link
Contributor

@troethe troethe commented Aug 3, 2023

Supply the old tm_matrix when flushing out text to the visitor_text
in crlf_space_check. The new one might already be changed and
unrelated to the current text.

Also add a test for the tm_matrix and cm_matrix that are given to
visitor_text when extracting text.
The test computes the coordinates of three letters in different
parts of a test page based on the matrices and checks, if they are
roughly where they should be.

Fixes #2059

Supply the old tm_matrix when flushing out `text` to the `visitor_text`
in `crlf_space_check`.
@codecov
Copy link

codecov bot commented Aug 3, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.10% 🎉

Comparison is base (c04a6bb) 94.17% compared to head (b7544be) 94.28%.
Report is 16 commits behind head on main.

❗ Current head b7544be differs from pull request most recent head 2e42262. Consider uploading reports for the commit 2e42262 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2060      +/-   ##
==========================================
+ Coverage   94.17%   94.28%   +0.10%     
==========================================
  Files          41       41              
  Lines        7333     7365      +32     
  Branches     1442     1451       +9     
==========================================
+ Hits         6906     6944      +38     
+ Misses        266      262       -4     
+ Partials      161      159       -2     
Files Changed Coverage Δ
pypdf/_page.py 93.65% <100.00%> (+0.04%) ⬆️
pypdf/_text_extraction/__init__.py 93.27% <100.00%> (+0.05%) ⬆️

... and 3 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MartinThoma MartinThoma changed the title Fix incorrect tm_matrix in call to visitor_text BUG: Fix incorrect tm_matrix in call to visitor_text Aug 3, 2023
@pubpub-zz
Copy link
Collaborator

@troethe
can you complete your PR with a test (reusing your pdf is a good idea) in order to ensure the issue will not come back.

Check the coordinates computed from the tm_matrix and cm_matrix
that are given to `visitor_text` when extracting text.
The test computes the coordinates of three letters in different
parts of a test page based on the matrices given to the `visitor_text`
function and checks, if they are roughly where they should be.
@troethe
Copy link
Contributor Author

troethe commented Aug 7, 2023

@pubpub-zz : All right, I added a new test on a simplified test document where the correct positioning is a bit more apparent. Let me know if you have any feedback.

Edit: Also note that my original PR was wrong, because I passed the tm_prev matrix as tm_matrix to the visitor, even though tm_prev wasn't actually the tm_matrix but tm_matrix * cm_matrix. I fixed this and updated the PR.

@MartinThoma
Copy link
Member

@troethe From my side we are good to go :-) Please adjust the top comment of the PR in such a way that it can be the commit message of the squash-merge. For example:

Supply the old tm_matrix when flushing out `text` to the `visitor_text`
in `crlf_space_check`.

Check the coordinates computed from the tm_matrix and cm_matrix
that are given to `visitor_text` when extracting text.
The test computes the coordinates of three letters in different
parts of a test page based on the matrices given to the `visitor_text`
function and checks, if they are roughly where they should be.

Fixes #2059

One last question:

A preliminary fix to #2059

What do you mean by "preliminary"? Doesn't this fix the mentioned issue?

@MartinThoma MartinThoma added the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Aug 13, 2023
@MartinThoma
Copy link
Member

@pubpub-zz I guess this PR is fine from your perspective?

@troethe
Copy link
Contributor Author

troethe commented Aug 13, 2023

What do you mean by "preliminary"? Doesn't this fix the mentioned issue?

Yes, it does, but when I originally submitted this, I didn't quite understand all of the code yet and I had not thoroughly tested it as shown by the fact that it ended up not being quite right. I guess I should have called it a WIP/draft instead.

@MartinThoma
Copy link
Member

Ok, then please let me know when you think I can merge :-)

troethe added a commit to troethe/pypdf that referenced this pull request Aug 13, 2023
Supply the old tm_matrix when flushing out `text` to the `visitor_text`
in `crlf_space_check`. The new one might already be changed and
unrelated to the current text.

Also add a test for the tm_matrix and cm_matrix that are given to
`visitor_text` when extracting text.
The test computes the coordinates of three letters in different
parts of a test page based on the matrices and checks, if they are
roughly where they should be.

Fixes py-pdf#2059
@troethe troethe force-pushed the main branch 2 times, most recently from 7c4c1c4 to 2e42262 Compare August 13, 2023 09:22
@troethe
Copy link
Contributor Author

troethe commented Aug 13, 2023

@MartinThoma Ok, I think it's good now!
Sorry for those last unnecessary pushes. I misread and thought you wanted me to update the top commit's message instead of the PR's top comment.

@MartinThoma MartinThoma merged commit 4458dc6 into py-pdf:main Aug 13, 2023
26 checks passed
@MartinThoma
Copy link
Member

No problem :-)

Thank you for the contribution 🙏 I will make a release that contains it today

@MartinThoma
Copy link
Member

If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

MartinThoma added a commit that referenced this pull request Aug 13, 2023
## What's new

### Performance Improvements (PI)
-  optimize _decode_png_prediction (#2068)

### Bug Fixes (BUG)
-  Fix incorrect tm_matrix in call to visitor_text (#2060)
-  Writing German characters into form fields (#2047)
-  Prevent stall when accessing image in corrupted pdf (#2081)
-  append() fails when articles do not have /T (#2080)

### Robustness (ROB)
-  Cope with xref not followed by separator (#2083)

[Full Changelog](3.15.0...3.15.1)
@troethe
Copy link
Contributor Author

troethe commented Aug 13, 2023

If you want, I can add you to https://pypdf.readthedocs.io/en/latest/meta/CONTRIBUTORS.html :-)

Sure, why not. :) Thank you!

@MartinThoma MartinThoma removed the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Sep 18, 2023
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this pull request Sep 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Incorrect text matrix passed to visitor_text in page.extract_text
3 participants