Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing spaces in extract_text() method #1328

Closed
Sunguru opened this issue Sep 6, 2022 · 3 comments · Fixed by #2868
Closed

Missing spaces in extract_text() method #1328

Sunguru opened this issue Sep 6, 2022 · 3 comments · Fixed by #2868
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@Sunguru
Copy link

Sunguru commented Sep 6, 2022

Missing spaces in extract_text() method.
See attached PDFs.
Text is being extracted nice, but it comes with no spaces from almost all fields.

Environment

$ python -c "import pypdf;print(pypdf.__version__)"
pypdf==3.14.0

Code + PDF

PDF: 0004.pdf

from pypdf import PdfReader, __version__

print(f"pypdf=={__version__}")

reader = PdfReader("0004.pdf")

page = reader.pages[0]
extracted = page.extract_text().split("Description:")[1].split("8/11/22")[0]
print(extracted)

gives:

 Reportingcrudeoilleak.
Leakwasisolatedtowell
pad.Segmentoflinewas
immediatelyisolated,now
estimatedat5barrelsofoil
spilt.Rootcausestill
unknownatthistime.

expected (copy-pasted with Google chrome):

Reporting crude oil leak.
Leak was isolated to well
pad. Segment of line was
immediately isolated, now
estimated at 5 barrels of oil
spilt. Root cause still
unknown at this time.

0000.pdf

Yes, you may add to the tests. It is public data from here: https://northdakota.hazconnect.com/ListIncidentPublic.aspx

p,s, Thank you for the great package!

@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Sep 6, 2022
@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 24, 2022
@MartinThoma MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023
@tpcgold
Copy link

tpcgold commented Sep 11, 2023

any workaround on this so far?
I ran into the exact same issue with pypdf

@ssjkamei
Copy link
Contributor

Hi.

The problem seemed to be the difference between the font size retrieved value and the actual space.
As far as the area concerned, the PDF spaces were -277.75 apart, whereas the size retrieved from the font was 278.0.
The larger font size was also -277.75 away in the area I checked, but the size I got from the font was 361.0.

If I round up the actual values, I think it will work.
However, I am not familiar with how PDFs and fonts work and I cannot determine if that is the correct process.

Is this helpful?

Try rounding up abs(float(op).

(abs(float(op)) >= _space_width)

@ssjkamei
Copy link
Contributor

Sorry, I have an addition.
It seems that it is not extracting spaces, but judging whether they are separated by more than the size of the space.
I think it is difficult to get small font size spaces between larger font sizes as spaces.

ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024
ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024
ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024
ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024
This reverts commit 5400f5a.

BUG: Missing spaces in extract_text() method (py-pdf#1328)

BUG: Missing spaces in extract_text() method (py-pdf#1328) add test
ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024
ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024
ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024
pubpub-zz pushed a commit that referenced this issue Sep 24, 2024
* BUG: Missing spaces in extract_text() method (#1328)

* Revert "BUG: Missing spaces in extract_text() method (#1328)"

This reverts commit 5400f5a.

* BUG: Missing spaces in extract_text() method (#1328)

* BUG: Missing spaces in extract_text() method (#1328) add test

* Revert "BUG: Missing spaces in extract_text() method (#1328)"

This reverts commit 5400f5a.

BUG: Missing spaces in extract_text() method (#1328)

BUG: Missing spaces in extract_text() method (#1328) add test

* BUG: Missing spaces in extract_text() method (#1328) Convert font size comparison to ratio

* Correction to new file URL.

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

* BUG: Missing spaces in extract_text() method (#1328) calculation efficiency

* BUG: Missing spaces in extract_text() method (#1328) Simplify the assertion process

---------

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 29, 2024
## Version 5.0.1, 2024-09-29

### New Features (ENH)
- Add `full` parameter to PdfWriter constructor (py-pdf#2865)

### Bug Fixes (BUG)
- Update pyproject.toml with minimum Python version of 3.8 (py-pdf#2859)
- Cope with unbalanced delimiters in dictionary object (py-pdf#2878)
- Cope with encoding with too many differences (py-pdf#2873)
- Missing spaces in extract_text() method (py-pdf#1328) (py-pdf#2868)
- Tolerate truncated files and no warning when jumping startxref (py-pdf#2855)

### Robustness (ROB)
- Repair PDF with invalid Root object (py-pdf#2880)
- Continue parsing dictionary object when error is detected (py-pdf#2872)
- Merge documents with invalid pages in named destinations (py-pdf#2857)
- Tolerate comments(%) in arrays (py-pdf#2856)

### Documentation (DOC), Testing (TST), Code Style (STY), Developer Experience (DEV), Maintenance (MAINT)

- (py-pdf#2844), (py-pdf#2862), (py-pdf#2863), (py-pdf#2847), (py-pdf#2860), (py-pdf#2867), (py-pdf#2874), (py-pdf#2879)

[Full Changelog](py-pdf/pypdf@5.0.0...5.0.1)
@pubpub-zz pubpub-zz mentioned this issue Sep 29, 2024
pubpub-zz added a commit that referenced this issue Sep 29, 2024
## Version 5.0.1, 2024-09-29

### New Features (ENH)
- Add `full` parameter to PdfWriter constructor (#2865)

### Bug Fixes (BUG)
- Update pyproject.toml with minimum Python version of 3.8 (#2859)
- Cope with unbalanced delimiters in dictionary object (#2878)
- Cope with encoding with too many differences (#2873)
- Missing spaces in extract_text() method (#1328) (#2868)
- Tolerate truncated files and no warning when jumping startxref (#2855)

### Robustness (ROB)
- Repair PDF with invalid Root object (#2880)
- Continue parsing dictionary object when error is detected (#2872)
- Merge documents with invalid pages in named destinations (#2857)
- Tolerate comments in arrays (#2856)

### Developer Experience (DEV)
- Use latest Python version for benchmarking (#2879)

### Maintenance (MAINT)
- Add tests to source distributions (#2874)
- Refactor _update_field_annotation (#2862)

[Full Changelog](5.0.0...5.0.1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants