Missing spaces in extract_text() method #1328

Sunguru · 2022-09-06T15:51:16Z

Missing spaces in extract_text() method.
See attached PDFs.
Text is being extracted nice, but it comes with no spaces from almost all fields.

Environment

$ python -c "import pypdf;print(pypdf.__version__)"
pypdf==3.14.0

Code + PDF

PDF: 0004.pdf

from pypdf import PdfReader, __version__

print(f"pypdf=={__version__}")

reader = PdfReader("0004.pdf")

page = reader.pages[0]
extracted = page.extract_text().split("Description:")[1].split("8/11/22")[0]
print(extracted)

gives:

 Reportingcrudeoilleak.
Leakwasisolatedtowell
pad.Segmentoflinewas
immediatelyisolated,now
estimatedat5barrelsofoil
spilt.Rootcausestill
unknownatthistime.

expected (copy-pasted with Google chrome):

Reporting crude oil leak.
Leak was isolated to well
pad. Segment of line was
immediately isolated, now
estimated at 5 barrels of oil
spilt. Root cause still
unknown at this time.

0000.pdf

Yes, you may add to the tests. It is public data from here: https://northdakota.hazconnect.com/ListIncidentPublic.aspx

p,s, Thank you for the great package!

The text was updated successfully, but these errors were encountered:

tpcgold · 2023-09-11T12:41:47Z

any workaround on this so far?
I ran into the exact same issue with pypdf

ssjkamei · 2024-09-21T05:42:14Z

Hi.

The problem seemed to be the difference between the font size retrieved value and the actual space.
As far as the area concerned, the PDF spaces were -277.75 apart, whereas the size retrieved from the font was 278.0.
The larger font size was also -277.75 away in the area I checked, but the size I got from the font was 361.0.

If I round up the actual values, I think it will work.
However, I am not familiar with how PDFs and fonts work and I cannot determine if that is the correct process.

Is this helpful?

Try rounding up abs(float(op).

pypdf/pypdf/_page.py

Line 1992 in 8dd9fcb

(abs(float(op)) >= _space_width)

ssjkamei · 2024-09-21T06:20:13Z

Sorry, I have an addition.
It seems that it is not extracting spaces, but judging whether they are separated by more than the size of the space.
I think it is difficult to get small font size spaces between larger font sizes as spaces.

This reverts commit 5400f5a.

This reverts commit 5400f5a. BUG: Missing spaces in extract_text() method (py-pdf#1328) BUG: Missing spaces in extract_text() method (py-pdf#1328) add test

…nt size comparison to ratio

…n efficiency

…he assertion process

* BUG: Missing spaces in extract_text() method (#1328) * Revert "BUG: Missing spaces in extract_text() method (#1328)" This reverts commit 5400f5a. * BUG: Missing spaces in extract_text() method (#1328) * BUG: Missing spaces in extract_text() method (#1328) add test * Revert "BUG: Missing spaces in extract_text() method (#1328)" This reverts commit 5400f5a. BUG: Missing spaces in extract_text() method (#1328) BUG: Missing spaces in extract_text() method (#1328) add test * BUG: Missing spaces in extract_text() method (#1328) Convert font size comparison to ratio * Correction to new file URL. Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com> * BUG: Missing spaces in extract_text() method (#1328) calculation efficiency * BUG: Missing spaces in extract_text() method (#1328) Simplify the assertion process --------- Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

## Version 5.0.1, 2024-09-29 ### New Features (ENH) - Add `full` parameter to PdfWriter constructor (py-pdf#2865) ### Bug Fixes (BUG) - Update pyproject.toml with minimum Python version of 3.8 (py-pdf#2859) - Cope with unbalanced delimiters in dictionary object (py-pdf#2878) - Cope with encoding with too many differences (py-pdf#2873) - Missing spaces in extract_text() method (py-pdf#1328) (py-pdf#2868) - Tolerate truncated files and no warning when jumping startxref (py-pdf#2855) ### Robustness (ROB) - Repair PDF with invalid Root object (py-pdf#2880) - Continue parsing dictionary object when error is detected (py-pdf#2872) - Merge documents with invalid pages in named destinations (py-pdf#2857) - Tolerate comments(%) in arrays (py-pdf#2856) ### Documentation (DOC), Testing (TST), Code Style (STY), Developer Experience (DEV), Maintenance (MAINT) - (py-pdf#2844), (py-pdf#2862), (py-pdf#2863), (py-pdf#2847), (py-pdf#2860), (py-pdf#2867), (py-pdf#2874), (py-pdf#2879) [Full Changelog](py-pdf/pypdf@5.0.0...5.0.1)

## Version 5.0.1, 2024-09-29 ### New Features (ENH) - Add `full` parameter to PdfWriter constructor (#2865) ### Bug Fixes (BUG) - Update pyproject.toml with minimum Python version of 3.8 (#2859) - Cope with unbalanced delimiters in dictionary object (#2878) - Cope with encoding with too many differences (#2873) - Missing spaces in extract_text() method (#1328) (#2868) - Tolerate truncated files and no warning when jumping startxref (#2855) ### Robustness (ROB) - Repair PDF with invalid Root object (#2880) - Continue parsing dictionary object when error is detected (#2872) - Merge documents with invalid pages in named destinations (#2857) - Tolerate comments in arrays (#2856) ### Developer Experience (DEV) - Use latest Python version for benchmarking (#2879) ### Maintenance (MAINT) - Add tests to source distributions (#2874) - Refactor _update_field_annotation (#2862) [Full Changelog](5.0.0...5.0.1)

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Sep 6, 2022

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 24, 2022

MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023

ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024

BUG: Missing spaces in extract_text() method (py-pdf#1328)

5400f5a

ssjkamei mentioned this issue Sep 24, 2024

BUG: Missing spaces in extract_text() method (#1328) #2868

Merged

ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024

Revert "BUG: Missing spaces in extract_text() method (py-pdf#1328)"

aac0436

This reverts commit 5400f5a.

ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024

BUG: Missing spaces in extract_text() method (py-pdf#1328)

64b1c92

ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024

BUG: Missing spaces in extract_text() method (py-pdf#1328) add test

70e9b38

ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024

BUG: Missing spaces in extract_text() method (py-pdf#1328) Convert fo…

f6dcb43

…nt size comparison to ratio

ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024

BUG: Missing spaces in extract_text() method (py-pdf#1328) calculatio…

2873b9e

…n efficiency

ssjkamei added a commit to r-kamei/pypdf that referenced this issue Sep 24, 2024

BUG: Missing spaces in extract_text() method (py-pdf#1328) Simplify t…

7597704

…he assertion process

pubpub-zz closed this as completed in #2868 Sep 24, 2024

pubpub-zz mentioned this issue Sep 29, 2024

REL: 5.0.1 #2884

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing spaces in extract_text() method #1328

Missing spaces in extract_text() method #1328

Sunguru commented Sep 6, 2022 •

edited by MartinThoma

Loading

tpcgold commented Sep 11, 2023

ssjkamei commented Sep 21, 2024

ssjkamei commented Sep 21, 2024

Missing spaces in extract_text() method #1328

Missing spaces in extract_text() method #1328

Comments

Sunguru commented Sep 6, 2022 • edited by MartinThoma Loading

Environment

Code + PDF

tpcgold commented Sep 11, 2023

ssjkamei commented Sep 21, 2024

ssjkamei commented Sep 21, 2024

Sunguru commented Sep 6, 2022 •

edited by MartinThoma

Loading