Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spaces to page.extract_text() output concatination #1830

Closed
damasch opened this issue May 4, 2023 · 3 comments
Closed

Add spaces to page.extract_text() output concatination #1830

damasch opened this issue May 4, 2023 · 3 comments

Comments

@damasch
Copy link

damasch commented May 4, 2023

If the method page.extract_text() is used.
The extracted text has no white spaces.

Actual output from the sample.pdf:

Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts.

There are missing whitespaces.

Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts.
               ^                ^

Expecting:

Text Formatting 
Inline formatting 
Here, we demonstrate various types of inline text formatting and the use of embedded fonts.

Or Minimum:

Expecting:

Text Formatting Inline formatting Here, we demonstrate various types of inline text formatting and the use of embedded fonts.

Environment

$ python -m platform
macOS-13.2.1-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf.__version__)"
3.0.1

Code + PDF

from PyPDF2 import PdfReader

reader = PdfReader("sample.pdf")

for page_num in range(len(reader.pages)):
      page = reader.pages[page_num]
      text = page.extract_text()

Sample

sample.pdf

Solution

In lines:

output += text

output += text

output += text

output += text # .translate(cmap)

output += text

output += text

output += text # just in case of

There is a concatination without whitespaces.

output += text 

If you change the code into this:

output += " " + text 

The new output will be:

      Text Formatting   Inline formatting   Here, we demonstrate various types of inline text formatting and the use of    embedded fonts.

Maybe change the return value to:

return output

return output.strip()

The text result will be:

Text Formatting   Inline formatting   Here, we demonstrate various types of inline text formatting and the use of    embedded fonts.

There are allready uneccessary whitespaces but better result as before.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented May 20, 2023

the modification is not so simple because il will add extra space in numerous document. adding space criteria should rely on text width calculation.
for information ,a bench mark measures the quality of the text extraction
https://github.com/py-pdf/benchmarks

@pubpub-zz
Copy link
Collaborator

White space are adressed in issue #1507
I will add the test to group all the possible tests.

@pubpub-zz
Copy link
Collaborator

@MartinThoma
I propose to close this PR as not planned

@pubpub-zz pubpub-zz closed this as not planned Won't fix, can't repro, duplicate, stale Jun 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants