-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spaces to page.extract_text() output concatination #1830
Comments
the modification is not so simple because il will add extra space in numerous document. adding space criteria should rely on text width calculation. |
White space are adressed in issue #1507 |
@MartinThoma |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
If the method
page.extract_text()
is used.The extracted text has no white spaces.
Actual output from the sample.pdf:
There are missing whitespaces.
Expecting:
Or Minimum:
Expecting:
Environment
$ python -m platform macOS-13.2.1-x86_64-i386-64bit $ python -c "import pypdf;print(pypdf.__version__)" 3.0.1
Code + PDF
Sample
sample.pdf
Solution
In lines:
pypdf/pypdf/_page.py
Line 1658 in 23d81ff
pypdf/pypdf/_page.py
Line 1664 in 23d81ff
pypdf/pypdf/_page.py
Line 1696 in 23d81ff
pypdf/pypdf/_page.py
Line 1720 in 23d81ff
pypdf/pypdf/_page.py
Line 1833 in 23d81ff
pypdf/pypdf/_page.py
Line 1854 in 23d81ff
pypdf/pypdf/_page.py
Line 1868 in 23d81ff
There is a concatination without whitespaces.
If you change the code into this:
The new output will be:
Maybe change the return value to:
pypdf/pypdf/_page.py
Line 1871 in 23d81ff
The text result will be:
There are allready uneccessary whitespaces but better result as before.
The text was updated successfully, but these errors were encountered: