Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random whitespaces are inserted when using page.extract_text() #1507

Open
einelson opened this issue Dec 18, 2022 · 18 comments
Open

Random whitespaces are inserted when using page.extract_text() #1507

einelson opened this issue Dec 18, 2022 · 18 comments
Labels
whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@einelson
Copy link

einelson commented Dec 18, 2022

I am trying to extract text from various PDF documents to use in an NLP project. While using page.extractText() random whitespace is appearing in the outputted words when there are no spaces in the pdf document.

Environment

Using VS code and running via command prompt.

$ python -m platform
Windows-10-10.0.22621-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.12.1

Code + PDF

This is a minimal, complete example that shows the issue:

test_doc.pdf
(PDF was generated using default settings in Microsoft word). It looks like this:

image

The code is:

import os

from PyPDF2 import PdfReader, __version__

pdf = PdfReader(os.path.join(os.getcwd(), "test_doc.pdf"))

print(f"PyPDF2=={__version__}")

text = ""
for page in pdf.pages:
    page_content = page.extract_text()
    text = text + page_content
print(text)

Output

PyPDF2==2.12.1
This is a test document by Ethan Nelson.  
 
Tuesday was a good time to call ( 000) 000-0000 . This is his ph one mu mber . This is a random address for 
testing purposes : 341 Maple st Paytonville Maine 45681.  
Anyway, there are random whitespaces here . 
@MartinThoma MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Dec 18, 2022
@MartinThoma
Copy link
Member

@einelson Thank you for creating an example and sharing the issue!

Getting whitespaces right is notoriously hard. @pubpub-zz is the expert in that topic; I'll leave it to him to decide if we should leave this issue open. The issue is that PDF does not (necessarily) represent the words as words internally. In the worst case, it just gives the absolute position of each character in the document.

See https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard

@MartinThoma MartinThoma changed the title Random whitespace when using page.extractText() Random whitespace when using page.extract_text() Dec 18, 2022
@MartinThoma
Copy link
Member

You can decode the PDF using

mutool clean -daf test_doc.pdf test_doc_clean.pdf

Then you can see the text streams like this:

4 0 obj
<<
  /Length 3473
>>
stream
 /P <</MCID 0>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 709.54 Tm
/GS7 gs
0 g
/GS8 gs
0 G
[(This is a )9(te)-3(st)9( do)-4(cu)13(m)-4(en)12(t )-3(b)3(y)-3( )9(Et)-2(h)3(an)4( Nels)13(o)-5(n)3(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 252.89 709.54 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC  /P <</MCID 1>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 687.1 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC  /P <</MCID 2>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 664.54 Tm
0 g
0 G
[(Tuesday )8(was )-3(a)12( go)7(o)-5(d)3( t)-3(i)13(m)-4(e)9( t)7(o)-5( c)-2(all)4( )9(\()] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 220.01 664.54 Tm
0 g
0 G
[(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 236.81 664.54 Tm
0 g
0 G
 -0.105 Tc[(\) )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 242.57 664.54 Tm
0 g
0 G
[(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 259.37 664.54 Tm
0 g
0 G
[(-)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 262.61 664.54 Tm
0 g
0 G
[(0)-3(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 285.05 664.54 Tm
0 g
0 G
 -0.142 Tc[(. )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 290.21 664.54 Tm
0 g
0 G
[(This i)13(s his p)3(h)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 346.63 664.54 Tm
0 g
0 G
[(o)-5(n)3(e )7(m)-4(u)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 380.71 664.54 Tm
0 g
0 G
[(m)-4(b)3(er)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 404.71 664.54 Tm
0 g
0 G
 -0.0221 Tc[(. )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 409.87 664.54 Tm
0 g
0 G
[(This i)13(s a )-3(ran)5(d)3(o)5(m)-4( add)5(re)10(ss f)9(o)-5(r )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 650.02 Tm
0 g
0 G
[(te)-3(sting)6( )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 105.26 650.02 Tm
0 g
0 G
[(p)3(u)3(rp)4(o)-5(s)11(es)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 146.3 650.02 Tm
0 g
0 G
[(:)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 149.18 650.02 Tm
0 g
0 G
[( )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 151.7 650.02 Tm
0 g
0 G
[(3)7(4)-3(1)7( M)-5(ap)4(l)13(e )-3(st )6(P)-4(a)12(y)-3(t)9(o)-5(n)3(v)-4(il)3(le)11( M)-5(ain)6(e)9( 4)5(5)-3(6)7(8)-3(1)-3(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 325.61 650.02 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC  /P <</MCID 3>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 627.58 Tm
0 g
0 G
 -0.0322 Tc[(Any)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 89.184 627.58 Tm
0 g
0 G
[(way)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 107.42 627.58 Tm
0 g
0 G
[(,)11( t)-3(h)3(er)10(e )-3(are)11( rand)6(o)5(m)-4( )9(whitespac)12(es )-4(h)3(er)10(e)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 272.33 627.58 Tm
0 g
0 G
[(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 274.97 627.58 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC 
endstream
endobj

@MartinThoma
Copy link
Member

Let's focus on an example where PyPDF2 added an extra whitespace: phone became ph one.

In the PDF, the part "This is his phone mu" is represented as:

[(This i)13(s his p)3(h)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 346.63 664.54 Tm
0 g
0 G
[(o)-5(n)3(e )7(m)-4(u)] TJ

@MartinThoma MartinThoma changed the title Random whitespace when using page.extract_text() Random whitespaces are inserted when using page.extract_text() Dec 18, 2022
@pubpub-zz
Copy link
Collaborator

Let's focus on an example where PyPDF2 added an extra whitespace: phone became ph one.

In the PDF, the part "This is his phone mu" is represented as:

[(This i)13(s his p)3(h)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 346.63 664.54 Tm
0 g
0 G
[(o)-5(n)3(e )7(m)-4(u)] TJ

In here I would guess that PyPDF2 has inserted a white space becaucse of the 1 0 0 1 346.63 664.54 Tm sequence : this reset the 'cursor' position at an absolute space. both in X(horiz) and Y(vert). I would guess the vertical was detected unchanged (else a line return) but we currently do not/can not - because of calculation time increase - compute the horizontal position (this requires a major change identified in the roadmap). Because of that a white space insertion is considered as the most common case.

@einelson
Copy link
Author

Thank you for the quick replies and the examples!

I apologize since I am not very familiar with PDF encodings. So rather than just read the text in the PDF document, the extract_text() function tries to make sense of the encodings? Is there a reason that PyPDF2 tries to do that if this is just text extraction? I might be looking at it very simply but it looks like you can parse the text from the 'tuples' in the list style objects in the stream to extract the raw-unformatted text. Is there a method that I can use to access the PDF encoding stream to attempt to do this?

-Sorry, this bit is off topic-
My end goal is to do some text replacement in a PDF. Assuming I could figure out how to parse the text encodings myself and extract the text, these were some solutions I was looking at duplicating. However I am not 100% sure if editing the stream is considered best practice either and would guess that the encodings matter here for formatting purposes.

Thank you!
Ethan N

@MartinThoma
Copy link
Member

So rather than just read the text in the PDF document, the extract_text() function tries to make sense of the encodings? Is there a reason that PyPDF2 tries to do that if this is just text extraction?

PyPDF2 tries to give a useful text extraction. I have shown you the pure "text" data from above. If you want that without any interpretation, you can get it like this:

from PyPDF2 import PdfReader
from PyPDF2.generic import ContentStream

reader = PdfReader("example.pdf")
stream = ContentStream(reader.pages[0]["/Contents"].get_object(), reader)
print(stream.operations)

Give it a shot and let us know how it works :-)

My end goal is to do some text replacement in a PDF.

This is not as easy as it might look. PDF documents have pointers inside. If you change the length of anything, the pointers break. That very easily renders the complete PDF useless.

@MartinThoma
Copy link
Member

Don't forget that Mutools clean heavily simplified the PDF + your PDF is already pretty simple. In contrast, PyPDF2 needs to support all kinds of PDFs from the wild.

@einelson
Copy link
Author

Give it a shot and let us know how it works :-)

Thank you! I can see the encoding stream here and can definitely see how confusing it is to make sense of it! I'll give parsing it a shot and see if I can pull out the text without the whitespaces.

This is not as easy as it might look. PDF documents have pointers inside. If you change the length of anything, the pointers break. That very easily renders the complete PDF useless.

That is good to know, are there any resources for word replacement within a pdf that I could look into or any helpful documents?

@pubpub-zz
Copy link
Collaborator

@einelson
an introduction to PDF format is available here:
http://preserve.mactech.com/articles/mactech/Vol.15/15.09/PDFIntro/index.html

the pdf standard is available here:
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf

@brockenspectre
Copy link

I'm having the same issue with random whitespace additions and it's making regex matching nearly impossible. I'd like to +1 a fix for this even if the computation time increases. Thanks for all your work!

@MartinThoma
Copy link
Member

@brockenspectre This is not an issue of putting more computational power into the problem.

The issue is figuring out what is correct. And not only for a single PDF, but for all PDFs once could find in the wild.

@MartinThoma MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023
@JaWas2019
Copy link

I would like to add something wild I encountered to this issue. Unfortunately, I know too little about PDFs to make sense of it myself, but hopefully you lovely people can :)

Crypto n' Stocks - LinkedIn Teaser_reduced.pdf

This is a report teaser that was created by our designer in Figma.

This is how the text comes out after using pypdf:

image

Now, I was pretty quick in blaming Figma for probably creating a shitty file, but opening the same file in Acrobat Pro and copying any random section leads to perfectly usable text:

image

Im curious to hear what the reason might be! Other PDFs are working fine as well.

Thanks for the work you are doing for all of us and have a good one!

@renanzulian
Copy link

I'm not experienced with PDFs, but it's looking hard to solve.

Unfortunately, this problem is getting me stuck. I noticed that some libraries like pdfminer.six and pdfplumber haven't this problem. We could check how they are dealing with this problem.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented May 20, 2023

** from #1830
If the method page.extract_text() is used.
The extracted text has no white spaces.

Actual output from the sample.pdf:

Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts.

There are missing whitespaces.

Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts.
^ ^

Expecting:

Text Formatting
Inline formatting
Here, we demonstrate various types of inline text formatting and the use of embedded fonts.

Or Minimum:

Expecting:

Text Formatting Inline formatting Here, we demonstrate various types of inline text formatting and the use of embedded fonts.

Environment

$ python -m platform
macOS-13.2.1-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf.version)"
3.0.1

Code + PDF

from PyPDF2 import PdfReader

reader = PdfReader("sample.pdf")

for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
text = page.extract_text()

Sample

sample.pdf

@thatperson42
Copy link

Would it be possible to have a configurable argument to tune the sensitivity to whitespace? I've tried setting .extract_text(space_width=...) to different values but have not gotten different results in some of the examples shared above.
page.extract_text(space_width=2000)

@hoehermann
Copy link

My end goal is to do some text replacement in a PDF.

I am trying to achieve something similar. I came up with this. It works for some cases, but working with PDF is much more convoluted than I envisioned. @einelson, have you found a more robust solution?

@gregdingle
Copy link

I found that pymupdf did not have the random white space problem.

@ssjkamei
Copy link
Contributor

ssjkamei commented Oct 3, 2024

#1507 (comment)
I believe #2882 addresses all but the blanks mentioned above.
The newly specified PDF has a different unit calculation than the other PDFs.
The other PDFs have 1/1000pt per unit, while the PDF mentioned in the comments seems to have 1pt per unit. I do not understand where this difference is taken from.

The description of the units can be found in PDF 1.7, Table 111, Width, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

10 participants