Random whitespaces are inserted when using page.extract_text() #1507

einelson · 2022-12-18T02:52:37Z

I am trying to extract text from various PDF documents to use in an NLP project. While using page.extractText() random whitespace is appearing in the outputted words when there are no spaces in the pdf document.

Environment

Using VS code and running via command prompt.

$ python -m platform
Windows-10-10.0.22621-SP0

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
2.12.1

Code + PDF

This is a minimal, complete example that shows the issue:

test_doc.pdf
(PDF was generated using default settings in Microsoft word). It looks like this:

The code is:

import os

from PyPDF2 import PdfReader, __version__

pdf = PdfReader(os.path.join(os.getcwd(), "test_doc.pdf"))

print(f"PyPDF2=={__version__}")

text = ""
for page in pdf.pages:
    page_content = page.extract_text()
    text = text + page_content
print(text)

Output

PyPDF2==2.12.1
This is a test document by Ethan Nelson.  
 
Tuesday was a good time to call ( 000) 000-0000 . This is his ph one mu mber . This is a random address for 
testing purposes : 341 Maple st Paytonville Maine 45681.  
Anyway, there are random whitespaces here .

The text was updated successfully, but these errors were encountered:

MartinThoma · 2022-12-18T09:23:48Z

@einelson Thank you for creating an example and sharing the issue!

Getting whitespaces right is notoriously hard. @pubpub-zz is the expert in that topic; I'll leave it to him to decide if we should leave this issue open. The issue is that PDF does not (necessarily) represent the words as words internally. In the worst case, it just gives the absolute position of each character in the document.

See https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard

MartinThoma · 2022-12-18T09:32:14Z

You can decode the PDF using

mutool clean -daf test_doc.pdf test_doc_clean.pdf

Then you can see the text streams like this:

4 0 obj
<<
  /Length 3473
>>
stream
 /P <</MCID 0>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 709.54 Tm
/GS7 gs
0 g
/GS8 gs
0 G
[(This is a )9(te)-3(st)9( do)-4(cu)13(m)-4(en)12(t )-3(b)3(y)-3( )9(Et)-2(h)3(an)4( Nels)13(o)-5(n)3(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 252.89 709.54 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC  /P <</MCID 1>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 687.1 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC  /P <</MCID 2>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 664.54 Tm
0 g
0 G
[(Tuesday )8(was )-3(a)12( go)7(o)-5(d)3( t)-3(i)13(m)-4(e)9( t)7(o)-5( c)-2(all)4( )9(\()] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 220.01 664.54 Tm
0 g
0 G
[(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 236.81 664.54 Tm
0 g
0 G
 -0.105 Tc[(\) )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 242.57 664.54 Tm
0 g
0 G
[(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 259.37 664.54 Tm
0 g
0 G
[(-)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 262.61 664.54 Tm
0 g
0 G
[(0)-3(0)7(0)-3(0)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 285.05 664.54 Tm
0 g
0 G
 -0.142 Tc[(. )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 290.21 664.54 Tm
0 g
0 G
[(This i)13(s his p)3(h)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 346.63 664.54 Tm
0 g
0 G
[(o)-5(n)3(e )7(m)-4(u)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 380.71 664.54 Tm
0 g
0 G
[(m)-4(b)3(er)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 404.71 664.54 Tm
0 g
0 G
 -0.0221 Tc[(. )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 409.87 664.54 Tm
0 g
0 G
[(This i)13(s a )-3(ran)5(d)3(o)5(m)-4( add)5(re)10(ss f)9(o)-5(r )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 650.02 Tm
0 g
0 G
[(te)-3(sting)6( )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 105.26 650.02 Tm
0 g
0 G
[(p)3(u)3(rp)4(o)-5(s)11(es)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 146.3 650.02 Tm
0 g
0 G
[(:)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 149.18 650.02 Tm
0 g
0 G
[( )] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 151.7 650.02 Tm
0 g
0 G
[(3)7(4)-3(1)7( M)-5(ap)4(l)13(e )-3(st )6(P)-4(a)12(y)-3(t)9(o)-5(n)3(v)-4(il)3(le)11( M)-5(ain)6(e)9( 4)5(5)-3(6)7(8)-3(1)-3(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 325.61 650.02 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC  /P <</MCID 3>> BDC q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 72.024 627.58 Tm
0 g
0 G
 -0.0322 Tc[(Any)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 89.184 627.58 Tm
0 g
0 G
[(way)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 107.42 627.58 Tm
0 g
0 G
[(,)11( t)-3(h)3(er)10(e )-3(are)11( rand)6(o)5(m)-4( )9(whitespac)12(es )-4(h)3(er)10(e)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 272.33 627.58 Tm
0 g
0 G
[(.)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 274.97 627.58 Tm
0 g
0 G
[( )] TJ
ET
Q
 EMC 
endstream
endobj

MartinThoma · 2022-12-18T09:35:14Z

Let's focus on an example where PyPDF2 added an extra whitespace: phone became ph one.

In the PDF, the part "This is his phone mu" is represented as:

[(This i)13(s his p)3(h)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 346.63 664.54 Tm
0 g
0 G
[(o)-5(n)3(e )7(m)-4(u)] TJ

pubpub-zz · 2022-12-19T09:12:20Z

Let's focus on an example where PyPDF2 added an extra whitespace: phone became ph one.

In the PDF, the part "This is his phone mu" is represented as:
[(This i)13(s his p)3(h)] TJ
ET
Q
q
0.00000912 0 612 792 re
W* n
BT
/F1 11.04 Tf
1 0 0 1 346.63 664.54 Tm
0 g
0 G
[(o)-5(n)3(e )7(m)-4(u)] TJ

In here I would guess that PyPDF2 has inserted a white space becaucse of the 1 0 0 1 346.63 664.54 Tm sequence : this reset the 'cursor' position at an absolute space. both in X(horiz) and Y(vert). I would guess the vertical was detected unchanged (else a line return) but we currently do not/can not - because of calculation time increase - compute the horizontal position (this requires a major change identified in the roadmap). Because of that a white space insertion is considered as the most common case.

einelson · 2022-12-19T16:01:48Z

Thank you for the quick replies and the examples!

I apologize since I am not very familiar with PDF encodings. So rather than just read the text in the PDF document, the extract_text() function tries to make sense of the encodings? Is there a reason that PyPDF2 tries to do that if this is just text extraction? I might be looking at it very simply but it looks like you can parse the text from the 'tuples' in the list style objects in the stream to extract the raw-unformatted text. Is there a method that I can use to access the PDF encoding stream to attempt to do this?

-Sorry, this bit is off topic-
My end goal is to do some text replacement in a PDF. Assuming I could figure out how to parse the text encodings myself and extract the text, these were some solutions I was looking at duplicating. However I am not 100% sure if editing the stream is considered best practice either and would guess that the encodings matter here for formatting purposes.

Thank you!
Ethan N

MartinThoma · 2022-12-19T21:14:51Z

So rather than just read the text in the PDF document, the extract_text() function tries to make sense of the encodings? Is there a reason that PyPDF2 tries to do that if this is just text extraction?

PyPDF2 tries to give a useful text extraction. I have shown you the pure "text" data from above. If you want that without any interpretation, you can get it like this:

from PyPDF2 import PdfReader
from PyPDF2.generic import ContentStream

reader = PdfReader("example.pdf")
stream = ContentStream(reader.pages[0]["/Contents"].get_object(), reader)
print(stream.operations)

Give it a shot and let us know how it works :-)

My end goal is to do some text replacement in a PDF.

This is not as easy as it might look. PDF documents have pointers inside. If you change the length of anything, the pointers break. That very easily renders the complete PDF useless.

MartinThoma · 2022-12-19T21:15:59Z

Don't forget that Mutools clean heavily simplified the PDF + your PDF is already pretty simple. In contrast, PyPDF2 needs to support all kinds of PDFs from the wild.

einelson · 2022-12-20T20:37:13Z

Give it a shot and let us know how it works :-)

Thank you! I can see the encoding stream here and can definitely see how confusing it is to make sense of it! I'll give parsing it a shot and see if I can pull out the text without the whitespaces.

This is not as easy as it might look. PDF documents have pointers inside. If you change the length of anything, the pointers break. That very easily renders the complete PDF useless.

That is good to know, are there any resources for word replacement within a pdf that I could look into or any helpful documents?

pubpub-zz · 2022-12-20T21:16:17Z

@einelson
an introduction to PDF format is available here:
http://preserve.mactech.com/articles/mactech/Vol.15/15.09/PDFIntro/index.html

the pdf standard is available here:
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf

brockenspectre · 2023-01-03T03:37:40Z

I'm having the same issue with random whitespace additions and it's making regex matching nearly impossible. I'd like to +1 a fix for this even if the computation time increases. Thanks for all your work!

MartinThoma · 2023-01-03T16:06:45Z

@brockenspectre This is not an issue of putting more computational power into the problem.

The issue is figuring out what is correct. And not only for a single PDF, but for all PDFs once could find in the wild.

JaWas2019 · 2023-05-09T15:13:15Z

I would like to add something wild I encountered to this issue. Unfortunately, I know too little about PDFs to make sense of it myself, but hopefully you lovely people can :)

Crypto n' Stocks - LinkedIn Teaser_reduced.pdf

This is a report teaser that was created by our designer in Figma.

This is how the text comes out after using pypdf:

Now, I was pretty quick in blaming Figma for probably creating a shitty file, but opening the same file in Acrobat Pro and copying any random section leads to perfectly usable text:

Im curious to hear what the reason might be! Other PDFs are working fine as well.

Thanks for the work you are doing for all of us and have a good one!

renanzulian · 2023-05-12T13:54:44Z

I'm not experienced with PDFs, but it's looking hard to solve.

Unfortunately, this problem is getting me stuck. I noticed that some libraries like pdfminer.six and pdfplumber haven't this problem. We could check how they are dealing with this problem.

pubpub-zz · 2023-05-20T15:14:35Z

** from #1830
If the method page.extract_text() is used.
The extracted text has no white spaces.

Actual output from the sample.pdf:

Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts.

There are missing whitespaces.

Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts.
^ ^

Expecting:

Text Formatting
Inline formatting
Here, we demonstrate various types of inline text formatting and the use of embedded fonts.

Or Minimum:

Expecting:

Text Formatting Inline formatting Here, we demonstrate various types of inline text formatting and the use of embedded fonts.

Environment

$ python -m platform
macOS-13.2.1-x86_64-i386-64bit

$ python -c "import pypdf;print(pypdf.version)"
3.0.1

Code + PDF

from PyPDF2 import PdfReader

reader = PdfReader("sample.pdf")

for page_num in range(len(reader.pages)):
page = reader.pages[page_num]
text = page.extract_text()

Sample

sample.pdf

thatperson42 · 2023-06-14T18:21:38Z

Would it be possible to have a configurable argument to tune the sensitivity to whitespace? I've tried setting .extract_text(space_width=...) to different values but have not gotten different results in some of the examples shared above.
page.extract_text(space_width=2000)

hoehermann · 2024-02-24T21:36:37Z

My end goal is to do some text replacement in a PDF.

I am trying to achieve something similar. I came up with this. It works for some cases, but working with PDF is much more convoluted than I envisioned. @einelson, have you found a more robust solution?

gregdingle · 2024-02-25T01:39:05Z

I found that pymupdf did not have the random white space problem.

ssjkamei · 2024-10-03T04:32:49Z

#1507 (comment)
I believe #2882 addresses all but the blanks mentioned above.
The newly specified PDF has a different unit calculation than the other PDFs.
The other PDFs have 1/1000pt per unit, while the PDF mentioned in the comments seems to have 1pt per unit. I do not understand where this difference is taken from.

The description of the units can be found in PDF 1.7, Table 111, Width, etc.

MartinThoma added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label Dec 18, 2022

MartinThoma changed the title ~~Random whitespace when using page.extractText()~~ Random whitespace when using page.extract_text() Dec 18, 2022

MartinThoma changed the title ~~Random whitespace when using page.extract_text()~~ Random whitespaces are inserted when using page.extract_text() Dec 18, 2022

MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Jan 14, 2023

pubpub-zz mentioned this issue May 20, 2023

Add spaces to page.extract_text() output concatination #1830

Closed

thunderbug1 mentioned this issue Jun 6, 2023

more robust default pdf parser run-llama/llama_index#6108

Closed

8 tasks

MartinThoma mentioned this issue Jul 29, 2023

ENH: Extract LaTeX characters #2016

Merged

jzohrab mentioned this issue Jan 9, 2024

Add PDF import support for books (Issue #93) LuteOrg/lute-v3#119

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random whitespaces are inserted when using page.extract_text() #1507

Random whitespaces are inserted when using page.extract_text() #1507

einelson commented Dec 18, 2022 •

edited by MartinThoma

Loading

MartinThoma commented Dec 18, 2022

MartinThoma commented Dec 18, 2022

MartinThoma commented Dec 18, 2022

pubpub-zz commented Dec 19, 2022

einelson commented Dec 19, 2022

MartinThoma commented Dec 19, 2022

MartinThoma commented Dec 19, 2022

einelson commented Dec 20, 2022

pubpub-zz commented Dec 20, 2022

brockenspectre commented Jan 3, 2023

MartinThoma commented Jan 3, 2023

JaWas2019 commented May 9, 2023

renanzulian commented May 12, 2023

pubpub-zz commented May 20, 2023 •

edited

Loading

thatperson42 commented Jun 14, 2023

hoehermann commented Feb 24, 2024

gregdingle commented Feb 25, 2024

ssjkamei commented Oct 3, 2024

Random whitespaces are inserted when using page.extract_text() #1507

Random whitespaces are inserted when using page.extract_text() #1507

Comments

einelson commented Dec 18, 2022 • edited by MartinThoma Loading

Environment

Code + PDF

Output

MartinThoma commented Dec 18, 2022

MartinThoma commented Dec 18, 2022

MartinThoma commented Dec 18, 2022

pubpub-zz commented Dec 19, 2022

einelson commented Dec 19, 2022

MartinThoma commented Dec 19, 2022

MartinThoma commented Dec 19, 2022

einelson commented Dec 20, 2022

pubpub-zz commented Dec 20, 2022

brockenspectre commented Jan 3, 2023

MartinThoma commented Jan 3, 2023

JaWas2019 commented May 9, 2023

renanzulian commented May 12, 2023

pubpub-zz commented May 20, 2023 • edited Loading

thatperson42 commented Jun 14, 2023

hoehermann commented Feb 24, 2024

gregdingle commented Feb 25, 2024

ssjkamei commented Oct 3, 2024

einelson commented Dec 18, 2022 •

edited by MartinThoma

Loading

pubpub-zz commented May 20, 2023 •

edited

Loading