-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random whitespaces are inserted when using page.extract_text() #1507
Comments
@einelson Thank you for creating an example and sharing the issue! Getting whitespaces right is notoriously hard. @pubpub-zz is the expert in that topic; I'll leave it to him to decide if we should leave this issue open. The issue is that PDF does not (necessarily) represent the words as words internally. In the worst case, it just gives the absolute position of each character in the document. See https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard |
You can decode the PDF using
Then you can see the text streams like this:
|
Let's focus on an example where PyPDF2 added an extra whitespace: In the PDF, the part "This is his phone mu" is represented as:
|
In here I would guess that PyPDF2 has inserted a white space becaucse of the |
Thank you for the quick replies and the examples! I apologize since I am not very familiar with PDF encodings. So rather than just read the text in the PDF document, the extract_text() function tries to make sense of the encodings? Is there a reason that PyPDF2 tries to do that if this is just text extraction? I might be looking at it very simply but it looks like you can parse the text from the 'tuples' in the list style objects in the stream to extract the raw-unformatted text. Is there a method that I can use to access the PDF encoding stream to attempt to do this? -Sorry, this bit is off topic-
Thank you! |
PyPDF2 tries to give a useful text extraction. I have shown you the pure "text" data from above. If you want that without any interpretation, you can get it like this:
Give it a shot and let us know how it works :-)
This is not as easy as it might look. PDF documents have pointers inside. If you change the length of anything, the pointers break. That very easily renders the complete PDF useless. |
Don't forget that Mutools clean heavily simplified the PDF + your PDF is already pretty simple. In contrast, PyPDF2 needs to support all kinds of PDFs from the wild. |
Thank you! I can see the encoding stream here and can definitely see how confusing it is to make sense of it! I'll give parsing it a shot and see if I can pull out the text without the whitespaces.
That is good to know, are there any resources for word replacement within a pdf that I could look into or any helpful documents? |
@einelson the pdf standard is available here: |
I'm having the same issue with random whitespace additions and it's making regex matching nearly impossible. I'd like to +1 a fix for this even if the computation time increases. Thanks for all your work! |
@brockenspectre This is not an issue of putting more computational power into the problem. The issue is figuring out what is correct. And not only for a single PDF, but for all PDFs once could find in the wild. |
I would like to add something wild I encountered to this issue. Unfortunately, I know too little about PDFs to make sense of it myself, but hopefully you lovely people can :) Crypto n' Stocks - LinkedIn Teaser_reduced.pdf This is a report teaser that was created by our designer in Figma. This is how the text comes out after using pypdf: Now, I was pretty quick in blaming Figma for probably creating a shitty file, but opening the same file in Acrobat Pro and copying any random section leads to perfectly usable text: Im curious to hear what the reason might be! Other PDFs are working fine as well. Thanks for the work you are doing for all of us and have a good one! |
I'm not experienced with PDFs, but it's looking hard to solve. Unfortunately, this problem is getting me stuck. I noticed that some libraries like pdfminer.six and pdfplumber haven't this problem. We could check how they are dealing with this problem. |
** from #1830 Actual output from the sample.pdf: Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts. There are missing whitespaces. Text FormattingInline formattingHere, we demonstrate various types of inline text formatting and the use of embedded fonts. Expecting: Text Formatting Or Minimum: Expecting: Text Formatting Inline formatting Here, we demonstrate various types of inline text formatting and the use of embedded fonts. Environment $ python -m platform $ python -c "import pypdf;print(pypdf.version)" Code + PDF from PyPDF2 import PdfReader reader = PdfReader("sample.pdf") for page_num in range(len(reader.pages)): Sample |
Would it be possible to have a configurable argument to tune the sensitivity to whitespace? I've tried setting |
I found that pymupdf did not have the random white space problem. |
#1507 (comment) The description of the units can be found in PDF 1.7, Table 111, Width, etc. |
I am trying to extract text from various PDF documents to use in an NLP project. While using page.extractText() random whitespace is appearing in the outputted words when there are no spaces in the pdf document.
Environment
Using VS code and running via command prompt.
$ python -m platform Windows-10-10.0.22621-SP0 $ python -c "import PyPDF2;print(PyPDF2.__version__)" 2.12.1
Code + PDF
This is a minimal, complete example that shows the issue:
test_doc.pdf
(PDF was generated using default settings in Microsoft word). It looks like this:
The code is:
Output
The text was updated successfully, but these errors were encountered: