Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent hyphenation (and lost blanks) #2262

Closed
BrainAnnex opened this issue Oct 22, 2023 · 9 comments · Fixed by #2890
Closed

Inconsistent hyphenation (and lost blanks) #2262

BrainAnnex opened this issue Oct 22, 2023 · 9 comments · Fixed by #2890
Labels
whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@BrainAnnex
Copy link

BrainAnnex commented Oct 22, 2023

I'm trying to extract text from PDF documents, to isolate individual words and create an indexing system.

Some PDF files are parsed fine, but others (such as the attached "Ocean Currents.pdf") are disasters! Here's an example of the parsed text from the second page of the document:

the current flows in the op-\nposite direction to the surface current. This shift of currentdirectionswithdepth,combinedwiththedecreaseinveloc-ity with depth, is called the Ekman spiral .\nThevelocityofthesurfacecurrentisthesumoftheve-\nlocitiesoftheEkman,geostrophic,tidal,andothercurrents.The Ekman surface current or wind drift current dependsuponthespeedofthewind,itsconstancy,thelengthoftimeit has blown, and other factors. In general, however, winddriftcurrentisabout2percentofthewindspeed,oralittleless,indeepwaterwherethewindhasbeenblowingsteadi-ly for at least 12 hours.\n3203. Currents Related To Density Differences\nThe density of water varies with salinity, temperature,\nand pressure. At any given depth, the differences in densityaredueonlytodifferencesintemperatureandsalinity.With\nsufficientdata,mapsshowinggeographicaldensitydistribu-tion at a certain depth can be drawn, with lines connectingpoints of equal density. These lines would be similar to iso-bars on a weather map

Notice 2 problems:

  1. many words are attached together, with the blanks spaces lost; example:
currentdirectionswithdepth,combinedwiththedecreaseinveloc-ity
  1. hyphenation is rendered inconsistently.
    For example (see screenshot below):
op-
posite

is extracted as op-\nposite (with a newline),
while:

iso-
bars

is extracted as iso-bars (no newline!)

source

Code + PDF

Ocean Currents.pdf
(full document attached; please add to your tests)

pip install pypdf            #  Using version 3.16.4
from pypdf import PdfReader
pdf_name = "Ocean Currents.pdf"
reader = PdfReader(pdf_name)
p = reader.pages[1]
p.extract_text()

Thoughts

I suspect you'll say that the attached PDF is malformed. Maybe it is... but another software, PyMuPDF, parses it just fine.

In fact, I have noticed that the lost spaces are far more prevalent in extractions by pypdf, compared to PyMuPDF - BUT for some files it's the opposite, and pypdf does far better.

Empirically, I've noticed an intriguing complementary between pypdf and PyMuPDF : for files where one messes up badly, the other one does well - and vice versa. Maybe a different threshold of how to detect blank spaces in sentences?

But the inconsistent hyphenation I mentioned at the beginning is another issue that seriously gets in the way of word extraction...

Thanks!

@stefan6419846
Copy link
Collaborator

Please see the corresponding docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard To summarize: Text extraction is hard and involves quite some guessing - you only have individual character positions by default, all remaining steps tend to use heuristics to form words etc., thus they are not always correct. (Speaking of (Py)MuPDF: They provide commercial solutions as well and thus might have better general results.)

@MartinThoma MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. labels Oct 23, 2023
@MartinThoma
Copy link
Member

Thanks for sharing the file and some examples! This helps a lot to refine our heuristics.

I agree with everything @stefan6419846 said. There is little hope to ever solve this completely for all pdf documents.

Do you own the license of that file or is it public domain? I'm always interested in refining my benchmark for text extraction

@BrainAnnex
Copy link
Author

It'd be nice to have a user-settable threshold, for situations (not super-common, but not exactly rare, either - in my tests) when the words are not spaced enough for the algorithm to make the right choice.

Does such a setting exists?

@BrainAnnex
Copy link
Author

Also, the inconsistent hyphenation (sometimes leading to extracted text with a newline and sometimes without), is a separate issue altogether. Maybe I ought to have started 2 separate discussion threads...

@BrainAnnex
Copy link
Author

Do you own the license of that file or is it public domain? I'm always interested in refining my benchmark for text extraction

@MartinThoma - it's the PDF version of a book I used to own. I don't know if it's public domain. Doesn't a page extracted for technical tests qualify for "fair use"?

@MartinThoma
Copy link
Member

it's the PDF version of a book I used to own. I don't know if it's public domain

In that case I would advise against sharing it publicly. Private sharing might be OK, but I'm not a lawyer and I don't want to get into / cause issues 😅

@ssjkamei
Copy link
Contributor

ssjkamei commented Oct 3, 2024

The line breaks were not working in this PDF.
The reason it doesn't work seems to be a unit error.

If you follow the TL section below, you will need to multiply the font size when processing the TL.
It should be noted that they are units, not pt.

PDF 1.7
Table 105 - Text state operators (continued)
Set the text leading T1, to leading, which shall be a number expressed in unscaled text space units. Text leading shall be used only by "T*", ', and " operators. Initial value: 0.

@ssjkamei
Copy link
Contributor

ssjkamei commented Oct 4, 2024

To make this modification I just need to make the following changes to _page.py.
I think I should make the change, but is this something you can't accept without a test case?

before:

            elif operator == b"TL":
                check_crlf_space = True
                TL = float(operands[0])

after:

            elif operator == b"TL":
                check_crlf_space = True
                scale_x = math.sqrt(tm_matrix[0]**2 + tm_matrix[2]**2)
                TL = float(operands[0]) * font_size * scale_x

@stefan6419846
Copy link
Collaborator

I think I should make the change, but is this something you can't accept without a test case?

Yes, this should deserve a corresponding test case if not already covered by the existing tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants