Inconsistent hyphenation (and lost blanks) #2262

BrainAnnex · 2023-10-22T22:30:14Z

I'm trying to extract text from PDF documents, to isolate individual words and create an indexing system.

Some PDF files are parsed fine, but others (such as the attached "Ocean Currents.pdf") are disasters! Here's an example of the parsed text from the second page of the document:

the current flows in the op-\nposite direction to the surface current. This shift of currentdirectionswithdepth,combinedwiththedecreaseinveloc-ity with depth, is called the Ekman spiral .\nThevelocityofthesurfacecurrentisthesumoftheve-\nlocitiesoftheEkman,geostrophic,tidal,andothercurrents.The Ekman surface current or wind drift current dependsuponthespeedofthewind,itsconstancy,thelengthoftimeit has blown, and other factors. In general, however, winddriftcurrentisabout2percentofthewindspeed,oralittleless,indeepwaterwherethewindhasbeenblowingsteadi-ly for at least 12 hours.\n3203. Currents Related To Density Differences\nThe density of water varies with salinity, temperature,\nand pressure. At any given depth, the differences in densityaredueonlytodifferencesintemperatureandsalinity.With\nsufficientdata,mapsshowinggeographicaldensitydistribu-tion at a certain depth can be drawn, with lines connectingpoints of equal density. These lines would be similar to iso-bars on a weather map

Notice 2 problems:

many words are attached together, with the blanks spaces lost; example:

currentdirectionswithdepth,combinedwiththedecreaseinveloc-ity

hyphenation is rendered inconsistently.
For example (see screenshot below):

op-
posite

is extracted as op-\nposite (with a newline),
while:

iso-
bars

is extracted as iso-bars (no newline!)

Code + PDF

Ocean Currents.pdf
(full document attached; please add to your tests)

pip install pypdf            #  Using version 3.16.4
from pypdf import PdfReader
pdf_name = "Ocean Currents.pdf"
reader = PdfReader(pdf_name)
p = reader.pages[1]
p.extract_text()

Thoughts

I suspect you'll say that the attached PDF is malformed. Maybe it is... but another software, PyMuPDF, parses it just fine.

In fact, I have noticed that the lost spaces are far more prevalent in extractions by pypdf, compared to PyMuPDF - BUT for some files it's the opposite, and pypdf does far better.

Empirically, I've noticed an intriguing complementary between pypdf and PyMuPDF : for files where one messes up badly, the other one does well - and vice versa. Maybe a different threshold of how to detect blank spaces in sentences?

But the inconsistent hyphenation I mentioned at the beginning is another issue that seriously gets in the way of word extraction...

Thanks!

The text was updated successfully, but these errors were encountered:

stefan6419846 · 2023-10-23T06:01:50Z

Please see the corresponding docs as well: https://pypdf.readthedocs.io/en/latest/user/extract-text.html#why-text-extraction-is-hard To summarize: Text extraction is hard and involves quite some guessing - you only have individual character positions by default, all remaining steps tend to use heuristics to form words etc., thus they are not always correct. (Speaking of (Py)MuPDF: They provide commercial solutions as well and thus might have better general results.)

MartinThoma · 2023-10-23T07:53:59Z

Thanks for sharing the file and some examples! This helps a lot to refine our heuristics.

I agree with everything @stefan6419846 said. There is little hope to ever solve this completely for all pdf documents.

Do you own the license of that file or is it public domain? I'm always interested in refining my benchmark for text extraction

BrainAnnex · 2023-10-24T23:09:26Z

It'd be nice to have a user-settable threshold, for situations (not super-common, but not exactly rare, either - in my tests) when the words are not spaced enough for the algorithm to make the right choice.

Does such a setting exists?

BrainAnnex · 2023-10-24T23:11:58Z

Also, the inconsistent hyphenation (sometimes leading to extracted text with a newline and sometimes without), is a separate issue altogether. Maybe I ought to have started 2 separate discussion threads...

BrainAnnex · 2023-10-24T23:15:16Z

Do you own the license of that file or is it public domain? I'm always interested in refining my benchmark for text extraction

@MartinThoma - it's the PDF version of a book I used to own. I don't know if it's public domain. Doesn't a page extracted for technical tests qualify for "fair use"?

MartinThoma · 2023-10-29T11:55:56Z

it's the PDF version of a book I used to own. I don't know if it's public domain

In that case I would advise against sharing it publicly. Private sharing might be OK, but I'm not a lawyer and I don't want to get into / cause issues 😅

ssjkamei · 2024-10-03T13:49:23Z

The line breaks were not working in this PDF.
The reason it doesn't work seems to be a unit error.

If you follow the TL section below, you will need to multiply the font size when processing the TL.
It should be noted that they are units, not pt.

PDF 1.7
Table 105 - Text state operators (continued)
Set the text leading T1, to leading, which shall be a number expressed in unscaled text space units. Text leading shall be used only by "T*", ', and " operators. Initial value: 0.

ssjkamei · 2024-10-04T02:32:38Z

To make this modification I just need to make the following changes to _page.py.
I think I should make the change, but is this something you can't accept without a test case?

before:

            elif operator == b"TL":
                check_crlf_space = True
                TL = float(operands[0])

after:

            elif operator == b"TL":
                check_crlf_space = True
                scale_x = math.sqrt(tm_matrix[0]**2 + tm_matrix[2]**2)
                TL = float(operands[0]) * font_size * scale_x

stefan6419846 · 2024-10-04T07:47:11Z

I think I should make the change, but is this something you can't accept without a test case?

Yes, this should deserve a corresponding test case if not already covered by the existing tests.

MartinThoma added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. labels Oct 23, 2023

ssjkamei mentioned this issue Oct 4, 2024

BUG: Line breaks are not generated due to incorrect calculation at the text leading #2890

Merged

stefan6419846 closed this as completed in #2890 Oct 4, 2024

stefan6419846 closed this as completed in fcb103a Oct 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent hyphenation (and lost blanks) #2262

Inconsistent hyphenation (and lost blanks) #2262

BrainAnnex commented Oct 22, 2023 •

edited

Loading

stefan6419846 commented Oct 23, 2023

MartinThoma commented Oct 23, 2023

BrainAnnex commented Oct 24, 2023

BrainAnnex commented Oct 24, 2023

BrainAnnex commented Oct 24, 2023

MartinThoma commented Oct 29, 2023

ssjkamei commented Oct 3, 2024

ssjkamei commented Oct 4, 2024

stefan6419846 commented Oct 4, 2024

Inconsistent hyphenation (and lost blanks) #2262

Inconsistent hyphenation (and lost blanks) #2262

Comments

BrainAnnex commented Oct 22, 2023 • edited Loading

Code + PDF

Thoughts

stefan6419846 commented Oct 23, 2023

MartinThoma commented Oct 23, 2023

BrainAnnex commented Oct 24, 2023

BrainAnnex commented Oct 24, 2023

BrainAnnex commented Oct 24, 2023

MartinThoma commented Oct 29, 2023

ssjkamei commented Oct 3, 2024

ssjkamei commented Oct 4, 2024

stefan6419846 commented Oct 4, 2024

BrainAnnex commented Oct 22, 2023 •

edited

Loading