BUG: Added line-breaks at dashes #234

rnzucker · 2015-11-12T19:41:55Z

I've been trying out PyPDF2 and encountered cases where it is skipping text. It has no problem with one file (https://github.com/rnzucker/MadLib/blob/master/test-1.pdf), beyond adding newlines at 80 characters. But with another one (https://github.com/rnzucker/MadLib/blob/master/test-2.pdf, the beginning of a newspaper editorial), it starts with the "-time" from "prime-time" in the first line. It also skipped other text in the file. My code is very simple:

from PyPDF2 import PdfReader

reader = PdfReader("test-1.pdf")
text = ""
for page in reader.pages:
    text += page.extract_text()
print(text)

JeremyMMulcahey · 2016-02-09T14:32:10Z

I'm having the same issue with transcripts. Some sections of dialogue are missing the first 1-3 lines when the speakers alternate in a conversation.

The conversational format is:
Speaker1:
Speaker2:
Speaker1:

Has there been any progress on this issue?
I'll poke around the package and see if anything jumps out.

MartinThoma · 2022-04-24T13:05:44Z

@rnzucker Would it be ok for you if I added those files to PyPDF2 (Resouces) so that we can keep testing? (Under the Packages BSD license)

rnzucker · 2022-04-24T17:03:38Z

Totally fine. They are just snippets of newspaper articles.

MartinThoma · 2022-06-26T08:29:39Z

Note to myself: The test-2 causes a newline where it shouldn't be. No text is missing (anymore).

The test-2.pdf is the following article of the New York Times from 2015: https://www.nytimes.com/2015/11/12/opinion/waiting-for-the-republican-shakeout.html -- I'm uncertain if we may add it.

pubpub-zz · 2022-07-10T08:54:02Z

this is the results with PR #1084 for test-2:

Watching Tuesday’s Republican presidential debate, with the eight prime -time contenders 
talking over and past one another, the question arises: Should the party show a fe w of these 
candidates the door?  
Some fret that this mash -up lacks seriousness. The Republican National Committee says it won’t 
intervene. It is relying on voters to usher also -rans off the national stage , and that may be a good 
thing.  
Americans won’t pay full attention to the presidenti al campaign for weeks. By the time they do, 
debates and media exposure will have made for worthy vetting of these candidates’ attention -
getting but illogical tax plans, their dubious statements, and that most symbolic but ridiculous of 
qualifications, thei r early biographies. Gov. Scott Walker’s exit suggests that fears of “super 
PAC” money’s keeping flawed candida tes afloat may not materialize.  
A number of conservative thinkers believe the shedding of vestigial candidates will happen soon 
enough. In a com ing book, Henry Olsen of the Ethics and Public Policy Center in Washington 
divides the Republican electorate into “four discrete factions that are based primarily on 
ideology, with elements of class and religious background tempering that focus.”

The extra space are introduced with Tm repositioning. I don't have currently an easy solution to identify this as a 'simple' text repositioning without space.

* ENH : extract width from CIDFontType0/2 * ENH : improve cr/lf and space extraction * BUG : fix error in decoding #1075 * FIX: in ToUnicode ignore comments (starting with %) * FIX: extend utf16 for min of 4 characters Improves #234 Improves #957 Closes #1003 Closes #1019 Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing

* ENH : extract width from CIDFontType0/2 * ENH : improve cr/lf and space extraction * BUG : fix error in decoding py-pdf#1075 * FIX: in ToUnicode ignore comments (starting with %) * FIX: extend utf16 for min of 4 characters Improves py-pdf#234 Improves py-pdf#957 Closes py-pdf#1003 Closes py-pdf#1019 Used https://tug.ctan.org/info/symbols/comprehensive/symbols-a4.pdf for testing

stefan6419846 · 2024-10-03T13:16:53Z

According to #2882 (comment), this has just been fixed.

mstamy2 added the workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow label May 19, 2016

MartinThoma added the Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests label Jun 6, 2022

MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Jun 26, 2022

MartinThoma changed the title ~~PyPDF2 at times skipping text~~ BUG: Added line-breaks at dashes Jun 26, 2022

MartinThoma mentioned this issue Jul 11, 2022

ENH: Extract Text Enhancement (whitespaces) #1084

Merged

MartinThoma added the whitespace While doing extract_text, getting the right number of whitespaces (spaces and newlines) is hard. label Feb 28, 2023

ssjkamei mentioned this issue Oct 3, 2024

BUG: Issue in text extraction (spaces) (#1153) #2882

Merged

stefan6419846 closed this as completed Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Added line-breaks at dashes #234

BUG: Added line-breaks at dashes #234

rnzucker commented Nov 12, 2015 •

edited by MartinThoma

Loading

JeremyMMulcahey commented Feb 9, 2016

MartinThoma commented Apr 24, 2022

rnzucker commented Apr 24, 2022

MartinThoma commented Jun 26, 2022

pubpub-zz commented Jul 10, 2022 •

edited by MartinThoma

Loading

stefan6419846 commented Oct 3, 2024

BUG: Added line-breaks at dashes #234

BUG: Added line-breaks at dashes #234

Comments

rnzucker commented Nov 12, 2015 • edited by MartinThoma Loading

JeremyMMulcahey commented Feb 9, 2016

MartinThoma commented Apr 24, 2022

rnzucker commented Apr 24, 2022

MartinThoma commented Jun 26, 2022

pubpub-zz commented Jul 10, 2022 • edited by MartinThoma Loading

stefan6419846 commented Oct 3, 2024

rnzucker commented Nov 12, 2015 •

edited by MartinThoma

Loading

pubpub-zz commented Jul 10, 2022 •

edited by MartinThoma

Loading