This gem is not able to extract the line near pdf page break #260

shibli786 · 2017-11-11T19:02:46Z

Issue 1
This gem is not able to extract the line near pdf page break some time
I have attached the PDF file
Extract the text and check ON PAGE 2 last line (just before the page break) is not getting extracted
ON the page 2 of attached PDF

  "**TA PIMCO Total Return- Service Class 3 5/1/2002 -0.36 3.72 0.92 1.68 0.62 3.13 3.22"**

is not getting extracted

Issue 2
If text has some subscript then it got appended to the word and some time the subscript is appended in new line \n please extract the content and check the textfile
abc.pdf

The text was updated successfully, but these errors were encountered:

yob · 2019-10-26T12:09:56Z

Issue one seems to have been resolved - I can't reproduce it on the latest release (v2.2.1).

Issue two will be harder to address in a consistent way.

In this particular PDF, the superscript numbers are regular numbers printed in a smaller font (not unicode superscripts codepoints). That makes it hard to reliably identify them as superscript.

With a bit of tweaking to the page layout algorithm it'd probably be possible to have them rendered t the same line as the text they're associated with, but they'd appear as full height normal numbers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This gem is not able to extract the line near pdf page break #260

This gem is not able to extract the line near pdf page break #260

shibli786 commented Nov 11, 2017 •

edited

Loading

yob commented Oct 26, 2019

This gem is not able to extract the line near pdf page break #260

This gem is not able to extract the line near pdf page break #260

Comments

shibli786 commented Nov 11, 2017 • edited Loading

yob commented Oct 26, 2019

shibli786 commented Nov 11, 2017 •

edited

Loading