Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This gem is not able to extract the line near pdf page break #260

Open
shibli786 opened this issue Nov 11, 2017 · 1 comment
Open

This gem is not able to extract the line near pdf page break #260

shibli786 opened this issue Nov 11, 2017 · 1 comment

Comments

@shibli786
Copy link

shibli786 commented Nov 11, 2017

Issue 1
This gem is not able to extract the line near pdf page break some time
I have attached the PDF file
Extract the text and check ON PAGE 2 last line (just before the page break) is not getting extracted
ON the page 2 of attached PDF

  "**TA PIMCO Total Return- Service Class 3 5/1/2002 -0.36 3.72 0.92 1.68 0.62 3.13 3.22"**

is not getting extracted

Issue 2
If text has some subscript then it got appended to the word and some time the subscript is appended in new line \n please extract the content and check the textfile
abc.pdf

@yob
Copy link
Owner

yob commented Oct 26, 2019

Issue one seems to have been resolved - I can't reproduce it on the latest release (v2.2.1).

Issue two will be harder to address in a consistent way.

In this particular PDF, the superscript numbers are regular numbers printed in a smaller font (not unicode superscripts codepoints). That makes it hard to reliably identify them as superscript.

With a bit of tweaking to the page layout algorithm it'd probably be possible to have them rendered t the same line as the text they're associated with, but they'd appear as full height normal numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants