Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extracted text does not match text of pdf #118

Open
pblesi opened this issue Jan 7, 2014 · 2 comments
Open

extracted text does not match text of pdf #118

pblesi opened this issue Jan 7, 2014 · 2 comments

Comments

@pblesi
Copy link

pblesi commented Jan 7, 2014

reader.pages.at(3).text produces this output:

• FAX/Scanner/Copiers
• 2 Digital Cameras
• 1 Cisco Router
• Hub

however text contained when pdf is rendered is:

4 FAX/Scanner/Copiers
2 Digital Cameras
1 Cisco Router
1 Hub

As you can see the numbers for 2 of the elements in the list are missing.

It appears I cannot include the pdf file, but the raw content for this page is:

/C2_0 1 Tf
0 Tc 0 Tw 12 0 0 12 97.2 186.9 Tm
<0078>Tj
/TT2 1 Tf
-0.0004 Tc 0.0026 Tw 0.46 0 Td
[( )-760(2 Poly Com systems )]TJ
ET
EMC
/P <</MCID 29 >>BDC
BT
/C2_0 1 Tf
0 Tc 0 Tw 12 0 0 12 97.2 172.26 Tm
<0078>Tj
/TT2 1 Tf
-0.0002 Tc 0.7624 Tw 0.46 0 Td
[( 4 )760(FAX/Scanner/Copiers )]TJ
ET
EMC
/P <</MCID 30 >>BDC
BT
/C2_0 1 Tf
0 Tc 0 Tw 12 0 0 12 97.2 157.68 Tm
<0078>Tj
/TT2 1 Tf
-0.0002 Tc 0.0024 Tw 0.46 0 Td
[( )-760(2 Digita)-4(l)2( Cameras )]TJ
ET
EMC
/P <</MCID 31 >>BDC
BT
/C2_0 1 Tf
0 Tc 0 Tw 12 0 0 12 97.2 143.04 Tm
<0078>Tj
/TT2 1 Tf
-0.0002 Tc 0.0024 Tw 0.46 0 Td
[( )-760(1 Cisco Router )]TJ
ET
EMC
/P <</MCID 32 >>BDC
BT
/C2_0 1 Tf
0 Tc 0 Tw 12 0 0 12 97.2 128.46 Tm
<0078>Tj
/TT2 1 Tf
-0.0014 Tc 0.7636 Tw 0.46 0 Td
[( 1 )760(Hub )]TJ
ET
EMC
/P <</MCID 33 >>BDC
BT
/C2_0 1 Tf
0 Tc 0 Tw 12 0 0 12 97.2 113.82 Tm
<0078>Tj
/TT2 1 Tf
-0.0004 Tc 0.0026 Tw 0.46 0 Td
[( )-760(6 NEC projectors mounted on portable carts )]TJ
ET
EMC

@aarmora
Copy link

aarmora commented Jul 21, 2015

Did you find a solution for this? I believe I'm facing a similar issue.

@yob
Copy link
Owner

yob commented Feb 14, 2017

I suspect this is an issue with our text layout algorithms in the PageLayout class.

Unfortunately I'm short on time at the moment, but I'll happily accept patches if you want to investigate further,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants