Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple xrefs in pdf cause page extraction return only the first page #214

Closed
meldonization opened this issue Dec 12, 2018 · 5 comments · Fixed by #513 or #535
Closed

multiple xrefs in pdf cause page extraction return only the first page #214

meldonization opened this issue Dec 12, 2018 · 5 comments · Fixed by #513 or #535
Assignees

Comments

@meldonization
Copy link

PDF file: bad_page_number.pdf

  • pdfminer extracts only the first page of the file
  • parser.bufpos stop at size 10325
  • it looks like this file contains two %%EOF lines
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdf2html import *
fpath = '~/bad_page_number.pdf.pdf'
fp = open(fpath, 'rb')
parser = PDFParser(fp)
laparams = LAParams()
document = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
print(len(list(PDFPage.create_pages(document))))

Could you please look into the problem please?

@bmteller
Copy link

bmteller commented Sep 9, 2020

This can be fixed by changing line.startwith(b'trailer') to line.strip().startswith(b'trailer'). https://github.com/pdfminer/pdfminer.six/blob/develop/pdfminer/pdfdocument.py#L102

the document has an xref table where the trailer keyword is indented.

0003575084 00000 n
 trailer
<</Size 973
>>

@pietermarsman
Copy link
Member

Looks like a valid solution. Feel free to implement if you have the time.

@jstockwin
Copy link
Member

I can take a look at this

@jstockwin jstockwin self-assigned this Sep 29, 2020
jstockwin added a commit to jstockwin/pdfminer.six that referenced this issue Sep 29, 2020
jstockwin added a commit to jstockwin/pdfminer.six that referenced this issue Sep 29, 2020
pietermarsman added a commit that referenced this issue Oct 24, 2020
* Fix for when 'trailer' is indented

Closes #214

* Address CR comments - strip line after parsing

* Update CHANGELOG.md

Co-authored-by: Pieter Marsman <pietermarsman@gmail.com>
@pietermarsman
Copy link
Member

@jstockwin I accidentally merged the PR while the build was broken. Could you see what the issue was and recreate the PR?

@pietermarsman pietermarsman reopened this Oct 25, 2020
@jstockwin
Copy link
Member

Huh, oops. Sure, I will take another look...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants