-
Notifications
You must be signed in to change notification settings - Fork 684
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inconsistent coordinate systems when cropping #1181
Comments
Thank you for the detailed issue, @wodny. I'm not sure the response below resolves the entirety of what you're seeing, but it seems like a decent place to start. As I understand it, a core problem you're seeing is this: import pdfplumber
pdf = pdfplumber.open("pages-cut.pdf")
page = pdf.pages[1]
print(page.crop(page.bbox).extract_text()) ... returns a blank string. Indeed, with a normal PDF, that'd be unexpected. But it seems the reason this is happening is that the coordinates of the page's characters are all outside the page's bbox ( {'matrix': (8.000022, 0.0, 0.0, 8.000022, -318.393281, 342.49985499999997),
...
'x0': -318.393281,
...
'x1': -313.51326758,
...
'width': 4.880013420000012,
'height': 8.000022000000001,
'size': 8.000022000000001,
...
'text': 'F',
...
'top': 246.4637276420001,
'bottom': 254.4637496420001,
...} Given those coordinates, I would not expect that character to be retained after Of course, if we look at the PDF itself in a PDF viewer, the characters appear normally. This suggests to me two possibilities, although perhaps I'm overlooking others:
What do you make of this assessment? Does it change your belief that there's a bug in |
I have done some more debugging (added some to Some additional notes about
I have created a piece of code that creates pages the way #!/usr/bin/env python3
from pprint import pprint
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfinterp import PDFPageInterpreter
def get_layout(rsrcmgr, page, i):
device = PDFPageAggregator(
rsrcmgr,
pageno=i
)
interpreter = PDFPageInterpreter(rsrcmgr, device)
interpreter.process_page(page)
return device.get_result()
with open("pages-cut-x.pdf", "rb") as stream:
doc = PDFDocument(PDFParser(stream))
rsrcmgr = PDFResourceManager()
pages = list(PDFPage.create_pages(doc))
layouts = [ get_layout(rsrcmgr, page, i) for i, page in enumerate(pages, 1) ]
for i, layout in enumerate(layouts):
print(i, pages[i], layout)
for obj in layout._objs:
if obj.__class__.__name__ == "LTRect":
print(obj)
print() This gives me the following output:
def begin_page(self, page: PDFPage, ctm: Matrix) -> None:
(x0, y0, x1, y1) = page.mediabox
(x0, y0) = apply_matrix_pt(ctm, (x0, y0))
(x1, y1) = apply_matrix_pt(ctm, (x1, y1))
mediabox = (0, 0, abs(x0 - x1), abs(y0 - y1))
self.cur_item = LTPage(self.pageno, mediabox) while This means that in terms of geometry correct results are generated if if strict:
test_proposed_bbox(crop_bbox, parent_page.bbox) should probably become this: if strict:
test_proposed_bbox(crop_bbox, parent_page.layout.bbox) Additionally |
Another big thanks for the detailed and thoughtful response, @wodny. This is a helpful clue you shared:
... in conjunction with this from the first block of output in your response:
As I understand it, And yet, as you point out, Reverting With the changes in 9025c3f, this code: import pdfplumber
pdf = pdfplumber.open("pages-cut-x.pdf")
for i, p in enumerate(pdf.pages):
print(f"--- Page {i + 1} ---")
print(p.crop(p.bbox).extract_table())
print("") ... produces this, which seems like the expected output: --- Page 1 ---
[['FooCol1', 'FooCol2', 'FooCol3'], ['Foo4', 'Foo5', 'Foo6'], ['Foo7', 'Foo8', 'Foo9'], ['Foo10', 'Foo11', 'Foo12'], ['', '', '']]
--- Page 2 ---
[['BarCol1', 'BarCol2', 'BarCol3'], ['Bar4', 'Bar5', 'Bar6'], ['Bar7', 'Bar8', 'Bar9'], ['Bar10', 'Bar11', 'Bar12'], ['', '', '']] I've also tested it with alterations to the MediaBox's y coordinates, and the fix seems robust those as well. The proposed changes are available on the issue-1181 branch. Try it out and let me know if it resolves the issue for you? |
Indeed that makes the most sense.
And just before that it creates a transformation matrix based on the mediabox.
It seems it started 15 years ago but is not described as a paradigm change but rather a means to an end (as I interpret it):
Oh, this is nice (unifying objects coordinates)! pdfminer does no scaling in the CTM/LTPage adjustment so offsetting should be enough. It also works after adding
It does resolve the issue. Thank you. |
Great, thanks! And thanks for the additional links and checks. The fix is now broadly available in the new |
It's not really clear to me that this is a bug in
You could say that Both are "correct" in that any interpretation of device space is correct, as long as it's understood what it means. |
Describe the bug
It seems that
Page.crop()
:strict=True
),For pages which have a bbox not starting at
(0, 0)
this causespage.crop(page.bbox)
to return an empty set of objects. Addingrelative=True
does not help because it makes it two times worse. This is related to #245.Note that externally provided PDF's may already be cropped. This is what
mutool
(MuPDF) does when using theposter
function. It copies the page into multiple pages and then adjusts theirMediaBox
.Have you tried repairing the PDF?
Repairing the PDF fixes the problem but:
crop()
.BTW,
README.md
doesn't mentionrepair
.Code to reproduce the problem
In pdfplumber's
utils/geometry.py
:Effect:
PDF file
mutool poster -x 2
,Additionally available at my page.
Expected behavior
page.crop(page.bbox)
should be more-or-less an identity transformation.Actual behavior
page.crop(page.bbox)
returns an empty page and complains withstrict=True
when bbox does not start at (0, 0) .Screenshots
Original before cutting (two tables on one page later cut in half):
Environment
Additional context
If it gets accepted as a bug I can propose a patch.
It would probably look something like this:
The text was updated successfully, but these errors were encountered: