Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with coordinates in some PDF #43

Closed
kermitt2 opened this issue Feb 28, 2019 · 7 comments
Closed

Problem with coordinates in some PDF #43

kermitt2 opened this issue Feb 28, 2019 · 7 comments

Comments

@kermitt2
Copy link
Owner

I am trying to trace the problem of incorrect coordinates for string elements in some PDF. One example is the attached PubMed Central PDF. Using it with GROBID and the PDF.js document display + annotations, we see that the bounding boxes for the annotations are not correct (while usually they are!).

The problem is apparently coming from pdfalto, but I am not sure if it comes from incorrect page dimension or an incorrect origin point on the page for the string coordinates.

So in the attached PDF, all page dimensions are x:662, y:860. First page, first token "Association" is positioned with x:85, y:126, w:115, h:17.8. Proportion x/y is visually incorrect. x and y should be x:57, y:90 (from PDF.js)

Second page, first token "Xia" is positioned x:71, y:64, w:10, h:7.3, once again x/y is not visually clearly not correct. It should be x:42, y:21 (from PDF.js)

Looking at XmlAltoOutputDev.cc and TextPage::startPage, page coordinates come from GfxState, and the pagebox, but then I saw nothing that looks really related to this :/

PMC5348138.pdf

@Aazhar
Copy link
Collaborator

Aazhar commented Mar 2, 2019

Indeed the annotations displayed from PDF.js are incorrect, but it's not related to pdfalto, using macos system viewer for selection shows that the coordinates are correct for example the token "Association" :

screenshot 2019-03-02 at 15 07 26

I'm having a look at pdf.js processings.

@kermitt2
Copy link
Owner Author

kermitt2 commented Mar 2, 2019

The coordinates that are displayed by Preview with the selection are not correct, look:

screen shot 2019-03-02 at 15 56 55

The origin is already shifted... I think the coordinates I put above from PDF.js are the correct/expected ones for origin at (0,0) (x:57, y:90 for top left corner of "Association" and not x:85, y:126 from pdfalto or x:83.92, y:124.62 from Preview). Or somehow the size of the page is not correct and should be shifted/rescaled accordingly.

@Aazhar
Copy link
Collaborator

Aazhar commented Mar 2, 2019

Ok I think I know the problem, actually there are different level for the boxes (media/crop/bleed) each is used by particular impression equipments, when these are not the same sized box it leads to such issues, I'll see how to fix this

@Aazhar
Copy link
Collaborator

Aazhar commented Mar 2, 2019

So i've made a change to use crop box by default instead of media box : b14cd4e

@Aazhar
Copy link
Collaborator

Aazhar commented Mar 2, 2019

This should be a dynamic option from pdfalto command line what do you think ?

@Aazhar
Copy link
Collaborator

Aazhar commented Mar 2, 2019

Just reminder, this was legacy from pdf2xml..

@kermitt2
Copy link
Owner Author

kermitt2 commented Mar 4, 2019

Yes the issue was from pdf2xml !

Your fix entirely solves the issue for all my examples cases, and everything is fine with usual documents, so it's super many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants