Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impossible (?) to programmatically read comments/annotations/highlights #17509

Closed
techvx opened this issue Jan 13, 2024 · 2 comments
Closed

Impossible (?) to programmatically read comments/annotations/highlights #17509

techvx opened this issue Jan 13, 2024 · 2 comments
Labels

Comments

@techvx
Copy link

techvx commented Jan 13, 2024

PDF file:

Given HowtoReadPaper.pdf - the goal is to read the first and only "How to Read a Paper" text, previously highlighted.

Configuration:

  • Web browser and its version: n/a.
  • Operating system and its version: Windows 10, build 19045.3930
  • PDF.js version: 4.0.39
  • Is a browser extension: n/a

Steps to reproduce the problem:

  1. Create a new project with npm init or similar.
  2. Add pdfjs to the list of packages.
  3. Load the PDF into memory via getDocument.
  4. Manually loop through each of numPages with getPage and getAnnotations.
  5. Receive a completely opaque any[] at the end of the last call.
  6. With JSON.stringify, receive the "sample" from below.
  7. Find the Improving annotations API to get access to all annotations stored in the PDF #5283 with no further pointers.
sample
{
  "annotationFlags": 4,
  "borderStyle": {
    "width": 1,
    "style": 1,
    "dashArray": [3],
    "horizontalCornerRadius": 0,
    "verticalCornerRadius": 0
  },
  "color": { "0": 255, "1": 237, "2": 0 },
  "backgroundColor": null,
  "borderColor": null,
  "rotation": 0,
  "contentsObj": { "str": "", "dir": "ltr" },
  "hasAppearance": true,
  "id": "51R",
  "modificationDate": "",
  "rect": [216.073, 701.57, 393.644, 724.093],
  "subtype": "Highlight",
  "hasOwnCanvas": false,
  "noRotate": false,
  "noHTML": false,
  "titleObj": { "str": "", "dir": "ltr" },
  "creationDate": "",
  "popupRef": "53R",
  "annotationType": 9,
  "quadPoints": [
    [
      { "x": 216.073, "y": 724.093 },
      { "x": 393.644, "y": 724.093 },
      { "x": 216.073, "y": 701.57 },
      { "x": 393.644, "y": 701.57 }
    ]
  ]
}

What is the expected behavior? (add screenshot)

To be able to read the highlighted text via API itself, instead of relying on third-party tools.

What went wrong? (add screenshot)

Not sure. So far, it seems either/and:

  1. the API functionality is there - I simply can't find it
  2. it was never designed for such a use case in the first-place
  3. comments/highlights/annotations are to be rendered, not accessed

If the PDF-JS doesn't support such a feature, and likely never will, any pointers in the direction of any PDF library that does would be highly appreciated. If it does, please forgive my oversight. The API page of the project isn't exactly the most helpful resource in its current state. Blind lookups of "annotation" and "highlight" in the api.js file didn't add much to the clarity, either - unfortunately.

@calixteman
Copy link
Contributor

Unfortunately, the pdf specifications don't say that the highlighted text is a part of the annotation data.
For example for the highlight annotation on page 1:
image
so as far as I can tell the only thing you can do is to get the quadPoints from the annotation, get the text layer which will contain the coordinates of the text and you'll have to find the text corresponding to the quadPoints.
FYI, there is almost no chance that we add this feature in pdf.js, except if you can demonstrate that it could useful in the Firefox context.

@Snuffleupagus
Copy link
Collaborator

As explained in #17509 (comment) the PDF file-format wasn't really created with such a use-case in mind, since the text-content of the document is completely separate from the annotations.

@Snuffleupagus Snuffleupagus closed this as not planned Won't fix, can't repro, duplicate, stale Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants