Adding annotations to the PDF to link back its content to its source. #2192

yanntrividic · 2024-06-24T15:14:20Z

Hello!

Before anything, a bit of context: this PR is a work in progress, and it is not ready to be merged as such. It will require some more work in order to be eventually added to the main branch, as discussed beforehand with @liZe and @grewn0uille. The idea behind this first draft is to allow WeasyPrint to embed metadata in the PDF for each HTMLElement with an id attribute it converts by adding new \Annot PDF objects that can then be accessed in the PDF readers.

What it allowed me to do for now is this:

On the left, you have a webpage; and on the right, you have the PDF produced by this fork of WeasyPrint, previewed with PDF.js. A few event listeners were added to bidirectionally "synchronize" the two visualisations. This is just a proof-of-concept, but from there we basically have what we need to build powerful interfaces that take into account the content of the PDF as semantic data that can be linked back to its source.

We talked about adding a PDF variant for debugging that could be accessible through an option like --pdf-variant debug. For now, nothing has been done in this direction, the code I propose here is just "hardcoded" into the default behaviour of WeasyPrint. I guess it will need some cleanup also, as I'm not sure if I understood the spec totally right.

Anyway, I'd be really interested in working with you on this and going in a direction that suits the philosophy of the project. If you feel like I could be of help, please share your thoughts here so that we can discuss what would be the best way to proceed, and how I could contribute further!

I can also share on demand the code of the interface I'm building, even though it is not ready to be made totally public for now, so don't hesitate to ask :)

Thanks for the great job!

liZe · 2024-08-03T12:01:03Z

Hi @yanntrividic!

I’ve just pushed a debug branch that provides a --pdf-variant=debug option. The result is a bit different because I’ve changed the way it works, but I think that the result includes enough data to work. Tell me if anything’s wrong or missing!

yanntrividic · 2024-08-30T13:59:34Z

Hello @liZe!

Summer has done its work, I am finally in a situation where I can have a look at this again. Thanks for taking the time to make this seed of an idea into actual code :) it's really nice to study how you integrated it properly with the rest of the code base.

I was able to integrate your branch in my app, I had to change a few lines in my app's code to make it compliant with your new logics, but it worked out easily. I just had to add one line in your proposition; you're right when you say that the result includes enough data to work with, but it would be sufficient only if we were building the PDF renderer ourselves. In my case, I use PDFjs, and PDFjs needs a Dest key to actually render the data in the HTML content, otherwise it is just not there -- the metadata associated with the T key is not present in the HTML code.

To my understanding, the easiest way to pass the id attribute to PDFjs is by turning the annotation into an anchor. Otherwise it is just ignored. Maybe there is another way that I don't know about? We face similar problems with other renderers such as PDFium.

When I try to read through the standard, I don't see many solutions. It's possible to embed an action into an annotation, and it might be a lead for a solution there, but we would face the same issue regarding interoperability in the end.

Any thoughts on this? :)

yanntrividic · 2024-08-30T14:05:06Z

On another note, I have a problem with the annotation your code generated after meeting a col element that spans over several pages. The rectangles of all the col elements take the shape and position of the col element of the last page, even though those are different shapes and positions.

Here is an example on one page (on the last page, the shape frames perfectly the element):

… id attribute.

Yann Trividic and others added 4 commits September 28, 2024 17:23

WeasyPrint now produces a LinkAnnotation for each HTMLElement with an…

3ac5eb8

… id attribute.

ruff checks passed!

b4bb02a

@liZe's code has been integrated into the PR

b33b917

Dest key added to the debug annotations

7276b5f

liZe force-pushed the main branch from 5c7a6dd to 7276b5f Compare September 28, 2024 15:25

liZe added this to the 63.0 milestone Sep 28, 2024

liZe added the bug Existing features not working as expected label Sep 28, 2024

Fix tests

8da63da

liZe merged commit d0ac723 into Kozea:main Sep 28, 2024
6 checks passed

jambudipa mentioned this pull request Oct 21, 2024

Per-element metadata from HTML -> PDF -> HTML (via pdf.js) #2279

Closed

grewn0uille added feature New feature that should be supported and removed bug Existing features not working as expected labels Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding annotations to the PDF to link back its content to its source. #2192

Adding annotations to the PDF to link back its content to its source. #2192

yanntrividic commented Jun 24, 2024

liZe commented Aug 3, 2024

yanntrividic commented Aug 30, 2024

yanntrividic commented Aug 30, 2024

Adding annotations to the PDF to link back its content to its source. #2192

Adding annotations to the PDF to link back its content to its source. #2192

Conversation

yanntrividic commented Jun 24, 2024

liZe commented Aug 3, 2024

yanntrividic commented Aug 30, 2024

yanntrividic commented Aug 30, 2024