Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better parsing of annotation data in worker thread. #10716

Closed

Conversation

vlastimilmaca
Copy link
Contributor

@vlastimilmaca vlastimilmaca commented Apr 7, 2019

Added parsing of IRT, RT, StateModel, State, CreationDate, ModifiedDate, RichText, IT, ... .
_preparePopup replaced with MarkupAnnotation base class.
Added classes for missing annotations, so all anotations have correct subtype in data property.

Date parsing is done like this: pdf date string -> ISO 8601 string ( parse pdf date with regex and construct ISO string from that) -> new Date constructor. There are also unit tests, which should cover all cases (I hope).

There are also some unit tests to test some of the newly parsed properties.

This should help others (and pdf.js too) to build better UI layer for displaying annotations, because worker thread will provide a lot more info about annotations.

As of issue #10677 : there is missing implementation of caret annotation element in viewer part, but that should be easy, as there is nothing special ... it should be same as strikeout annotation.

To make work strikeout + caret, there will be needed some changes in viewer part to correctly display master annotation(caret) popup on child annotation (strikeout) or some more changes in worker part.

Child annotation has it's own popup, but during parsing, popup annotation is grabing raw data from child annotation (not-yet-parsed), so the master content is not there yet :/. I'm affraid, it is not currently possible to access parsed data of other annotations during parsing process ... so it should be probably handled in viewer part or there should be added some after-process into worker part, which will handle this ... Or maybe, popup annot could look at parent, check if it is RT Group and if it is, then go for it's parent ?

What do you think guys?

Vlastimil Máca added 2 commits April 7, 2019 22:27
…IRT, RT, StateModel, State, CreationDate, ModifiedDate, RichText, IT, ... . _preparePopup replaced with MarkupAnnotation base class. Added classes for missing annotations, so all anotations have correct subtype in data property.
@timvandermeij
Copy link
Contributor

Added classes for missing annotations, so all anotations have correct subtype in data property.

I'm not sure if it's a good idea to add support for this only for the API. Setting only the data property may give a false impression that we actually support those types whereas we only expose the most basic properties and not all specific ones for those annotation types. Moreover, we don't actually render them with this patch, so even though they don't show up in the viewer they also don't cause a warning in the console anymore and may give more problems if https://github.com/mozilla/pdf.js/blob/master/src/display/annotation_layer.js#L266 is being called for unsupported annotation types in the display layer.

I think it would be better to support annotation types in both the core and the display layer, which is how we have been implementing the existing types so far. It will take more time, but is also more complete and easier to verify/review since it's a small, self-contained part. Finally, it also allows us to add more tests, i.e., also reference tests on top of the unit tests.

Added parsing of IRT, RT, StateModel, State, CreationDate, ModifiedDate, RichText, IT, ... .

This PR is, as-is, too large. If we were to choose that only supporting the types in the API is OK, then it will need to be split up for review.

@vlastimilmaca
Copy link
Contributor Author

I'm not sure if it's a good idea to add support for this only for the API. Setting only the data property may give a false impression that we actually support those types whereas we only expose the most basic properties and not all specific ones for those annotation types.

Yeah, that is true and I undestand your concern, but at least you will know they exist in pdf and you can do something with that in display layer (you can say ... ok this annotation is here, it is of this type and if it is markup annotation - it has this title). Even now, there are implemented annotations, which doesn't have 100% support of all features (which is why I did this PR - I was missing markup properties), so there is already precedence.

Let's say, that every annotation must implement parsing of it's required properties (according to pdf spec) before it can be added into API. Then at least caret annotation is ok to add (it is also a markup annotation, so it's not just parsed subtype).

Or we can add warning message into cases, where there is only the basic implementation for those annotations (movie, sound etc.)

Moreover, we don't actually render them with this patch, so even though they don't show up in the viewer they also don't cause a warning in the console anymore and may give more problems if https://github.com/mozilla/pdf.js/blob/master/src/display/annotation_layer.js#L266 is being called for unsupported annotation types in the display layer.

I think this should not be a problem, since in code you mentioned, in switch above are explicitly named supported annotations and as default is this base class, which throws warning. It doesn't matter if subtype is undefined or not named in cases. It is well designed for this situation :).

I think it would be better to support annotation types in both the core and the display layer, which is how we have been implementing the existing types so far. It will take more time, but is also more complete and easier to verify/review since it's a small, self-contained part. Finally, it also allows us to add more tests, i.e., also reference tests on top of the unit tests.

There is opened issue #5283 which is asking for improved annotation API already, so I think people want to have api first and display later. But I understand, there might occur issues later when implementing display layer (they shouldn't, but we all know they might), which might need change of API and that may bring some breaking changes in the future, so I understand this concern as well ... but that happens ...

This PR is, as-is, too large. If we were to choose that only supporting the types in the API is OK, then it will need to be split up for review.

Yeah it is maybe too much all at once. So how should I split it?
Create separate annotation-markup-parsing which should contain following ?

Added parsing of IRT, RT, StateModel, State, CreationDate, ModifiedDate, RichText, IT, ... .

And then create PR for carret annotation in api + in display to satisfy #10677 ?

@timvandermeij
Copy link
Contributor

Thank you for replying. You actually made me remember that I already prepared most work for caret annotations (the test file is even already in the repository), but I simply forgot to finalize it. I just did and made #10723 for that so we have support for caret annotations in both the core and the display layer, so that's then already split off. I'm hoping to do free text annotations next, for which we already have an open pull request that needs to be finalized as well. After that I think we should check on a case-by-case basis how we can integrate the other parts of this patch.

@vlastimilmaca
Copy link
Contributor Author

Nice, you're welcome :). What about splitting it even more … for the first step I would try to replace _preparePopup method in API for MarkupAnnotation base class . If it won't break tests, can we move on to adding other features then?
So the next patch would be date parsing + tests for dates, then next patch for Reply types and groups and then patch for stateModel feature ? Ideally accompanied with tests and test files?

@timvandermeij
Copy link
Contributor

The implementation of caret annotations in the core and display layer is now merged.

[...] for the first step I would try to replace _preparePopup method in API for MarkupAnnotation base class [...]

I looked into that and found that the specification defines that abstraction too, so it sounds like a good idea to do that to match the specification more closely. The advantage is that that's only refactoring, so it can be its own PR and not breaking existing tests is enough for that to be accepted. After that we can see what's best to do next.

@vlastimilmaca
Copy link
Contributor Author

Ok, so I created pull request for that (#10728).

@vlastimilmaca
Copy link
Contributor Author

What is next ? :)

@vlastimilmaca
Copy link
Contributor Author

About creation date and modified date PR (Date PR) … there is some polishing work to be done and mainly, we need to decide what do we want to get..

  1. Do we want to have Date object in data properties, or do we want ISO8601 string, or do we want the default pdf string (which is hard to work with in JS)? Maybe creating Date object is little too much as it can be created on demand, when needed (to call new Date(data.creationDate).toLocaleString() instead of data.creationDate.toLocaleString() is not too hard). Also string is probably more safe to transfer … so I vote for ISO8601 string for this question.

  2. When annotation has a ReplyType = Group, it should take dates from it's IRT annotation, so from that point of view, Date PR should come after RT&IRT PR as it will need to change when that comes in.

  3. Do we want to utilize these two new properties in display layer? Should display layer show date and time, when annotation was created or modified now? Acrobat doesn't do that on level pdf.js has now. So this might be delayed for some sidebar for annotation threads if that's gonna be a thing? Only alternative I see is to display replies directly in popups … that's a thing to decide too, but that can wait, if we allow API to evolve without imminent display layer changes.

  4. About polishing … setCreationDate and setModifiedDate should have same signature (they are inconsistent in this draft). I think, input should be dictionary and value should be parsed and set into data property, as I don't see any benefit of creating property on annotation object itself (this.creationDate).

  5. About tests, there are some unit tests for pdfString -> ISO8601 string conversion, but there may also exist some unit tests for parsing … but again, parsing of dates from annotation data gets trickier with RT&IRT in play. So the parsing tests are probably more for RT&IRT PR than Date PR …

Please correct me if I'm wrong at something. Thank you for your guidance :).

@timvandermeij
Copy link
Contributor

The implementation of free text annotations in the core and display layer is now merged too.

I have some ideas for how to approach the date parsing since I already wrote some code for that. I'll try to check if we can extract that so we can simply add the creation/modification date attributes. Most likely we can also show them in the display layer quite easily so we have that covered as well.

While I look into that, could you share some more information about how Adobe Reader handles discussion threads for annotations (since that's a good reference implementation for us)? For example, are they shown in a sidebar or inline in the popup? If possible, could you make some screenshots (I work on Linux and always had issues getting Adobe Reader to run)? This will help to determine how we can display this information to the user.

I found the specification for RT and IRT at https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf#page=400&zoom=auto,-215,523 and I don't really understand yet what the difference between type R and type Group is, since it looks like both types should be grouped into one annotation for display. Could you explain this a bit more?

Implementing core layer support for this functionality seems OK to me as long as there is sufficient unit test coverage; display layer support is more difficult and can be done in a follow-up.

@vlastimilmaca
Copy link
Contributor Author

vlastimilmaca commented Apr 13, 2019

Sure.
First things first :).

Every MarkupAnnotation can have replies to it.

Reply is always text annotation, which has RT=Reply and IRT is reference to annotation, to which this annotation replies (same as popup has parent reference). Reply text is stored in Contents and author in Title.
Every reply annotation can have it's own popup annotation (it usually does … same as any other regular text annotation)
When you have multiple replies (R1-R3) to one annotation (A1) in pdf it goes like this:
A1, R1(IRT=A1), R2(IRT=R1), R3(IRT=R2). So this is one discussion thread. You can never reply to a reply in the middle and create another thread inside one annotation (at least in Acrobat). You can always reply only to a last reply, so there is max 1 level of replies per annotation, and they are basically stored as linked list.

Now it gets little complicated.
Every annotation (even the reply ones) can have states.
Every user can place state on every annotation or it's reply.
As storage for this are used text annotations which has IRT, T=AuthorName, StateModel and State filled. (RT is not specified, but I thing it is usually Reply, but I'm not sure ... important is, that StateModel is filled, that's how you can tell, it is state modifier to IRT annotation). State has 2 models - Review or Marked model.
Review states can have 5 possible values (see commit or spec). Marked state is basically true or false.

So in previous example, let's say, we will place 'accepted' review state annotation (SA) to a Reply2 and we also mark it, we will get this:
A1, R1(IRT=A1), R2(IRT=R1), R3(IRT=R2), SA1(T=vmaca, IRT=R2, SM=Review,S=Accepted), SA2(T=vmaca, IRT=R2, SM=Marked,S=Marked).

Let's say, I changed my mind and switched state to rejected, that will add a next state annotation:
A1, R1(IRT=A1), R2(IRT=R1), R3(IRT=R2), SA1(T=vmaca, IRT=R2, SM=Review,S=Accepted), SA2(T=vmaca, IRT=R2, SM=Marked,S=Marked), SA3(T=vmaca, IRT=R2, SM=Review,S=Rejected)

So there is always stored history of states to each annotation (even reply ones). Adobe enables to display whole history through properties menu (right click on annotation or reply). But in UI, it only shows last state per each user.

Marked states behave even more interestingly (at least in Acrobat) and I don't know how to do that in pdf.js … Marked states are just check marks which are displayed on annotations or it's replies. Trick is, that Acrobat stores user name (it takes it from windows account and it cannot be changed by user) and it displays only marked states, which were made by current user. (It compares string name from it's settings with string name in title property of state annotation). User, who is running Acrobat and is seeing pdf file, can only see marked states that he made, but not marked states of other users which are stored in pdf file.

Then there are Group annotations.
Group annotations are here to enable to bundle more information(or annotations) into one annotation. Typical usage is this: when you have caret annotation and you want to display, which text needs to be deleted prior caret location. Then the overline annotation is added as a group reply to the caret annotation. So, when dealing with group annotations, there always is the master annotation (it shouldn't have IRT) and bunch of annotations which has IRT and RT = group, but they all are treated as one annotation (especially, when it comes to draging, resizing etc.) . Any replies (RT=reply) which are made to this group of annotations, are stored as a replies to master annotation … same goes for states. So if you click on overline or you click on caret, you always see same set of replies (discussion thread).

Main difference between Group and Reply is, that Group can be of any other markup annotation type and has appearance, but reply is always text annotation and is visible only in discussion area of pdf viewer … I call reply annotations "modifiers" as they serve as extra meta-data providers for it's parent annotation … but groups are multiple annotations which behaves as one.

I think, you got it right. It should be bundled together, but it is not that simple because of how replies are stored (linked list .. and I am not sure, that order of annotations is guaranteed … that parent is before child ). And with states coming into mix … there is more bundling then it seems …

Generally speaking, the master markup annotation for display should have these subsets:

  • replies (not easy to get into one parent, because of how they are stored)
  • groups (this subset is present only when annotation has no IRT)
  • states (this subset is on replies too)

I hope I got it right, but this is how I understand this topic … :)

Images will follow in next comment, now i need to reinstall Acrobat and restart PC ;) (I wanted to change language to english … but that's not that easy … it requires to install completely different version of acrobat :/ ).

@vlastimilmaca
Copy link
Contributor Author

vlastimilmaca commented Apr 13, 2019

How Acrobat displays annotations:

This is what Acrobat displays on hover - it is same with what pdf.js has. When annotation has no replies, it doesn't show bracket area "(3 replies)" at all.
image

This is what it displays, when top annotation has states (it doesn't count marked states or states of replies at all)

image

This is how it looks, when you click on annotation and "Comment" side panel is closed (hover is on state, so you can see what is in tooltip)
image

This is how it looks, when you click on annotation and "Comment" side panel is opened
image

This is how a review history looks:
image

This is how it looks, when two users sets a multiple states (different states at top, same at bottom)
image

Another cool feature is, that popup on hover follows mouse as it moves over annotation … I think that should be doable here too.

Testing pdf here: annotations_acrobat_welcome.pdf

@timvandermeij
Copy link
Contributor

Closing since most parts are now merged and the final part is now in a new pull request, so let's track it there from now on. Thank you for your efforts so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants