Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add support for caret #96

Closed
wants to merge 2 commits into from
Closed

Conversation

suyashmahar
Copy link

@suyashmahar suyashmahar commented Jun 26, 2024

Attemps to add support for replace annotations:

image

The replace text annotation is made of a StrikeOut and a Caret annotation. The StrikeOut part is the line, and the little arrow is the Caret annotation. Since they appear as separate annotations, this hacky implementation modifies the content of the last StrikeOut annotation when it sees a Caret annotation.

However, I noticed that every annotation here appears twice:

# Construct Annotation objects, and append them to the page.
for pa in pdftypes.resolve1(pdfpage.annots) if pdfpage.annots else []:
if isinstance(pa, pdftypes.PDFObjRef):
annot_dict = pdftypes.dict_value(pa)
if annot_dict: # Would be empty if pa is a broken ref
annot = _mkannotation(annot_dict, page)
if annot is not None:
page.annots.append(annot)
else:
logger.warning("Unknown annotation: %s", pa)

Attempts to resolve #61

@0xabu Do you have any thoughts on this?

Copy link
Owner

@0xabu 0xabu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I took a quick look but I'm pretty sure this isn't the right way to go about it. I'd be more likely to handle Caret and StrikeOut as normal annotations, i.e. collect them all on one page, and maybe only then merge them in a post-processing pass if they are really adjacent. I guess this would also account for the weirdness with the context_subscribers?

@@ -261,7 +261,9 @@ def capture_char(self, text: str) -> None:
# Locate and remove the annotation's existing context subscription.
assert last_charseq != 0
i = bisect.bisect_left(self.context_subscribers, (last_charseq,))
assert 0 <= i < len(self.context_subscribers)
if not (0 <= i < len(self.context_subscribers)):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why/when does this happen?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was specific to one of the pdf files that I have. I haven't had a chance to debug it yet.


if contents:
page.annots[-1].contents = contents
page.annots[-1].subtype = annot_type
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure this approach isn't always correct. In Adobe reader, I can insert a caret annotation all on its own without a StrikeOut, by choosing an insertion point and then typing into the document with the caret tool active. Modifying the prior annotation would be appropriate only if the caret is adjacent to the strikeout.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right! I'm guessing a caret associated with a strikeout should be reported as text replacement, but a standalone caret should be reported as text addition. What do you think, @0xabu ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me -- report the pair as a replacement only if the caret appears right at the end of a strikeout. I would still implement it as a separate pass, perhaps even in the printer class.

@suyashmahar
Copy link
Author

@0xabu OK. Since the last PR, I found out that when a strike-out is associated with a caret, it has the IRT (in-reply-to) property set. IRT seems to work in association with NM (name).

This works for a simple case:
image

Generates this:

## Detailed comments
 * Page #1: "test" -- Normal highlight comment


## Nits
 * Page #1 suggested replacement:
   > If you can read this, you have ~~Adobe Acrobat Reader~~ installed on your computer.

   Google Chrome

 * Page #1 suggested deletion:
   > If you can ~~read~~ this, you have Adobe Acrobat Reader installed on your computer.

 * Page #1 suggested deletion:
   > ...Reader installed on your ~~computer.~~ ...

   asdf

 * Page #1 suggested insertion: New content

There are a couple of things that need more work, though:

  • NM is an optional field and unique only for the page
  • FDF and PDF have different specs for the IRT field, in PDF it is a dictionary, in FDF it is the name of the associated comment.
  • Capture context doesn't work for a standalone caret annotation. I'm not sure why.

@0xabu
Copy link
Owner

0xabu commented Jun 27, 2024

Aha, that makes sense that they are linked by metadata. Then we should be able to merge them in a non-lossy way.

Thanks for investigating this. Feel free to go further, or if not I will take a look when I get round to it.

Capture context doesn't work for a standalone caret annotation.

I don't remember exactly how the context mechanism works, but possibly because the caret alone doesn't cover any text?

@suyashmahar
Copy link
Author

Sounds good. I'll clean up the code and leave this PR up. Thanks!

I don't remember exactly how the context mechanism works, but possibly because the caret alone doesn't cover any text?
As I understand, it doesn't. Might be because of that.

0xabu added a commit that referenced this pull request Dec 29, 2024
 * extract Caret annotations in PDF
 * handle IRT (in reply to) property, and expose as inter-Annotation lins
 * capture (but don't yet use) the optional N name property
 * when rendering the specific case of a Caret annotation with a single
   StrikeOut annotation as a "reply" (which is how Acrobat seems to render
   replace+insert edits), render this as a "suggested replacement"

Based on the work of Suyash Mahar in #96
0xabu added a commit that referenced this pull request Dec 29, 2024
 * extract Caret annotations in PDF
 * handle IRT (in reply to) property, and expose as inter-Annotation lins
 * capture the optional NM name property, and export it in JSON (this is really unrelated)
 * when rendering the specific case of a Caret annotation with a single
   StrikeOut annotation as a "reply" (which is how Acrobat seems to render
   replace+insert edits), render this as a "suggested replacement"

Based on the work of Suyash Mahar in #96
@0xabu
Copy link
Owner

0xabu commented Dec 29, 2024

I've merged equivalent support in #102 -- feedback welcome! Thanks again for the PR and sorry for the long delay while I got around to it.

@0xabu 0xabu closed this Dec 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support "Caret" annotation
2 participants