[WIP] Add support for caret #96

suyashmahar · 2024-06-26T02:06:59Z

Attemps to add support for replace annotations:

The replace text annotation is made of a StrikeOut and a Caret annotation. The StrikeOut part is the line, and the little arrow is the Caret annotation. Since they appear as separate annotations, this hacky implementation modifies the content of the last StrikeOut annotation when it sees a Caret annotation.

However, I noticed that every annotation here appears twice:

pdfannots/pdfannots/__init__.py

Lines 396 to 405 in f3d80db

    
           # Construct Annotation objects, and append them to the page. 
        
           for pa in pdftypes.resolve1(pdfpage.annots) if pdfpage.annots else []: 
        
               if isinstance(pa, pdftypes.PDFObjRef): 
        
                   annot_dict = pdftypes.dict_value(pa) 
        
                   if annot_dict:  # Would be empty if pa is a broken ref 
        
                       annot = _mkannotation(annot_dict, page) 
        
                       if annot is not None: 
        
                           page.annots.append(annot) 
        
               else: 
        
                   logger.warning("Unknown annotation: %s", pa)

Attempts to resolve #61

Caret annotations were introduced in the v1.5 spec. Page 616 of the PDF Reference 1.7
Example PDF

@0xabu Do you have any thoughts on this?

0xabu

Thanks for the PR! I took a quick look but I'm pretty sure this isn't the right way to go about it. I'd be more likely to handle Caret and StrikeOut as normal annotations, i.e. collect them all on one page, and maybe only then merge them in a post-processing pass if they are really adjacent. I guess this would also account for the weirdness with the context_subscribers?

0xabu · 2024-06-26T07:55:28Z

pdfannots/__init__.py

@@ -261,7 +261,9 @@ def capture_char(self, text: str) -> None:
                    # Locate and remove the annotation's existing context subscription.
                    assert last_charseq != 0
                    i = bisect.bisect_left(self.context_subscribers, (last_charseq,))
-                    assert 0 <= i < len(self.context_subscribers)
+                    if not (0 <= i < len(self.context_subscribers)):


Why/when does this happen?

This was specific to one of the pdf files that I have. I haven't had a chance to debug it yet.

0xabu · 2024-06-26T07:59:10Z

pdfannots/__init__.py

+
+                        if contents:
+                            page.annots[-1].contents = contents
+                            page.annots[-1].subtype = annot_type


I'm sure this approach isn't always correct. In Adobe reader, I can insert a caret annotation all on its own without a StrikeOut, by choosing an insertion point and then typing into the document with the caret tool active. Modifying the prior annotation would be appropriate only if the caret is adjacent to the strikeout.

Yes, you are right! I'm guessing a caret associated with a strikeout should be reported as text replacement, but a standalone caret should be reported as text addition. What do you think, @0xabu ?

That makes sense to me -- report the pair as a replacement only if the caret appears right at the end of a strikeout. I would still implement it as a separate pass, perhaps even in the printer class.

suyashmahar · 2024-06-26T18:33:32Z

@0xabu OK. Since the last PR, I found out that when a strike-out is associated with a caret, it has the IRT (in-reply-to) property set. IRT seems to work in association with NM (name).

This works for a simple case:

Generates this:

## Detailed comments
 * Page #1: "test" -- Normal highlight comment


## Nits
 * Page #1 suggested replacement:
   > If you can read this, you have ~~Adobe Acrobat Reader~~ installed on your computer.

   Google Chrome

 * Page #1 suggested deletion:
   > If you can ~~read~~ this, you have Adobe Acrobat Reader installed on your computer.

 * Page #1 suggested deletion:
   > ...Reader installed on your ~~computer.~~ ...

   asdf

 * Page #1 suggested insertion: New content

There are a couple of things that need more work, though:

NM is an optional field and unique only for the page
FDF and PDF have different specs for the IRT field, in PDF it is a dictionary, in FDF it is the name of the associated comment.
Capture context doesn't work for a standalone caret annotation. I'm not sure why.

0xabu · 2024-06-27T06:27:46Z

Aha, that makes sense that they are linked by metadata. Then we should be able to merge them in a non-lossy way.

Thanks for investigating this. Feel free to go further, or if not I will take a look when I get round to it.

Capture context doesn't work for a standalone caret annotation.

I don't remember exactly how the context mechanism works, but possibly because the caret alone doesn't cover any text?

suyashmahar · 2024-06-27T17:23:10Z

Sounds good. I'll clean up the code and leave this PR up. Thanks!

I don't remember exactly how the context mechanism works, but possibly because the caret alone doesn't cover any text?
As I understand, it doesn't. Might be because of that.

* extract Caret annotations in PDF * handle IRT (in reply to) property, and expose as inter-Annotation lins * capture (but don't yet use) the optional N name property * when rendering the specific case of a Caret annotation with a single StrikeOut annotation as a "reply" (which is how Acrobat seems to render replace+insert edits), render this as a "suggested replacement" Based on the work of Suyash Mahar in #96

* extract Caret annotations in PDF * handle IRT (in reply to) property, and expose as inter-Annotation lins * capture the optional NM name property, and export it in JSON (this is really unrelated) * when rendering the specific case of a Caret annotation with a single StrikeOut annotation as a "reply" (which is how Acrobat seems to render replace+insert edits), render this as a "suggested replacement" Based on the work of Suyash Mahar in #96

0xabu · 2024-12-29T20:46:55Z

I've merged equivalent support in #102 -- feedback welcome! Thanks again for the PR and sorry for the long delay while I got around to it.

Add support for caret

1f7472f

0xabu reviewed Jun 26, 2024

View reviewed changes

Add support for replace annotation using PDF's in-reply-to

0f1f266

0xabu mentioned this pull request Dec 29, 2024

Caret annotations: initial support #102

Merged

0xabu closed this Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add support for caret #96

[WIP] Add support for caret #96

suyashmahar commented Jun 26, 2024 •

edited

Loading

0xabu left a comment

0xabu Jun 26, 2024

suyashmahar Jun 26, 2024

0xabu Jun 26, 2024

suyashmahar Jun 26, 2024

0xabu Jun 26, 2024

suyashmahar commented Jun 26, 2024

0xabu commented Jun 27, 2024 •

edited

Loading

suyashmahar commented Jun 27, 2024

0xabu commented Dec 29, 2024

	# Construct Annotation objects, and append them to the page.
	for pa in pdftypes.resolve1(pdfpage.annots) if pdfpage.annots else []:
	if isinstance(pa, pdftypes.PDFObjRef):
	annot_dict = pdftypes.dict_value(pa)
	if annot_dict: # Would be empty if pa is a broken ref
	annot = _mkannotation(annot_dict, page)
	if annot is not None:
	page.annots.append(annot)
	else:
	logger.warning("Unknown annotation: %s", pa)

[WIP] Add support for caret #96

[WIP] Add support for caret #96

Conversation

suyashmahar commented Jun 26, 2024 • edited Loading

0xabu left a comment

Choose a reason for hiding this comment

0xabu Jun 26, 2024

Choose a reason for hiding this comment

suyashmahar Jun 26, 2024

Choose a reason for hiding this comment

0xabu Jun 26, 2024

Choose a reason for hiding this comment

suyashmahar Jun 26, 2024

Choose a reason for hiding this comment

0xabu Jun 26, 2024

Choose a reason for hiding this comment

suyashmahar commented Jun 26, 2024

0xabu commented Jun 27, 2024 • edited Loading

suyashmahar commented Jun 27, 2024

0xabu commented Dec 29, 2024

suyashmahar commented Jun 26, 2024 •

edited

Loading

0xabu commented Jun 27, 2024 •

edited

Loading