-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add support for caret #96
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I took a quick look but I'm pretty sure this isn't the right way to go about it. I'd be more likely to handle Caret and StrikeOut as normal annotations, i.e. collect them all on one page, and maybe only then merge them in a post-processing pass if they are really adjacent. I guess this would also account for the weirdness with the context_subscribers?
@@ -261,7 +261,9 @@ def capture_char(self, text: str) -> None: | |||
# Locate and remove the annotation's existing context subscription. | |||
assert last_charseq != 0 | |||
i = bisect.bisect_left(self.context_subscribers, (last_charseq,)) | |||
assert 0 <= i < len(self.context_subscribers) | |||
if not (0 <= i < len(self.context_subscribers)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why/when does this happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was specific to one of the pdf files that I have. I haven't had a chance to debug it yet.
pdfannots/__init__.py
Outdated
|
||
if contents: | ||
page.annots[-1].contents = contents | ||
page.annots[-1].subtype = annot_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure this approach isn't always correct. In Adobe reader, I can insert a caret annotation all on its own without a StrikeOut, by choosing an insertion point and then typing into the document with the caret tool active. Modifying the prior annotation would be appropriate only if the caret is adjacent to the strikeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right! I'm guessing a caret associated with a strikeout should be reported as text replacement, but a standalone caret should be reported as text addition. What do you think, @0xabu ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me -- report the pair as a replacement only if the caret appears right at the end of a strikeout. I would still implement it as a separate pass, perhaps even in the printer class.
@0xabu OK. Since the last PR, I found out that when a strike-out is associated with a caret, it has the Generates this:
There are a couple of things that need more work, though:
|
Aha, that makes sense that they are linked by metadata. Then we should be able to merge them in a non-lossy way. Thanks for investigating this. Feel free to go further, or if not I will take a look when I get round to it.
I don't remember exactly how the context mechanism works, but possibly because the caret alone doesn't cover any text? |
Sounds good. I'll clean up the code and leave this PR up. Thanks!
|
* extract Caret annotations in PDF * handle IRT (in reply to) property, and expose as inter-Annotation lins * capture (but don't yet use) the optional N name property * when rendering the specific case of a Caret annotation with a single StrikeOut annotation as a "reply" (which is how Acrobat seems to render replace+insert edits), render this as a "suggested replacement" Based on the work of Suyash Mahar in #96
* extract Caret annotations in PDF * handle IRT (in reply to) property, and expose as inter-Annotation lins * capture the optional NM name property, and export it in JSON (this is really unrelated) * when rendering the specific case of a Caret annotation with a single StrikeOut annotation as a "reply" (which is how Acrobat seems to render replace+insert edits), render this as a "suggested replacement" Based on the work of Suyash Mahar in #96
I've merged equivalent support in #102 -- feedback welcome! Thanks again for the PR and sorry for the long delay while I got around to it. |
Attemps to add support for replace annotations:
The replace text annotation is made of a
StrikeOut
and aCaret
annotation. TheStrikeOut
part is the line, and the little arrow is theCaret
annotation. Since they appear as separate annotations, this hacky implementation modifies the content of the lastStrikeOut
annotation when it sees aCaret
annotation.However, I noticed that every annotation here appears twice:
pdfannots/pdfannots/__init__.py
Lines 396 to 405 in f3d80db
Attempts to resolve #61
@0xabu Do you have any thoughts on this?