Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable automatic URL linking #19110

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ryzokuken
Copy link
Collaborator

@ryzokuken ryzokuken commented Nov 26, 2024

Automatically detect links in the text content of a file and automatically generate link annotations at the appropriate locations to achieve automatic link detection and hyperlinking.

References:

Please note that this is a WIP PR for soliciting your feedback while I work on polishing things and hopefully optimizing further.

Automatically detect links in the text content of a file and automatically
generate link annotations at the appropriate locations to achieve
automatic link detection and hyperlinking.
#processLinks() {
return this.pdfPage.getTextContent().then(content => {
const [text, diffs] = normalizedTextContent(content);
const urlRegex = /\b(?:https?:\/\/|mailto:|www.)(?:[[\S--\[]--\p{P}]|\/|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv;

Check notice

Code scanning / CodeQL

Syntax error Note

Error: Invalid regular expression flag
Copy link
Collaborator

@Snuffleupagus Snuffleupagus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this perform, especially in documents that contain a lot of text?

Also, we probably want a new option/preference to be able to disable this functionality.

}

#processLinks() {
return this.pdfPage.getTextContent().then(content => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We absolutely cannot fetch the textContent twice for each rendered page, since that'll be really inefficient in general.
Besides, it isn't necessary since the textContent is already available once the textLayer has rendered; see

  • this.highlighter?.setTextMapping(textDivs, textContentItemsStr);
  • /**
    * Store two arrays that will map DOM nodes to text they should contain.
    * The arrays should be of equal length and the array element at each index
    * should correspond to the other. e.g.
    * `items[0] = "<span>Item 0</span>" and texts[0] = "Item 0";
    *
    * @param {Array<Node>} divs
    * @param {Array<string>} texts
    */
    setTextMapping(divs, texts) {
    this.textDivs = divs;
    this.textContentItemsStr = texts;
    }

Comment on lines +1272 to +1273
const urlRegex = /\b(?:https?:\/\/|mailto:|www.)(?:[[\S--\[]--\p{P}]|\/|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv;
const matches = text.matchAll(urlRegex);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A regular expression isn't sufficient to guarantee that valid URLs are found, so please use createValidAbsoluteUrl to ensure that every "candidate" is actually correct/safe.

#processLinks() {
return this.pdfPage.getTextContent().then(content => {
const [text, diffs] = normalizedTextContent(content);
const urlRegex = /\b(?:https?:\/\/|mailto:|www.)(?:[[\S--\[]--\p{P}]|\/|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the regular expression should probably be created just once (and then cached) to avoid re-creating it for every page.

Comment on lines +1247 to +1248
annotationType: 2,
annotationFlags: 4,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should use actual constants, rather than hard-coded numbers.

isEditable: false,
hasApperance: false,
modificationDate: null,
structParent: 2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably use the default-value instead?

Suggested change
structParent: 2,
structParent: -1,

Comment on lines +1256 to +1263
// NOTE everything from here on is arbitrary
borderStyle: {
width: 2,
rawWidth: 2,
style: 1,
dashArray: [3],
horizontalCornerRadius: 0,
verticalCornerRadius: 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these don't matter, please at least make sure that you pick values that correspond to the default ones in AnnotationBorderStyle.

Comment on lines +122 to +123
annotations.push(...linkAnnotations);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you actually tested this with PDFs that already contain Annotations?

Because it appears to me that this could easily end up creating "duplicate" and overlapping LinkAnnotations in many cases.
For an initial implementation we might want to either:

  • Don't use these if the page already has any Annotations.
  • Don't use these if the page already has LinkAnnotations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants