-
Notifications
You must be signed in to change notification settings - Fork 10k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable automatic URL linking #19110
base: master
Are you sure you want to change the base?
Enable automatic URL linking #19110
Conversation
Automatically detect links in the text content of a file and automatically generate link annotations at the appropriate locations to achieve automatic link detection and hyperlinking.
#processLinks() { | ||
return this.pdfPage.getTextContent().then(content => { | ||
const [text, diffs] = normalizedTextContent(content); | ||
const urlRegex = /\b(?:https?:\/\/|mailto:|www.)(?:[[\S--\[]--\p{P}]|\/|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv; |
Check notice
Code scanning / CodeQL
Syntax error Note
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this perform, especially in documents that contain a lot of text?
Also, we probably want a new option/preference to be able to disable this functionality.
} | ||
|
||
#processLinks() { | ||
return this.pdfPage.getTextContent().then(content => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We absolutely cannot fetch the textContent twice for each rendered page, since that'll be really inefficient in general.
Besides, it isn't necessary since the textContent is already available once the textLayer has rendered; see
pdf.js/web/text_layer_builder.js
Line 99 in 079eb24
this.highlighter?.setTextMapping(textDivs, textContentItemsStr); pdf.js/web/text_highlighter.js
Lines 47 to 59 in 079eb24
/** * Store two arrays that will map DOM nodes to text they should contain. * The arrays should be of equal length and the array element at each index * should correspond to the other. e.g. * `items[0] = "<span>Item 0</span>" and texts[0] = "Item 0"; * * @param {Array<Node>} divs * @param {Array<string>} texts */ setTextMapping(divs, texts) { this.textDivs = divs; this.textContentItemsStr = texts; }
const urlRegex = /\b(?:https?:\/\/|mailto:|www.)(?:[[\S--\[]--\p{P}]|\/|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv; | ||
const matches = text.matchAll(urlRegex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A regular expression isn't sufficient to guarantee that valid URLs are found, so please use createValidAbsoluteUrl
to ensure that every "candidate" is actually correct/safe.
#processLinks() { | ||
return this.pdfPage.getTextContent().then(content => { | ||
const [text, diffs] = normalizedTextContent(content); | ||
const urlRegex = /\b(?:https?:\/\/|mailto:|www.)(?:[[\S--\[]--\p{P}]|\/|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the regular expression should probably be created just once (and then cached) to avoid re-creating it for every page.
annotationType: 2, | ||
annotationFlags: 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should use actual constants, rather than hard-coded numbers.
isEditable: false, | ||
hasApperance: false, | ||
modificationDate: null, | ||
structParent: 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably use the default-value instead?
structParent: 2, | |
structParent: -1, |
// NOTE everything from here on is arbitrary | ||
borderStyle: { | ||
width: 2, | ||
rawWidth: 2, | ||
style: 1, | ||
dashArray: [3], | ||
horizontalCornerRadius: 0, | ||
verticalCornerRadius: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these don't matter, please at least make sure that you pick values that correspond to the default ones in AnnotationBorderStyle
.
annotations.push(...linkAnnotations); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you actually tested this with PDFs that already contain Annotations?
Because it appears to me that this could easily end up creating "duplicate" and overlapping LinkAnnotations in many cases.
For an initial implementation we might want to either:
- Don't use these if the page already has any Annotations.
- Don't use these if the page already has LinkAnnotations.
Automatically detect links in the text content of a file and automatically generate link annotations at the appropriate locations to achieve automatic link detection and hyperlinking.
References:
Please note that this is a WIP PR for soliciting your feedback while I work on polishing things and hopefully optimizing further.