Enable automatic URL linking #19110

ryzokuken · 2024-11-26T17:03:41Z

Automatically detect links in the text content of a file and automatically generate link annotations at the appropriate locations to achieve automatic link detection and hyperlinking.

References:

Please note that this is a WIP PR for soliciting your feedback while I work on polishing things and hopefully optimizing further.

Automatically detect links in the text content of a file and automatically generate link annotations at the appropriate locations to achieve automatic link detection and hyperlinking.

web/pdf_page_view.js

+  #processLinks() {
+    return this.pdfPage.getTextContent().then(content => {
+      const [text, diffs] = normalizedTextContent(content);
+      const urlRegex = /\b(?:https?:\/\/|mailto:|www.)(?:[[\S--\[]--\p{P}]|\/|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv;


Snuffleupagus

How does this perform, especially in documents that contain a lot of text?

Also, we probably want a new option/preference to be able to disable this functionality.

Snuffleupagus · 2024-11-26T17:30:53Z

web/pdf_page_view.js

+  }
+
+  #processLinks() {
+    return this.pdfPage.getTextContent().then(content => {


We absolutely cannot fetch the textContent twice for each rendered page, since that'll be really inefficient in general.
Besides, it isn't necessary since the textContent is already available once the textLayer has rendered; see

pdf.js/web/text_layer_builder.js

Line 99 in 079eb24

this.highlighter?.setTextMapping(textDivs, textContentItemsStr);

pdf.js/web/text_highlighter.js

Lines 47 to 59 in 079eb24

/**

* Store two arrays that will map DOM nodes to text they should contain.

* The arrays should be of equal length and the array element at each index

* should correspond to the other. e.g.

* `items[0] = "<span>Item 0</span>" and texts[0] = "Item 0";

*

* @param {Array<Node>} divs

* @param {Array<string>} texts

*/

setTextMapping(divs, texts) {

this.textDivs = divs;

this.textContentItemsStr = texts;

}

Snuffleupagus · 2024-11-26T17:34:22Z

web/pdf_page_view.js

+      const urlRegex = /\b(?:https?:\/\/|mailto:|www.)(?:[[\S--\[]--\p{P}]|\/|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv;
+      const matches = text.matchAll(urlRegex);


A regular expression isn't sufficient to guarantee that valid URLs are found, so please use createValidAbsoluteUrl to ensure that every "candidate" is actually correct/safe.

Snuffleupagus · 2024-11-26T17:38:32Z

web/pdf_page_view.js

+  #processLinks() {
+    return this.pdfPage.getTextContent().then(content => {
+      const [text, diffs] = normalizedTextContent(content);
+      const urlRegex = /\b(?:https?:\/\/|mailto:|www.)(?:[[\S--\[]--\p{P}]|\/|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv;


Also, the regular expression should probably be created just once (and then cached) to avoid re-creating it for every page.

Snuffleupagus · 2024-11-26T17:40:29Z

web/pdf_page_view.js

+        annotationType: 2,
+        annotationFlags: 4,


These should use actual constants, rather than hard-coded numbers.

Snuffleupagus · 2024-11-26T17:42:29Z

web/pdf_page_view.js

+        isEditable: false,
+        hasApperance: false,
+        modificationDate: null,
+        structParent: 2,


This should probably use the default-value instead?

Suggested change

structParent: 2,

structParent: -1,

Snuffleupagus · 2024-11-26T17:44:17Z

web/pdf_page_view.js

+        // NOTE everything from here on is arbitrary
+        borderStyle: {
+          width: 2,
+          rawWidth: 2,
+          style: 1,
+          dashArray: [3],
+          horizontalCornerRadius: 0,
+          verticalCornerRadius: 0


If these don't matter, please at least make sure that you pick values that correspond to the default ones in AnnotationBorderStyle.

Snuffleupagus · 2024-11-26T17:52:58Z

web/annotation_layer_builder.js

+    annotations.push(...linkAnnotations);
+


Have you actually tested this with PDFs that already contain Annotations?

Because it appears to me that this could easily end up creating "duplicate" and overlapping LinkAnnotations in many cases.
For an initial implementation we might want to either:

Don't use these if the page already has any Annotations.

Don't use these if the page already has LinkAnnotations.

Enable automatic URL linking

7374820

Automatically detect links in the text content of a file and automatically generate link annotations at the appropriate locations to achieve automatic link detection and hyperlinking.

github-advanced-security bot found potential problems Nov 26, 2024

View reviewed changes

Snuffleupagus requested changes Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable automatic URL linking #19110

Enable automatic URL linking #19110

ryzokuken commented Nov 26, 2024 •

edited

Loading

Snuffleupagus left a comment •

edited

Loading

Snuffleupagus Nov 26, 2024

Snuffleupagus Nov 26, 2024

Snuffleupagus Nov 26, 2024

Snuffleupagus Nov 26, 2024

Snuffleupagus Nov 26, 2024

Snuffleupagus Nov 26, 2024

Snuffleupagus Nov 26, 2024

	/**
	* Store two arrays that will map DOM nodes to text they should contain.
	* The arrays should be of equal length and the array element at each index
	* should correspond to the other. e.g.
	* `items[0] = "<span>Item 0</span>" and texts[0] = "Item 0";
	*
	* @param {Array<Node>} divs
	* @param {Array<string>} texts
	*/
	setTextMapping(divs, texts) {
	this.textDivs = divs;
	this.textContentItemsStr = texts;
	}

		const urlRegex = /\b(?:https?:\/\/\|mailto:\|www.)(?:[[\S--\[]--\p{P}]\|\/\|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv;
		const matches = text.matchAll(urlRegex);

Enable automatic URL linking #19110

Are you sure you want to change the base?

Enable automatic URL linking #19110

Conversation

ryzokuken commented Nov 26, 2024 • edited Loading

Snuffleupagus left a comment • edited Loading

Choose a reason for hiding this comment

Snuffleupagus Nov 26, 2024

Choose a reason for hiding this comment

Snuffleupagus Nov 26, 2024

Choose a reason for hiding this comment

Snuffleupagus Nov 26, 2024

Choose a reason for hiding this comment

Snuffleupagus Nov 26, 2024

Choose a reason for hiding this comment

Snuffleupagus Nov 26, 2024

Choose a reason for hiding this comment

Snuffleupagus Nov 26, 2024

Choose a reason for hiding this comment

Snuffleupagus Nov 26, 2024

Choose a reason for hiding this comment

ryzokuken commented Nov 26, 2024 •

edited

Loading

Snuffleupagus left a comment •

edited

Loading