Cleanup linkify regex patterns (efficiency) #90

Lee-Carre · 2022-02-24T14:05:14Z

Inspired by #65:

Avoid groups being capture-groups (memory-usage)
more compact file-extension matching (especially similar extensions)
replace .+ with more specific URL-only set
match more possible variations (robustness & future-proofing)

* Avoid groups being capture-groups (memory-usage) * more compact file-extension matching (especially similar extensions) * replace `.+` with more specific URL-only set * match more possible variations (robustness & future-proofing)

Lee-Carre · 2022-02-24T14:47:31Z

Having read & thought a little more, I now suspect that my making all groups non-capturing will sabotage the next block of code for applying special treatment to some URLs.

I'll undo this and see if I can figure out (on a phone) how to add those commits to this PR (else I'll close this and open anew).

Edit: done.

Avoid over-broad (`.+`) matching.

Lee-Carre · 2022-02-24T17:32:58Z

@ENT8R

I'm working on refining the patterns further.

To make the patterns more specific (matching only valid URLs (avoiding false-positives), and doing so efficiently (since each pattern is run against each queried (OSM-)note)), it would be helpful to use backreferences.

Which regex parser / interpreter does your backend use? I would guess JavaScript, given that's the language of linkify.js.
If you already know; does it implement backreferences, and if so via what syntax? If you don't happen to recall, that's fine; I'll gladly read the documentation for the answer to the first question 🙂.

* closer matching to valid-only URLs * increased efficiency

[Trailing Dots in Domain Name](http://www.dns-sd.org/TrailingDotsInDomainNames.html)

Match well-formed URLs

Match more (valid) possibilities.

To match percent-encoding URL-escaping.

Lee-Carre · 2022-02-24T20:54:43Z

Apologies for the multiple related commits; I'm doing all this on a phone via GH's Web UI.

ENT8R · 2022-02-27T00:46:33Z

Thank you for your work so far, I did not look into the changes in detail yet, as I wanted to answer your questions before:

Having read & thought a little more, I now suspect that my making all groups non-capturing will sabotage the next block of code for applying special treatment to some URLs.

Correct, in some cases, the URLs are transformed to a different URL to provide a smaller image/thumbnail or to map a webpage to the correct image (needed for some Imgur links). As you already noticed, this is done via capture groups.

Which regex parser / interpreter does your backend use? I would guess JavaScript, given that's the language of linkify.js.

If you mean the language the user interface is implemented in and where the image linking takes place (linkify.js), JavaScript is the language being used.

If you already know; does it implement backreferences, and if so via what syntax?

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions and especially https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Groups_and_Ranges are useful if you are interested in further documentation. I also found another page which offers a quick overview of the possibilities: https://javascript.info/regexp-backreferences

Lee-Carre · 2022-02-27T21:59:09Z

Thank you for your work so far

Welcome; contributing what I'm able to 🙂. Opportunity to learn.

I did not look into the changes in detail yet

Don't worry about review, yet. I'll change the status of the PR (from draft to open; I presume that'll notify you(?), since you're likely ‘watching’ ‘all activity’) when I feel it's ready for scrutiny & consideration for merging.

Correct, in some cases, the URLs are transformed to a different URL to provide a smaller image/thumbnail or to map a webpage to the correct image (needed for some Imgur links). As you already noticed, this is done via capture groups.

👍 an elegant approach.

The necessary groups are now capturing, again. Embarrassment averted.

If you mean the language the user interface is implemented in and where the image linking takes place (linkify.js), JavaScript is the language being used.

Ah, it happens client-side; OK. Either way, thanks for the confirmation of which flavour of regex to use. I don't have the means to test changes (otherwise it would also mean that I could ensure that all the changes would be in a single commit), so I want to be extra-careful that what I submit is at least valid syntax.

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions and especially https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions/Groups_and_Ranges are useful if you are interested in further documentation. I also found another page which offers a quick overview of the possibilities: https://javascript.info/regexp-backreferences

Thankyou muchly. Mozilla does a decent job of documentation 🙂.

Besides backreferences for deduplicating code (some parts of e.g. wiki URLs repeat), I want to use named backreferences (or, rather, backreferences to named groups) for easier readability (I'm thinking of when future changes are necessary). Almost like comments in code, or well-named variables & functions.

I'll try to do that with fewer commits, though 😋.

Cleanup linkify regex patterns (efficiency)

860faf3

* Avoid groups being capture-groups (memory-usage) * more compact file-extension matching (especially similar extensions) * replace `.+` with more specific URL-only set * match more possible variations (robustness & future-proofing)

Lee-Carre added 2 commits February 24, 2022 15:44

Re-enable needed linkify regex capture groups for special handling

967881b

Refine linkify regex patterns (efficiency)

9bb17f6

Avoid over-broad (`.+`) matching.

Lee-Carre marked this pull request as draft February 24, 2022 16:38

Lee-Carre added 6 commits February 24, 2022 18:00

Refine Wikipedia URL matching

2215a96

* closer matching to valid-only URLs * increased efficiency

Allow matching FQDNs in image regex patterns

fe28921

[Trailing Dots in Domain Name](http://www.dns-sd.org/TrailingDotsInDomainNames.html)

Refine pattern for generic catch-all image URL

6b27999

Match well-formed URLs

Refine (further) Wiki image URL patterns

4173cf6

Match more (valid) possibilities.

Refine (minorly) regex pattern character set members

f11bb74

Refine regex patterns character sets

9cc8716

To match percent-encoding URL-escaping.

ENT8R force-pushed the main branch 2 times, most recently from de790a5 to 2887ad8 Compare August 10, 2022 22:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup linkify regex patterns (efficiency) #90

Cleanup linkify regex patterns (efficiency) #90

Lee-Carre commented Feb 24, 2022

Lee-Carre commented Feb 24, 2022 •

edited

Loading

Lee-Carre commented Feb 24, 2022

Lee-Carre commented Feb 24, 2022

ENT8R commented Feb 27, 2022

Lee-Carre commented Feb 27, 2022

Cleanup linkify regex patterns (efficiency) #90

Are you sure you want to change the base?

Cleanup linkify regex patterns (efficiency) #90

Conversation

Lee-Carre commented Feb 24, 2022

Lee-Carre commented Feb 24, 2022 • edited Loading

Lee-Carre commented Feb 24, 2022

Lee-Carre commented Feb 24, 2022

ENT8R commented Feb 27, 2022

Lee-Carre commented Feb 27, 2022

Lee-Carre commented Feb 24, 2022 •

edited

Loading