BB-662: URLs are turned into malformed links #837

the-good-boy · 2022-04-15T13:50:48Z

Problem

This PR addresses ticket BB-662.

Solution

Initially, we were adding an https:// even if it was already present in the string, now I have added a check.
I have also improved the regex we were using, as I feel it was a bit insufficient and antiquated. I took help from this, and I think it works a lot better and is somewhat cleaner also.

MonkeyDo

I don't think the right approach is to rewrite everything, and the new implementation seems to handle fewer URL formats.

I think instead we have to, look at the part of this function that adds the https://www. which seems to be misbehaving.
In fact, we can simplify the whole ordeal and simply add // in front of the href content if the url does not start with http(s)://.
This would instruct the browser to navigate away from the current website

Perhaps something like this?

function stringToHTMLWithLinks(string) {
 // eslint-disable-next-line max-len, no-useless-escape
  const urlRegex =
    /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+@)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+@)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-_]*)?\??(?:[\-\+=&;%~*@\.\w_]*)#?(?:[\.\!\/\\\w]*))?)/g;
  const startsWithProtocol = Boolean(string.match(/^https?:\/\//gi)?.length);
  if (startsWithProtocol) {
    return string.replace(urlRegex, '<a href="$1" target="_blank">$1</a>');
  } else {
    return string.replace(urlRegex, '<a href="//$1" target="_blank">$1</a>');
  }
}

For the input www.google.com this would produce 'www.google.com'

What do you think?

the-good-boy · 2022-09-05T12:14:46Z

I like your suggestion for handling http(s).
But as far as I remember, the new implementation was infact handling more URL formats than the original one.

MonkeyDo · 2022-09-09T16:13:56Z

Interesting, I got that wrong then :)

I think however the current regexp is too permissive.
For example if a user make a very small typo and forgets the put a space after a full stop, for example end of sentence.Beginning of new one it will be detected as a URL and output a link:
end of <a href="https://sentence.Beginning" target="_blank">https://sentence.Beginning</a> of new one

the-good-boy · 2023-02-11T18:11:50Z

Interesting, I got that wrong then :)

I think however the current regexp is too permissive. For example if a user make a very small typo and forgets the put a space after a full stop, for example end of sentence.Beginning of new one it will be detected as a URL and output a link: end of <a href="https://sentence.Beginning" target="_blank">https://sentence.Beginning</a> of new one

So, I finally got time to look at this again. I started from a clean slate. I agree that we should not rewrite the original regex. So I investigated the original issue we were facing (BB-662) and I think I'm onto something.

The only problem with the original code was that it was somehow detecting all the www. when trying to append an https:// in front of that. Looks like just adding a set of round brackets in the addHttpRegex can fix that. Infact I think that the original writer actually intended this, but might have missed it by mistake.

What are your thoughts, @MonkeyDo ?

MonkeyDo · 2023-02-14T16:41:12Z

I think that looks like a better approach, and much simpler!
Nice job on pinpointing the issue!

I have run into some cases where your suggested change does not work (such as for example when the substring is not at the beginning of a sentence), and I think the string replacement now strips the www. which is not what we wanted.

But after trying a few things I think the following regex should work well for our needs:

/(\b(?<!https?:\/\/)w{3}\.\S+\.)/gmi

my testing setup here

With that and another change to the substitution string (https://$1 instead of $1https://www.) I think we're covering most cases

Things I modified or added:

changed line start to a word boundary to allow for links in the middle of a sentence
added a negative lookbehind (?<!https?:\/\/) which ensures we don't match string that already have http:// or https://
changed www to w{3}, this one isn't really useful or anything, just feels more pleasant to me, feel free to ignore
capture the whole group (and use it in the string replacement https://$1)
added matching for a second dot so that www.foobar wouldn't match but www.foobar.com would

I'll note that with this setup we can't match stuff like google.com because without a list of top-level domains we couldn't differentiate this.text (simple typo, missing space after the dot). However I think this should do nicely for fixing the current issue.

the-good-boy · 2023-02-16T08:57:42Z

Thanks for the suggestions. I have made the required changes.

MonkeyDo

Thanks for the improvement @the-good-boy !

MonkeyDo reviewed Sep 5, 2022

View reviewed changes

the-good-boy added 4 commits February 11, 2023 23:07

remove redundant https

2317184

add http only if not already present

ec5e0c9

improved regex

dc13bb7

improve the original addHttpRegex

11b936b

the-good-boy force-pushed the malformed-links branch from 1b85071 to 11b936b Compare February 11, 2023 18:05

the-good-boy requested a review from MonkeyDo February 11, 2023 18:19

improve regex

191a684

MonkeyDo approved these changes Feb 21, 2023

View reviewed changes

MonkeyDo merged commit 206266b into metabrainz:master Feb 21, 2023

MonkeyDo changed the title ~~[BB-662]: URLs are turned into malformed links~~ BB-662: URLs are turned into malformed links Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BB-662: URLs are turned into malformed links #837

BB-662: URLs are turned into malformed links #837

the-good-boy commented Apr 15, 2022 •

edited

Loading

MonkeyDo left a comment

the-good-boy commented Sep 5, 2022

MonkeyDo commented Sep 9, 2022

the-good-boy commented Feb 11, 2023 •

edited

Loading

MonkeyDo commented Feb 14, 2023

the-good-boy commented Feb 16, 2023

MonkeyDo left a comment

BB-662: URLs are turned into malformed links #837

BB-662: URLs are turned into malformed links #837

Conversation

the-good-boy commented Apr 15, 2022 • edited Loading

Problem

Solution

MonkeyDo left a comment

Choose a reason for hiding this comment

the-good-boy commented Sep 5, 2022

MonkeyDo commented Sep 9, 2022

the-good-boy commented Feb 11, 2023 • edited Loading

MonkeyDo commented Feb 14, 2023

the-good-boy commented Feb 16, 2023

MonkeyDo left a comment

Choose a reason for hiding this comment

the-good-boy commented Apr 15, 2022 •

edited

Loading

the-good-boy commented Feb 11, 2023 •

edited

Loading