Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BB-662: URLs are turned into malformed links #837

Merged
merged 5 commits into from
Feb 21, 2023

Conversation

the-good-boy
Copy link
Contributor

@the-good-boy the-good-boy commented Apr 15, 2022

Problem

This PR addresses ticket BB-662.

Solution

Initially, we were adding an https:// even if it was already present in the string, now I have added a check.
I have also improved the regex we were using, as I feel it was a bit insufficient and antiquated. I took help from this, and I think it works a lot better and is somewhat cleaner also.

Copy link
Member

@MonkeyDo MonkeyDo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the right approach is to rewrite everything, and the new implementation seems to handle fewer URL formats.

I think instead we have to, look at the part of this function that adds the https://www. which seems to be misbehaving.
In fact, we can simplify the whole ordeal and simply add // in front of the href content if the url does not start with http(s)://.
This would instruct the browser to navigate away from the current website

Perhaps something like this?

function stringToHTMLWithLinks(string) {
 // eslint-disable-next-line max-len, no-useless-escape
  const urlRegex =
    /((([A-Za-z]{3,9}:(?:\/\/)?)(?:[\-;:&=\+\$,\w]+@)?[A-Za-z0-9\.\-]+|(?:www\.|[\-;:&=\+\$,\w]+@)[A-Za-z0-9\.\-]+)((?:\/[\+~%\/\.\w\-_]*)?\??(?:[\-\+=&;%~*@\.\w_]*)#?(?:[\.\!\/\\\w]*))?)/g;
  const startsWithProtocol = Boolean(string.match(/^https?:\/\//gi)?.length);
  if (startsWithProtocol) {
    return string.replace(urlRegex, '<a href="$1" target="_blank">$1</a>');
  } else {
    return string.replace(urlRegex, '<a href="//$1" target="_blank">$1</a>');
  }
}

For the input www.google.com this would produce 'www.google.com'

What do you think?

@the-good-boy
Copy link
Contributor Author

I like your suggestion for handling http(s).
But as far as I remember, the new implementation was infact handling more URL formats than the original one.

@MonkeyDo
Copy link
Member

MonkeyDo commented Sep 9, 2022

Interesting, I got that wrong then :)

I think however the current regexp is too permissive.
For example if a user make a very small typo and forgets the put a space after a full stop, for example end of sentence.Beginning of new one it will be detected as a URL and output a link:
end of <a href="https://sentence.Beginning" target="_blank">https://sentence.Beginning</a> of new one

@the-good-boy
Copy link
Contributor Author

the-good-boy commented Feb 11, 2023

Interesting, I got that wrong then :)

I think however the current regexp is too permissive. For example if a user make a very small typo and forgets the put a space after a full stop, for example end of sentence.Beginning of new one it will be detected as a URL and output a link: end of <a href="https://sentence.Beginning" target="_blank">https://sentence.Beginning</a> of new one

So, I finally got time to look at this again. I started from a clean slate. I agree that we should not rewrite the original regex. So I investigated the original issue we were facing (BB-662) and I think I'm onto something.

The only problem with the original code was that it was somehow detecting all the www. when trying to append an https:// in front of that. Looks like just adding a set of round brackets in the addHttpRegex can fix that. Infact I think that the original writer actually intended this, but might have missed it by mistake.

What are your thoughts, @MonkeyDo ?

@MonkeyDo
Copy link
Member

I think that looks like a better approach, and much simpler!
Nice job on pinpointing the issue!

I have run into some cases where your suggested change does not work (such as for example when the substring is not at the beginning of a sentence), and I think the string replacement now strips the www. which is not what we wanted.

But after trying a few things I think the following regex should work well for our needs:

/(\b(?<!https?:\/\/)w{3}\.\S+\.)/gmi

my testing setup here

With that and another change to the substitution string (https://$1 instead of $1https://www.) I think we're covering most cases

Things I modified or added:

  • changed line start to a word boundary to allow for links in the middle of a sentence
  • added a negative lookbehind (?<!https?:\/\/) which ensures we don't match string that already have http:// or https://
  • changed www to w{3}, this one isn't really useful or anything, just feels more pleasant to me, feel free to ignore
  • capture the whole group (and use it in the string replacement https://$1)
  • added matching for a second dot so that www.foobar wouldn't match but www.foobar.com would

I'll note that with this setup we can't match stuff like google.com because without a list of top-level domains we couldn't differentiate this.text (simple typo, missing space after the dot). However I think this should do nicely for fixing the current issue.

@the-good-boy
Copy link
Contributor Author

Thanks for the suggestions. I have made the required changes.

Copy link
Member

@MonkeyDo MonkeyDo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvement @the-good-boy !

@MonkeyDo MonkeyDo merged commit 206266b into metabrainz:master Feb 21, 2023
@MonkeyDo MonkeyDo changed the title [BB-662]: URLs are turned into malformed links BB-662: URLs are turned into malformed links Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants