Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match protocol relative URLs (://) #17

Closed
wants to merge 4 commits into from
Closed

Conversation

mre
Copy link
Contributor

@mre mre commented Mar 1, 2021

Protocol-relative links (PRL), also known as protocol-relative URLs (PRURL), are URLs that have no protocol specified. For example, //example.com will use the protocol of the current page, typically HTTP or HTTPS.

These URLs commonly occur in the wild, specifically on pages that support both HTTP and HTTPS protocols.

Since the URLs without a protocol belong to the class of relative URLs, it is up to the user to decide what protocol to use when parsing these URLs.
Example configuration for the url crate: https://docs.rs/url/2.2.1/url/#base-url

It would be great to support those in linkify, as I discovered quite a few cases while working on recursion in lychee.
(lycheeverse/lychee#165)

@robinst
Copy link
Owner

robinst commented Mar 2, 2021

Hmm, so this PR adds support for ://example.org, but if I understand you correctly, you actually want //example.org, no?

If we want to support /, we need to trigger on / as well, not only on :.

@mre
Copy link
Contributor Author

mre commented Mar 2, 2021

Oh, you're right. That's wrong. It was a bit late.

Protocol-relative links (PRL), also known as protocol-relative URLs (PRURL), are URLs that have no protocol specified. For example, `//example.com` will use the protocol of the current page, typically HTTP or HTTPS.

These URLs commonly occur in the wild, specifically on pages that support both http and https protocols.

Since the URLs without a protocol belong to the class of relative URLs,
it is up to the user to decide what protocol to use when parsing these
URLs.
Example configuration for the url crate:
https://docs.rs/url/2.2.1/url/#base-url
src/url.rs Outdated
fn scan(&self, s: &str, trigger_index: usize) -> Option<Range<usize>> {
let (proto, offset) = match self.trigger {
'/' => ("//", 2),
_ => ("://", 3),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you like, we can make the trigger an enum type. Then we can use an exhaustive match here:

let (proto, offset) = match self.trigger {
  Trigger::Slash => ("//", 2),
  Trigger::Colon => ("://", 3),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I just went ahead and added it. Think that's cleaner now. Let me know if I should revert the change. ✌️

@mre
Copy link
Contributor Author

mre commented Mar 2, 2021

Should be fixed now.

/// Users should not exhaustively match this enum, because more triggers
/// may be added in the future.
#[derive(Debug, PartialEq)]
pub enum Trigger {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this pub(crate)? Want to make it clear at this point that it's not part of the API.

@@ -29,6 +36,12 @@ impl Scanner for UrlScanner {
impl UrlScanner {
// See "scheme" in RFC 3986
fn find_start(&self, s: &str) -> Option<usize> {
// Match protocol relative URLs (`//example.org`)
// See https://stackoverflow.com/a/15146073/270334
if s.is_empty() && self.trigger == Trigger::Slash {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm does this mean in abc //example.org we wouldn't match //example.org? Can you add a test case for this?

Also needs some more negative test cases, e.g. something like foo//bar should not match, but e.g. "//foo.bar should.

@@ -188,13 +201,12 @@ impl Default for LinkFinder {

impl<'t> Links<'t> {
fn new(text: &'t str, url: bool, email: bool, email_domain_must_have_dot: bool) -> Links<'t> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should have an option (disabled by default) for the new functionality, something like protocol_relative_urls(true). Normally I wouldn't want to match those kinds of links in plain text.

You can see e.g. GitHub not matching it either: //github.com

@robinst
Copy link
Owner

robinst commented Apr 14, 2022

Just checking back on this one. Still needs some more changes (and resolving conflicts).

@mre
Copy link
Contributor Author

mre commented Apr 14, 2022

Oh yeah. Didn't find the time to look into it, heh. Will set myself a reminder for next week.

@mre
Copy link
Contributor Author

mre commented Jul 11, 2022

Should we close this and try again at a later point in time? linkify 0.9.0 made some changes to the codebase, so it's probably easier to start from a clean slate.

@robinst
Copy link
Owner

robinst commented Jul 11, 2022

Yep! It should be much simpler to do that now as the necessary bits of logic are now separate from UrlScanner!

@robinst robinst closed this Jul 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants