Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It's not immediately clear that "URL syntax" and "URL parser" conflict #118

Closed
domenic opened this issue May 9, 2016 · 166 comments · Fixed by #228
Closed

It's not immediately clear that "URL syntax" and "URL parser" conflict #118

domenic opened this issue May 9, 2016 · 166 comments · Fixed by #228

Comments

@domenic
Copy link
Member

domenic commented May 9, 2016

URL syntax is a model for valid URLs---basically "authoring requirements". The URL parser section allows parsing URLs which do not follow URL syntax.

An easy example is https://////example.com, which is disallowed because the portion after https: contradicts

A scheme-relative URL
must be "//", followed by a host, optionally followed by ":" and a port, optionally followed by a path-absolute URL.

/cc @bagder via some Twitter confusion this morning.

@bagder
Copy link

bagder commented May 9, 2016

Is there really any reason for accepting more than two slashes for non-file: URLs? I mean apart from this spec saying that the parser should accept them.

@domenic
Copy link
Member Author

domenic commented May 9, 2016

The fact that all browsers do.

@bagder
Copy link

bagder commented May 10, 2016

The fact that all browsers do.

I tested Safari on a recent OS X version and it doesn't even accept three slashes. Not in the address bar and not in Location: headers in a redirect. It handles one or two slashes, no more. So I refute your claim.

The fact that all browsers do

That's exactly the sort of mindset that will prevent the WHATWG URL spec to ever become the universal URL spec. URLs need to be defined to work in more contexts than browsers.

@annevk
Copy link
Member

annevk commented May 10, 2016

URLs need to be defined to work in more contexts than browsers.

They are, no? Handling multiple slashes or not seems irrespective of that. If Safari does not do it there might be wiggle room, or Safari might hit similar compatibility issues to curl.

@bagder
Copy link

bagder commented May 10, 2016

That's a large question and too big of a subject for me to address fully here.

URLs in the loose sense of the term are used all over the place.

URLs by the WHATWG definition are probably not used by much else than a handful of browsers, no. In my view (wearing my curl goggles), there are several reasons why we can't expect that to change much short-term going forward either. Like this slash issue shows.

I would love a truly universal and agreed URL syntax, but in my view we've never been further away from that than today.

@domenic
Copy link
Member Author

domenic commented May 10, 2016

I'm sorry for the imprecision. We often use "all browsers" to mean "the consensus browser behavior, modulo minor deviations and bugs."

The URL Standard defines URLs for software that wants to be compatible with browsers, and participate in the ecosystem of content which produces and consumes URLs meant for browsers. If cURL does not want to be part of that ecosystem, then yes, the URL Standard is probably not a good fit for cURL. But we've found over time that most software (e.g. servers which which to interact with browsers, or scraping tools which wish to be able to scrape the same sites as browsers visit) wants to converge on those rules.

@bagder
Copy link

bagder commented May 10, 2016

We often use "all browsers" to mean "the consensus browser behavior, modulo minor deviations and bugs."

This made me also go and check IE11 on win7, and you know what? It doesn't support three slashes either.

To me, this is important. It shows you've added a requirement to the spec that a notable share of browsers don't support. When I ask about why (because it really makes no sense to me), you make a recursive answer and say you did this because "all browsers" act like this. Which we now know isn't true. It's just backwards in so many levels.

If cURL does not want to be part of that ecosystem

Being part of that ecosystem does not mean that I blindly just suck up what the WHATWG says a URL is without me questioning and asking for clarification and reasoning. Being here, asking questions, responding, complaining, is part of being in the ecosystem.

curl already is and has been part of the ecosystem since a very long time. Deeply, firmly and actively - we have supported and worked with URLs since back when they were still truly standard "URLs" (RFC 1738). I'm here, writing this, because I'd rather want an interoperable world where we pass URLs back and forth and we agree on what they mean.

When you actively decide to break RFC 3986 and in extension RFC 7231 for the Location: header I would prefer you could explain why. If you want to be a part of the ecosystem.

the URL Standard is probably not a good fit for cURL

I wish we worked on a URL standard, then I'd participate and voice my opinions like I do with some other standards work. A URL standard is very much a good idea for curl and for the entire world.

A URL works in browsers and outside of browsers. It can be printed on posters, it works parsed and highlighted by terminal emulators or IRC clients, they get parsed by scripts and they get read out loud over the phone by kids to their grandparents. URLs are, or at least could be, truly universal. Limiting the scope to "all browsers" limits the usability of them. It fragments what a URL is and how it works (or not) in different places and for which uses.

If you want a URL standard, you must look beyond "all browsers".

@domenic
Copy link
Member Author

domenic commented May 10, 2016

This made me also go and check IE11 on win7, and you know what? It doesn't support three slashes either.

Edge does, however: http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4182

In general, Edge has made changes like this to be compatible with the wider ecosystem of web content. I can't speak for their engineers, but this shows clear convergence.


It's good to hear you're interested in participating. That wasn't my impression from your earlier comments, and I welcome the correction.

@JohnMH
Copy link

JohnMH commented May 10, 2016

Why should malformatted URLs be parsed? Surely the solution is to simply tell people who are using malformatted URLs to.. stop using malformatted URLs?

@jyasskin
Copy link
Member

In the interest of looking for ways forward, instead of just saying "no", per https://twitter.com/yoavweiss/status/730173495464894465, it might make sense to collect usage data and see if browsers can simplify the URL grammar.

@JohnMH
Copy link

JohnMH commented May 10, 2016

It may be best to ignore that browsers even use URLs, because there are definitely other pieces of software that use URLs. Consider the following URL: irc://network:port/#channel

On May 10, 2016 7:31:35 PM EDT, Jeffrey Yasskin notifications@github.com wrote:

In the interest of looking for ways forward, instead of just saying
"no", per https://twitter.com/yoavweiss/status/730173495464894465, it
might make sense to collect usage data and see if browsers can simplify
the URL grammar.


You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#118 (comment)

John M. Harris, Jr.
PGP Key: f2ea233509f192f98464c2e94f8f03c64bb38ffd

Sent from my Android device. Please excuse my brevity.

@domenic
Copy link
Member Author

domenic commented May 10, 2016

I'd suggest the following plan to any browsers interested in tightening the URL syntax they accept:

  • look through https://github.com/w3c/web-platform-tests/blob/master/url/urltestdata.json and find all non-error results that you wish became error results
  • (optionally, wait until I finish my long-delayed project to expand that file to give 100% coverage of the spec. Currently it gives around 60%.)
  • Instrument all URL parsing to figure out how often these undesirable patterns occur
  • When the numbers come back, decide which percentage of users or pages you are willing to break, and pick the subset of algorithm changes you can make to do so up to that percentage
  • optionally, weigh the percent of users broken vs. the corresponding spec or implementation complexity reduction
  • start coordinating with other vendors to see which of your changes they're interested in. (Some cases might already behave differently in those other vendors' browsers, which could help!)
  • Ship, preferably in a coordinated fashion with appropriate devrel support.

@JohnMH
Copy link

JohnMH commented May 11, 2016

Browsers are not the only applications that use URLs.

On May 10, 2016 7:40:56 PM EDT, Domenic Denicola notifications@github.com wrote:

I'd suggest the following plan to any browsers interested in tightening
the URL syntax they accept:

  • look through
    https://github.com/w3c/web-platform-tests/blob/master/url/urltestdata.json
    and find all non-error results that you wish became error results
  • (optionally, wait until I finish my long-delayed project to expand
    that file to give 100% coverage of the spec. Currently it gives around
    60%.)
  • Instrument all URL parsing to figure out how often these undesirable
    patterns occur
  • When the numbers come back, decide which percentage of users or pages
    you are willing to break, and pick the subset of algorithm changes you
    can make to do so up to that percentage
  • optionally, weigh the percent of users broken vs. the corresponding
    spec or implementation complexity reduction
  • start coordinating with other vendors to see which of your changes
    they're interested in. (Some cases might already behave differently in
    those other vendors' browsers, which could help!)
  • Ship, preferably in a coordinated fashion with appropriate devrel
    support.

You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
#118 (comment)

John M. Harris, Jr.
PGP Key: f2ea233509f192f98464c2e94f8f03c64bb38ffd

Sent from my Android device. Please excuse my brevity.

@domenic
Copy link
Member Author

domenic commented May 11, 2016

@JohnMHarrisJr your comment seems irrelevant to my plan for "any browsers interested in tightening the URL syntax they accept".

@ghost
Copy link

ghost commented May 11, 2016

The syntax for URIs is such that the authority component (user:password@host:port) is always separated from the scheme by two slashes, except for some schemes that do not require them. The path may only begin with // if the authority component is there, and in that case it must begin with a slash. So there is no possible case where there would be more than three slashes after the colon following the URI scheme.(1) HTTP in particular requires the two slashes between the URI scheme and the authority component, so there should always be exactly two slashes between the http URI scheme and the URI component.(2)

In other words, http://user:password@host:port/path is valid, http://user:password@host:port//path might be valid, but anything else is definitely not valid.

URLs being a subset of URIs, it would only make sense to follow the standards that have already been established for a long time—especially since they make sense.

@jyasskin
Copy link
Member

The main difference between the WHATWG and some other standards organizations is that the WHATWG attempts to describe the world as it is, rather than as folks would like it to be. That means that if a major implementation of URLs implements them in one way, the WHATWG specification needs to allow that implementation. So, it doesn't help to "prove" that browsers are wrong by citing earlier specifications.

That said, it seems like there's a good argument to have the URL spec allow implementations to reject too many slashes, since at least one recent browser and several other tools do reject them.

@domenic
Copy link
Member Author

domenic commented May 11, 2016

Well, we want interoperable behavior, so it's either accept or reject. There's room for "accept with a console log message telling you you're being bad" (the spec already has this concept), but it would take some cross-browser agreement to move toward "reject".

@ghost
Copy link

ghost commented May 11, 2016

@domenic
That would mean web browsers are not allowed not to accept something that is pointless and has been considered invalid since URIs were a thing, only to accept it but give a console log message. Which again considers only web browsers. Other applications that use URLs probably don't have a console for logging such messages.

@jyasskin
If the goal is interoperability, standardizing the behavior of the major (read: popular) implementations—which for WHATWG usually means "let's standardize whatever Google Chrome does, other browsers don't matter as much and anything that isn't a web browser or isn't the HTTP protocol doesn't matter at all"—isn't the best option since these implementations are usually the most actively developed and the ones that care most about those standards. If the standards make another decision than what these implementations do, it's more likely that they would change their behavior than it is that the other implementations would. Of course, this sounds like a bad argument, but that's only because it relies on a wrong premise, which is that defining things based on the major implementations is a good idea.

This isn't surprising given that many people interested in WHATWG use Chrome, have Gmail email addresses, or are Google employees. The others are with Mozilla, probably use Firefox, and probably use Gmail email addresses.

This approach of standards as a popularity contest is harming the web, it tries to make tools like curl that already do URL parsing correctly and very well behave like the popular web browsers for "interoperability". And the popular web browsers behave like they do only in order to support every unreasonable thing that can be found on web pages, because their market share depends on supporting as many web pages as possible so that users don't switch to another browser. And then other browsers, and tools like curl, are expected to do the same because a spec says to!

@domenic
Copy link
Member Author

domenic commented May 11, 2016

Your claims about the WHATWG having a Chrome bias are false. Please be respectful and make on-topic, evidence-based comments, and not ad hominem attacks, or moderation will be required.

@ghost
Copy link

ghost commented May 11, 2016

@domenic I admit my comment is the result of frustration, and neither on-topic, nor evidence-based, nor respectful.

@domenic
Copy link
Member Author

domenic commented May 11, 2016

Thanks for that. We can hopefully keep things more productive moving forward.

At this point I think the thread's original action item (from my OP) still stands, to clarify the authoring conformance requirements for "valid" URLs, versus the user agent requirements for parsing URLs.

Besides that, there seems to be some interest from some people in getting browsers (and other software in the ecosystem that wishes to be compatible with browsers) to move toward stricter URL parsing. I think my plan at #118 (comment) still represents the best path there.

As for the particular issue of more than two slashes, I have very little hope that this can be changed to restrict to two slashes, since software like cURL is already encountering compatibility issues, and we can at least guess that the change from IE11 to Edge to support 3+ slashes might also be compatibility motivated. (Does anyone want to try http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4182 in Safari tech preview to see if they've changed as well?)

But, of course, more data could be brought to bear, as I outlined in my plan. I personally don't think three slashes is the worst kind of URL possible out of all the weird URLs in https://github.com/w3c/web-platform-tests/blob/master/url/urltestdata.json (my vote goes to maybe h\tt\nt\rp://h\to\ns\rt:9\t0\n0\r0/p\ta\nt\rh?q\tu\ne\rry#f\tr\na\rg) but it seems like a lot of people care about that, so maybe browsers will want to invest effort in measuring that particular use case.

@justjanne
Copy link

@domenic Considering how many people write software working with existing URL libraries, wouldn’t it be more useful to define URL as "whatever the majority of tools support" (like the URL libraries in about every language, framework, command line tool, server framework, etc)?

Sure, what users input should be accepted, but considering that the input bars of browsers will also happily accept any text (and, if search is off, prepend an http://www., and append a .com/), is already a sign that maybe the definition here is wrong.

Maybe we need a defined spec for a single correct storage format for an identifier, and additionally an algorithm on how to get to this identifier based on user input.

"Google.com" is not a URL, although users think it is one – seperating the actual identifier and the user-visible representations might be helpful here (especially for people writing tools, as they can then declare "we accept only the globally defined format", and let you use other libraries for transforming user-input into that format).

@annevk
Copy link
Member

annevk commented May 11, 2016

@justjanne the URL Standard does not concern itself with the address bar. That is a misconception. It concerns itself with URLs found in <a>, Location headers, URL API, <form>, XMLHttpRequest, etc.

@annevk
Copy link
Member

annevk commented May 11, 2016

To be crystal clear, the browser address bar is UX and accepts all kinds of input that is not standardized. And it is totally up to each browser how they want to design that. They could even make it not accept textual input if they wanted to. That code has no influence on how URLs are parsed.

@justjanne
Copy link

justjanne commented May 11, 2016

@annevk It also concerns itself with URLs used for cross-app communication in Android, for IPC in several situations, etc. (Android uses for intents a URL format, and for cross-app communication, and doesn’t accept more than 2 or 1 slash either)

It also concerns itself with address bars.

What I was suggesting that maybe we should split it into one specific, defined representation (which libraries, tools, Android, cURL, etc could accept), and one additional definition for "how to parse input/Location headers/etc into that format".

Because obviously browsers have to accept a lot more malformed input than other tools, but also it’s obvious that not every tool should include a way to try and fix malformed input itself.

@annevk
Copy link
Member

annevk commented May 11, 2016

That is basically how it is structured today. There's a parser that parses input into a URL record. And that URL record can then be serialized. The URL record and its serialization are much more constrained than the input.

I think that cURL basically wants the same parsing library in the end. It has already adopted many parts of it. I'm sure as it encounters more content it will gradually start to do the same thing. There's some disagreement on judgments calls such as whether or not to allow multiple slashes, but given that browsers are converging there (see Edge) and there is content that relies on it (see cURL's experience) I'm not sure why we'd change that.

@justjanne
Copy link

justjanne commented May 11, 2016

@annevk That’d be cURL – but do you suggest every tool that handles URLs is rewritten based on this single implementation of a parser? Do you actually want Android to accept single or triple slashes in cross-app communication? You’d add a lot more ambiguity, complexity, and performance issues to any software working with it.

There are use cases for when you want to accept malformed input (for example, when it comes from a user), and there are use cases where you don’t. The definition of URL should be what you call "serialization of a URL record".

(And, IMO, for cURL it would be better to split the parsing into a URL record into a seperate tool, and do url-parse "http:/google.com/" | curl. The UNIX principle applies in many places, including here. )

@justjanne
Copy link

justjanne commented Oct 19, 2017

@nox Yet, you only contacted browser vendors. https://///url.spec.whatwg.org/ writes that the goal of the WHATWG is to

Align RFC 3986 and RFC 3987 with contemporary implementations and obsolete them in the process.

Replacing the URL spec (RFCs 3986 and 3987) also includes everything that uses it. From offline usage to libraries in languages, from autolinking regex used on Android to inter-process communication schemes. From fileformats that use URLs (every XML implementation) to low-level filesystem APIs in every operating system.

If you state that your official goal is to obsolete and replace the standard all these use, you'll need to open a discussion with a great deal more stakeholders. Or you need to go back, and rethink if replacing the entire URL spec, and every usage of it, is such a good idea.

@annevk
Copy link
Member

annevk commented Oct 20, 2017

It's easy enough to find emails seeking feedback on this document on various mailing lists, including those controlled by W3C and IETF. Where else would you like us to solicit feedback from?

@justjanne
Copy link

@annevk let's ask the other way around.

WHATWG officially declared the URL RFCs obsolete.

So, what format should I use to specify URLs in XML files, in my HTML, on Android for Intents, in my file explorer, and elsewhere?

You said your standard hasn't failed, so is there a single place outside of web browsers that has adopted your spec? You say all other specs are obsolete, so can I use your format everywhere now?

Not even your issue tracker has adopted your spec: https:////url.spec.whatwg.org/ — how is this not a failure?

@annevk
Copy link
Member

annevk commented Oct 20, 2017

@justjanne I don't understand why the issue tracker would have to accept non-conforming URLs. There's all kind of restrictions and heuristics around plain text URLs that don't normally apply. (This also goes for the address bar as @magcius pointed out, which is UX and has vastly different considerations from URLs transmitted via other means.)

As an example of adoption outside browsers: Node.js ships an implementation.

@justjanne
Copy link

@annevk Why is it non-conforming? According to the spec on url.spec.whatwg.org, it's a valid URL.

The only spec according to which this isn't a valid URL would be the URL RFCs, but those are obsolete, according to WHATWG.

@annevk
Copy link
Member

annevk commented Oct 20, 2017

It's not valid, please see https://url.spec.whatwg.org/#urls.

@justjanne
Copy link

Mhm, I just found the definition of the scheme limiter in the host string.

So, as your goal was to define URL parsing for any situation, and obsolete all previous ones, but your own issue tracker doesn't use your base URL parser...

What should I use to parse URLs out of plain text that could potentially be of any language (including ones that wouldn't leave word boundaries between text, and a URL)?

I find very little in your spec about these situations, but you said very handwavy that there should be additional restrictions.

@justjanne
Copy link

Also, the format should be ideally universal and accepted by any system a user might ever enter a URL in, as your standard is not that young anymore, and you said it's successful, I trust that it has been adopted aby any system a user might enter URLs in, so they will be immediately familiar with my interface.

@domenic
Copy link
Member Author

domenic commented Oct 20, 2017

To hopefully clear up any confusion, neither the URL Standard nor the RFCs it obsoletes provides an algorithm for interpreting an arbitrary string of Markdown text and finding URLs within it. That seems to be what you're wondering about, @justjanne, with your discussion of the issue tracker. I believe that might be specified by CommonMark, but I am not sure.

Both the URL Standard and the RFCs it obsolete only operate on specific string inputs which are identified as URLs, for example in a Location: header or <a href=""> element. In other surfaces, such as Markdown text, plaintext emails, or location bar entry, different heuristics apply.

For example, as you noted, https://///url.spec.whatwg.org/ is parsed by <a href=""> and Location: parsers as https://url.spec.whatwg.org/. But it isn't parsed that way by GitHub's Markdown parser.

On the other hand, www.example.com is parsed by <a href=""> and Location: parsers as a relative URL, so e.g. https://github.com/whatwg/url/issues/www.example.com. But if you enter that into Markdown, it is instead parsed as http://www.example.com/, with an implicit http: and added slash. (Example: www.example.com). Similar considerations apply to the location bar, although that uses different heuristics, e.g. for my browser it parses the input url as https://www.google.com/search?q=url&ie=utf-8&oe=utf-8.

To reiterate: the URL Standard, like the RFCs before it, consider Markdown and the location bar out of scope. The URL Standard's "valid" definition can be helpful in building heuristics for situations like that, but won't suffice by itself. (E.g. without a base URL, www.example.com is not a valid URL, so it's clear GitHub/CommonMark are not simply using an algorithm "only link valid URLs".)

@justjanne
Copy link

@domenic So, you deprecated the URL RFCs, and replaced them with a version that's even more limited in scope? (Because the RFCs at least try to define URLs for all situations — be it XML namespaces, IPC systems, or filesystems).

That's what I'd consider a failure at the goal of universally replacing the RFCs.

And it doesn't help me with parsing URLs (according to whatever definition) out of IRC messages.

@domenic
Copy link
Member Author

domenic commented Oct 20, 2017

I'm not sure where you got that conclusion from my comment.

@justjanne
Copy link

@domenic

Premise 1: url.spec.whatwg.org defines that its goal is to obsolete and replace all other URL definitions.

Premise 2: the URL RFCs define URL syntax for all use cases, including parsing out of plain text, or in XML namespaces.

Premise 3: you state that the scope of the WHATWG spec is limited to basically only web browsing.

Conclusion: WHATWG is trying to obsolete and replace a spec with one that's far more limited in scope.

@annevk
Copy link
Member

annevk commented Oct 20, 2017

I don't how you came to premise 2 and 3, especially after what @domenic just wrote down. They're both incorrect.

@justjanne
Copy link

@annevk:

Premise 3 comes from

the URL Standard […] consider Markdown and the location bar out of scope.

Premise 2, I'll just quote RFC 3986

URIs are often transmitted through formats that do not provide a clear context for their interpretation. For example, there are many occasions when a URI is included in plain text; examples include text sent in email, USENET news, and on printed paper. In such cases, it is important to be able to delimit the URI from the rest of the text, and in particular from punctuation marks that might be mistaken for part of the URI.

RFC 3986 — which you claim to obsolete and replace — has an entire section on parsing URLs out of plain text.

@justjanne
Copy link

See: RFC 3986, Appendix C.

@nox
Copy link
Member

nox commented Oct 20, 2017

See: RFC 3986, Appendix C.

Nothing in this appendix is normative.

@justjanne
Copy link

So I still don't see how your standard can be a replacement if you consider an entirely different scope than the RFCs.

And it worries me greatly that you apparently haven't even read the RFCs while trying to replace them.

@justjanne
Copy link

@nox And yet, it's part of the scope of the RFC, and the normative part mention such situations several times, too.

The RFCs have a far greater scope than this "replacement" claims to have.

@justjanne
Copy link

This specification does not place any limits on the nature of a resource, the reasons why an application might seek to refer to a resource, or the kinds of systems that might use URIs for the sake of identifying resources. This specification does not require that a URI persists in identifying the same resource over time, though that is a common goal of all URI schemes.

@justjanne
Copy link

I could quote half of the RFCs introduction, they make it especially clear that the scope of it is NOT limited to a subset of resources that a web browser might refer to, but it is a spec for any universal usage of an URI, in any situation.

@annevk
Copy link
Member

annevk commented Oct 20, 2017

There's no processing requirements for plain text in that RFC. E.g., for http://example.com. there's nothing there that states whether the trailing dot is part of the URL or not. I guess you could assume that the RFC states it should be included, but almost no tool would want that kind of behavior so they'd all violate the RFC.

@annevk
Copy link
Member

annevk commented Oct 20, 2017

As for your premises, I don't see how considering the address bar out of scope (which I can tell you it's also for the RFC as no standard gets to dictate UI) means that the URL Standard is only for web browsing. As I told you earlier it's implemented by Node.js.

As for XML namespaces, they're treated as strings (and required as such by the XML specification) so it really doesn't matter whether they're considered in scope or not. I suppose they're in scope for the URL Standard as far as validity is concerned.

@justjanne
Copy link

The RFC gives recommendations for parsing plain text, but not strict requirements. Your standard — which claims to obsolete the RFCs — hasn't even tried addressing every stakeholder that currently relies on the RFC before declaring it obsolete. Nor does it provide an adequate replacement with identical scope.

@annevk
Copy link
Member

annevk commented Oct 20, 2017

Sure it does, there's a whole section on writing URLs, which is perfectly adequate for plain text.

@gibson042
Copy link

To be honest, I gave up on this specification after #87 (comment) , which explicitly rejects RFC 3986 Normalization and Comparison and ostensibly allows servers to treat /- vs. /%2D vs. /%2d as distinct. Such a position is impractical (since "comparison methods are designed to minimize false negatives while strictly avoiding false positives"), but also basically impossible to implement in a world where middleboxes do use syntax-based and scheme-based normalization for equivalence comparison.

The problem could be fixed by defining normalization, which should include specifying a model for addressing invalid input like /%%2d%2d%3f (e.g., /%25--%3F or %25%252d-%3F or %25%252d%252d%253f or …), but given an express desire to avoid that I think it's dead in the water.

@annevk
Copy link
Member

annevk commented Oct 21, 2017

@gibson042 servers are able to treat those as distinct, they don't have to. I wouldn't mind encouraging them not to, if we can find a suitable algorithm to recommend.

@gibson042
Copy link

@annevk Servers have the ability to treat them distinctly, but in so doing would be non-conforming with RFC 3986 and not web-compatible. As for a remedy, I gave five options in the original post, and directly linked to them above. If you have changed position and are now amenable to including one here, then let me know and I'll submit a PR.

@annevk
Copy link
Member

annevk commented Oct 22, 2017

Yeah, I think it's reasonable to recommend normalization for servers (and maybe even expose it in the JavaScript API at some point). We've had some requests for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.