-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It's not immediately clear that "URL syntax" and "URL parser" conflict #118
Comments
Is there really any reason for accepting more than two slashes for non-file: URLs? I mean apart from this spec saying that the parser should accept them. |
The fact that all browsers do. |
I tested Safari on a recent OS X version and it doesn't even accept three slashes. Not in the address bar and not in Location: headers in a redirect. It handles one or two slashes, no more. So I refute your claim.
That's exactly the sort of mindset that will prevent the WHATWG URL spec to ever become the universal URL spec. URLs need to be defined to work in more contexts than browsers. |
They are, no? Handling multiple slashes or not seems irrespective of that. If Safari does not do it there might be wiggle room, or Safari might hit similar compatibility issues to curl. |
That's a large question and too big of a subject for me to address fully here. URLs in the loose sense of the term are used all over the place. URLs by the WHATWG definition are probably not used by much else than a handful of browsers, no. In my view (wearing my curl goggles), there are several reasons why we can't expect that to change much short-term going forward either. Like this slash issue shows. I would love a truly universal and agreed URL syntax, but in my view we've never been further away from that than today. |
I'm sorry for the imprecision. We often use "all browsers" to mean "the consensus browser behavior, modulo minor deviations and bugs." The URL Standard defines URLs for software that wants to be compatible with browsers, and participate in the ecosystem of content which produces and consumes URLs meant for browsers. If cURL does not want to be part of that ecosystem, then yes, the URL Standard is probably not a good fit for cURL. But we've found over time that most software (e.g. servers which which to interact with browsers, or scraping tools which wish to be able to scrape the same sites as browsers visit) wants to converge on those rules. |
This made me also go and check IE11 on win7, and you know what? It doesn't support three slashes either. To me, this is important. It shows you've added a requirement to the spec that a notable share of browsers don't support. When I ask about why (because it really makes no sense to me), you make a recursive answer and say you did this because "all browsers" act like this. Which we now know isn't true. It's just backwards in so many levels.
Being part of that ecosystem does not mean that I blindly just suck up what the WHATWG says a URL is without me questioning and asking for clarification and reasoning. Being here, asking questions, responding, complaining, is part of being in the ecosystem. curl already is and has been part of the ecosystem since a very long time. Deeply, firmly and actively - we have supported and worked with URLs since back when they were still truly standard "URLs" (RFC 1738). I'm here, writing this, because I'd rather want an interoperable world where we pass URLs back and forth and we agree on what they mean. When you actively decide to break RFC 3986 and in extension RFC 7231 for the Location: header I would prefer you could explain why. If you want to be a part of the ecosystem.
I wish we worked on a URL standard, then I'd participate and voice my opinions like I do with some other standards work. A URL standard is very much a good idea for curl and for the entire world. A URL works in browsers and outside of browsers. It can be printed on posters, it works parsed and highlighted by terminal emulators or IRC clients, they get parsed by scripts and they get read out loud over the phone by kids to their grandparents. URLs are, or at least could be, truly universal. Limiting the scope to "all browsers" limits the usability of them. It fragments what a URL is and how it works (or not) in different places and for which uses. If you want a URL standard, you must look beyond "all browsers". |
Edge does, however: http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4182 In general, Edge has made changes like this to be compatible with the wider ecosystem of web content. I can't speak for their engineers, but this shows clear convergence. It's good to hear you're interested in participating. That wasn't my impression from your earlier comments, and I welcome the correction. |
Why should malformatted URLs be parsed? Surely the solution is to simply tell people who are using malformatted URLs to.. stop using malformatted URLs? |
In the interest of looking for ways forward, instead of just saying "no", per https://twitter.com/yoavweiss/status/730173495464894465, it might make sense to collect usage data and see if browsers can simplify the URL grammar. |
It may be best to ignore that browsers even use URLs, because there are definitely other pieces of software that use URLs. Consider the following URL: irc://network:port/#channel On May 10, 2016 7:31:35 PM EDT, Jeffrey Yasskin notifications@github.com wrote:
John M. Harris, Jr. Sent from my Android device. Please excuse my brevity. |
I'd suggest the following plan to any browsers interested in tightening the URL syntax they accept:
|
Browsers are not the only applications that use URLs. On May 10, 2016 7:40:56 PM EDT, Domenic Denicola notifications@github.com wrote:
John M. Harris, Jr. Sent from my Android device. Please excuse my brevity. |
@JohnMHarrisJr your comment seems irrelevant to my plan for "any browsers interested in tightening the URL syntax they accept". |
The syntax for URIs is such that the authority component (user:password@host:port) is always separated from the scheme by two slashes, except for some schemes that do not require them. The path may only begin with // if the authority component is there, and in that case it must begin with a slash. So there is no possible case where there would be more than three slashes after the colon following the URI scheme.(1) HTTP in particular requires the two slashes between the URI scheme and the authority component, so there should always be exactly two slashes between the In other words, URLs being a subset of URIs, it would only make sense to follow the standards that have already been established for a long time—especially since they make sense. |
The main difference between the WHATWG and some other standards organizations is that the WHATWG attempts to describe the world as it is, rather than as folks would like it to be. That means that if a major implementation of URLs implements them in one way, the WHATWG specification needs to allow that implementation. So, it doesn't help to "prove" that browsers are wrong by citing earlier specifications. That said, it seems like there's a good argument to have the URL spec allow implementations to reject too many slashes, since at least one recent browser and several other tools do reject them. |
Well, we want interoperable behavior, so it's either accept or reject. There's room for "accept with a console log message telling you you're being bad" (the spec already has this concept), but it would take some cross-browser agreement to move toward "reject". |
@domenic @jyasskin This isn't surprising given that many people interested in WHATWG use Chrome, have Gmail email addresses, or are Google employees. The others are with Mozilla, probably use Firefox, and probably use Gmail email addresses. This approach of standards as a popularity contest is harming the web, it tries to make tools like curl that already do URL parsing correctly and very well behave like the popular web browsers for "interoperability". And the popular web browsers behave like they do only in order to support every unreasonable thing that can be found on web pages, because their market share depends on supporting as many web pages as possible so that users don't switch to another browser. And then other browsers, and tools like curl, are expected to do the same because a spec says to! |
Your claims about the WHATWG having a Chrome bias are false. Please be respectful and make on-topic, evidence-based comments, and not ad hominem attacks, or moderation will be required. |
@domenic I admit my comment is the result of frustration, and neither on-topic, nor evidence-based, nor respectful. |
Thanks for that. We can hopefully keep things more productive moving forward. At this point I think the thread's original action item (from my OP) still stands, to clarify the authoring conformance requirements for "valid" URLs, versus the user agent requirements for parsing URLs. Besides that, there seems to be some interest from some people in getting browsers (and other software in the ecosystem that wishes to be compatible with browsers) to move toward stricter URL parsing. I think my plan at #118 (comment) still represents the best path there. As for the particular issue of more than two slashes, I have very little hope that this can be changed to restrict to two slashes, since software like cURL is already encountering compatibility issues, and we can at least guess that the change from IE11 to Edge to support 3+ slashes might also be compatibility motivated. (Does anyone want to try http://software.hixie.ch/utilities/js/live-dom-viewer/?saved=4182 in Safari tech preview to see if they've changed as well?) But, of course, more data could be brought to bear, as I outlined in my plan. I personally don't think three slashes is the worst kind of URL possible out of all the weird URLs in https://github.com/w3c/web-platform-tests/blob/master/url/urltestdata.json (my vote goes to maybe |
@domenic Considering how many people write software working with existing URL libraries, wouldn’t it be more useful to define URL as "whatever the majority of tools support" (like the URL libraries in about every language, framework, command line tool, server framework, etc)? Sure, what users input should be accepted, but considering that the input bars of browsers will also happily accept any text (and, if search is off, prepend an http://www., and append a .com/), is already a sign that maybe the definition here is wrong. Maybe we need a defined spec for a single correct storage format for an identifier, and additionally an algorithm on how to get to this identifier based on user input. "Google.com" is not a URL, although users think it is one – seperating the actual identifier and the user-visible representations might be helpful here (especially for people writing tools, as they can then declare "we accept only the globally defined format", and let you use other libraries for transforming user-input into that format). |
@justjanne the URL Standard does not concern itself with the address bar. That is a misconception. It concerns itself with URLs found in |
To be crystal clear, the browser address bar is UX and accepts all kinds of input that is not standardized. And it is totally up to each browser how they want to design that. They could even make it not accept textual input if they wanted to. That code has no influence on how URLs are parsed. |
@annevk It also concerns itself with URLs used for cross-app communication in Android, for IPC in several situations, etc. (Android uses for intents a URL format, and for cross-app communication, and doesn’t accept more than 2 or 1 slash either) It also concerns itself with address bars. What I was suggesting that maybe we should split it into one specific, defined representation (which libraries, tools, Android, cURL, etc could accept), and one additional definition for "how to parse input/Location headers/etc into that format". Because obviously browsers have to accept a lot more malformed input than other tools, but also it’s obvious that not every tool should include a way to try and fix malformed input itself. |
That is basically how it is structured today. There's a parser that parses input into a URL record. And that URL record can then be serialized. The URL record and its serialization are much more constrained than the input. I think that cURL basically wants the same parsing library in the end. It has already adopted many parts of it. I'm sure as it encounters more content it will gradually start to do the same thing. There's some disagreement on judgments calls such as whether or not to allow multiple slashes, but given that browsers are converging there (see Edge) and there is content that relies on it (see cURL's experience) I'm not sure why we'd change that. |
@annevk That’d be cURL – but do you suggest every tool that handles URLs is rewritten based on this single implementation of a parser? Do you actually want Android to accept single or triple slashes in cross-app communication? You’d add a lot more ambiguity, complexity, and performance issues to any software working with it. There are use cases for when you want to accept malformed input (for example, when it comes from a user), and there are use cases where you don’t. The definition of URL should be what you call "serialization of a URL record". (And, IMO, for cURL it would be better to split the parsing into a URL record into a seperate tool, and do |
@nox Yet, you only contacted browser vendors. https://///url.spec.whatwg.org/ writes that the goal of the WHATWG is to
Replacing the URL spec (RFCs 3986 and 3987) also includes everything that uses it. From offline usage to libraries in languages, from autolinking regex used on Android to inter-process communication schemes. From fileformats that use URLs (every XML implementation) to low-level filesystem APIs in every operating system. If you state that your official goal is to obsolete and replace the standard all these use, you'll need to open a discussion with a great deal more stakeholders. Or you need to go back, and rethink if replacing the entire URL spec, and every usage of it, is such a good idea. |
It's easy enough to find emails seeking feedback on this document on various mailing lists, including those controlled by W3C and IETF. Where else would you like us to solicit feedback from? |
@annevk let's ask the other way around. WHATWG officially declared the URL RFCs obsolete. So, what format should I use to specify URLs in XML files, in my HTML, on Android for Intents, in my file explorer, and elsewhere? You said your standard hasn't failed, so is there a single place outside of web browsers that has adopted your spec? You say all other specs are obsolete, so can I use your format everywhere now? Not even your issue tracker has adopted your spec: https:////url.spec.whatwg.org/ — how is this not a failure? |
@justjanne I don't understand why the issue tracker would have to accept non-conforming URLs. There's all kind of restrictions and heuristics around plain text URLs that don't normally apply. (This also goes for the address bar as @magcius pointed out, which is UX and has vastly different considerations from URLs transmitted via other means.) As an example of adoption outside browsers: Node.js ships an implementation. |
@annevk Why is it non-conforming? According to the spec on url.spec.whatwg.org, it's a valid URL. The only spec according to which this isn't a valid URL would be the URL RFCs, but those are obsolete, according to WHATWG. |
It's not valid, please see https://url.spec.whatwg.org/#urls. |
Mhm, I just found the definition of the scheme limiter in the host string. So, as your goal was to define URL parsing for any situation, and obsolete all previous ones, but your own issue tracker doesn't use your base URL parser... What should I use to parse URLs out of plain text that could potentially be of any language (including ones that wouldn't leave word boundaries between text, and a URL)? I find very little in your spec about these situations, but you said very handwavy that there should be additional restrictions. |
Also, the format should be ideally universal and accepted by any system a user might ever enter a URL in, as your standard is not that young anymore, and you said it's successful, I trust that it has been adopted aby any system a user might enter URLs in, so they will be immediately familiar with my interface. |
To hopefully clear up any confusion, neither the URL Standard nor the RFCs it obsoletes provides an algorithm for interpreting an arbitrary string of Markdown text and finding URLs within it. That seems to be what you're wondering about, @justjanne, with your discussion of the issue tracker. I believe that might be specified by CommonMark, but I am not sure. Both the URL Standard and the RFCs it obsolete only operate on specific string inputs which are identified as URLs, for example in a For example, as you noted, On the other hand, To reiterate: the URL Standard, like the RFCs before it, consider Markdown and the location bar out of scope. The URL Standard's "valid" definition can be helpful in building heuristics for situations like that, but won't suffice by itself. (E.g. without a base URL, |
@domenic So, you deprecated the URL RFCs, and replaced them with a version that's even more limited in scope? (Because the RFCs at least try to define URLs for all situations — be it XML namespaces, IPC systems, or filesystems). That's what I'd consider a failure at the goal of universally replacing the RFCs. And it doesn't help me with parsing URLs (according to whatever definition) out of IRC messages. |
I'm not sure where you got that conclusion from my comment. |
Premise 1: url.spec.whatwg.org defines that its goal is to obsolete and replace all other URL definitions. Premise 2: the URL RFCs define URL syntax for all use cases, including parsing out of plain text, or in XML namespaces. Premise 3: you state that the scope of the WHATWG spec is limited to basically only web browsing. Conclusion: WHATWG is trying to obsolete and replace a spec with one that's far more limited in scope. |
I don't how you came to premise 2 and 3, especially after what @domenic just wrote down. They're both incorrect. |
Premise 3 comes from
Premise 2, I'll just quote RFC 3986
RFC 3986 — which you claim to obsolete and replace — has an entire section on parsing URLs out of plain text. |
See: RFC 3986, Appendix C. |
Nothing in this appendix is normative. |
So I still don't see how your standard can be a replacement if you consider an entirely different scope than the RFCs. And it worries me greatly that you apparently haven't even read the RFCs while trying to replace them. |
@nox And yet, it's part of the scope of the RFC, and the normative part mention such situations several times, too. The RFCs have a far greater scope than this "replacement" claims to have. |
|
I could quote half of the RFCs introduction, they make it especially clear that the scope of it is NOT limited to a subset of resources that a web browser might refer to, but it is a spec for any universal usage of an URI, in any situation. |
There's no processing requirements for plain text in that RFC. E.g., for http://example.com. there's nothing there that states whether the trailing dot is part of the URL or not. I guess you could assume that the RFC states it should be included, but almost no tool would want that kind of behavior so they'd all violate the RFC. |
As for your premises, I don't see how considering the address bar out of scope (which I can tell you it's also for the RFC as no standard gets to dictate UI) means that the URL Standard is only for web browsing. As I told you earlier it's implemented by Node.js. As for XML namespaces, they're treated as strings (and required as such by the XML specification) so it really doesn't matter whether they're considered in scope or not. I suppose they're in scope for the URL Standard as far as validity is concerned. |
The RFC gives recommendations for parsing plain text, but not strict requirements. Your standard — which claims to obsolete the RFCs — hasn't even tried addressing every stakeholder that currently relies on the RFC before declaring it obsolete. Nor does it provide an adequate replacement with identical scope. |
Sure it does, there's a whole section on writing URLs, which is perfectly adequate for plain text. |
To be honest, I gave up on this specification after #87 (comment) , which explicitly rejects RFC 3986 Normalization and Comparison and ostensibly allows servers to treat The problem could be fixed by defining normalization, which should include specifying a model for addressing invalid input like |
@gibson042 servers are able to treat those as distinct, they don't have to. I wouldn't mind encouraging them not to, if we can find a suitable algorithm to recommend. |
@annevk Servers have the ability to treat them distinctly, but in so doing would be non-conforming with RFC 3986 and not web-compatible. As for a remedy, I gave five options in the original post, and directly linked to them above. If you have changed position and are now amenable to including one here, then let me know and I'll submit a PR. |
Yeah, I think it's reasonable to recommend normalization for servers (and maybe even expose it in the JavaScript API at some point). We've had some requests for it. |
URL syntax is a model for valid URLs---basically "authoring requirements". The URL parser section allows parsing URLs which do not follow URL syntax.
An easy example is
https://////example.com
, which is disallowed because the portion afterhttps:
contradicts/cc @bagder via some Twitter confusion this morning.
The text was updated successfully, but these errors were encountered: