Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a grammar for the URL parser #479

Open
alercah opened this issue Apr 24, 2020 · 82 comments
Open

Provide a grammar for the URL parser #479

alercah opened this issue Apr 24, 2020 · 82 comments
Labels
editorial Changes that do not affect how the standard is understood

Comments

@alercah
Copy link

alercah commented Apr 24, 2020

As an occasional standards-user, the lack of a succinct expression of the grammar for valid URL strings is rather frustrating. It makes it rather difficult to follow what's going on and, in particular, to work out whether a given thing is a valid URL. A grammar in EBNF or a similar form would be greatly appreciated and make this spec significantly easier to understand.

@domenic
Copy link
Member

domenic commented Apr 24, 2020

Just to be sure, you're saying you prefer something like

url-query-string = url-unit{0,}

(or whatever grammar format) to the spec's current

A URL-query string must be zero or more URL units.

?

@masinter
Copy link

I think this is #24 #416

@annevk annevk added the topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing) label May 5, 2020
@alercah
Copy link
Author

alercah commented May 12, 2020

Apologies for the slow reply.

I think that @masinter is right, and that my concern matches the ones discussed there. I skimmed those threads and a few statements concerned me, such as the assertion that a full Turing machine is required to parse URLs: this would very much surprise me; my instinct on reading the grammar is that, once you separate out the different paths for file, special schemes, and non-special schemes in relative URLs, the result is almost certainly context-free. It might even be regular. The fact that the given algorithm mostly does a single pass is a strong indication that complex parsing is not required.

I discussed with a colleague of mine, @tabatkins, and they said that the CSS syntax parser was much improved when it was rewritten from a state-machine based parser into a recursive descent parser. Doing this would effectively require writing the URL grammar out as a context-free grammar, which would make providing a BNF-like specification, even if it's only informative, very easy.

Separately, but not entirely, splitting out the parsing from the semantic functions (checking some validity rules, creating the result URL when parsing a relative URL string) would likely improve the readability of the spec and the simplicity of implementing it. I think this might be better suited for a separate thread, though, as there are some other thoughts I have in this vein as well.

@alwinb
Copy link
Contributor

alwinb commented Oct 14, 2020

This might be a more complicated problem than you think (@alercah). I have tried several times, but the scheme dependent behaviour causes a lot of duplicate rules, so you end up with a grammar that is not very concise nor easy to read. And there is a tricky problem with repeated slashes before the host, the handling of which is base URL dependent.

I have some notes on it here. (I eventually went with a hybrid approach of a couple of very simple grammars and some logic rules in between). This ties into a model of URLs that I describe here.

What's the status of this? It really does work. I developed the theory when I tried to write a library that supports relative URLs. I am quite confident that it matches the standard (but not everything is described in the notes); as the library now passes all of the parsing tests.

@alercah
Copy link
Author

alercah commented Oct 16, 2020 via email

@masinter
Copy link

it would be useful to start with the BNF of RFC 3986 and make changes as necessary. at least to explain the differences and exceptions.

@alwinb
Copy link
Contributor

alwinb commented Oct 19, 2020

Interesting, I did not know that there were parser generators that support parameterised rules.

I did consider a more formal presentation with subscripted rules, but then I backed off because I thought it would be less accessible. It makes me think of higher order grammars, and I think that's too heavy. I guess in this case it could result in something quite readable too though.

As for the comparison with RFC 3986, it would be great if this can help to point out the differences. I have not looked into that much, but the good news is that it might not be that different, after all.
I couldn't start with the RFC though because I was specifically aiming for the WHATWG standard. That was motivated by an assumption that this is the common URL standard, in part because It does mention obsoleting RFC 3986 and RFC 3987 as a goal.

@alwinb
Copy link
Contributor

alwinb commented Oct 19, 2020

Back to the issue, the question is how this could flow back to the WHATWG standard. And I am not really sure how that would work yet. The parser algorithm seems to be the heart of the standard, and I think there is a lot of work behind that. There is of course the section on URL Writing which does look like a grammar in prose style.

To be clear, what I tried to do, and what I suspect people in this thread (and alike) are after is not to give a grammar for valid URL strings (like in the URL writing section), but to give one that describes the language of URLs that is implicitly defined by the parser algorithm – and in such a way that it also describes their internal structure. Then the grammar contains all the information that you need for building a parser. This is indeed possible, but it is a large change from the standard as it is now.

@sideshowbarker
Copy link
Contributor

sideshowbarker commented Oct 20, 2020

There is of course the section on URL Writing which does look like a grammar in prose style.

To be clear, what I tried to do, and what I suspect people in this thread (and alike) are after is not to give a grammar for valid URL strings (like in the URL writing section)

I think in fact that there are people who’ve been involved with discussion of this who have actually been hoping for a formal grammar for valid URL strings — in some kind of familiar formalism rather than in the prose style the spec uses. (And to be clear, I’m not personally one of the people who wants that — but from following past discussions around this, I can say I’m certain that’s what at least some people have been asking for.)

but to give one that describes the language of URLs that is implicitly defined by the parser algorithm

I know that’s what some people want but I think as pointed out in #479 (comment) (and #24 and other places) there are some serious challenges in attempting to write such a grammar.

And what I think has also been pointed out in #24 and elsewhere, for anybody who wants that, there’s nothing that prevents them from attempting to write up such a grammar themselves, based on the spec algorithms — but short of that happening, nobody else involved with the development of the spec is volunteering to try to write it up.

@alwinb
Copy link
Contributor

alwinb commented Oct 21, 2020

The technical issues are mostly solved. I'm willing to help, and I'm looking for some feedback about how to get started.

I cannot just write a separate section because in one way or another it'll compete with the algorithm for normativity, (ao). That's a big problem and I think it is the main reason to resist this.

It also requires specifying the parse tree and some operations on them. That could use the (internal) URL records, but it will require some changes to them.

My main concern is that these things together will trigger too much resistance, too many changes, and that then the effort will fail.

What I can do is try to sketch out some approaches that could help to prevent that. I'll need some time to figure that out.
I'm not sure what else I can do to get this going at the moment. Feedback would be appreciated.

@sideshowbarker
Copy link
Contributor

I cannot just write a separate section because in one way or another it'll compete with the algorithm for normativity, (ao). That's a big problem and I think it is the main reason to resist this.

Yes — and in other cases in WHATWG specs where there’s been discussion about formalisms for similar cases (e.g., some things in the HTML spec), that rationale (people will end up treating the extra formalism as normative) has been a strong argument against including such formalisms in the specs.

For this case, a grammar could be maintained in a separate (non-WHATWG) repo, and published separately — and then the spec could possibly (non-normatively) link to it (not strictly necessary, but just to help provide awareness it exists).

@domenic
Copy link
Member

domenic commented Oct 21, 2020

Agreed with @sideshowbarker, generally. If people want to work on personal projects that provide alternative URL parser formalisms, that's great, and I'm glad we've worked on a test suite to help. As seen from this thread, some folks might appreciate some alternatives more than they appreciate the spec, and so it could be helpful to such individuals. But the spec is good as-is.

@alwinb
Copy link
Contributor

alwinb commented Oct 21, 2020

There are issues with it that cause further fragmentation right now. I have to say I'm disappointed with this response. I'm trying to help out and solve issues. Not just this one but also #531, #354 amongst others, which cannot be done without a compositional approach. If you do not address that, people come up with ad hoc solutions, creating new corner cases, leading to renewed fragmentation. You can already see this happening in some of the issues. It is also not true that it cannot be done, because I already did it, once for my library and a couple of weeks ago I did a fork of jsdom/whatwg-url over a weekend that uses a modular parser/ resolver based on my notes, has everything in place to start supporting relative URLs as well, and passes all the tests. I didn't post about it, because the changes are too large. Clearly it would not work out. I'm trying to take these concerns into account and work with them. Disregarding that with 'things are fine', I think is a shame.

@alercah
Copy link
Author

alercah commented Oct 22, 2020

While I unfortunately do not have the time to contribute to any work on this at the moment, I have a few thoughts.

  1. First, I agree that care should be taken to avoid confusion about normativity. There definitely should be only one normative spec. If a grammar were to go into the spec itself alongside the algorithm, with the algorithm remaining normative, great care would need to be taken to ensure that they remain accurate as disagreement between the two breeds problems.
  2. Second, I believe that you already basically have not one, but two alternate semi-normative specifications anyway: the section on writing URLs, which specifies a sort of a grammar on how to write them out, and the test suite. I don't believe that anyone can state with certainty that the section on writing URLs actually matches the parser, and I think this comment by one of the major contributors to the spec goes to show how the test suite is treated basically as normatively as the spec, if not more.
  3. Third, I am convinced that trying to define a grammar, normative or non-normative, for the spec as it is, is fundamentally a fool's errand.
  4. But I am not of the opinion that this means that it shouldn't be done. I believe that the current parser should be ripped out entirely, or at least moved to an auxiliary specification on how browsers should implement an actual specification.

To elaborate a bit, I very much disagree with the claim that "the spec is good as is". The spec definitely provides an unambiguous specification with enough information to determine whether or not an implementation meets the specification. This is enough to meet the bare minimum requirements and be an adequate technical standard. But it has a number of flaws that make it difficult to use in practice:
- It conflates domains. This URL specification is primarily geared towards the web and web standards, as is indicated by a lot of the implicit assumptions it makes (see also #535). But the use of URLs, and RFC 3986, extends far beyond the web and the spec does not make any meaningful attempt to address uses outside the web. Recommendations on displaying URLs to users are explicitly applicable only to browsers. It defines an API applicable only to the web, with no discussion of API design for other environments. It canonically defines file as the default scheme when no scheme is specified, when most clients would likely prefer to make that decision themselves.
- The mere fact that the spec is a living standard is not suitable for use in many application domains. It may be acceptable for the web, perhaps, but there are other interchange systems that need a more reliable mechanism.
- It contains almost no background or discussion. It contains only a section listing the goals of the document and three sparse paragraphs on security considerations. It does not explain the purpose of a URL or the human meaning of its various components. It explains almost none of its decisions, such as why special schemes are special or why particular different API setters behave the way they do, or why special schemes get a special, elevated place in the spec to have their scheme-specific parsing requirements incorporated into it.
- It is poorly organized. For instance, it discusses security considerations in sections 4.8 and 1.3 and does not mention this in section 2.
- Most relevantly to the original topic here, it is nearly impossible for a human to reason about whether or not a URL is valid without manually executing the algorithm. It is incredibly opaque. There is no benefit to this. I defer to @sjamaan's excellent comment. I find the suggestion that section 4.3 provides a useful "overview" of the grammar to be ridiculous. It doesn't. It's just as opaque as the rest of the document.
- As an additional point, the opacity of the spec makes it nearly impossible to reason about whether a given behaviour is intentional or a bug. The spec is defined by the implementation in pseudocode. Even understanding the spec's behaviour given an input, much less deciding whether or not it is correct, effectively requires debugging the specification.
- There is no abstraction of related concepts, and there is bad mixing of technical layers between semantics and syntax. Semantic errors are returned during parsing, rather than during a separate step on the parsed values.

It is worth noting that this specification explicitly intends to obsolete RFC 3986. RFC 3986 is a confusing mix of normative and informative text, and a difficult specification to apply and use. Yet this specification is distant from being able to obsolete it because it is targeted entirely at one application domain

In conclusion, this spec is a PHP Hammer. It is not "good". It is barely adequate in the one domain it chooses to support, and abysmal in any other domain.

If the direction of this standard can't reasonably be changed (assuming there are people willing to put in the effort), and in particular if WhatWG is not interested in addressing other domains in this specification. I would be fully supportive of an effort, likely through IETF's RFC process, to design a specification which actually does replace RFC 3986, and to have the WhatWG spec recognized only as the web standard on the implementation of that domain-agnostic URL specification. I will probably direct any energy I do find myself with to address this spec to that project rather than this one.

@sideshowbarker
Copy link
Contributor

sideshowbarker commented Oct 22, 2020

  1. Second, I believe that you already basically have not one, but two alternate semi-normative specifications anyway: the section on writing URLs, which specifies a sort of a grammar on how to write them out

To be clear, there’s nothing semi-normative about that https://url.spec.whatwg.org/#url-writing section. It’s normative.

and the test suite.

And to be clear about that: The test suite is not normative.

I don't believe that anyone can state with certainty that the section on writing URLs actually matches the parser

The section on writing URLs doesn’t claim it matches the parser. Specifically: There are known URL cases that the writing-URLs section defines as non-conforming — as far as documents/authors being prohibited for using them— but which have normative requirements that parsers must follow if documents/authors use them anyway.

and I think this comment by one of the major contributors to the spec goes to show how the test suite is treated basically as normatively as the spec, if not more.

While some people may treat the test suite as authoritative for some purposes, it’s not normative. In the URL spec and other WHATWG specs, normative is a term of art used consistently with an exact unambiguous meaning: it applies only to the spec, and specifically only to the spec language that states actual requirements (e.g., using RFC 2119 must, must not, etc., wording).

The test suite doesn’t state requirements; instead it tests the normative requirements in the spec. And if the test suite were to test something which the spec doesn’t explicitly require, then the test suite would be out of conformance with the spec.

  • Most relevantly to the original topic here, it is nearly impossible for a human to reason about whether or not a URL is valid without manually executing the algorithm.

The algorithm doesn’t define whether a URL is valid or not; instead the algorithm defines how a URL must be processed, whether or not the https://url.spec.whatwg.org/#url-writing section defines that URL as valid/conforming.

@sideshowbarker
Copy link
Contributor

sideshowbarker commented Oct 22, 2020

Note also that the URL spec has multiple conformance classes for which it states normative requirements; its algorithms state one set of requirements for parsers as a conformance class, and separately, the https://url.spec.whatwg.org/#url-writing section states a different set of requirements for documents/authors as a conformance class.

@alercah
Copy link
Author

alercah commented Oct 22, 2020

I'm well aware that the test suite is not normative, and that the writing spec is normative, and of the use of "normative" as a term of art. But you said:

Yes — and in other cases in WHATWG specs where there’s been discussion about formalisms for similar cases (e.g., some things in the HTML spec), that rationale (people will end up treating the extra formalism as normative) has been a strong argument against including such formalisms in the specs.

You claimed that people treating the extra formalism as normative is an argument against the inclusion, not that it would create two potentially-contradictory normative texts.

By the same argument, you should remove the URL writing spec, because it risks being treated as normative, and consider retiring the test suite as well because people treat it as normative because the spec itself is incomprehensible.

I don't think that you should remove either of them. I think you should make the spec comprehensible so that people stop being tempted to treat something else as normative.

The section on writing URLs doesn’t claim it matches the parser.

I agree that it does not claim to produce invalid URLs. It does, however, make a claim that the operation of serialization is reversible by the parser:

The URL serializer takes a URL and returns an ASCII string. (If that string is then parsed, the result will equal the URL that was serialized.)

Admittedly, this claim is rather suspect because it then provides many examples of where that is not true. I suspect it is missing some qualifiers, such as that the serialization must succeed and the parsing must be done with no base URL and no encoding override.

Even with those qualifiers added, I challenge you to produce a formal proof that serialization and parsing produces an equal URL.

@alwinb
Copy link
Contributor

alwinb commented Oct 22, 2020

Thank you @alercah. I feel validated by the statement that I have been running a fools errand. It is nice that someone understands the issues and the amount of work that it involves.

The only reason I pushed through was because I had made a commitment to myself that I would finish this project.

@johnwcowan
Copy link

RFC 3986 is probably what you want.

@alwinb
Copy link
Contributor

alwinb commented Oct 23, 2020

No, I want an end to the situation where an essential building block of the internet has two incompatible specifications that cannot be unified because the one that describes the behaviour of web browsers does so by a long stretch of convoluted pseudocode to describe a monolith function that mixes parsing with normalisation, resolution, percent encoding and updates to URL components. Indeed, an update to RFC 3986 to include browser behaviour would be really, really great. Unfortunately that requires reverse engineering this standard.

@masinter
Copy link

I want an end to the situation where an essential building block of the internet has two incompatible specifications that cannot be unified

I tried over many years to resolve this issue. @sideshowbarker @rubys @royfielding @duerst can attest,
See https://tools.ietf.org/html/draft-ruby-url-problem-01
from 2015

@alwinb
Copy link
Contributor

alwinb commented Oct 25, 2020

Work has started. This is going to happen. Stay tuned.

@alwinb
Copy link
Contributor

alwinb commented Nov 15, 2020

There is a GitHub project page for a rephrased specification here.
It can be viewed online here.

Whilst still incomplete, it is coming along quite nicely.
The key section on Reference Resolution is complete.
The formal grammars are nearly complete.
There is also a reference implementation of the specification here.

It will not be hard to add a normative section on browser behaviour to e.g. RFC 3986/ RFC 3987 once this is finished. The differences are primarily around the character sets and the multiples of slashes before the authority. The latter is taken care of by the forced-resolution as described in that section Reference Resolution.

This also means that it will be possible to add a section to the WHATWG Standard that accurately describes the differences with the RFCs.

@alwinb
Copy link
Contributor

alwinb commented Jun 5, 2021

Following up on this:

This also means that it will be possible to add a section to the WHATWG Standard that accurately describes the differences with the RFCs.

I have done more research, esp. around the character sets, making some tools to compute the differences. These are my findings. I will follow up with a post about other, minor grammar changes and reference resolution.

The differences will be very small, after all is said and done. Which is great!

Character Sets

IRI vs WHATWG URL

The codepoints allowed in the components of valid WHATWG URLs are almost the same as in RFC3987 IRIs. There is only one difference:

  • WHATWG URLs allow more non-ASCII unicode code points in components.

Specifically, the WHATWG Standard allows the additional codepoints:

  • The Private Use Areas: { u+E000-u+F8FF, u+F0000-u+FFFFD, u+100000-u+10FFFD }.
  • Specials, minus the non-characters: { u+FFF0-u+FFFD }
  • Tags and variation selectors, specifically, { u+E0000-u+E0FFF }.

Specials are allowed in the query part of an IRI, not in the other components though.

IRI vs loose-WHATWG URL

Let me call any input that the 'basic url parser' accepts as a single argument, a 'loose-WHATWG URL'.

Note: The IRI grammar does not split the userinfo into a username and password, but RFC3986 (URI) suggests in 3.2.1. that the first : separates the username from the password. So I assume this in what follows. Note though that valid WHATWG URLs do not allow username and password components at all.

To go from IRIs to loose WHATWG URLs, allow any non-ASCII unicode code point in components, and a number of additional ASCII characters as well. Let's define iinvalid:

iinvalid := { u+0-u+1F, , ", <, >, [, ], ^, `, {, |, }, u+7F }

Then, for the components:

  • username: add iinvalid and @ (but remove :).
  • password: add iinvalid and @.
  • opaque-host: add a subset of iinvalid: { u+1-u+8, u+B-u+C, u+E-u+1F, ", `, {, }, u+7F }
  • path component: Add iinvalid.
  • query: add iinvalid.
  • fragment: add iinvalid and #.
  • For non-special loose WHATWG URLs also add \ to all the above except for opaque-host.

The grammar would have to be modified to allow invalid percent escape sequences: a single % followed by zero or one hexdigits, (but not two).

Note that the WHATWG parser removes tabs and newlines { u+9, u+A, u+D } in a preprocessing pass, so you may choose to exclude those from the iinvalid set. Preprocessing also removes leading and trailing sequences of { u+0-u+20 } (aka c0-space), but it's not a good idea to try and express that in the grammar.

@masinter
Copy link

masinter commented Jun 5, 2021

I've suggested a BOF session at IETF 111, which will be held online, to consider what changes to IETF specs would, in conjunction with WHATWG specs, would resolve this issue.
A BOF is not a working group, but rather a precursor. to evaluate whether there is enough energy to start one.
IETF attendance fees can be waved.
https://mailarchive.ietf.org/arch/msg/dispatch/i3_t-KjapMhFPCIoQe1N47buZ5M/

@alwinb
Copy link
Contributor

alwinb commented Jun 8, 2021

In case a new IETF effort does get started,

I just want to state that I hope, and actually believe that a new/ updated IETF specification and the WHATWG URL standard could complement each other quite well. It will require work and there will be problems, but it is possible and worthwhile.

@masinter
Copy link

masinter commented Jun 8, 2021

@alwinb Nothing in IETF will happen unless you show up with friends willing to actually do the work.

@alwinb
Copy link
Contributor

alwinb commented Jun 9, 2021

So here's some work to be done.

  • Decide if an addendum is enough, or if RFC3986/39876 should be merged (the latter has my preference)
  • Decide if the full WHATWG parsing/resolution behaviour should be included, or if it is enough to provide the elementary operations that can then be recombined in the WHATWG standard to exactly reproduce their current behaviour (latter one has my preference, then the standards can really be complementary!)
  • Decide how to include the loose grammar in such a document (my preference: parameterise the character sets)
  • Rewrite my 'force' operation into the RFC style and maybe refactor the merge operations from RFC3986 a little, or switch to my model of sequences more whole heartedly.
  • Amend or parameterise the 'path merge' to support the WHATWG percent-encoded dotted segments.
  • A remaining technical issue: solve Base URL Windows drive sometimes favoured over input string drive #574, and figure out how to incorporate that into the RFC grammar
  • Decide what to do with the numbers in the ip-addresses of the loose grammar, esp. how to express their allowed range (ie. on the grammatical level as in RFC3986 or on a semantic level)
  • Preferably, find implementations of the existing RFCs, work with them to implement the additions and have them test agains the wpt test suite, to corroborate that the additions can be combined to express the WHATWG behaviour
  • Expand the wpt test suite to include validity tests (!!)
  • Write about the encoding-normal form, parameterise it by component-dependent character sets, so that the percentEncodeSets of the WHATWG standard can be plugged into the comparison ladder nicely.
  • For the WHATWG standard: decide if a precomposed version of the 'basic-url-parser' should be kept or if it should be split up. It may be possible to automatically generate a precomposed version from an implementation of the elementary operations, and to also automatically generate the pseudocode from that.

Let's get started!

@alwinb
Copy link
Contributor

alwinb commented Aug 31, 2022

the result is not as clean or clear as people may be hoping for, and it needs supplemented with various hacks to coerce it in to the correct result. The web grew organically, not systematically, and URLs certainly do show that.

This is not true: there is a logic to things that grow organically. There is some noise and randomness, but not nearly as much as you think. The challenge is to uncover it, rather than to create it.

The WHATWG URL standard can be very neatly characterised, it really isn't bad.

@annevk
Copy link
Member

annevk commented Aug 31, 2022

@bagder what do you mean with "does not cover"? It covers all of them.

@aucampia
Copy link

Not in the IETF's own sense of obsoletion, but it can indeed obsolete the IETF RFCs in the plain English sense of the word:

no longer produced or used; out of date.
"the disposal of old and obsolete machinery"

It technically can but I don't see how it has done this.

Your claim may have some basis if we only consider browsers, but I don't just consider browsers, and from what I can see the only real effect of this effort has is that it bifurcated the concept of URL while introducing even further confusion in setting unattainable goals like "Standardize on the term URL. URI and IRI are just confusing." which is not a goal that has been attained and does not seem to be a goal that will be attained.

@masinter
Copy link

My note cites Sam @rubys tweet. Has there been much progress getting browsers to implement this spec (7 years?)?

@becarpenter
Copy link

Hi Larry,
I have no comments whatever here on "drop#" but I want to reply to your question as if it was about RFC6874 (except that has been pending for 9 years). No, it has not been implemented and for good reason: it specified something essentially unimplementable. We did a bad job. Hopefully, 6874bis is completely implementable. As was said upthread, it will only become running code if the browser community wants it, and that won't happen because the IETF says so, or even because WHATWG or @alwinb says so, but entirely because a substantial set of browser users say "we need this". All we're trying to do with 6874bis is place a clearly specified marker.
And I think that goes for every single feature of URI/IRI syntax. With all respect for @alwinb's effort, neither perfect ABNF nor his spec ultimately counts, compared to user requirements and running code.

@masinter
Copy link

@becarpenter My comment about implementation was with respect to

In case of it is of interest, I also automated tests of the spec against each of the 4 browser bases at the time (Ie was still relevant) as well as other languages and libraries. I lost interest shortly after when... I was able to identify 4 tests, one for each browser where the other three browsers agreed on the result and the odd browser was different. I posted the results and solicited feedback from each browser and got... crickets. No response whatsoever. To say that the spec aspired to be implemented is too generous in my opinion. The browsers at the time didn't see URL/URI/whatever compatibility as a priority.

The problem is that traditionally, the URL has been a user interface element AND a protocol element (a dessert topping AND a floor wax). But URLs are hidden more and more.

If the implementation adoption of this living standard hasn't moved much in 7 years, it is hard to believe the aspirational goals should be retained. If this were a spec for things that look like URLs that you can type in the address line when you don't mean to search, maybe. But the folks demanding more compatibility are used to URLs breaking anyway.
People building APIs based on URLs and curl and wget etc need a clear simple spec. WHATWG should treat URLs more like HTTP.

@domenic
Copy link
Member

domenic commented Aug 31, 2022

There has indeed been significant convergence over the last 7 years. Most notably, WebKit/Safari is shipping a fully-compliant implementation. Server-side runtimes are probably the next most prominent; all notable JavaScript-based ones are compliant, and I believe I've heard of progress with others (e.g. Go, Java, PHP) but I am not as in-tune with those communities. Finally, Gecko/Firefox and Chromium/Chrome have made a good deal of progress through the years; although they're definitely not fully compliant yet, the trend is positive and both are on board with the project, and just haven't bumped it to the top of anyone's priority list. (E.g., Chromium saw a big jump in compliance when an external contributor took the time to submit improvements.)

@becarpenter
Copy link

@masinter said

the URL has been a user interface element AND a protocol element

Indeed. We make exactly this point at https://www.ietf.org/archive/id/draft-ietf-6man-rfc6874bis-02.html#section-4-1 . IMNSHO the mania for mixing search and links in the same dialogue box has been disastrous and made the browser UI incomprehensible to normal brains. But that is a lost battle.

@bagder
Copy link

bagder commented Sep 1, 2022

@bagder what do you mean with "does not cover"? It covers all of them.

Sorry. I was wrong.

I thought I read that (and maybe I did but it has been edited out since). Or maybe I was just drunk. Still, the document mentions many schemes but only such that are/were ever supported by browsers.

I don't think the document is very clear about this fact (possibly caused by assumptions).

@alwinb
Copy link
Contributor

alwinb commented Sep 1, 2022

What happens if you parse any of the schemes listed at https://en.m.wikipedia.org/wiki/List_of_URI_schemes with the WHATWG parser?

Is the way in which non-special URLs are parsed so safe and generic that it won’t cause problems with any of them?

@alwinb
Copy link
Contributor

alwinb commented Sep 1, 2022

I’m sorry I actually wanted to say something in support of the WHATWG effort. Because it does provide in the needs of browsers, and the IETF never did that. There is progress on getting browsers aligned. So I don’t agree with the suggestion about authority and power, it doesn’t match the situation here.

@annevk
Copy link
Member

annevk commented Sep 1, 2022

Is the way in which non-special URLs are parsed so safe and generic that it won’t cause problems with any of them?

Yup, apart from certain edge case it should mostly match the RFC. It used to be worse and mostly divide the string up into a scheme and path, but that was a long time ago. (And even then it wouldn't really be incompatible as most schemes require post-processing of some kind anyway. You just end up with a different division of labor.)

@johnwcowan
Copy link

the WHATWG effort [...] does provide in the needs of browsers, and the IETF never did that. There is progress on getting browsers aligned.

Quite so, and so monopoly (oligopoly) power is advanced. "People of the same trade seldom meet together, even for merriment and diversion, but the conversation ends in a conspiracy against the public, or in some contrivance to raise prices," as Adam Smith said. The IETF is at most a conspiracy of individuals, and is not the agent of any such public harm.

@karwa
Copy link
Contributor

karwa commented Sep 1, 2022

Let's try to avoid being overly dramatic, and assume that everybody participating in this process does so in good faith.

Nobody is going to get rich and powerful from a URL standard. Efforts like this arise from thoroughly boring technical considerations - data which is not processed consistently by applications, and a desire from engineers to ensure they are brought in to alignment.

To quote a (now 6 years old) blog post from @bagder (emphasis added)

A “URL” given in one place is certainly not certain to be accepted or understood as a “URL” in another place.

👉 Not even curl follows any published spec very closely these days, as we’re slowly digressing for the sake of “web compatibility”. 👈

In other words: web compatibility is more important than conformance to any particular RFC, otherwise they wouldn't deviate for its sake. That's as clear a statement as you're going to find about why the IETF RFCs have failed. It is simply impractical to take a puritan stance; pragmatism and compromise are necessary.

two things come to mind that we do in curl that aren’t RFC3986 (apart from adding support for one to three slashes): we escape-encode incoming spaces and we percent-encode > 7bit inputs, 👉 to act more like browsers 👈. (Primarily important for redirects.)

The same is true for Python's urllib, which as mentioned departs from the IETF RFCs to strip tabs and newlines as the WHATWG standard does - essentially conforming to neither standard and creating their own URL dialect. The maintainers are aware of this (it was mentioned when they introduced that change), and it seems that they want to fully align with the WHATWG standard eventually (python/cpython#88049).

User reports from my own library (WebURL for Swift) suggest the same thing: that web compatibility is what people care about, more than conformance to RFC-this-or-that. Mobile apps and servers often consume poorly-formatted URLs prepared by content editors or embedded in social media posts, and there is an expectation that if it works in a browser, it should work with all other software, as well. Browser behaviour is the de facto standard.

This effort is an attempt to sort that mess out - to define what "web compatibility" even means, and to produce a specification which truly can claim to describe URLs on the web. If we accept that web compatibility trumps formal compliance to any particular document (and I think that much is clear), then I simply do not understand the logic in all of this hostility and political bickering.

What's the worst that could happen from a bit of constructive engagement instead?

@aucampia
Copy link

aucampia commented Sep 1, 2022

User reports from my own library (WebURL for Swift) suggest the same thing: that web compatibility is what people care about, more than conformance to RFC-this-or-that.

I'm sure people mostly care about compatibility, but the robustness principle has well known hazards [draft-iab-protocol-maintenance-05]. Maybe these hazards don't apply to some, but that does not mean they don't exist.

I want to understand and communicate what constitutes a valid URI, and I don't want to do this exclusively in code that is translated by hand from a natural language specification, but also in discourse, and I feel what this spec offers is not meeting my need, and I feel a grammar would help, which brings us back to the topic at hand, as this issue is not about whether or not goals which has not been attained by this standard is attainable or not, it is also not about what users in abstract really want, it is about a succinct grammar for valid URL strings, which is what I want, and what others in this tread want.

Given the WHATWG has so far made it clear they are not interested in defining a grammar for WHATWG URLs I don't understand why this issue has not been closed.

@alwinb
Copy link
Contributor

alwinb commented Sep 1, 2022

Yup, apart from certain edge case it should mostly match the RFC

What if you start parsing them against a base URL? I mean, at a glance at least some are incompatible.
And their standard texts? I assume most of them reference specific parts of 3986. I'd be very surprised if they can use the standard as a drop-in replacement for the RFCs.

I can check for myself, of course.

@alwinb
Copy link
Contributor

alwinb commented Sep 1, 2022

Regarding this issue, there may be other options.

First, I suspect that the people that are asking for a formal grammar, would be somewhat satisfied also with a short, concise, and easy to follow parsing algorithm; one that is separate from resolution and normalisation.

It is also possible to specify the parser with a little bit of code that drives a DFA. Then you have a relatively small table and a bit of code that together define the parser. It is not bnf but it may be quite clear. I can try and show that, if it is not adopted then at least it may be fun.

@alercah
Copy link
Author

alercah commented Sep 1, 2022

As the one who started this thread, I'd like to hook on this bit:

a short, concise, and easy to follow parsing algorithm; one that is separate from resolution and normalisation.

Indeed this would go a long way. One useful property of a BNF style grammar is that it allows people yo develop their own operations over URLs without needing a full implementation. For instance, to answer the question "is this a valid URL?" should not require normalisation, resolution, editing, or any knowledge of the various pieces of a URL beyond the basic syntax. It should be extricable. And it should especially be extricable from permissive parsing fallback that attempts to correct invalid URLs and convert them to valid ones.

My core view is that the specification, whichever one or ones end up being the future, provide tools to help us understand URLs and the sort of operations one might do on them, including how to extend them. The WhatWG specification feels to me less like a specification defining what a URL is and how to use it, as an implementation guide on how to write a particular kind of programming interface for working with URLs.

@masinter
Copy link

masinter commented Sep 1, 2022

It would be great if, along the way, the browser and URL library/utility folks could settle on a way to represent IPv6 addresses in URLs, no matter what organizations publish what kind of specs.

Opinions on https://www.ietf.org/archive/id/draft-ietf-6man-rfc6874bis-02.html soon would be helpful.
This would be new behavior, but it seems more practical to not wait until users demand it.

@becarpenter
Copy link

@masinter: exactly, thank you. The requirement for rfc6874bis has come up as an issue against Firefox at https://bugzilla.mozilla.org/show_bug.cgi?id=700999 (and other browsers) from people who actually need it, and the reason for codifying it in an RFC is purely pragmatic. Obviously what really counts is getting it into code because it's useful , not because it's been written down in ABNF. If it gets added to the WHATWG grammar or @alwinb's grammar, that's great too.
And if anyone wants to comment on rfc6874bis, it will be up for IETF Last Call fairly soon, but comments are always welcome at ipv6@ietf.org.

@domenic
Copy link
Member

domenic commented Sep 2, 2022

It's unclear if people here are aware, but the spec already gives a succinct grammar-like definition of "valid URL": see https://url.spec.whatwg.org/#url-writing . That is what the first reply comment in this thread was asking about.

Note that, as explained in https://url.spec.whatwg.org/#writing , this is unrelated to what URLs are accepted by software, because software needs to parse URLs, not just perform the string -> boolean validation algorithm which grammars enable. But might be useful or interesting for some of the people who have been commenting recently; I'm not sure.

(However, there's a long-standing issue that the spec actually provides two ways of determining whether an input string is valid or not, and it seems unlikely to me that they are the same... I've just opened a tracking issue for that at #704.)

@alwinb
Copy link
Contributor

alwinb commented Sep 2, 2022

this is unrelated to what URLs are accepted by software, because software needs to parse URLs, not just perform the string -> boolean validation algorithm which grammars enable.

Just to be sure. Grammars do enable parsing. There’s an introduction to that in the section on Derivations and syntax trees on Wikipedia. I have always assumed that everybody here knows this.

@alercah
Copy link
Author

alercah commented Sep 2, 2022

@domenic you're quite right, please forgive me for getting lost in the thread. But yes, while I did say that the current text does seem to have something resembling a grammar, because it does, when I was first looking into it I found it quite difficult to grok in the current presentation when I wanted to validate whether an implementation appeared to be following the spec. And that was certainly exacerbated by the issues you referred to in #704.

@becarpenter
Copy link

If an implementation is syntax-driven, like a recursive descent compiler, you can be certain that it follows the grammar, which is really what @alwinb just implied is the benefit of a syntax-driven parser. It can still have broken semantics, though.
Do any of the major browsers work that way?

@alwinb
Copy link
Contributor

alwinb commented Sep 7, 2022

@becarpenter This standard does not contain anything grammar-like for “parsable URLs”.

So even if someone wanted to do a grammar driven parser, there would not be a grammar to base it on (unless they are willing to use mine).

For context: The parser accepts URL-strings that are not valid according to the standard, leading to the split concepts of valid and parsing-valid.

I have seen the assumption that writing a grammar for parsing-valid URLs could not be done because error correcting could not be expressed within a grammar.

That is not the case. There are only minor differences between valid URLs and parsing-valid URLs.
The latter supports a larger base alphabet within the components. Furthermore, with drive-letters, eg. file://c:/ is parsed as file:///c:/, and invalid percent escape sequences and the use of backslashes are tolerated in parsing-valid URLs. The role of backslashes is scheme-dependent though. Credentials are considered invalid, but you could enforce that at the semantic level.

⚠️ Specials URLs that do not have an authority are also considered invalid URLs, but it is essential to enforce that during resolution, instead of doing so at the grammatical level – this is so because they can be used as a relative reference.

(Also relevant to #704)

@annevk annevk changed the title Provide a succinct grammar for valid URL strings Provide a grammar for the URL parser Mar 6, 2023
@annevk annevk added editorial Changes that do not affect how the standard is understood and removed topic: validation Pertaining to the rules for URL writing and validity (as opposed to parsing) labels Mar 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
editorial Changes that do not affect how the standard is understood
Development

No branches or pull requests