-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide a grammar for the URL parser #479
Comments
Just to be sure, you're saying you prefer something like
(or whatever grammar format) to the spec's current
? |
Apologies for the slow reply. I think that @masinter is right, and that my concern matches the ones discussed there. I skimmed those threads and a few statements concerned me, such as the assertion that a full Turing machine is required to parse URLs: this would very much surprise me; my instinct on reading the grammar is that, once you separate out the different paths for I discussed with a colleague of mine, @tabatkins, and they said that the CSS syntax parser was much improved when it was rewritten from a state-machine based parser into a recursive descent parser. Doing this would effectively require writing the URL grammar out as a context-free grammar, which would make providing a BNF-like specification, even if it's only informative, very easy. Separately, but not entirely, splitting out the parsing from the semantic functions (checking some validity rules, creating the result URL when parsing a relative URL string) would likely improve the readability of the spec and the simplicity of implementing it. I think this might be better suited for a separate thread, though, as there are some other thoughts I have in this vein as well. |
This might be a more complicated problem than you think (@alercah). I have tried several times, but the scheme dependent behaviour causes a lot of duplicate rules, so you end up with a grammar that is not very concise nor easy to read. And there is a tricky problem with repeated slashes before the host, the handling of which is base URL dependent. I have some notes on it here. (I eventually went with a hybrid approach of a couple of very simple grammars and some logic rules in between). This ties into a model of URLs that I describe here. What's the status of this? It really does work. I developed the theory when I tried to write a library that supports relative URLs. I am quite confident that it matches the standard (but not everything is described in the notes); as the library now passes all of the parsing tests. |
It's not supported by vanilla BNF, but I would personally be quite
satisfied with a grammar taking parameterized rules like you have there.
Many modern parser generators can handle them; those that cannot, it is
relatively easy to split out the (small number of) parameters here.
…On Wed, 14 Oct 2020 at 11:33, Alwin Blok ***@***.***> wrote:
This might be a more complicated problem than you think. I have tried
several times, but the scheme dependent behaviour causes a lot of duplicate
rules, so you end up with a grammar that is not very concise nor easy to
read. And there is a tricky problem with repeated slashes before the host,
the handling of which is base URL dependent.
I have some notes on it here
<https://github.com/alwinb/reurl/blob/master/doc/grammar.md>. (I
eventually went with a hybrid approach of a couple of very simple grammars
and some logic rules in between). This ties into a model of URLs that I
describe here <https://github.com/alwinb/reurl/blob/master/doc/theory.md>.
What's the status of this? It really does work. I developed the theory
when I tried to write a library that supports relative URLs. I am quite
confident that it matches the standard (but not everything is described in
the notes); as the library now passes all of the parsing tests.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#479 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE7AOVLPC5M2RVBAW3GK3UDSKXAFHANCNFSM4MQOLQEA>
.
|
it would be useful to start with the BNF of RFC 3986 and make changes as necessary. at least to explain the differences and exceptions. |
Interesting, I did not know that there were parser generators that support parameterised rules. I did consider a more formal presentation with subscripted rules, but then I backed off because I thought it would be less accessible. It makes me think of higher order grammars, and I think that's too heavy. I guess in this case it could result in something quite readable too though. As for the comparison with RFC 3986, it would be great if this can help to point out the differences. I have not looked into that much, but the good news is that it might not be that different, after all. |
Back to the issue, the question is how this could flow back to the WHATWG standard. And I am not really sure how that would work yet. The parser algorithm seems to be the heart of the standard, and I think there is a lot of work behind that. There is of course the section on URL Writing which does look like a grammar in prose style. To be clear, what I tried to do, and what I suspect people in this thread (and alike) are after is not to give a grammar for valid URL strings (like in the URL writing section), but to give one that describes the language of URLs that is implicitly defined by the parser algorithm – and in such a way that it also describes their internal structure. Then the grammar contains all the information that you need for building a parser. This is indeed possible, but it is a large change from the standard as it is now. |
I think in fact that there are people who’ve been involved with discussion of this who have actually been hoping for a formal grammar for valid URL strings — in some kind of familiar formalism rather than in the prose style the spec uses. (And to be clear, I’m not personally one of the people who wants that — but from following past discussions around this, I can say I’m certain that’s what at least some people have been asking for.)
I know that’s what some people want but I think as pointed out in #479 (comment) (and #24 and other places) there are some serious challenges in attempting to write such a grammar. And what I think has also been pointed out in #24 and elsewhere, for anybody who wants that, there’s nothing that prevents them from attempting to write up such a grammar themselves, based on the spec algorithms — but short of that happening, nobody else involved with the development of the spec is volunteering to try to write it up. |
The technical issues are mostly solved. I'm willing to help, and I'm looking for some feedback about how to get started. I cannot just write a separate section because in one way or another it'll compete with the algorithm for normativity, (ao). That's a big problem and I think it is the main reason to resist this. It also requires specifying the parse tree and some operations on them. That could use the (internal) URL records, but it will require some changes to them. My main concern is that these things together will trigger too much resistance, too many changes, and that then the effort will fail. What I can do is try to sketch out some approaches that could help to prevent that. I'll need some time to figure that out. |
Yes — and in other cases in WHATWG specs where there’s been discussion about formalisms for similar cases (e.g., some things in the HTML spec), that rationale (people will end up treating the extra formalism as normative) has been a strong argument against including such formalisms in the specs. For this case, a grammar could be maintained in a separate (non-WHATWG) repo, and published separately — and then the spec could possibly (non-normatively) link to it (not strictly necessary, but just to help provide awareness it exists). |
Agreed with @sideshowbarker, generally. If people want to work on personal projects that provide alternative URL parser formalisms, that's great, and I'm glad we've worked on a test suite to help. As seen from this thread, some folks might appreciate some alternatives more than they appreciate the spec, and so it could be helpful to such individuals. But the spec is good as-is. |
There are issues with it that cause further fragmentation right now. I have to say I'm disappointed with this response. I'm trying to help out and solve issues. Not just this one but also #531, #354 amongst others, which cannot be done without a compositional approach. If you do not address that, people come up with ad hoc solutions, creating new corner cases, leading to renewed fragmentation. You can already see this happening in some of the issues. It is also not true that it cannot be done, because I already did it, once for my library and a couple of weeks ago I did a fork of jsdom/whatwg-url over a weekend that uses a modular parser/ resolver based on my notes, has everything in place to start supporting relative URLs as well, and passes all the tests. I didn't post about it, because the changes are too large. Clearly it would not work out. I'm trying to take these concerns into account and work with them. Disregarding that with 'things are fine', I think is a shame. |
While I unfortunately do not have the time to contribute to any work on this at the moment, I have a few thoughts.
To elaborate a bit, I very much disagree with the claim that "the spec is good as is". The spec definitely provides an unambiguous specification with enough information to determine whether or not an implementation meets the specification. This is enough to meet the bare minimum requirements and be an adequate technical standard. But it has a number of flaws that make it difficult to use in practice: It is worth noting that this specification explicitly intends to obsolete RFC 3986. RFC 3986 is a confusing mix of normative and informative text, and a difficult specification to apply and use. Yet this specification is distant from being able to obsolete it because it is targeted entirely at one application domain In conclusion, this spec is a PHP Hammer. It is not "good". It is barely adequate in the one domain it chooses to support, and abysmal in any other domain. If the direction of this standard can't reasonably be changed (assuming there are people willing to put in the effort), and in particular if WhatWG is not interested in addressing other domains in this specification. I would be fully supportive of an effort, likely through IETF's RFC process, to design a specification which actually does replace RFC 3986, and to have the WhatWG spec recognized only as the web standard on the implementation of that domain-agnostic URL specification. I will probably direct any energy I do find myself with to address this spec to that project rather than this one. |
To be clear, there’s nothing semi-normative about that https://url.spec.whatwg.org/#url-writing section. It’s normative.
And to be clear about that: The test suite is not normative.
The section on writing URLs doesn’t claim it matches the parser. Specifically: There are known URL cases that the writing-URLs section defines as non-conforming — as far as documents/authors being prohibited for using them— but which have normative requirements that parsers must follow if documents/authors use them anyway.
While some people may treat the test suite as authoritative for some purposes, it’s not normative. In the URL spec and other WHATWG specs, normative is a term of art used consistently with an exact unambiguous meaning: it applies only to the spec, and specifically only to the spec language that states actual requirements (e.g., using RFC 2119 must, must not, etc., wording). The test suite doesn’t state requirements; instead it tests the normative requirements in the spec. And if the test suite were to test something which the spec doesn’t explicitly require, then the test suite would be out of conformance with the spec.
The algorithm doesn’t define whether a URL is valid or not; instead the algorithm defines how a URL must be processed, whether or not the https://url.spec.whatwg.org/#url-writing section defines that URL as valid/conforming. |
Note also that the URL spec has multiple conformance classes for which it states normative requirements; its algorithms state one set of requirements for parsers as a conformance class, and separately, the https://url.spec.whatwg.org/#url-writing section states a different set of requirements for documents/authors as a conformance class. |
I'm well aware that the test suite is not normative, and that the writing spec is normative, and of the use of "normative" as a term of art. But you said:
You claimed that people treating the extra formalism as normative is an argument against the inclusion, not that it would create two potentially-contradictory normative texts. By the same argument, you should remove the URL writing spec, because it risks being treated as normative, and consider retiring the test suite as well because people treat it as normative because the spec itself is incomprehensible. I don't think that you should remove either of them. I think you should make the spec comprehensible so that people stop being tempted to treat something else as normative.
I agree that it does not claim to produce invalid URLs. It does, however, make a claim that the operation of serialization is reversible by the parser:
Admittedly, this claim is rather suspect because it then provides many examples of where that is not true. I suspect it is missing some qualifiers, such as that the serialization must succeed and the parsing must be done with no base URL and no encoding override. Even with those qualifiers added, I challenge you to produce a formal proof that serialization and parsing produces an equal URL. |
Thank you @alercah. I feel validated by the statement that I have been running a fools errand. It is nice that someone understands the issues and the amount of work that it involves. The only reason I pushed through was because I had made a commitment to myself that I would finish this project. |
RFC 3986 is probably what you want. |
No, I want an end to the situation where an essential building block of the internet has two incompatible specifications that cannot be unified because the one that describes the behaviour of web browsers does so by a long stretch of convoluted pseudocode to describe a monolith function that mixes parsing with normalisation, resolution, percent encoding and updates to URL components. Indeed, an update to RFC 3986 to include browser behaviour would be really, really great. Unfortunately that requires reverse engineering this standard. |
I tried over many years to resolve this issue. @sideshowbarker @rubys @royfielding @duerst can attest, |
Work has started. This is going to happen. Stay tuned. |
There is a GitHub project page for a rephrased specification here. Whilst still incomplete, it is coming along quite nicely. It will not be hard to add a normative section on browser behaviour to e.g. RFC 3986/ RFC 3987 once this is finished. The differences are primarily around the character sets and the multiples of slashes before the authority. The latter is taken care of by the forced-resolution as described in that section Reference Resolution. This also means that it will be possible to add a section to the WHATWG Standard that accurately describes the differences with the RFCs. |
Following up on this:
I have done more research, esp. around the character sets, making some tools to compute the differences. These are my findings. I will follow up with a post about other, minor grammar changes and reference resolution. The differences will be very small, after all is said and done. Which is great! Character SetsIRI vs WHATWG URLThe codepoints allowed in the components of valid WHATWG URLs are almost the same as in RFC3987 IRIs. There is only one difference:
Specifically, the WHATWG Standard allows the additional codepoints:
Specials are allowed in the query part of an IRI, not in the other components though. IRI vs loose-WHATWG URLLet me call any input that the 'basic url parser' accepts as a single argument, a 'loose-WHATWG URL'. Note: The IRI grammar does not split the userinfo into a username and password, but RFC3986 (URI) suggests in 3.2.1. that the first To go from IRIs to loose WHATWG URLs, allow any non-ASCII unicode code point in components, and a number of additional ASCII characters as well. Let's define iinvalid: iinvalid := { u+0-u+1F, Then, for the components:
The grammar would have to be modified to allow invalid percent escape sequences: a single Note that the WHATWG parser removes tabs and newlines { u+9, u+A, u+D } in a preprocessing pass, so you may choose to exclude those from the iinvalid set. Preprocessing also removes leading and trailing sequences of { u+0-u+20 } (aka c0-space), but it's not a good idea to try and express that in the grammar. |
I've suggested a BOF session at IETF 111, which will be held online, to consider what changes to IETF specs would, in conjunction with WHATWG specs, would resolve this issue. |
In case a new IETF effort does get started, I just want to state that I hope, and actually believe that a new/ updated IETF specification and the WHATWG URL standard could complement each other quite well. It will require work and there will be problems, but it is possible and worthwhile. |
@alwinb Nothing in IETF will happen unless you show up with friends willing to actually do the work. |
So here's some work to be done.
Let's get started! |
This is not true: there is a logic to things that grow organically. There is some noise and randomness, but not nearly as much as you think. The challenge is to uncover it, rather than to create it. The WHATWG URL standard can be very neatly characterised, it really isn't bad. |
@bagder what do you mean with "does not cover"? It covers all of them. |
It technically can but I don't see how it has done this.
Your claim may have some basis if we only consider browsers, but I don't just consider browsers, and from what I can see the only real effect of this effort has is that it bifurcated the concept of URL while introducing even further confusion in setting unattainable goals like "Standardize on the term URL. URI and IRI are just confusing." which is not a goal that has been attained and does not seem to be a goal that will be attained. |
Hi Larry, |
@becarpenter My comment about implementation was with respect to
The problem is that traditionally, the URL has been a user interface element AND a protocol element (a dessert topping AND a floor wax). But URLs are hidden more and more. If the implementation adoption of this living standard hasn't moved much in 7 years, it is hard to believe the aspirational goals should be retained. If this were a spec for things that look like URLs that you can type in the address line when you don't mean to search, maybe. But the folks demanding more compatibility are used to URLs breaking anyway. |
There has indeed been significant convergence over the last 7 years. Most notably, WebKit/Safari is shipping a fully-compliant implementation. Server-side runtimes are probably the next most prominent; all notable JavaScript-based ones are compliant, and I believe I've heard of progress with others (e.g. Go, Java, PHP) but I am not as in-tune with those communities. Finally, Gecko/Firefox and Chromium/Chrome have made a good deal of progress through the years; although they're definitely not fully compliant yet, the trend is positive and both are on board with the project, and just haven't bumped it to the top of anyone's priority list. (E.g., Chromium saw a big jump in compliance when an external contributor took the time to submit improvements.) |
@masinter said
Indeed. We make exactly this point at https://www.ietf.org/archive/id/draft-ietf-6man-rfc6874bis-02.html#section-4-1 . IMNSHO the mania for mixing search and links in the same dialogue box has been disastrous and made the browser UI incomprehensible to normal brains. But that is a lost battle. |
Sorry. I was wrong. I thought I read that (and maybe I did but it has been edited out since). Or maybe I was just drunk. Still, the document mentions many schemes but only such that are/were ever supported by browsers. I don't think the document is very clear about this fact (possibly caused by assumptions). |
What happens if you parse any of the schemes listed at https://en.m.wikipedia.org/wiki/List_of_URI_schemes with the WHATWG parser? Is the way in which non-special URLs are parsed so safe and generic that it won’t cause problems with any of them? |
I’m sorry I actually wanted to say something in support of the WHATWG effort. Because it does provide in the needs of browsers, and the IETF never did that. There is progress on getting browsers aligned. So I don’t agree with the suggestion about authority and power, it doesn’t match the situation here. |
Yup, apart from certain edge case it should mostly match the RFC. It used to be worse and mostly divide the string up into a scheme and path, but that was a long time ago. (And even then it wouldn't really be incompatible as most schemes require post-processing of some kind anyway. You just end up with a different division of labor.) |
Quite so, and so monopoly (oligopoly) power is advanced. "People of the same trade seldom meet together, even for merriment and diversion, but the conversation ends in a conspiracy against the public, or in some contrivance to raise prices," as Adam Smith said. The IETF is at most a conspiracy of individuals, and is not the agent of any such public harm. |
Let's try to avoid being overly dramatic, and assume that everybody participating in this process does so in good faith. Nobody is going to get rich and powerful from a URL standard. Efforts like this arise from thoroughly boring technical considerations - data which is not processed consistently by applications, and a desire from engineers to ensure they are brought in to alignment. To quote a (now 6 years old) blog post from @bagder (emphasis added)
In other words: web compatibility is more important than conformance to any particular RFC, otherwise they wouldn't deviate for its sake. That's as clear a statement as you're going to find about why the IETF RFCs have failed. It is simply impractical to take a puritan stance; pragmatism and compromise are necessary.
The same is true for Python's urllib, which as mentioned departs from the IETF RFCs to strip tabs and newlines as the WHATWG standard does - essentially conforming to neither standard and creating their own URL dialect. The maintainers are aware of this (it was mentioned when they introduced that change), and it seems that they want to fully align with the WHATWG standard eventually (python/cpython#88049). User reports from my own library (WebURL for Swift) suggest the same thing: that web compatibility is what people care about, more than conformance to RFC-this-or-that. Mobile apps and servers often consume poorly-formatted URLs prepared by content editors or embedded in social media posts, and there is an expectation that if it works in a browser, it should work with all other software, as well. Browser behaviour is the de facto standard. This effort is an attempt to sort that mess out - to define what "web compatibility" even means, and to produce a specification which truly can claim to describe URLs on the web. If we accept that web compatibility trumps formal compliance to any particular document (and I think that much is clear), then I simply do not understand the logic in all of this hostility and political bickering. What's the worst that could happen from a bit of constructive engagement instead? |
I'm sure people mostly care about compatibility, but the robustness principle has well known hazards [draft-iab-protocol-maintenance-05]. Maybe these hazards don't apply to some, but that does not mean they don't exist. I want to understand and communicate what constitutes a valid URI, and I don't want to do this exclusively in code that is translated by hand from a natural language specification, but also in discourse, and I feel what this spec offers is not meeting my need, and I feel a grammar would help, which brings us back to the topic at hand, as this issue is not about whether or not goals which has not been attained by this standard is attainable or not, it is also not about what users in abstract really want, it is about a succinct grammar for valid URL strings, which is what I want, and what others in this tread want. Given the WHATWG has so far made it clear they are not interested in defining a grammar for WHATWG URLs I don't understand why this issue has not been closed. |
What if you start parsing them against a base URL? I mean, at a glance at least some are incompatible. I can check for myself, of course. |
Regarding this issue, there may be other options. First, I suspect that the people that are asking for a formal grammar, would be somewhat satisfied also with a short, concise, and easy to follow parsing algorithm; one that is separate from resolution and normalisation. It is also possible to specify the parser with a little bit of code that drives a DFA. Then you have a relatively small table and a bit of code that together define the parser. It is not bnf but it may be quite clear. I can try and show that, if it is not adopted then at least it may be fun. |
As the one who started this thread, I'd like to hook on this bit:
Indeed this would go a long way. One useful property of a BNF style grammar is that it allows people yo develop their own operations over URLs without needing a full implementation. For instance, to answer the question "is this a valid URL?" should not require normalisation, resolution, editing, or any knowledge of the various pieces of a URL beyond the basic syntax. It should be extricable. And it should especially be extricable from permissive parsing fallback that attempts to correct invalid URLs and convert them to valid ones. My core view is that the specification, whichever one or ones end up being the future, provide tools to help us understand URLs and the sort of operations one might do on them, including how to extend them. The WhatWG specification feels to me less like a specification defining what a URL is and how to use it, as an implementation guide on how to write a particular kind of programming interface for working with URLs. |
It would be great if, along the way, the browser and URL library/utility folks could settle on a way to represent IPv6 addresses in URLs, no matter what organizations publish what kind of specs. Opinions on https://www.ietf.org/archive/id/draft-ietf-6man-rfc6874bis-02.html soon would be helpful. |
@masinter: exactly, thank you. The requirement for rfc6874bis has come up as an issue against Firefox at https://bugzilla.mozilla.org/show_bug.cgi?id=700999 (and other browsers) from people who actually need it, and the reason for codifying it in an RFC is purely pragmatic. Obviously what really counts is getting it into code because it's useful , not because it's been written down in ABNF. If it gets added to the WHATWG grammar or @alwinb's grammar, that's great too. |
It's unclear if people here are aware, but the spec already gives a succinct grammar-like definition of "valid URL": see https://url.spec.whatwg.org/#url-writing . That is what the first reply comment in this thread was asking about. Note that, as explained in https://url.spec.whatwg.org/#writing , this is unrelated to what URLs are accepted by software, because software needs to parse URLs, not just perform the (However, there's a long-standing issue that the spec actually provides two ways of determining whether an input string is valid or not, and it seems unlikely to me that they are the same... I've just opened a tracking issue for that at #704.) |
Just to be sure. Grammars do enable parsing. There’s an introduction to that in the section on Derivations and syntax trees on Wikipedia. I have always assumed that everybody here knows this. |
@domenic you're quite right, please forgive me for getting lost in the thread. But yes, while I did say that the current text does seem to have something resembling a grammar, because it does, when I was first looking into it I found it quite difficult to grok in the current presentation when I wanted to validate whether an implementation appeared to be following the spec. And that was certainly exacerbated by the issues you referred to in #704. |
If an implementation is syntax-driven, like a recursive descent compiler, you can be certain that it follows the grammar, which is really what @alwinb just implied is the benefit of a syntax-driven parser. It can still have broken semantics, though. |
@becarpenter This standard does not contain anything grammar-like for “parsable URLs”. So even if someone wanted to do a grammar driven parser, there would not be a grammar to base it on (unless they are willing to use mine). For context: The parser accepts URL-strings that are not valid according to the standard, leading to the split concepts of valid and parsing-valid. I have seen the assumption that writing a grammar for parsing-valid URLs could not be done because error correcting could not be expressed within a grammar. That is not the case. There are only minor differences between valid URLs and parsing-valid URLs. (Also relevant to #704) |
As an occasional standards-user, the lack of a succinct expression of the grammar for valid URL strings is rather frustrating. It makes it rather difficult to follow what's going on and, in particular, to work out whether a given thing is a valid URL. A grammar in EBNF or a similar form would be greatly appreciated and make this spec significantly easier to understand.
The text was updated successfully, but these errors were encountered: