-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyperlink should treat + in query as %20 #129
Comments
Well, we aren't first. Brief foray into Stack Overflow seems to yield a lack of consensus as to correctness. The Query String Article on Wikipedia (under URL Encoding) suggests this is a thing in HTML, which I read as distinct from a web browser… which makes the desired behavior possibly client-specific. I might be reading the above wrong, and I haven't dug through RFCs for a definitive answer, but it seems like most of the time, replacing |
Oh man, I thought I responded to this, but you basically read my mind @wsanchez. It's a web-ism. Might be worth making a default (https is the default scheme in hyperlink after all). But yeah, I grappled with this a bit in my initial work and left the behavior as it is today. Could make it a flag in |
Ah, okay, so I am not sure that it need even be a flag. Since the hex-encoded form is universal it is reasonable to canonicalize to that. Noting this behavior in the documentation seems enough of a fix to me. |
I seem to have confused myself here, so I'm going to retrace my steps. We're trying to make use of Hyperlink in both client and server-side libraries, with slightly different requirements.
In the server-side use case it would be fine if we could pre-normalize the URL or otherwise cause |
I agree with 1. When you say "querystrings generated by web browsers", what's an example of how you get a web browser to generate a query string? |
|
Right, to confirm this awesomeness, I wrote this script: from textwrap import dedent
from klein import run, route
@route("/")
def home(request):
request.setHeader("content-type", "text/html")
return dedent(
"""
<html>
<head><title>Form</title></head>
<body>
<form action="" method="GET">
<div>
<input type="text" name="name" id="name" required>
</div>
<div>
<input type="submit" value="Submit">
</div>
</form>
</body>
</html>
"""
)
run("localhost", 8080) And ran it, entered "hello, world" into the text field, and the URL I'm sent to in the (Safari) browser's location bar is So… back to @twm's retracing of steps, it does seem like dealing with web browsers requires that one convert those If Hyperlink doesn't do this, then it would seem like Klein should do on on request URIs, but to do so, it would have to tase out the query part and only do this there, which seems more like Hyperlink's domain… If Hyperlink does this… let's say by default for simplicity, then the downside is… that someone put a It remains unclear by way of standards what a client is supposed to do with an un-encoded plus, I guess "what web browsers do" is the next best thing. Though I'm nervous about what I see in the location bar as Trying to take HTML out of the picture, here am in the Safari JavaScript console to replicate the Web API @twm referenced: > url = new URL("http://localhost:8080/?name=hello%2C+world")
< URL {href: "http://localhost:8080/?name=hello%2C+world", origin: "http://localhost:8080", protocol: "http:", username: "", password: "", …}
> url.search
< "?name=hello%2C+world"
> url.searchParams.get("name")
< "hello, world" So the string representation of the URL leaves So I'm not sure whether the >>> url = DecodedURL.fromText("https://localhost/getme?name=hello,%20big+world&value=4321")
>>> url.get("name")
['hello, big+world'] …the API should probably have converted the |
@glyph should weigh in here because he wrote some tests in Klein which apparently agree with @twm about this. |
|
Figuring out how all of WhatWG's spec fits together is extremely tedious, but it does contain this gem:
Which suggests... something. |
(My understanding from @mahmoud's previous statements is that we are preferring WhatWG to the RFCs in hyperlink, as it's somewhat more up-to-date and pragmatic.) |
Oh dang, not sure where I gave that impression*. WhatWG is browser/web-focused, I've been trying to make hyperlink more vanilla, not adopting all the HTTP-adjacent assumptions. So, leaning toward the earlier RFCs, using other prominent URL use cases (e.g., git, db drivers) to temper web influences. I'm not sure what other protocols do with As an aside, WHATWG's "plain English" description of URL parsing is one of my go-to example of how not to document technical specs. Just give us a good reference implementation already. |
No, but, the underlying |
(If you rebuilt the URL via |
So in the event of a mixed URL, the underlying state would track the index of every space and whether it should be serialized as a |
We already do this in the underlying URL. Consider: >>> hyperlink.parse("https://example.com/multiple ways to encode%20spaces").path[0]
'multiple ways to encode spaces' |
@glyph, that's not an ideal example because the special treatment of "+" is, I think, limited to query arguments, not the path. So… to capture the more relevant (I think) current state of affairs: >>> import hyperlink
>>> u = hyperlink.URL.from_text('https://example.com/thing?value=multiple ways+to encode%20spaces')
>>> u
URL.from_text('https://example.com/thing?value=multiple ways%2Bto encode%20spaces')
>>> u.query
(('value', 'multiple ways+to encode%20spaces'),)
>>> du = hyperlink.DecodedURL.from_text('https://example.com/thing?value=multiple ways+to encode%20spaces')
>>> du
DecodedURL(url=URL.from_text('https://example.com/thing?value=multiple ways%2Bto encode%20spaces'))
>>> du.query
(('value', 'multiple ways+to encode spaces'),) @glyph I think you are asserting that Hyperlink current handling is correct, except that That would be achievable, I think, by calling I think @mahmoud's concern (and mine) is that this isn't always correct, even if it's usually correct for a lot of users of Hyperlink, the distinction being "I expect RFC 3986" versus "I think I'm handling an HTML form submission". Of course, this is all about HTML YOLO nonsense screwing up the world, since there's no guarantee (I think?) that code handling a URL would know that it's getting the URL via a form submission versus a non-HTML client. But forms are probably a common enough case that you have to go that route, and really clients should just replace So I'm not sure I'm on board with "DecodedURL should absolutely be converting |
Yep!
So mirror-universe mahmoud quite effectively convinced me that WhatWG should take precedence, since it describes how people are practically using URLs in the vast majority of applications. Who would want behavior derived from a 15-year-old RFC that doesn't accurately describe the behavior of current browsers? What user-base is that? Wikipedia pretty authoritatively states that plus gets encoded to space, citing… w3schools, which I guess is normative now https://en.wikipedia.org/wiki/Query_string#cite_note-w3schools-10 But, for the sake of argument, taking the "standards have a moral weight" position: rfc3986 is updated by rfc7320. rfc7320 section 2.4 concedes:
So it seems that the IETF is on board with query strings being described more or less by HTML, and other URL-modifying languages that support describing forms should just do something else entirely. But the fact that "HTML" describes ultimately leaves the server holding the bag. Which means that HTML implicitly defines the behavior of the HTTP[S] schemes with respect to decoding URLs, since the server might be handling an HTML form post under those schemes. So I feel strongly that the default should absolutely be to decode That said, I do at least have a counterexample of a scheme where this type of processing is not performed: on a mac, |
I'd say "acknowledges the existence of some stupidity in HTML" instead of "concedes" but OK. :-) I'm coming to the conclusion that a flag is not super helpful, and that if you want to ensure that your application's query parts are decoded correctly, never use a I think that means that the damage is limited to query strings, and no other part of URLs. Does it apply to both names and values? The examples I've seen only show I'm having a hard time finding useful things in the WHATWG living document about this, but are HTML form submissions limited to HTTP(S) schemes? The
No, what that suggests is that applications that are not creating URLs form HTML forms are different than applications that are, which is the problem. I don't think the scheme is the relevant difference, except that HTML seems to be limited to HTTP(S). The |
Yeah, same. Let's prefer to encode on the input side with
<!DOCTYPE html>
<html>
<body>
<form method="GET" action="">
<input name="here's a name" value="here's a value">
</form>
</body>
</html> results in
so I'd say, yes, in names as well.
In theory, no, you could have a In practice, I can only test this with a
when a browser GETs the
Yeah, you're right. And while it's practically limited to
It's not supposed to, per the spec, but... it does. |
At this point my opinion is that using an unencoded Further, the "go with HTML" solution is more wrong but also more practical. I can't find anything in the WHATWG spec spaghetti to file a bug against, so I'm giving up on any sort of correctness. That said, either Hyperlink needs to change, or Glyph's test in Klein needs to go away, and that needs to happen soon, because the next Twisted release is going to require Klein to upgrade Treq, which now uses Hyperlink. |
What do you mean by 'both "solutions"'? There are a couple of issues surrounding "+" right now. First is the issue of input preservation.
Next up, there's the question of what I think that it's pretty clear that the right answer by default needs to be |
I'm on board with No answer to the default behavior strikes me as correct, but I've decided not to care because it's broken, don't use un-encoded |
In the branch: >>> parse("/?x=a+b").get('x')
['a b'] Plus the escape hatch:
It looks like someone else already fixed the |
Via twisted/klein#339, Hyperlink seems to be percent-encoding the
+
when parsing the URL:This seems like a bug to me.
+
in the query part should be treated like%20
. For example, this is what the web API does:The text was updated successfully, but these errors were encountered: