-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix percent decoding and normalization #23
Conversation
…y and uses_netloc. previously ipv6 hosts were being passed through without family and the colons were causing failures. this makes .replace()'s interface match URL.__init__ (and its own docstring\!)
…y part truly tested in the roundtripping)
…encodable fields, disabling decoding of restricted delimiter characters as well as the all-important percent sign itself. also brings in the decode_to_bytes function, reducing reliance on the standard library.
Codecov Report
@@ Coverage Diff @@
## master #23 +/- ##
==========================================
+ Coverage 96.73% 96.78% +0.05%
==========================================
Files 6 6
Lines 979 1026 +47
Branches 117 123 +6
==========================================
+ Hits 947 993 +46
Misses 19 19
- Partials 13 14 +1
Continue to review full report at Codecov.
|
…called by the former
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good except for the uses_netloc
thing which seems to be noise (potentially included accidentally?). If you remove that I think it's good to land.
'https://example.com/?a=%26', # ampersand in query param value | ||
'https://example.com/?a=%3D', # equals in query param value | ||
# double-encoded percent sign in all percent-encodable positions: | ||
"http://(%2525):(%2525)@example.com/(%2525)/?(%2525)=(%2525)#(%2525)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
self.assertEqual(test, result) | ||
|
||
def test_roundtrip_double_iri(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a place where it might be interesting to drop in hypothesis (since it was in fact hypothesis that caught this bug, in txacme's test suite)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i'm down for some hypothesis in the future. i played with it a bit on this PR, but haven't delved deep enough yet.
@@ -861,6 +917,8 @@ def replace(self, scheme=_UNSET, host=_UNSET, path=_UNSET, query=_UNSET, | |||
port=_optional(port, self.port), | |||
rooted=_optional(rooted, self.rooted), | |||
userinfo=_optional(userinfo, self.userinfo), | |||
family=_optional(family, self.family), | |||
uses_netloc=_optional(uses_netloc, self.uses_netloc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems incorrect. You shouldn't be able to specify uses_netloc
to the constructor at all, it's entirely a function of the scheme; adding it to replace
seems to be propagating the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirming this, deleting these added lines doesn't make any tests fail; if it should be here at all, it shouldn't be in this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally, uses_netloc
is there because for unrecognized schemes parsed using from_text
, URL will persist whether or not the original URL used the netloc slashes.
Logistically, I can't remember if I intended to include it in this PR or if I just saw the big regression of not passing through family
and uses_netloc
-- both in the constructor and the docstring of replace
, but missing in the function signature itself -- and fixed it post haste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's worth working it's worth testing :-). Urgency notwithstanding it seems like an unrelated change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the branch started out as seeking to normalize decoding in general, but it became pretty percent-decoding centric, as idna improvements will come in another PR.
I've been brainstorming other ways to go with some of these latter-day arguments, but for now I think I prefer bringing things into sync, so I've added a test.
OK, I guess the documentation already indicates that this is possible. That said: has a release gone out with this parameter in it? I think it's an ugly wart on the interface and I'd prefer to find a different way to deal with this if we can before committing to public API for it. |
if not allow_percent: | ||
delims = set(delims) | set([u'%']) | ||
for delim in delims: | ||
_hexord = hex(ord(delim))[2:].zfill(2).encode('ascii').upper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I think you could simplify this by using the %x
format tokens:
_hexord = format(ord(delim), '02X').encode('ascii')
(I don’t know if there’s a performance difference, or if this code is particularly perf-critical)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is executed 4x at startup. Each time the loop happens <10 times. ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's shorter and easier to read, so the performance issue may be secondary :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case a follow-up PR to do this would be fine.
@glyph yes, it's had this parameter from the initial release. It's deep enough down the parameter list that practically I don't consider it a public API. That said, with an immutable object like this, there's no other way to control whether the slashes should be emitted. If you're wanting to slim down the parameter list, I have an idea on how to get rid of |
Per #16, hyperlink's inherited behavior was applying percent-decoding too broadly. Each percent-encodable field in the URL (userinfo, path, query string, and fragment), has certain special characters that should never be decoded [1].
To this end, this PR adds and utilizes a
_decode_XXX_part()
function for each of the fields, symmetrical to the_encode_XXX_part()
that was already there.Several test cases were added, as well as a general test for the invariant that multiple calls of
to_iri()
will not generate different URL objects.@glyph has shown the most interest in this feature, so I've added him as reviewer.
[1]: To take the path as an example, the special characters include delimiters, such as the slash (
/
), question mark (?
), and hash mark (#
), as well as the percent sign (%
) itself.