Fix percent decoding and normalization #23

mahmoud · 2017-07-03T02:11:29Z

Per #16, hyperlink's inherited behavior was applying percent-decoding too broadly. Each percent-encodable field in the URL (userinfo, path, query string, and fragment), has certain special characters that should never be decoded [1].

To this end, this PR adds and utilizes a _decode_XXX_part() function for each of the fields, symmetrical to the _encode_XXX_part() that was already there.

Several test cases were added, as well as a general test for the invariant that multiple calls of to_iri() will not generate different URL objects.

@glyph has shown the most interest in this feature, so I've added him as reviewer.

[1]: To take the path as an example, the special characters include delimiters, such as the slash (/), question mark (?), and hash mark (#), as well as the percent sign (%) itself.

…y and uses_netloc. previously ipv6 hosts were being passed through without family and the colons were causing failures. this makes .replace()'s interface match URL.__init__ (and its own docstring\!)

…y part truly tested in the roundtripping)

…encodable fields, disabling decoding of restricted delimiter characters as well as the all-important percent sign itself. also brings in the decode_to_bytes function, reducing reliance on the standard library.

codecov-io · 2017-07-03T02:13:56Z

Codecov Report

Merging #23 into master will increase coverage by 0.05%.
The diff coverage is 98.27%.

@@            Coverage Diff             @@
##           master      #23      +/-   ##
==========================================
+ Coverage   96.73%   96.78%   +0.05%     
==========================================
  Files           6        6              
  Lines         979     1026      +47     
  Branches      117      123       +6     
==========================================
+ Hits          947      993      +46     
  Misses         19       19              
- Partials       13       14       +1

Impacted Files	Coverage Δ
hyperlink/test/test_url.py	`99.77% <100%> (ø)`	⬆️
hyperlink/_url.py	`93.69% <97.56%> (+0.21%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02de01b...c5fa760. Read the comment docs.

…called by the former

glyph

Everything looks good except for the uses_netloc thing which seems to be noise (potentially included accidentally?). If you remove that I think it's good to land.

glyph · 2017-07-03T05:24:19Z

hyperlink/test/test_url.py

+    'https://example.com/?a=%26',  # ampersand in query param value
+    'https://example.com/?a=%3D',  # equals in query param value
+    # double-encoded percent sign in all percent-encodable positions:
+    "http://(%2525):(%2525)@example.com/(%2525)/?(%2525)=(%2525)#(%2525)",


glyph · 2017-07-03T05:24:50Z

hyperlink/test/test_url.py

            self.assertEqual(test, result)

+    def test_roundtrip_double_iri(self):


This looks like a place where it might be interesting to drop in hypothesis (since it was in fact hypothesis that caught this bug, in txacme's test suite)

yeah, i'm down for some hypothesis in the future. i played with it a bit on this PR, but haven't delved deep enough yet.

glyph · 2017-07-03T05:30:25Z

hyperlink/_url.py

@@ -861,6 +917,8 @@ def replace(self, scheme=_UNSET, host=_UNSET, path=_UNSET, query=_UNSET,
            port=_optional(port, self.port),
            rooted=_optional(rooted, self.rooted),
            userinfo=_optional(userinfo, self.userinfo),
+            family=_optional(family, self.family),
+            uses_netloc=_optional(uses_netloc, self.uses_netloc)


This seems incorrect. You shouldn't be able to specify uses_netloc to the constructor at all, it's entirely a function of the scheme; adding it to replace seems to be propagating the error.

Confirming this, deleting these added lines doesn't make any tests fail; if it should be here at all, it shouldn't be in this change.

Functionally, uses_netloc is there because for unrecognized schemes parsed using from_text, URL will persist whether or not the original URL used the netloc slashes.

Logistically, I can't remember if I intended to include it in this PR or if I just saw the big regression of not passing through family and uses_netloc -- both in the constructor and the docstring of replace, but missing in the function signature itself -- and fixed it post haste.

If it's worth working it's worth testing :-). Urgency notwithstanding it seems like an unrelated change.

Well, the branch started out as seeking to normalize decoding in general, but it became pretty percent-decoding centric, as idna improvements will come in another PR.

I've been brainstorming other ways to go with some of these latter-day arguments, but for now I think I prefer bringing things into sync, so I've added a test.

glyph · 2017-07-03T06:12:15Z

OK, I guess the documentation already indicates that this is possible.

That said: has a release gone out with this parameter in it? I think it's an ugly wart on the interface and I'd prefer to find a different way to deal with this if we can before committing to public API for it.

alexwlchan · 2017-07-03T06:14:48Z

hyperlink/_url.py

+    if not allow_percent:
+        delims = set(delims) | set([u'%'])
+    for delim in delims:
+        _hexord = hex(ord(delim))[2:].zfill(2).encode('ascii').upper()


minor: I think you could simplify this by using the %x format tokens:

_hexord = format(ord(delim), '02X').encode('ascii')

(I don’t know if there’s a performance difference, or if this code is particularly perf-critical)

This code is executed 4x at startup. Each time the loop happens <10 times. ;)

It's shorter and easier to read, so the performance issue may be secondary :)

In any case a follow-up PR to do this would be fine.

mahmoud · 2017-07-03T06:21:37Z

@glyph yes, it's had this parameter from the initial release. It's deep enough down the parameter list that practically I don't consider it a public API. That said, with an immutable object like this, there's no other way to control whether the slashes should be emitted.

If you're wanting to slim down the parameter list, I have an idea on how to get rid of family... For another PR of course :)

mahmoud added 8 commits June 29, 2017 22:17

add one failing test for iri/decoding, over the roundtripping url list

28e43d7

fix an iri roundtripping test by having .replace() pass through famil…

66ef4be

…y and uses_netloc. previously ipv6 hosts were being passed through without family and the colons were causing failures. this makes .replace()'s interface match URL.__init__ (and its own docstring\!)

WIP all double-IRI roundtrip working (path seems to have been the onl…

ddec148

…y part truly tested in the roundtripping)

clean up and generalize path decoding machinery

9ef30bc

add many failing tests for double-iri roundtripping (per #16)

027c8c6

add appropriate percent decoding function () for each of the percent-…

d64b1e7

…encodable fields, disabling decoding of restricted delimiter characters as well as the all-important percent sign itself. also brings in the decode_to_bytes function, reducing reliance on the standard library.

a little bit of docs and cleanup around percent decoding

2075d28

add a potentially tricky decoding test

31af281

mahmoud requested a review from glyph July 3, 2017 02:11

merge _percent_decode and _unquote_to_bytes since the latter is only …

0e877ec

…called by the former

mahmoud mentioned this pull request Jul 3, 2017

what if my password has a reserved delimiter in it? #11

Closed

glyph requested changes Jul 3, 2017

View reviewed changes

add test confirming that replacing uses_netloc works

c5fa760

glyph approved these changes Jul 3, 2017

View reviewed changes

alexwlchan reviewed Jul 3, 2017

View reviewed changes

glyph merged commit 4d30724 into master Jul 3, 2017

glyph deleted the fix_decoding_normalization branch July 3, 2017 06:20

glyph mentioned this pull request Jul 3, 2017

reserved characters are treated inconsistently and not sensibly preserved #16

Closed

mahmoud mentioned this pull request Jul 3, 2017

remove "uses_netloc" parameter and attribute to .__init__ and .replace #24

Open

alexwlchan mentioned this pull request Jul 3, 2017

Simplify the hex formatting in _url.py #29

Merged

		self.assertEqual(test, result)

		def test_roundtrip_double_iri(self):

Fix percent decoding and normalization #23

Fix percent decoding and normalization #23

Uh oh!

Conversation

mahmoud commented Jul 3, 2017

Uh oh!

codecov-io commented Jul 3, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

glyph left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glyph commented Jul 3, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mahmoud commented Jul 3, 2017

Uh oh!

Uh oh!

codecov-io commented Jul 3, 2017 •

edited

Loading