Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix percent decoding and normalization #23

Merged
merged 10 commits into from
Jul 3, 2017
Merged

Conversation

mahmoud
Copy link
Member

@mahmoud mahmoud commented Jul 3, 2017

Per #16, hyperlink's inherited behavior was applying percent-decoding too broadly. Each percent-encodable field in the URL (userinfo, path, query string, and fragment), has certain special characters that should never be decoded [1].

To this end, this PR adds and utilizes a _decode_XXX_part() function for each of the fields, symmetrical to the _encode_XXX_part() that was already there.

Several test cases were added, as well as a general test for the invariant that multiple calls of to_iri() will not generate different URL objects.

@glyph has shown the most interest in this feature, so I've added him as reviewer.

[1]: To take the path as an example, the special characters include delimiters, such as the slash (/), question mark (?), and hash mark (#), as well as the percent sign (%) itself.

mahmoud added 8 commits June 29, 2017 22:17
…y and uses_netloc. previously ipv6 hosts were being passed through without family and the colons were causing failures. this makes .replace()'s interface match URL.__init__ (and its own docstring\!)
…encodable fields, disabling decoding of restricted delimiter characters as well as the all-important percent sign itself. also brings in the decode_to_bytes function, reducing reliance on the standard library.
@mahmoud mahmoud requested a review from glyph July 3, 2017 02:11
@codecov-io
Copy link

codecov-io commented Jul 3, 2017

Codecov Report

Merging #23 into master will increase coverage by 0.05%.
The diff coverage is 98.27%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #23      +/-   ##
==========================================
+ Coverage   96.73%   96.78%   +0.05%     
==========================================
  Files           6        6              
  Lines         979     1026      +47     
  Branches      117      123       +6     
==========================================
+ Hits          947      993      +46     
  Misses         19       19              
- Partials       13       14       +1
Impacted Files Coverage Δ
hyperlink/test/test_url.py 99.77% <100%> (ø) ⬆️
hyperlink/_url.py 93.69% <97.56%> (+0.21%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02de01b...c5fa760. Read the comment docs.

Copy link
Collaborator

@glyph glyph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good except for the uses_netloc thing which seems to be noise (potentially included accidentally?). If you remove that I think it's good to land.

'https://example.com/?a=%26', # ampersand in query param value
'https://example.com/?a=%3D', # equals in query param value
# double-encoded percent sign in all percent-encodable positions:
"http://(%2525):(%2525)@example.com/(%2525)/?(%2525)=(%2525)#(%2525)",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

self.assertEqual(test, result)

def test_roundtrip_double_iri(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a place where it might be interesting to drop in hypothesis (since it was in fact hypothesis that caught this bug, in txacme's test suite)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i'm down for some hypothesis in the future. i played with it a bit on this PR, but haven't delved deep enough yet.

@@ -861,6 +917,8 @@ def replace(self, scheme=_UNSET, host=_UNSET, path=_UNSET, query=_UNSET,
port=_optional(port, self.port),
rooted=_optional(rooted, self.rooted),
userinfo=_optional(userinfo, self.userinfo),
family=_optional(family, self.family),
uses_netloc=_optional(uses_netloc, self.uses_netloc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems incorrect. You shouldn't be able to specify uses_netloc to the constructor at all, it's entirely a function of the scheme; adding it to replace seems to be propagating the error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming this, deleting these added lines doesn't make any tests fail; if it should be here at all, it shouldn't be in this change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, uses_netloc is there because for unrecognized schemes parsed using from_text, URL will persist whether or not the original URL used the netloc slashes.

Logistically, I can't remember if I intended to include it in this PR or if I just saw the big regression of not passing through family and uses_netloc -- both in the constructor and the docstring of replace, but missing in the function signature itself -- and fixed it post haste.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's worth working it's worth testing :-). Urgency notwithstanding it seems like an unrelated change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the branch started out as seeking to normalize decoding in general, but it became pretty percent-decoding centric, as idna improvements will come in another PR.

I've been brainstorming other ways to go with some of these latter-day arguments, but for now I think I prefer bringing things into sync, so I've added a test.

@glyph
Copy link
Collaborator

glyph commented Jul 3, 2017

OK, I guess the documentation already indicates that this is possible.

That said: has a release gone out with this parameter in it? I think it's an ugly wart on the interface and I'd prefer to find a different way to deal with this if we can before committing to public API for it.

if not allow_percent:
delims = set(delims) | set([u'%'])
for delim in delims:
_hexord = hex(ord(delim))[2:].zfill(2).encode('ascii').upper()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I think you could simplify this by using the %x format tokens:

_hexord = format(ord(delim), '02X').encode('ascii')

(I don’t know if there’s a performance difference, or if this code is particularly perf-critical)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is executed 4x at startup. Each time the loop happens <10 times. ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's shorter and easier to read, so the performance issue may be secondary :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case a follow-up PR to do this would be fine.

@glyph glyph merged commit 4d30724 into master Jul 3, 2017
@glyph glyph deleted the fix_decoding_normalization branch July 3, 2017 06:20
@mahmoud
Copy link
Member Author

mahmoud commented Jul 3, 2017

@glyph yes, it's had this parameter from the initial release. It's deep enough down the parameter list that practically I don't consider it a public API. That said, with an immutable object like this, there's no other way to control whether the slashes should be emitted.

If you're wanting to slim down the parameter list, I have an idea on how to get rid of family... For another PR of course :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants