Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not encode chars that are allowed in path segments #109

Merged

Conversation

bantic
Copy link
Contributor

@bantic bantic commented Aug 19, 2016

During route generation for dynamic segments, do not percent-encode
characters that, according to RFC 3986, are allowed in path segments.

RFC 3986 defines "sub-delims" (these chars: ! $ & ' ( ) * + , ; =) and
specifies that they along with : and @ do not need to be encoded in path segments
in URIs. See: https://tools.ietf.org/html/rfc3986#section-3.3

This commit changes RouteRecognizer'ss generation code for dynamic
segments to explicitly avoid encoding those characters.

Fixes emberjs/ember.js#14094

@rwjblue
Copy link
Collaborator

rwjblue commented Aug 19, 2016

@krisselden and/or @wycats will need to review

@bantic
Copy link
Contributor Author

bantic commented Aug 19, 2016

Sounds good. For reference, the ABNF for path segments in RFC 3896 looks like this:

segment       = *pchar
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

Prior to #91, dynamic segments in routes (e.g. the ":post_id" in /posts/:post_id) simply interpolated the parameter value unchanged into the url, which created some problems (e.g., with post id values of "abc/def" or "100%", RR would generate the following problematic urls: "/posts/abc/def" and "posts/100%").
To fix this, #91 changed the generation code to call encodeURIComponent on the parameter before putting it into the url (e.g., id "abc/def" -> url "posts/abc%2Fdef").

But some of the characters that should be fine to appear in a path segment unencoded (like @) are encoded by encodeURIComponent (the spec for this method in ECMAScript 2016 is based on the older URI RFC 2396). So this PR removes the use of encodeURIComponent and adds a encodePathSegment method that first splits out any reserved characters that should not be encoded, then encodeURIComponent's the remaining characters in the string, then connects it again with the reserved unencoded chars.

Link to path section of RFC 3896.
Link to collected ABNF for URI.
Link to ECMAScript 2016 spec for encodeURIComponent.

@bantic
Copy link
Contributor Author

bantic commented Aug 19, 2016

I ran the benchmarks before and after this change and the generation code is slightly slower (2.9M -> 2.7M ops on my machine), but otherwise the benchmarks do not appear to be too heavily impacted.

@rwjblue
Copy link
Collaborator

rwjblue commented Aug 19, 2016

FWIW - This looks good to me.

if (separators.length) {
ret += separators.shift();
}
}
Copy link
Contributor

@nathanhammond nathanhammond Aug 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can do this without modification of the arrays and without an if which I believe will be faster:

var ret = '';
separators.push('');
for (var i = 0; i < pieces.length; i++) {
  ret = ret + pieces[i] + separators[i];
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Thanks. I made the change.

@nathanhammond
Copy link
Contributor

This is an implementation of the path encoding semantics for paths from RFC 3896 for maximum correctness while minimizing the number of characters encoded.

I've left one comment in the implementation addressing perf, after which it looks good to me for landing and a bump to 0.2.3 (also to land upstream in Ember for the next beta release).

During route generation for dynamic segments, do not percent-encode
characters that, according to RFC 3986, are allowed in path segments.

RFC 3986 defines "sub-delims" (these chars: `! $ & ' ( ) * + , ; =`) and
specifies that they along with `:` and `@` do not need to be encoded in path segments
in URIs. See: https://tools.ietf.org/html/rfc3986#section-3.3

This commit changes RouteRecognizer'ss generation code for dynamic
segments to explicitly avoid encoding those characters.

Fixes emberjs/ember.js#14094
@bantic bantic force-pushed the 14094-fix-encoding-of-dynamic-segments branch from de95066 to 17f5dad Compare August 20, 2016 15:23
@bantic
Copy link
Contributor Author

bantic commented Aug 20, 2016

@nathanhammond Thanks, I addressed the feedback you mentioned. 👍

@nathanhammond
Copy link
Contributor

This halves generate speed on my box with either of your original approach or my proposed modification. This shouldn't be surprising as we're doing a bunch more work than we were previously.

Old:
  Generate ............. 2,822,603.62 op/s

New:
  Generate ............. 1,510,478.62 op/s

In this case I want to trade correctness for performance given that this is a rounding error in the total time cost to render a link inside of Ember. A faster approach to normalization may in the future involve an NFA or DFA...

@nathanhammond nathanhammond merged commit 1ab15b5 into tildeio:master Aug 20, 2016
@krisselden
Copy link
Collaborator

Lot of bloat a simpler solution is to encode once then replace with a regex matching the encoded chars with the unescape fn

@bantic bantic deleted the 14094-fix-encoding-of-dynamic-segments branch August 20, 2016 21:25
@bantic
Copy link
Contributor Author

bantic commented Aug 20, 2016

That's true. I also tried these two approaches but they were slightly worse for performance and much worse for readability (imo). But the snippets below both pass all tests and are much terser; I can pr one of them if that's preferred.

generate regex:

var reservedSegmentChars = [
  "!", "$", "&", "'", "(", ")", "*", "+", ",", ";", "=", // sub-delims
  ":", "@" // others explicitly allowed by RFC 3986
];
// the encoded versions of the chars, filtering out ones for which `encodeURIComponent` has no effect
var encodedSegmentChars = reservedSegmentChars.map(encodeURIComponent).filter(function(encoded) {
  return encoded.length === 3 && encoded[0] === '%';
});
var encodedSegmentCharRegex = new RegExp(encodedSegmentChars.join('|'), 'g');
function encodePathSegment(segment) {
  segment = '' + segment; // coerce to string
  return encodeURIComponent(segment).replace(encodedSegmentCharRegex, function(match) {
    return decodeURIComponent(match);
  });
}

hand-tuned regex:

// the escaped versions of the characters we care about are:
// %24, %26, %2B, %2C, %3A, %3B, %3D, %40
var encodedSegmentCharRegex = /%((2(4|6|B|C))|(3(A|B|D))|(40))/g;
function encodePathSegment(segment) {
  segment = '' + segment; // coerce to string
  return encodeURIComponent(segment).replace(encodedSegmentCharRegex, function(match) {
    return decodeURIComponent(match);
  });
}

@nathanhammond Running the benches again you're right, my results match yours for perf of generate — I must have mixed up my bench results yesterday.

@krisselden
Copy link
Collaborator

krisselden commented Aug 20, 2016

UNESCAPE = /%(?:24|26|3B|2C|3B|3D|3A|40)/g

function encodeSegment(str) {
  return encodeURIComponent(str).replace(UNESCAPE, unescape);
}

! ' ( ) * are already not encoded by the Javascript spec

uriMark ::: one of
- _ . ! ~ * ' ( )

@nathanhammond
Copy link
Contributor

nathanhammond commented Aug 20, 2016

RegExp-replace solution (any of 'em):
  Generate ............. 2,041,474.54 op/s

With the smaller file size and 33% perf win I think that we should opt for this:

// the escaped versions of the characters we care about are:
// %24, %26, %2B, %2C, %3A, %3B, %3D, %40
var encodedSegmentCharRegex = /%(?:24|26|3B|2C|3B|3D|3A|40)/g;
function encodePathSegment(segment) {
  segment = '' + segment; // coerce to string
  return encodeURIComponent(segment).replace(encodedSegmentCharRegex, decodeURIComponent);
}

@bantic Would you mind PRing since it's your code (plus @krisselden's regex for non-capturing)?

@bantic
Copy link
Contributor Author

bantic commented Aug 20, 2016

@nathanhammond Yes, happy to — Kris pointed out on slack that this regex is faster (it doesn't have capturing groups):
var UNESCAPE = /%(?:24|26|2B|2C|3B|3D|3A|40)/g;

I have to head out now but I will PR a change later tonight

nathanhammond added a commit to nathanhammond/ember-route-recognizer that referenced this pull request Aug 20, 2016
nathanhammond added a commit to nathanhammond/ember-route-recognizer that referenced this pull request Aug 20, 2016
nathanhammond added a commit to nathanhammond/ember-route-recognizer that referenced this pull request Aug 20, 2016
bantic added a commit to bantic/route-recognizer that referenced this pull request Aug 20, 2016
This approach is much simpler than walking the string, and appears to be
at least as performant as the previous.

Refs tildeio#109
@krisselden
Copy link
Collaborator

@nathanhammond remember when the capture isn't being used to use (?:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants