Do not encode chars that are allowed in path segments #109

bantic · 2016-08-19T19:15:41Z

During route generation for dynamic segments, do not percent-encode
characters that, according to RFC 3986, are allowed in path segments.

RFC 3986 defines "sub-delims" (these chars: ! $ & ' ( ) * + , ; =) and
specifies that they along with : and @ do not need to be encoded in path segments
in URIs. See: https://tools.ietf.org/html/rfc3986#section-3.3

This commit changes RouteRecognizer'ss generation code for dynamic
segments to explicitly avoid encoding those characters.

Fixes emberjs/ember.js#14094

rwjblue · 2016-08-19T19:35:48Z

@krisselden and/or @wycats will need to review

bantic · 2016-08-19T19:56:57Z

Sounds good. For reference, the ABNF for path segments in RFC 3896 looks like this:

segment       = *pchar
pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
pct-encoded   = "%" HEXDIG HEXDIG
sub-delims    = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="

Prior to #91, dynamic segments in routes (e.g. the ":post_id" in /posts/:post_id) simply interpolated the parameter value unchanged into the url, which created some problems (e.g., with post id values of "abc/def" or "100%", RR would generate the following problematic urls: "/posts/abc/def" and "posts/100%").
To fix this, #91 changed the generation code to call encodeURIComponent on the parameter before putting it into the url (e.g., id "abc/def" -> url "posts/abc%2Fdef").

But some of the characters that should be fine to appear in a path segment unencoded (like @) are encoded by encodeURIComponent (the spec for this method in ECMAScript 2016 is based on the older URI RFC 2396). So this PR removes the use of encodeURIComponent and adds a encodePathSegment method that first splits out any reserved characters that should not be encoded, then encodeURIComponent's the remaining characters in the string, then connects it again with the reserved unencoded chars.

Link to path section of RFC 3896.
Link to collected ABNF for URI.
Link to ECMAScript 2016 spec for encodeURIComponent.

bantic · 2016-08-19T19:58:01Z

I ran the benchmarks before and after this change and the generation code is slightly slower (2.9M -> 2.7M ops on my machine), but otherwise the benchmarks do not appear to be too heavily impacted.

rwjblue · 2016-08-19T20:59:17Z

FWIW - This looks good to me.

nathanhammond · 2016-08-19T23:48:13Z

lib/route-recognizer/normalizer.js

+    if (separators.length) {
+      ret += separators.shift();
+    }
+  }


You can do this without modification of the arrays and without an if which I believe will be faster:

var ret = ''; separators.push(''); for (var i = 0; i < pieces.length; i++) { ret = ret + pieces[i] + separators[i]; }

+1 Thanks. I made the change.

nathanhammond · 2016-08-19T23:52:53Z

This is an implementation of the path encoding semantics for paths from RFC 3896 for maximum correctness while minimizing the number of characters encoded.

I've left one comment in the implementation addressing perf, after which it looks good to me for landing and a bump to 0.2.3 (also to land upstream in Ember for the next beta release).

During route generation for dynamic segments, do not percent-encode characters that, according to RFC 3986, are allowed in path segments. RFC 3986 defines "sub-delims" (these chars: `! $ & ' ( ) * + , ; =`) and specifies that they along with `:` and `@` do not need to be encoded in path segments in URIs. See: https://tools.ietf.org/html/rfc3986#section-3.3 This commit changes RouteRecognizer'ss generation code for dynamic segments to explicitly avoid encoding those characters. Fixes emberjs/ember.js#14094

bantic · 2016-08-20T15:25:21Z

@nathanhammond Thanks, I addressed the feedback you mentioned. 👍

nathanhammond · 2016-08-20T20:12:18Z

This halves generate speed on my box with either of your original approach or my proposed modification. This shouldn't be surprising as we're doing a bunch more work than we were previously.

Old:
  Generate ............. 2,822,603.62 op/s

New:
  Generate ............. 1,510,478.62 op/s

In this case I want to trade correctness for performance given that this is a rounding error in the total time cost to render a link inside of Ember. A faster approach to normalization may in the future involve an NFA or DFA...

krisselden · 2016-08-20T20:50:54Z

Lot of bloat a simpler solution is to encode once then replace with a regex matching the encoded chars with the unescape fn

bantic · 2016-08-20T21:31:02Z

That's true. I also tried these two approaches but they were slightly worse for performance and much worse for readability (imo). But the snippets below both pass all tests and are much terser; I can pr one of them if that's preferred.

generate regex:

var reservedSegmentChars = [
  "!", "$", "&", "'", "(", ")", "*", "+", ",", ";", "=", // sub-delims
  ":", "@" // others explicitly allowed by RFC 3986
];
// the encoded versions of the chars, filtering out ones for which `encodeURIComponent` has no effect
var encodedSegmentChars = reservedSegmentChars.map(encodeURIComponent).filter(function(encoded) {
  return encoded.length === 3 && encoded[0] === '%';
});
var encodedSegmentCharRegex = new RegExp(encodedSegmentChars.join('|'), 'g');
function encodePathSegment(segment) {
  segment = '' + segment; // coerce to string
  return encodeURIComponent(segment).replace(encodedSegmentCharRegex, function(match) {
    return decodeURIComponent(match);
  });
}

hand-tuned regex:

// the escaped versions of the characters we care about are:
// %24, %26, %2B, %2C, %3A, %3B, %3D, %40
var encodedSegmentCharRegex = /%((2(4|6|B|C))|(3(A|B|D))|(40))/g;
function encodePathSegment(segment) {
  segment = '' + segment; // coerce to string
  return encodeURIComponent(segment).replace(encodedSegmentCharRegex, function(match) {
    return decodeURIComponent(match);
  });
}

@nathanhammond Running the benches again you're right, my results match yours for perf of generate — I must have mixed up my bench results yesterday.

krisselden · 2016-08-20T21:33:01Z

UNESCAPE = /%(?:24|26|3B|2C|3B|3D|3A|40)/g

function encodeSegment(str) {
  return encodeURIComponent(str).replace(UNESCAPE, unescape);
}

! ' ( ) * are already not encoded by the Javascript spec

uriMark ::: one of
- _ . ! ~ * ' ( )

nathanhammond · 2016-08-20T21:42:35Z

RegExp-replace solution (any of 'em):
  Generate ............. 2,041,474.54 op/s

With the smaller file size and 33% perf win I think that we should opt for this:

// the escaped versions of the characters we care about are:
// %24, %26, %2B, %2C, %3A, %3B, %3D, %40
var encodedSegmentCharRegex = /%(?:24|26|3B|2C|3B|3D|3A|40)/g;
function encodePathSegment(segment) {
  segment = '' + segment; // coerce to string
  return encodeURIComponent(segment).replace(encodedSegmentCharRegex, decodeURIComponent);
}

@bantic Would you mind PRing since it's your code (plus @krisselden's regex for non-capturing)?

bantic · 2016-08-20T21:47:39Z

@nathanhammond Yes, happy to — Kris pointed out on slack that this regex is faster (it doesn't have capturing groups):
var UNESCAPE = /%(?:24|26|2B|2C|3B|3D|3A|40)/g;

I have to head out now but I will PR a change later tonight

Port changes from tildeio/route-recognizer#109

This approach is much simpler than walking the string, and appears to be at least as performant as the previous. Refs tildeio#109

krisselden · 2016-08-22T01:18:53Z

@nathanhammond remember when the capture isn't being used to use (?:

bantic mentioned this pull request Aug 19, 2016

link-to component no longer accepts special chars emberjs/ember.js#14094

Closed

nathanhammond reviewed Aug 19, 2016
View reviewed changes

bantic force-pushed the 14094-fix-encoding-of-dynamic-segments branch from de95066 to 17f5dad Compare August 20, 2016 15:23

nathanhammond merged commit 1ab15b5 into tildeio:master Aug 20, 2016

bantic deleted the 14094-fix-encoding-of-dynamic-segments branch August 20, 2016 21:25

nathanhammond added a commit to nathanhammond/ember-route-recognizer that referenced this pull request Aug 20, 2016

Port changes from tildeio/route-recognizer#109

f1f3971

nathanhammond added a commit to nathanhammond/ember-route-recognizer that referenced this pull request Aug 20, 2016

Port changes from tildeio/route-recognizer#109

c5db11e

nathanhammond added a commit to nathanhammond/ember-route-recognizer that referenced this pull request Aug 20, 2016

Merge pull request #1 from nathanhammond/fixup-109

00faf7c

Port changes from tildeio/route-recognizer#109

bantic added a commit to bantic/route-recognizer that referenced this pull request Aug 20, 2016

Use a regex in encodePathSegment

e97e5f3

This approach is much simpler than walking the string, and appears to be at least as performant as the previous. Refs tildeio#109

bantic mentioned this pull request Aug 20, 2016

Use a regex in encodePathSegment #110

Merged

Monkeychip mentioned this pull request Dec 17, 2020

Bug: Fix issue with double encoding on space in secret history route hashicorp/vault#10596

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not encode chars that are allowed in path segments #109

Do not encode chars that are allowed in path segments #109

bantic commented Aug 19, 2016

rwjblue commented Aug 19, 2016

bantic commented Aug 19, 2016

bantic commented Aug 19, 2016

rwjblue commented Aug 19, 2016

nathanhammond Aug 19, 2016 •

edited

Loading

bantic Aug 20, 2016

nathanhammond commented Aug 19, 2016

bantic commented Aug 20, 2016

nathanhammond commented Aug 20, 2016

krisselden commented Aug 20, 2016

bantic commented Aug 20, 2016 •

edited

Loading

krisselden commented Aug 20, 2016 •

edited

Loading

nathanhammond commented Aug 20, 2016 •

edited

Loading

bantic commented Aug 20, 2016

krisselden commented Aug 22, 2016

Do not encode chars that are allowed in path segments #109

Do not encode chars that are allowed in path segments #109

Conversation

bantic commented Aug 19, 2016

rwjblue commented Aug 19, 2016

bantic commented Aug 19, 2016

bantic commented Aug 19, 2016

rwjblue commented Aug 19, 2016

nathanhammond Aug 19, 2016 • edited Loading

Choose a reason for hiding this comment

bantic Aug 20, 2016

Choose a reason for hiding this comment

nathanhammond commented Aug 19, 2016

bantic commented Aug 20, 2016

nathanhammond commented Aug 20, 2016

krisselden commented Aug 20, 2016

bantic commented Aug 20, 2016 • edited Loading

krisselden commented Aug 20, 2016 • edited Loading

nathanhammond commented Aug 20, 2016 • edited Loading

bantic commented Aug 20, 2016

krisselden commented Aug 22, 2016

nathanhammond Aug 19, 2016 •

edited

Loading

bantic commented Aug 20, 2016 •

edited

Loading

krisselden commented Aug 20, 2016 •

edited

Loading

nathanhammond commented Aug 20, 2016 •

edited

Loading