Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

normalize route and path, encoded and decode dynamic path segments #91

Merged

Conversation

bantic
Copy link
Contributor

@bantic bantic commented May 4, 2016

This PR supersedes #55. That PR originally attempted to encode/decode dynamic segments in matched paths, but it grew much larger when I attempted to make the solution robust in the face of dynamic segments with percent-encoded percent characters (encodeURI("%") === "%25").

I'm approaching this PR in the spirit of an RFC, with a very detailed rationale for the changes below. If there is an alternative preferred approach for this large-ish change I'm happy to restructure or break apart this PR as necessary.

cc @krisselden

Changes to route-recognizer

Note about terminology:

  • route refers to a string that is added to the route recognizer (e.g.: the route in router.add([{ path: '/foo/bar', ... }]) is /foo/bar)
  • path refers to a string that will be given to the route recognizer to recognize (e.g. the path in router.recognize('/boo/baz') is /boo/baz)
  • match/matches refers to when a path is recognized by the route recognizer (e.g. a route recognizer with route /foo/:bar would "match" the path /foo/something)
  • the route recognizer is also sometimes referred to as, simply, the router

What is covered

This PR addresses the following:

  • Adds many tests for non-standard routes and paths (with percent-encodings and/or unicode characters)
  • Adds normalization for added routes and recognized paths to improve the matching of non-standard routes and paths
  • Fixes a bug where dynamic segments are not percent-encoded during generation
  • Fixes a bug where dynamic segments are not percent-decoded during recognition
  • Adds tests (and some code) to ensure that star segments in routes do not get decoded during recognition
  • Adds a flag that can be turned off by consumers to opt-out of the changes introduced here

Who would be affected

It may only affect users who have routes with:

  • percent-encoded or unicode characters
  • dynamic segments that may match paths with percent-encoded characters
  • dynamic segments that are generated using values that have characters that should be encoded (e.g. forward slash: /)

It may cause breaking changes for users who have workarounds in place to encode/decode before generation/recognition. The breakage can be fixed by removing the workaround.

It will likely not cause breaking changes for users who have static routes with percent-encoded or unicode characters. All static route/path combinations that currently match will also match after this PR. Static routes that used to be considered distinct (e.g. /f%20o and /f o) are now normalized to the same thing. Users with affected static route/path combos likely need to do nothing; they may simply end up with superfluous routes that can be removed.

Route and Path Normalization

The route-recognizer docs do not have a lot to say about routes with non-ascii and percent-encoded characters, so some of the work of this change is to add tests for many possible routes and expected matches that may exist in the wild.
The changes in this PR are intended to make the route-recognizer more likely to match routes that are added using either percent-encoded or non-encoded characters, and to recognize paths correctly whether they include percent-encoded characters or not.

The normalization added is intended to make the router recognize routes that:

  • are added with unicode characters (e.g. /café)
  • are added as percent-encoded strings (e.g. /caf%C3%A9)
  • are added with non-encoded "special" url characters (e.g. /foo bar, /foo:bar)
  • are added with percent-encoded "special" url characters (e.g. /foo%20bar, /foo%3Abar)
  • are recognized with paths containing percent-encoded unicode (e.g. router.recognize('caf%C3%A9'))
  • are recognized with paths containing non-percent-encoded unicode (e.g. router.recognize('café'))
  • are recognized with paths containing certain non-encoded "special" url characters (e.g., router.recognize('foo bar'), router.recognize('foo:bar'))
  • are recognized with paths containing certain percent-encoded "special" url characters (e.g., router.recognize('foo%20bar'), router.recognize('foo%3Abar'))

This adds a normalization step for routes when they are added (before they are parsed into segments) and paths when they are recognized.
Similar to the way one would lower-case two strings to check for case-insensitive equality, this normalization is intended to improve the recognition of
routes by ensuring there is a higher degree of similarity between routes and paths. All percent-encoded unicode characters and most percent-encoded special url characters
are normalized into equal strings, whether they are added as routes or recognized as paths.

Normalization Steps:

The normalization for routes and paths is the same:

  • decodeURI the string ignoring percent-encoded percent (%) characters (i.e., only ignore the literal value: %25). This prevents an issue with encoded percent characters in dynamic segments, where they may be inadvertently doubly-decoded resulting in a URIError.
  • The decode call decodes any encoded unicode characters, so a route that is added as /café or /caf%C3%A9 will be decoded to /café and will recognize the path /café or /caf%C3%A9
  • upper-case all remaining percent-encoded sequences (e.g. %3a -> %3A). That way these literal encoded sequences will still be matched. A route added as "/fo%3a" will be matched by incoming paths "/fo%3a" and "/fo%3A"

Normalization of URI-reserved characters

Characters in the uriReserved set (; / ? : @ & = + $ ,) are not encoded by encodeURI, and their percent-encodings (which should not be used in URLs, unless they are part of a dynamic segment) are likewise not decoded by decodeURI (e.g., encodeURI(':') === ':' and decodeURI('%3A') === '%3A').

Since some of these characters have special meaning to the router dsl (: at the start of a segment) and to URLs (? indicating query params), they need to be encoded when adding them as routes (e.g., to add a route with the literal value ":" at the start of a segment it must be encoded: router.add([{ path: "/foo/%3Abar" ...).

Example affected static routes

Examples of static routes that would be affected by this change:

static route path Old code matches? New code matches ? Note
/fo%20 /fo N Y old code does not normalize route /fo%20
/fo%20 /fo%20 N Y old code decodeURIs path (to /fo) before recognizing against route /fo%20
/fo%3a /fo%3A N Y old code is case-sensitive for percent encodings
/fo%25 /fo%25 N Y percent-encoded % incorrectly decoded by old code
/caf%C3%A9 /café N Y old code does not normalize route
/caf%C3%A9 /caf%C3%A9 N Y old code does not normalize route but it does decodeURI path to /café, which doesn't match

Example unaffected static routes

Examples of static routes that would not be affected by this change:

static route path Old code matches? New code matches ? Note
/café /caf%C3A9 Y Y old code decodeURIs the path (to /café) so it will be recognized
/fo: /fo: Y Y exact match
/fo /fo%20 Y Y old code decodeURIs the path (to /fo) so it is recognized
/fo /fo Y Y exact match
/fo%3A /fo%3A Y Y exact match (decodeURI('%3A') === '%3A')
/fo%3A /fo: N N old and new code both don't change the %3A and : in route and path, respectively
/café /café Y Y exact match

Dynamic Route Generation (bugfix)

Changes generation to call encodeURIComponent on dynamic route segments.
Previously, the parameter would be interpolated into the generated route unchanged.

This should be considered a bugfix because the previous behavior made it
possible for the router to generate a route that it later wouldn't be able to recognize.

It is a breaking bugfix for users that have workarounds in place to encodeURIComponent
their parameters before generating links from them. Those workarounds would lead to
doubly-encoded parameters in URLs.

Example route generation

Given a route with a dynamic segment: "/post/:id", the following values of id will
generate the routes shown.

id value Old route New route Note
abc/def /post/abc/def /post/abc%2Fdef The old value is no longer recognizable by this route
100% /post/100% /post/100%25 The old value is an invalid URI

Dynamic Route Parameter Decoding (bugfix)

Changes route recognition to call decodeURIComponent on
recognized parameters from dynamic route segments.

Previously, the captured parameters would be returned unmodified. However, before
dynamic parameters are parsed, the path passed to router.recognize would be
"normalized" with decodeURI (to handle non-ascii unicode characters), so the
captured parameter was not always the same as the value in the path. See below.

The route-recognizer's docs do not discuss the expected behavior in these scenarios.
This change brings the behavior into closer alignment with behavior of other well-known
routers like the Rails router. As such, it could be considered either an enhancement
or a bugfix.

This is a breaking change for users that are working around this issue by calling decodeURIComponent on the params returned by the route. With the new code, this could result in unexpected values caused by double-decoding the value.

Example dynamic route parameter decoding

Given a route with a dynamic segment: "/post/:id", the following paths will yield
the shown values for the id param.

path Old id param New id param Note
/post/abc%def" abc%2Fdef abc/def A user with a workaround would be unaffected.
post/café café café
post/caf%C3%A9 café café the old code's call to decodeURI(path) decodes the percent-encoded %C3A9 to é. A user with a workaround would be unaffected.
post/%3A1 %3A1 :1 the old code's call to decodeURI does not decode %3A to :. A user with a workaround would be unaffected.
post/100%25 100% 100% the old code's call to decodeURI(path) decodes %25 to %. A user with a workaround in place would get an error when trying to decodeURIComponent("100%")

Special case: doubly-encoded segments

The incorrect decoding of encoded percent characters also affects dynamic segments that have been mulitply encoded.
For example, if the post id is a url that itself has a encodeURIComponent-encoded parameter, such as:

let url = "http://example.com/post/" + encodeURIComponent("http://other-url.com")
let encodedUrl = encodeURIComponent(encodedUrl);

The reserved characters from "http://other-url.com" will be doubly-encoded in encodedUrl. E.g., the // in http://other-url.com is first encoded as %2F%2F,
and then the % characters are encoded again to %25: encodeURIComponent(encodeURIComponent("//")) === "%252F%252F".

The old code would call decodeURI once on encodedUrl, resulting in the following (incorrect) value for id: id === encodeURIComponent("http://example.com/post/http://other.url").
The new code avoids decoding the percent characters in the initial path, and it results in the following (correct) value for id: id === "http://example.com/post/" + encodeURIComponent("http://other.url").

Glob Routes (Star Segments)

This PR adds some tests and code to ensure that glob routes (e.g. /*catchall) do not decode their matching segments.
E.g., given the catchall route and path /abc/foo%2Fbar, the matched param catchall is abc/foo%2Fbar, not abc/foo/bar.

This allows handlers like "/foo/:bar" to match a path with a url encoded
segment such as "/foo/abc%Fdef" -> { params: { bar: "abc/def" } }

See related issue emberjs/ember.js#11497
@bantic bantic force-pushed the normalize-and-encoded-decode-dynamic-segments branch from f594a76 to 391ec7b Compare May 4, 2016 23:06
@bantic
Copy link
Contributor Author

bantic commented May 5, 2016

Updated the description to clarify that although certain users with percent-encoded static routes may be affected, it is not a breaking change (all static routes that match in the current code will still match after this PR). Clarified that users who may find this a breaking change for dynamic route segments likely only need to remove their workaround.

Also updated the terminology section for clarity and renamed the "normalization caveats" to "normalization or uri-reserved characters".

@mmun
Copy link
Collaborator

mmun commented May 6, 2016

Can we disallow special symbols in a route path? e.g. /foo:bar and /foo*bar

@krisselden
Copy link
Collaborator

@mmun They have to be encoded to work, is that not enough?

@mmun
Copy link
Collaborator

mmun commented May 6, 2016

@krisselden The PR doesn't enforce that at the moment though:

 +  // unencoded ":" in non-significant place
 +  route: "/foo/b:ar",
 +  matches: ["/foo/b:ar"],
 +  nonmatches: ["/foo/b%3Aar", "/foo/b%3aar"]
 +  /* FIXME should this work?
 +}, {
 +  // encoded non-uri-reserved char "*" in significant place
 +  route: "/foo/%2Abar",
 +  matches: ["/foo/*bar", "/foo/%2Abar", "/foo/%2baar"]
 +  */
 +}, {
 +  // unencoded "*" in non-significant place
 +  route: "/foo/b*ar",
 +  matches: ["/foo/b*ar", "/foo/b%2Aar", "/foo/b%2aar"]
 +}, {

@krisselden
Copy link
Collaborator

@bantic lets require all reserved characters in the configured route to be encoded so we can use any of them for future parsing needs.

@krisselden
Copy link
Collaborator

Otherwise this looks good to me.

@bantic
Copy link
Contributor Author

bantic commented May 6, 2016

Sounds good, but I think we need to delineate which characters are in the reserved set. The uriReserved set are these: ; / ? : @ & = + $ ,. We can't require all of these (especially /) to show up in encoded form in a route definition.

How about:

  • ; : @ & = + $ , * — must always be encoded (the "reserved set", for RR's purposes)
  • % — must be encoded, but handled specially for path-matching
  • ? # — never allowed, encoded or not
  • / — Required as-is to mark segment boundaries, must be encoded if the literal encoding is to match

There is also mention of reserving ( and ) — that could be done as part of these changes or folded in later.

And then, from the perspective of path-matching, paths with either unencoded or encoded form of characters in the reserved set will match, e.g.:

  • route /foo%3A matches paths /foo:, /foo%3A and /foo%3a

The % and / are handled specially:

  • percent: route /foo%25 matches path /foo%25 but not /foo% (<-- not valid URI)
  • slash: route /foo%2Fbar matches path /foo%2Fbar and /foo%2fbar but not /foo/bar

@krisselden @mmun Let me know if that is in line with what you had in mind.

@bantic
Copy link
Contributor Author

bantic commented May 6, 2016

This would also mean it's not possible to match a literal : but not its encoding (or vice-versa).

E.g., there would be no route that could be added such that the path /foo:bar matches and /foo%3Abar does not. That seems ok to me.

@mmun
Copy link
Collaborator

mmun commented May 7, 2016

@bantic that sounds like a feature to me!

@bantic
Copy link
Contributor Author

bantic commented May 9, 2016

Pushed some WIP to address the comment above about reserved characters and to address discussion on #dev-router in slack about normalizing route/path segments rather than full strings.
This PR now normalizes routes like /foo/%3Abar such that paths with the unencoded value (of %3A -> :) matches: /foo/:bar.

This breaks the test that ensures that a glob route doesn't modify its matched path, though, so I need to fix that. Also need to make a decision which characters are "reserved" and how to communicate when user attempts to add a route with unencoded reserved chars:

I made a visualizer that can be used to see what paths will match a given route (and what their param values will be): http://rr-visualizer.surge.sh/

@bantic bantic force-pushed the normalize-and-encoded-decode-dynamic-segments branch from f67c9b6 to 3383514 Compare May 17, 2016 21:54
Change State's `names` array to hold {name, decode} objects instead of
only the names. Recognition uses the `decode` boolean to know whether to
decode the capture.
@bantic bantic force-pushed the normalize-and-encoded-decode-dynamic-segments branch from 3383514 to 8f7845b Compare May 17, 2016 21:55
@bantic
Copy link
Contributor Author

bantic commented May 17, 2016

Performance on master (2a21d5e):

  Add ................... 338.41 op/s
  End-to-end ............ 130.64 op/s
  Generate ........ 3,808,691.86 op/s
  Handlers For ... 30,679,818.13 op/s
  Recognize ......... 286,352.95 op/s

Performance on this branch (5bf1311):

  Add ................... 132.74 op/s
  End-to-end ............. 76.13 op/s
  Generate ........ 2,786,454.71 op/s
  Handlers For ... 29,659,182.82 op/s
  Recognize .......... 89,432.47 op/s

@bantic
Copy link
Contributor Author

bantic commented May 17, 2016

This is ready for review. Here's a short summary of the changes:

Normalization when adding routes and when recognizing paths

Added routes are normalized and paths are normalized before recognition so that percent-encoded characters in routes are now mostly normalized and users do not need to add the same route in multiple forms to deal with all possible URLs that it might encounter.

Before: "/foo%3Abar", "/foo%3abar" and "/foo:bar" were all considered different routes.
Now: they are normalized to the same thing.

Bugfix: url generation with dynamic segment that has special characters (like "/")

Old behavior was to put the parameter into the url as-is. New behavior is to encodeURIComponent the value before putting it in the url.

Before: route "/post/:id" with id "abc/def" would generate the url "/post/abc/def", which no longer would be matched by the route that generated it.
Now: the url that is generated would be "/post/abc%2Fdef" (%2F is the percent-encoded value of /).

Bugfix: inconsistent decoding of dynamic segments during recognition

Old behavior was to sort-of-normalize an incoming path using decodeURI, which decodes encoded utf-8 chars and some other chars (like space and utf8 chars), but leaves other encoded sequences (like : ("%3B") and + ("%2B")) still encoded.

Before: a route "posts/:id" would match the url "posts/abc%3Bdef" with the non-decoded id value "abc%3Bdef", but match the url "posts/abc%20def" with the decoded id value "abc def".

Now: the matched parameter value is always decoded (for dynamic segments), and the following symmetric relationship is always true:

(router.add([{ handler, ... }], {as: routeName}) &&
  router.recognize( router.generate(routeName, anyParams) )) ===
    <result with handler for routeName>

In english: after adding a route to RR, RR will always recognize the output of an RR.generate call for that route and any params (even if those params have "special" characters like /).

Problematic edge case in the old behavior: If a url had a percent-encoded percent character (the literal string "%25"), the matching param would have the decoded value ("%"), and consumers that use decodeURIComponent to work around this behavior get a malformed URI error when they try to decodeURIComponent(stringContainingBarePercentCharacter) — they would need to add extra code to their workaround to properly handle parameter values that have bare percent characters. After this PR is merged those workaround(s) are no longer necessary.

Bugfix / Behavior clarification: catchall (aka glob or star segment) routes do not decode their matching params

The docs did not specify how/if star segments should decode parameters, and so they had the same inconsistent behavior for dynamic segments mentioned above in this comment. This PR adds some tests (and code) to clarify that star segment matches are not decoded.

Old behavior: route with "/*catchall" would match url "/abc%20def" with {catchall: "abc def"}
New: same route and url matches with verbatim string match {catchall: "abc%20def"}

Feature Flag: ENCODE_AND_DECODE_PATH_SEGMENTS

Added RouteRecognizer.ENCODE_AND_DECODE_PATH_SEGMENTS flag — when it is true the new behavior described in this comment is turned on. Users can set it to false in their applications to opt-out of the new behavior.
Some of the tests set this flag off to assert expected old (possibly buggy) behavior.

Not covered: Reserved characters in routes

The comments above mention enforcing that certain characters (like : and *) must always be percent-encoded when adding a route if they are meant literally (and not to indicate a dynamic or star segment).
I suggest handling that enforcement in a separate PR after this is merged.
If that should be part of this PR the following need to be ascertained:

  • What are all the reserved characters?
  • How should RR enforce their encoding? (Throw an error when they are used unencoded when adding a route?)

@bantic
Copy link
Contributor Author

bantic commented May 17, 2016

The RR visualizer at http://rr-visualizer.surge.sh/ is updated with these changes.

@bantic
Copy link
Contributor Author

bantic commented May 18, 2016

Rewrote the normalization algorithm and added some benchmarks for it.
Perf is now on par with master again.
Also added some tests for normalization because its algorithmic complexity grew a bit.

Performance on master (2a21d5e):

  Add ................... 383.06 op/s
  End-to-end ............ 141.42 op/s
  Generate ........ 3,906,401.17 op/s
  Handlers For ... 31,834,925.29 op/s
  Recognize ......... 299,361.68 op/s

Performance on this PR (0861ce2):

  Add ........................ 314.32 op/s
  End-to-end ................. 131.77 op/s
  Generate ............. 3,098,667.92 op/s
  Handlers For ........ 31,272,320.36 op/s
  Recognize .............. 214,845.82 op/s

New benchmarks for normalization:

  Normalize Complex ...... 121,907.96 op/s
  Normalize Simple ..... 1,343,586.11 op/s
  Normalize Medium ....... 395,562.42 op/s

@mmun
Copy link
Collaborator

mmun commented May 23, 2016

@bantic can you have the route visualizer show the segments + canonical values?

@mmun mmun merged commit 4da3c3e into tildeio:master May 23, 2016
@mixonic
Copy link
Collaborator

mixonic commented May 24, 2016

(fwiw @bantic is mostly offline for this week)

@nathanhammond
Copy link
Contributor

I'm now refactoring my code in #90 to pass these new tests.

@bantic
Copy link
Contributor Author

bantic commented Jun 3, 2016

👍 @mmun sorry for the delay. I'll update the visualizer today or tomorrow

@bantic bantic deleted the normalize-and-encoded-decode-dynamic-segments branch June 3, 2016 21:30
@bantic
Copy link
Contributor Author

bantic commented Jun 3, 2016

@mmun Updated the visualizer with normalized route segments and normalized path: http://rr-visualizer.surge.sh/

mfazekas added a commit to mfazekas/ember-cli-mirage that referenced this pull request Nov 19, 2016
samselikoff pushed a commit to miragejs/ember-cli-mirage that referenced this pull request Dec 9, 2016
samselikoff pushed a commit to miragejs/ember-cli-mirage that referenced this pull request Dec 10, 2016
featheredtoast added a commit to discourse/discourse that referenced this pull request Dec 23, 2019
* FIX: category routes model params should decode their URL parts

Ember's route star globbing does not uri decode by default. This is
problematic for subcategory globs with encoded URL site settings enabled.

Subcategories with encoded URLs will 404 without this decode.

I found this tildeio/route-recognizer#91
which explicitly explains that globbing does not decode automatically.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants