Issue #18 URI path processing

Issue #18 URI path processing. Moved the wiki text to a PR for better comment and suggestions.
jakartaee · Oct 7, 2021 · bd12ce5 · bd12ce5
1 parent 4367a70
commit bd12ce5
Showing 1 changed file with 76 additions and 0 deletions.
diff --git a/spec/src/main/asciidoc/servlet-spec-body.adoc b/spec/src/main/asciidoc/servlet-spec-body.adoc
@@ -1300,6 +1300,82 @@ the header value to an `int`, a `NumberFormatException` is thrown. If
 the `getDateHeader` method cannot translate the header to a `Date`
 object, an `IllegalArgumentException` is thrown.
 
+=== Request URI Processing
+The process described here adapts and extends the URI canonicalization process described in [RFC 3986](https://datatracker.ietf.org/doc/html/rfc3986) to create a standard Servlet URI path canonicalization process that ensures that URIs can be mapped to Servlets, Filters and security constraints in an unambiguous manner. It is also intended to provide information to reverse proxy implementations so they are aware of how requests they pass to servlet containers will be processed.
+
+Servlet containers may implement the standard Servlet URI path canonicalization in any manner they see fit as long as the end result is identical to the end result of the process described here. Servlet containers may provide container specific configuration options to vary the standard canonicalization process. Any such variations may have security implications and both Servlet container implementors and users are advised to be sure that they understand the implications of any such container specific canonicalization options.
+
+==== URI Transport
+===== HTTP/1.1
+The URI is extracted from the `request-target` as defined by [RFC 7230](https://datatracker.ietf.org/doc/html/rfc7230#section-3.1.1). URIs in `origin-form` or `asterisk-form` are passed unchanged to stage 2. URIs in `absolute-form` have the protocol and authority removed to convert them to `origin-form` and are then passed to stage 2. URIs in `authority-form` are outside of the scope of this specification.
+
+===== HTTP/2
+The URI is the `:path` pseudo header as defined by [RFC 7540](https://datatracker.ietf.org/doc/html/rfc7540#section-8.1.2.3) and is passed unchanged to stage 2.
+
+===== Other protocols
+Containers may support other protocols. Containers should extract an appropriate URI for the request from the protocol and pass it to stage 2.
+
+==== Separation of path and query
+The URI is split by the first occurrence of any '?' character to path and query.  The query is preserved for later handling and the following steps applied to the path.
+
+==== Discard fragment
+A fragment in the path is indicated by the first occurrence of a `\#` character.  Any `#` character and following fragment is removed from the path and discarded.
+
+==== Decoding of non-special characters
+Characters other than `/`, `;` and `%` that are encoded in `%nn` form are decoded and the resulting octet sequences is treated as UTF-8 and converted to a character sequence.
+
+> Note that special characters cannot be part of a UTF-8 character sequence as all such sequences are comprised of negative octets.
+
+> Note this is not reserved characters as defined by RFC3986, as that does not include `%` and includes many characters we don't care about. Avoiding a second decoding is worthwhile.
+
+==== Collapse sequences of multiple `"/"` characters
+> **WARNING** Swapping the order of stage 3 and stage 4 may be significant. Consider `"/aaa/bbb//../"`.
+
+> **TODO** Are we sure we don't want to do this in the other order?
+
+Any sequence of more than one `"/"` character in the URI must be replaced with a single `"/"`.
+
+==== Remove dot-segments+
+* A path not starting with "/" must be rejected with a 400 response.
+* Sequences of the form `"/./"` must be replaced with `"/"`.
+* Sequences of the form `"/" + segment + "/../"` must be replaced with `"/"`.
+* If there is no preceding segment for a `".."` segment then return a 400 response.
+
+==== Removal of path parameters
+A path segment containing the `";"` character is split at the first occurence of `";"`. The segment is replaced by the character sequence preceeding the `";"`. The characters following the `";"` are considered a path parameters and may be preserved by the container for later processing (eg `jsessionid`).
+
+> TODO How do we handle URIs like `/foo/;/bar`?  I think as currently written we end up with `/foo//bar` ?
+
+==== Decoding of remaining `%nn` sequences
+Any remaining `%nn` sequences in the path should be decoded. Some containers may be configured to leave some specific characters encoded (eg. the characters '/' and '%' may be left decoded by some container configuration).
+
+==== Mapping URI to context and resource
+The decoded path is used to map the request to a context and resource within the context. This form of the URI path is used for all subsequent mapping (web applications, servlet, filters and security constraints).
+
+==== Rejecting Suspicious Sequences
+If suspicious sequences are discovered during the prior steps, the request must be rejected with a 400 bad request using the error handling of the matched context.  By default the set of suspicious sequences includes:
+
+ * The encoded `"/"` character
+ * Any `"."` or `".."` segment that had a path parameter
+ * Any `"."` or `".."` segment with any encoded characters
+ * The `"\"` character encoded or not.
+ * Any control characters either encoded or not.
+
+A container or context may be configured to have a different set of rejected sequences.
+
+> TODO how do we define control characters? < 0x20? Is 0x7F (DEL) OK?
+
+> TODO should we also by default reject '\' and '%5c' ?
+
+> TODO should we also by default reject non visible and/or control characters ?
+
+> TODO if %2F is allowed, we may now have double '/', '/./' and '/../' segments in the URL, should stage 3 and 4 be re-run if this is allowed?
+
+
+
+
+
+
 === Request Path Elements
 
 The request path that leads to a servlet