Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify behaviour of getRequestURI(), getContextPath(), getServletPath() and getPathInfo() #18

Open
glassfishrobot opened this issue Oct 7, 2011 · 49 comments · Fixed by #429 or #424
Assignees
Labels
Candidate4NextRelease This issues should be consider for inclusion in the next release project. Enhancement New feature or request

Comments

@glassfishrobot
Copy link

The specification is unclear for some or all of the above methods how the following should be handled.

  • path parameters - should they be included in the returned values or not?
  • normalisation - should the returned values be pre- or -post any normalisation?
  • url decoding - should the returned values be decoded or include the escapes?
  • character encoding - if a value is decoded, what character encoding should be used?
  • welcome files - if a request maps to a welcome file, should it be included in the return values?
@glassfishrobot
Copy link
Author

@glassfishrobot Commented
Reported by markt_asf

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
markt_asf said:
Add getRequestURL() to the methods to be reviewed

@glassfishrobot
Copy link
Author

@glassfishrobot Commented
This issue was imported from java.net JIRA SERVLET_SPEC-18

@glassfishrobot
Copy link
Author

@markt-asf
Copy link
Contributor

An additional point requiring clarification is that RFC 3986 (section 2.2) states that %nn encoded reserved characters are not equivalent to their decoded forms. i.e. "/foo/bar/" and "/foo%2Fbar" are not equivalent. This raises an additional question. When a %nn reserved character appears in any part of the URI should it be decoded or not in the return value of each of the above methods? I'm leaning towards a solution that leaves them in their %nn form and adds a utility method to the API that performs %nn decoding.

@gregw gregw added Enhancement New feature or request and removed Component: Misc labels Jan 18, 2020
@gregw gregw added the Candidate4NextRelease This issues should be consider for inclusion in the next release project. label Sep 8, 2020
@gregw
Copy link
Contributor

gregw commented Feb 17, 2021

the servlet spec in section 3.1 says:

Path parameters that are part of a GET request (as defined by HTTP/1.1) are not exposed by these APIs. They must be parsed from the String values returned by the getRequestURI method or the getPathInfo method.

I think when those words were written URIs were defined by RFC2396 which well defined path parameters in section 3.3:

      path          = [ abs_path | opaque_part ]
      path_segments = segment *( "/" segment )
      segment       = *pchar *( ";" param )
      param         = *pchar
      pchar         = unreserved | escaped |
                      ":" | "@" | "&" | "=" | "+" | "$" | ","

However, that RFC has since been obsoleted by RFC3986 which now no longer has a normative definition for path parameters and instead just says:

Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax. URI producing applications
often use the reserved characters allowed in a segment to delimit
scheme-specific or dereference-handler-specific subcomponents. For
example, the semicolon (";") and equals ("=") reserved characters are
often used to delimit parameters and parameter values applicable to
that segment. The comma (",") reserved character is often used for
similar purposes. For example, one URI producer might use a segment
such as "name;v=1.1" to indicate a reference to version 1.1 of
"name", whereas another might use a segment such as "name,1.1" to
indicate the same. Parameter types may be defined by scheme-specific
semantics, but in most cases the syntax of a parameter is specific to
the implementation of the URI's dereferencing algorithm.

So this gives us two problems. Firstly I now cannot find any current specification of HTTP URL path parameters and how they should be parsed. So I think the servlet spec could be relying on the grandfathered definition from RFC2396. IS this true? Does anybody know of a current specification for them?

Secondly, there are URIs that if RFC3986 is applied to (ie normalisation, then decoding and param handling) then the result is very ambiguous when received as a string. Examples include:

URI RFC3986 normalised common error
/foo/%2e%2e/bar /foo/../bar /bar
/foo/aaa%2fbbb/../bar /foo/bar /foo/aaa/bar
/foo/..;/bar /foo/../bar /bar

@markt-asf
Copy link
Contributor

I'd love to see us make some progress on this in the next iteration of the Servlet spec.

There are a lot of edge cases and interdependencies between edge cases as well as places where the meaning RFCs may not be immediately clear and/or has been misinterpreted in the past. We have a wiki so my suggestion is that we draft some text so we have a common, agreed understanding in the wiki, with examples, and then discuss the changes we wish to make to the specification document and/or the Javadoc so that the specification is aligned with that common understanding. I'll get started on that Wiki page and I'll post again here when I have something ready to start discussing.

@markt-asf
Copy link
Contributor

Wiki page has been created. I think the first thing to tackle is path parameters.

https://github.com/eclipse-ee4j/servlet-api/wiki/HTTP-URIs-and-Servlet-API-methods

@gregw
Copy link
Contributor

gregw commented Jul 14, 2021

I think we also need to consider security issues that are introduced by ill defined path parameters. Jetty recently "improved" our URI handling to be more compliant with RFC3986 and that just resulted in a bunch of CVEs as it created many issues.

For example in the URI /foo/..;/bar the middle segment is not .. so by strict RFC3986 it should not be normalized. But once we strip out the path parameter the resulting path has a pure .. segment, so it is normalized by the file system. So for example a request to /foo/..;/WEB-INF/web.xml was not blocked as it's path started with /foo not /WEB-INF, but when the path was given to the file system, the web.xml was served. There are similar problems with encoded dots and / if you strictly respect RFC3986 (eg /foo/%2e%2e/WEB-INF/web.xml & /foo%2f../WEB-INF/web.xml) so we have reverted to not following RFC3986, specifically in that we decode and then normalize. This is better for security, but it is not good that we do not follow the RFC for URI handling.

Fundamentally getServletPath() and getPathInfo() are ambiguous when they return strings that may include decoded reserved characters or removed parameters.

@markt-asf
Copy link
Contributor

Agreed wholeheartedly re security issues. It is also important we clarify these ambiguities and get consistent behaviour between containers because inconsistent behaviour between containers can also be a source of vulnerabilities (if an app coded for one container works securely on that container it may not be secure on a different container).
With respecting to %nn encoding, I was planning on looking at that after path parameters. While the two aren't independent, I think we can look at path parameters in isolation first and then, once we have a plan for path parameters, see what impact %nn encoding has on that plan.
For path parameters, I've done some testing along these lines:

URI uriC = new URI("file:///home/aaa/..;/mark");
uriB = uriC.normalize();
File fileC = new File(uriC);
System.out.println(uriC.toString());
System.out.println(fileC.isDirectory());

The result is:

file:///home/aaa/..;/mark
false

so my conclusion is that if we retain path parameters, we have to retain them all the way through the processing chain and if we do that the request will fail because the context path doesn't match, the servlet path doesn't match or the file doesn't exist. That is the second option in the wiki page option. The other option is to remove them early on in the processing. I've added a comparison of the options to the wiki.

Despite arguing consistently for option 1 on the Tomcat lists for a number of years (and implementing it) I'm beginning to think that option 2 is the better option.

@gregw
Copy link
Contributor

gregw commented Jul 15, 2021

@markt-asf I'm not entirely sure that we can separate discussion of ; from issues associated with other difficult characters like %, #, ?, ., / and \. But I'm happy to try primarily focusing on the issues of ; and using other characters as sanity checks.

I think an exercise that would be good to help clarify how these characters should be handled in received URIs is to better define what path is in the ServletContext.getResource(String path) API. It is entirely unclear to me if it is an encoded URI path, a decoded URI path or in the file system space. Defining exactly what space this string is in will help us define getServletPath() and getPathInfo() because I think they need to return a string in the same space.

If the path parameter is a decoded URI path then:

  • getResource("/foo;oo/bar.txt") will return a URL of something like file://path/to/webroot/foo%3boo/bar.txt, ie a file called "bar.txt" in a directory called "foo;oo"
  • If there is a file called "foo/bar.txt" where the / is a character in the name and not a separator (possible on some, but not all file systems) then that resource cannot be accessed by getResource as getResource("/foo/bar.txt") will look for "bar.txt" in "/foo" and getResource("/foo%2fbar.txt") will look for a file with a % in its name (equiv to file://path/to/webroot/foo%252fbar.txt with the % itself % encoded).
  • On windows implementations will need to be careful that getResource("foo\bar.txt") does not find "bar.txt" in directory "foo", but instead looks for a file with the \ in it's name.

Fundamentally, the problem with the decoded URI path string space is that once decoded, you can't tell the difference between a URIs /foo/bar and /foo%2fbar. Thus if we leave in parameters, we have the same problem that it is impossible to distinguish between URIs /foo;oo/bar and /foo%3boo/bar. Thus the strings returned from the path methods and thosed passed to the getResource method are not sufficient to identify all possible resources as they have discarded some information.

If we want to be able to access all resources, the getResource(String) needs to take an encoded URI path. But then it is different to the path methods, which are documented as returning decoded strings, so we would make applications vulnerable to double decode attacks like /foo/%25./WEB-INF/secret.jsp. It only makes sense to do this if the path methods also return strings in the same space so that ServletContext.getResource(request.getPathInfo()) is a safe operation. But that is a big change that would break many/most applications that do not expect they have to do their own % encoding.

One way forward I see is:

  • embrace the partial space created by decoded URI paths and well define which resources are not accessible (e.g. files with ;, %, / and \ in their names cannot be accessed by normal requests)
  • Any request that contains a potentially ambiguous URI (eg encoded ; / or \) is by default rejected with a 400.
  • An application can declare that it understands partially decoded URI paths. For such applications, the path methods will return mostly decoded strings, but specific reserved characters will remain encoded if needed to disambiguate the path. The getResource method will likewise accept just those reserved characters as encoded if they are to be part of the file name or decoded if they are to be interpreted as part of the URI

@markt-asf
Copy link
Contributor

I agree that path in ServletContext.getResource(String path) should be consistent with the values returned for getServletPath() and getPathInfo().

As I have got my head around the evolution of path parameters in the URI RFCs from the very specific in RFC 1808 to the very generic in RFC 3986, I am now in agreement with you that we need to approach path parameters and %nn decoding in combination.

I have a way forward in mind that I think is very similar to the way forward you expressed.

I am assuming that we are starting with the path component of the URI and, apart from the process of extraction, it is unchanged. The steps are as follows:

  • The original path component is stored for use as the return value for getRequestURI(). (This value is neither normalized nor decoded.)

  • Any jsessionid path paraemeter present is parsed, stored and removed from the path. This makes the use of URL re-writing for session tracking mostly transparent to the application.

  • Unreserved characters are %nn decoded.

  • Exact "/../" and "/./" sequences are normalized. As per the RFC 3986 examples, "/aaa/bbb;c=d/../eee" will be normalized to "/aaa/eee". "/..;/" would be left as-is.

  • The normalized path (still containing path parameters and %nn encoded reserved characters) is mapped to provide contextPath, servletPath and pathInfo. These are normalized but encoded.

  • Three new methods getContextPathEncoded(), getServletPathEncoded() and getPathInfoEncoded() provide access to these normalized, encoded values.

  • Another three new methods getContextPathDecoded(), getServletPathDecoded() and getPathInfoDecoded() take the respective normalized and encoded values. If an unencoded reserved character is present (i.e. path parameters are present) or if an encoded '/' is present, null is returned. Otherwise, %nn decoding of all %nn sequences is performed and the resulting value returned from the method.

  • The default servlet is expected to use the getXXXDecoded() methods to map the URI to the file system

  • path in ServletContext.getResource(String path) is equivalent to getXXXDecoded()

  • The current getContextPath(), getServletPath() and getPathInfo() methods are deprecated with a note that their behaviour was ambiguous regarding normalization and decoding and that users should refer to the documentation for their container to determine the actual behaviour of these methods for that container.

This means files that use any reserved character other than '/' (which isn't a valid character for a file name on either Linux or Windows) in their names will be accessible via a normal request.

Applications and frameworks that want to use path parameters can do so via the getXXXEncoded() methods without risk of double decoding (because when we %nn decode we only decode unreserved characters).

What I'd really like to do is to avoid deprecating getContextPath(), getServletPath() and getPathInfo() as they are so fundamental to the API. However, if we use them instead of getXXXDecoded(), the nulls will likely break stuff. It we use them instead of getXXXEncoded() the unexpected encoding will likely break stuff.

One possibility would be for the application to declare which reserved characters it wished to use as delimiters. '/' would always be a delimiter. The steps above then change to:

  • ...

  • Unreserved characters and reserved characters not declared to be used by the application characters are %nn decoded.

  • Exact "/../" and "/./" sequences are normalized. As per the RFC 2986 examples, "/aaa/bbb;c=d/../eee" will be normalized to "/aaa/eee".

  • The normalized path (still containing path parameters and %nn encoded reserved characters declared to be used by the application) is mapped to provide contextPath, servletPath and pathInfo. These are normalized but encoded.

  • Three new methods getContextPathEncoded(), getServletPathEncoded() and getPathInfoEncoded() provide access to these normalized, encoded values.

  • getContextPath(), getServletPath() and getPathInfo() take the respective normalized and encoded values. If an unencoded reserved character declared to be used by the application is present (i.e. path parameters are present) or if an encoded '/' is present, null is returned. Otherwise, %nn decoding of all %nn sequences is performed and the resulting value returned from the method.

  • The default servlet is expected to use getServletPath() and getPathInfo() to map the URI to the file system

  • path in ServletContext.getResource(String path) is equivalent to getPathInfo()

%25 is an edge case in the above that would need some more thinking about in this appraoch to avoid double decoding issues but I think it is solveable.

@gregw
Copy link
Contributor

gregw commented Jul 16, 2021

I agree with a lot you say... but a few points inline...

I am assuming that we are starting with the path component of the URI and, apart from the process of extraction, it is unchanged. The steps are as follows:

  • The original path component is stored for use as the return value for getRequestURI(). (This value is neither normalized nor decoded.)
  • Any jsessionid path paraemeter present is parsed, stored and removed from the path. This makes the use of URL re-writing for session tracking mostly transparent to the application.
  • Unreserved characters are %nn decoded.

We should also specifically prohibit %uXXXX decoding. This was a rejected proposal, but there are still some usages of it, so we still offer it as a legacy mode. Would be good to kill that completely.

  • Exact "/../" and "/./" sequences are normalized. As per the RFC 3986 examples, "/aaa/bbb;c=d/../eee" will be normalized to "/aaa/eee". "/..;/" would be left as-is.
  • The normalized path (still containing path parameters and %nn encoded reserved characters) is mapped to provide contextPath, servletPath and pathInfo. These are normalized but encoded.
  • Three new methods getContextPathEncoded(), getServletPathEncoded() and getPathInfoEncoded() provide access to these normalized, encoded values.

The existing getContextPath() actually returns the encoded context path! (see javadoc that says: "The container does not decode this string"). Which is good for the number of new methods needed, but is going to make naming of new methods a nightmare!

The other thing to consider is do we really need the decoded path broken down into servletPath and pathInfo? I see so much code that is adding those back together. How about we just provide a new getEncodedPathInContext() method that returns servletPath+pathInfo ? Actually since this is a new method and we already have a mix of encoded and decoded return types, we could just call it getPathInContext() for simplicity and document that it returns the path with parameters any reserved characters encoded.

  • Another three new methods getContextPathDecoded(), getServletPathDecoded() and getPathInfoDecoded() take the respective normalized and encoded values. If an unencoded reserved character is present (i.e. path parameters are present) or if an encoded '/' is present, null is returned. Otherwise, %nn decoding of all %nn sequences is performed and the resulting value returned from the method.

I don't like even more new methods, as the API is already very fat and many of these methods might need to be replicated in request attributes, which are already too many! So instead I think we just accept that the existing getServletPath() and getPathInfo() are decoded. Instead I think we should offer some modal behaviour, where the application can declare how it wants them to work. I see the following modes (names are WIP):

  • LAX they just return the decoded path as they do today with params stripped - thus this mode doesn't break anybody (unless they are already broken when hit with ambiguous URIs)
  • STRICT they return a decoded path IFF there are no encoded reserved characters nor parameters (other than JSESSIONID which we already remove), otherwise null is returned (or perhaps an exception thrown?)
  • SAFE the container rejects with a 400 any request with a URI that encodes reserved characters or parameters

The default mode could be LAX if we don't want to break anybody, but I'm happy to allow the container to pick the default mode if a particular webapp does not indicate a preference.

  • The default servlet is expected to use the getXXXDecoded() methods to map the URI to the file system

Firstly, the default servlet cannot map URIs directly to the file system and must go via a getResource like API so it can access META-INF/resources in fragment jars etc. This also gives it the same behaviour as JspServlets etc.
Secondly using the decoded path is not sufficient to resolve the ambiguous URIs like /foo%2fbar. Instead I think we need to provide a new method URI ServletContext.getUri(String encodedPath) and the default servlet needs to serve resources equivalent to URI ServletContext.getUri(request.getPathInContext()). Note this switches to URI rather than the horrid URL class, but I guess it could also be done with Path, or maybe we provide both getPath and getUri methods?

  • path in ServletContext.getResource(String path) is equivalent to getXXXDecoded()
    See above... is equivalent to the string returned by get[Servlet]Path[Info]()
  • The current getContextPath(), getServletPath() and getPathInfo() methods are deprecated with a note that their behaviour was ambiguous regarding normalization and decoding and that users should refer to the documentation for their container to determine the actual behaviour of these methods for that container.

See above. I'd prefer not to deprecate these methods as the vast majority of applications use them and have no need to ever see an path that contains a decoded reserved character. Thus the vast majority of applications will continue to work perfectly well with no code changes even if their container switches to STRICT or SAFE mode. The few applications that do need handle encoded reserved characters are probably working with getRequestURI anyway, but they can be modified to use the new methods if they like.

This means files that use any reserved character other than '/' (which isn't a valid character for a file name on either Linux or Windows) in their names will be accessible via a normal request.

Remember that resources can be served from META-INF/resources in a fragment jar from WEB-INF/lib. These jars can be produced on one OS and run on another and thus could contain all sorts of nasty characters that are not legal in the local filesystem. Also OS's can mount foreign filesystems, so I just don't think we can assume that a resource name will never contain /. I admit that such resources should be vanishingly rare and I can't really see a use for them (other than as some kind of exploit bomb droped into somebody elses WEB-INF/lib. Thus I'm good with an approach that basically just rejects such URIs unless the application declares that it really wants to use them.

Applications and frameworks that want to use path parameters can do so via the getXXXEncoded() methods without risk of double decoding (because when we %nn decode we only decode unreserved characters).

But we do need to provide something like my proposed URI ServletContext.getUri(String encodedPath) and/or a ServletContext.getResourceEncoded(String) method, that handle exactly only those encoded reserved characters and throws IAE if it starts seeing other % encoded characters. i.e. it accepts exactly the strings returned from the new getXxxEncoded() method(s).

What I'd really like to do is to avoid deprecating getContextPath(), getServletPath() and getPathInfo() as they are so fundamental to the API. However, if we use them instead of getXXXDecoded(), the nulls will likely break stuff. It we use them instead of getXXXEncoded() the unexpected encoding will likely break stuff.

One possibility would be for the application to declare which reserved characters it wished to use as delimiters. '/' would always be a delimiter. The steps above then change to:

URI handling should be portable and we really don't need any doubt about what are reserved characters or not, specially as the getResource[En|De]coded style APIs are expected to be fed from the results of the getPath[En|De]coded style APIs.

So in summary, I like the decoding approach you've outlined, but think we can achieve the outcome we want with:

  • only 2(or 3) new methods: String HttpServletRequest.getPathInContext() and URI ServletContext.getUri(String encodedPath) (and/or Path ServletContext.getPath(String encodedPath)).
  • Introduce modal behavior for existing get[Servlet]Path[Info]() methods so that they either: act as they do today; return null for ambiguous URIs; are not even possible to call for ambiguous URIs. The other good thing about this approach is that the modes, once defined can be applied by a container to existing applications written against the current servlet-api versions and 99.9% of webapps will work perfectly fine in STRICT or SAFE mode with no code changes.

@markt-asf
Copy link
Contributor

Yes, I agree with pretty much all of that. It is great to feel that some progress is being made in this area.

I wasn't aware %uXXXX was even a possibility. I'd be happy to exclude that and any other weirdness. Something like:

The only permitted encoding scheme for URIs is the pct-encoded scheme defined in RFC 3986.

I'd forgotten that HttpServletRequest.getContextPath() returned the encoded path. I'm surprised given the complete pain that was to implement. Maybe I blocked it out ;). I agree that complicates naming.

Regarding splitting servletPath and pathInfo, the feedback I have from the Spring framework is that they do need them to be separate. I'm not against new methods like getPathInContext() but neither am I convinced it is necessary at this point.

I hear your objection regarding new methods and implications for request attributes. That is a good reason to try and minimise the number of new methods.

I agree the default servlet needs to use something like getResource(). My point was more about what gets passed to that method. I'm aiming for a solution that supports the widest possible set of filenames with the minimum number of special cases that require additional handling. My hope is that we can define behaviour in such a way that as many of the ambiguities as possible are designed out. For example so that using a path parameter (other than jsessionid) with a static resource would result in a 404.

My thinking around allowing applications to effectively remove some characters from the reserved set was to assist with migration with what was going to be a stricter regime. If the consensus is that that would create more ambiguity/problems than it solves, I'm happy to leave that as a potential container specific feature.

I like the idea of allowing the application to operate in different modes. That provides a lot more flexibility without having to add a new set of methods for each mode. Regarding each mode:

  • LAX: Maybe call this LEGACY? I agree this should be the "same as Servlet 5" mode. My only comment is that getContextPath() won't be decoded in this mode. The behaviour of this mode may be slightly different for different containers depending on how they implemented the methods in prior versions.

  • DECODED: new mode. Should be very similar to LAX (depending on container behaviour in prior versions) but like getServletPath() and getPathInfo(), getContextPath() also has parameters stripped, is decoded and normalized.

  • STRICT: I'd prefer returning null rather than throwing an exception if %nn and/or parameters are present as I think this comes under "Use exceptions only for exceptional conditions"

  • SAFE: I think this is very similar to STRICT (since I'm assuming the application will treat a 'null' being returned in STRICT as an error condition) but moves the rejection of the request earlier in the processing. I'd suggest a 404 rather than a 400 since that suggests the request is not a valid request rather than something the server is chossing not to respond to. I've been trying to think of a use case where STRICT would be chosen over SAFE and I can't. Did you have a use case in mind?

  • ENCODED: new mode. Non-reserved characters %nn decoded. Exact "/../" and "/./" sequences are normalized. Encoded reserved characters and path parameters remain.

After some further experiments with new File(URI), I agree there is a need for something like URI ServletContext.getUri(String encodedPath) to bridge the gap between the possibly encoded and/or containing parameters path returned by getServletPath() and getPathInfo() and methods like ServletContext.getResource(String) and friends. However, I think it needs to return String so that can then be passed into the getResource methods. I'm not sure about the name but something like String ServletContext.getResourcePath(path) where:

  • if path contains any reserved characters apart from / the method returns null (this effectivly blocks parameters)
  • if path contains %2f the method returns null. We could add a boolean parameter to allow this but I can't think of a use case for this but I can think of plenty of security exploits if it were allowed.
  • decode all remaining %nn sequences
  • return the remaining string

If the returned string was then used in ServletContext.getResource(String) and friends, it should mean that files that happen to use reserved characters in their paths would be accessible.

If we included the behaviour of ServletContext.getResource(String) and friends in the definition of the operating modes discussed above, we could potentially remove the need for the new conversion method by having the container do this conversion inside ServletContext.getResource(String) and friends. It would only be needed for the new ENCODED mode.

@gregw
Copy link
Contributor

gregw commented Jul 22, 2021

I think the use-case for a mode like STRICT vs SAFE is a webapp that has multiple servlets, some of which that are prepared to deal with encoded paths, but others that have not been updated to do so. In such a mixed mode deployment, the updated servlets could use new methods to access the encoded path, but if any requests with encoded paths arrived at non-updated servlets, then they would either throw or get a null if they used the existing path methods.

Ultimately, this scenario shows the limits of the modal approach, in that the mode really doesn't apply to the whole webapp, but should apply servlet by servlet and filter by filter. However I don't think that fine grained approach would be workable, as what do you do if a request is mapped to a servlet that is ENCODED mode, but also matches a filter that declares it is in DECODED mode. Thus I think the modes need to be container wide. But I also think that having a new API to get the resesrved-only-encoded form would allow filters/servlets to be written that will work correctly regardless of mode.

So I like your modes (and their names), but they only apply to the existing path methods and to the interpretation of the existing getResource(String) methods.
We then need a new methods like String HttpServletRequest.getUriPathInContext() and URI ServletContext.getURI(String uriPathInContext) that work on unambiguous reserved-only-encoded forms that will work in exactly the same way regardless of the mode. We then can have:

  • SAFE is used for 99.99% of applications that don't care about strangely encoded paths
  • STRICT can be used by applications that have some filters/servlets updated to use new APIs but some filters/servlets that are not updated but need to be protected from strange encodings.
  • ENCODED can be used by applications that have updated all filters/servlets to be able to handle reserved-only-encoded forms.
  • LEGACY is for the 0.001% of applications that really need the current bad behaviour
  • DECODED is the one I can't really see a use-case for, but it's existance is kind of symmetric with ENCODED

So we could get away with just SAFE, STRICT and LEGACY if we wanted to minimize the number of modes.

@gregw
Copy link
Contributor

gregw commented Jul 26, 2021

I've been thinking about the modes and I'd really like to keep them to a minimum.

I think we can get by with just a boolean legacy mode in which the getServletPath and getPathInfo methods return fully decoded paths and the getResource and getRealPath methods expect fully decoded paths as well.

When not in legacy mode then getContextPath, getServletPath and getPathInfo return strings in which only URI reserved characters and ; are % encoded. The getResource and getRealPath methods expect similarly reserved-only encoded strings (any encoded non-reserved character is an IAE).

The behaviours of other modes described above can be obtained by filters or perhaps new security constraint types. To assist with that, it would be good to have a new method boolean HttpServletRequest.isFullyDecoded() that would return true if there are no encoded reserved characters in the URI (name is WIP).

The I still think we need the new String HttpServletRequest.getPathInContext() and URI ServletContext.getURI(String pathInContext) that always work in encoded-non-reserved mode regardless of the legacy mode.

We could even avoid having an explicit legacy mode switch and instead just use the version of the web.xml file (with no web.xml defaulting to 6.0 semantics).

In summary, in a 6.0 webapp, all the following methods would work in only-reserved-characters-encoded mode: getContextPath(), getServletPath(), getPathInfo(), getPathInContext(), getResource(String), getRealPath(String), getURI(String). In legacy mode, the pre-existing methods work as they always did, but the new methods work in only-reserved-characters-encoded mode.

@markt-asf
Copy link
Contributor

I've been thinking about the impact this discussion has on mapping requests. But first I'd like to respond to your points.

I agree reducing modes is good. I've no objection to dropping STRICT and SAFE. The more I thought about it, the less comfortable I was with the idea of rejecting valid URIs. LEGACY and DECODED should be very similar. My thinking in adding DECODED was to cover the case where a container's current behaviour was closer to ENCODED. However, I haven't seen any evidence that any container behaves like that so DECODED could be dropped on the basis the use case for it is purely theoretical. That leaves LEGACY and ENCODED as you suggested. I'm not sure about using a boolean to control the mode. I'd like to leave open the possibility of containers adding their own custom mode(s) and/or us adding an additional mode as this discussion evolves. An Enum was my first thought but that isn't extensible. Maybe a string where values that start X- are reserved for container specific extensions and all other names are reserved for the specification.

For LEGACY mode, I think we are in agreement if by "fully decoded paths" you mean:

  • extract jsessionid if any, removing it from the URI
  • remove all remaining path parameters
  • decode everything apart from %2f
  • normalize
  • map to Servlet
  • determine paths based on the Servlet the request was mapped to

For ENCODED mode, the above becomes:

  • extract jsessionid if any, removing it from the URI
  • decode all %nn sequences excluding RFC 3986 reserved characters and '%25for%`
  • normalize
  • map to Servlet
  • determine paths based on the Servlet the request was mapped to

Which brings us nicely to the topic of mapping. In LEGACY mode mapping is easy. In ENCODED mode, any of the path components (context path, servlet path and path info) may have path parameters. We need to figure out how to handle this. Possible options are:

  1. ignore path parameters when mapping
  2. introduce some form of URI pattern matching that accounts for path parameters in <url-pattern> elements
  3. only allow path parameters in path info, rejecting requests with a 404 if they appear elsewhere
  4. soemthing else

Spring supports path parameters in the servlet path and path info. If we want to support that (and I think we should) that rules out option 3. Option 2 would mean defining how path parameters work. My concern is that there will be multiple different approaches to this currently in use. Therefore, I like option 1 which treats the path parameters as opqaue data that the application processes as it sees fit. One note on security. The URI must not be normalised after removal of path parameters else segments like /..;/ could be used for a path traversal attack.

We also need to decide if the vales in <url-pattern> elements should have reserved characters encoded or not. I think either will work (I haven't thought of a case that does yet) but for consistency my current preference is for encoding reserved characters here.

I had been thinking that the mode would be per web application. Your comments made think about the practicalities of per servlet or per container. I think a per container approach will create issues as different applications will inevitably update to the newer approach at different rates. With a mapping approach that ignores path parameters, I think we can safely consider mode at the web application level. I think we could implement mode at the Servlet level. I agree with your assessment that this would complicate things for Filters that are used with Servlets in different modes. I think those compications are surrmountable - essentially the servlet needs to expose the mode it is configured to use so the Filter can adjust accordingly. There are also a bunch of other ways applications could address (or just avoid) this issue. From an implementation perspective, I'm not sure per Servlet is that more (any?) more complicated than per web application.

An advantage of setting the mode per Servlet is that I think that removes the need for the additional methods you proposed. I am rather wary of adding extra methods and a solution that addresses the current ambiguities without the need for extra methods is very tempting. I also like flexibility this provides for larger applications that may need to migrate to the new way of doing things in stages.

I think the final part of this puzzle is HttpServletMapping. If we choose option 1 for mapping requests then that would mean the current behaviour of HttpServletMapping is unchanged.

What was your thinking in requring an IAE if getResource() and friends were called with a path that included a %nn encoded non-reserved character?

I think we have at least mentioned all the areas affected by these changes. Can you think of any we've missed? Once we have the majority of this agreed I plan to start implementing this in Tomcat to confirm that what we come up with is practical to implement. I suspect that will result in a little fine-tuning.

@gregw
Copy link
Contributor

gregw commented Jul 27, 2021

I'll digest your main responses tomorrow.... but I'll answer the easy one below.

What was your thinking in requring an IAE if getResource() and friends were called with a path that included a %nn encoded non-reserved character?

The idea is that those methods should take precisely the strings returned from the path methods. Currently they accept decoded paths, so if there is a % character in the arg, it is considered part of the file name. If we start allowing arbitrary characters to be encoded, then we are more likely to break the few apps that do actually pass % characters. If we only accept % encoding for precisely the reserved characters that the path methods leave encoded then that a) forces us to very well define those characters; b) makes minimal changes to how those methods are used today (for the 99.999% case); c) any modes introduced are exactly the sane for what the path methods return and what the resource methods take.

@gregw
Copy link
Contributor

gregw commented Jul 28, 2021

I've been thinking about the impact this discussion has on mapping requests. But first I'd like to respond to your points.

That leaves LEGACY and ENCODED as you suggested. I'm not sure about using a boolean to control the mode. I'd like to leave open the possibility of containers adding their own custom mode(s) and/or us adding an additional mode as this discussion evolves. An Enum was my first thought but that isn't extensible. Maybe a string where values that start X- are reserved for container specific extensions and all other names are reserved for the specification.

How this is handled all depends on where we see the future. Do we want an arbitrary Servlet-9.0 application to be able to continue to choose between the modes for the path methods - ie LEGACY is an entirely valid mode into the future; or is that the future of the path methods is to return the reserve-encoded form that we define in 6.0 and we only offer a legacy mode to ease the transition. I favour the later as carrying legacy modes over multiple major releases will result in the need to version the legacy modes.

So if we say the path methods from 6.0 onwards are intended to return the reserve-encoded form, then I think it is entirely valid to use the existence of a < 6.0 web.xml to trigger the legacy mode. We can then add some verbage to say that a container is free to all the mode to be specifically configured and to override the web.xml. If we go this way, then we don't even need to define named modes a LEGACY and ENCODED. Instead we just define the bahaviour of these methods in 6.0 and note that they behaive differently if there is a <6.0 web.xml present.

For LEGACY mode, I think we are in agreement if by "fully decoded paths" you mean:

  • extract jsessionid if any, removing it from the URI
  • remove all remaining path parameters
  • decode everything apart from %2f

Nope - I think everything is decoded, including %2f. Do you currently leave %2f encoded?

  • normalize
  • map to Servlet
  • determine paths based on the Servlet the request was mapped to

For ENCODED mode, the above becomes:

  • extract jsessionid if any, removing it from the URI
  • decode all %nn sequences excluding RFC 3986 reserved characters and '%25for%`
  • normalize
  • map to Servlet
  • determine paths based on the Servlet the request was mapped to

Yep, modulo treatment of ; below.

Which brings us nicely to the topic of mapping. In LEGACY mode mapping is easy. In ENCODED mode, any of the path components (context path, servlet path and path info) may have path parameters. We need to figure out how to handle this. Possible options are:

  1. ignore path parameters when mapping
  2. introduce some form of URI pattern matching that accounts for path parameters in <url-pattern> elements
  3. only allow path parameters in path info, rejecting requests with a 404 if they appear elsewhere
  4. soemthing else

Spring supports path parameters in the servlet path and path info. If we want to support that (and I think we should) that rules out option 3. Option 2 would mean defining how path parameters work. My concern is that there will be multiple different approaches to this currently in use. Therefore, I like option 1 which treats the path parameters as opqaue data that the application processes as it sees fit. One note on security. The URI must not be normalised after removal of path parameters else segments like /..;/ could be used for a path traversal attack.

What about option 5. Treat ; characters other than as part of a JSESSIONID as just a normal character in a segment?
According to the RFCs, path parameters are no longer a thing generally. So we should take out the one for session ID and leave all others in the path. For mapping, we would then allow a filter/servlet to be mapped to something like /foo;bar where ; is treated no differently to any other character. If we really REALLY wanted to do more, then perhaps we could introduce a new mapping like /foo;* which would match any URI like /foo, /foo/, /foo;anything, but not /foo/bar or /foo;anything/bar. We don't support any mid patterns like /foo/*/bar or /foo*/bar already, so I don't see why we should start for parameters.

We also need to decide if the vales in <url-pattern> elements should have reserved characters encoded or not. I think either will work (I haven't thought of a case that does yet) but for consistency my current preference is for encoding reserved characters here.
+1
And I'd prefer it to be an error if any other characters are encoded.

I had been thinking that the mode would be per web application. Your comments made think about the practicalities of per servlet or per container. I think a per container approach will create issues as different applications will inevitably update to the newer approach at different rates. With a mapping approach that ignores path parameters, I think we can safely consider mode at the web application level. I think we could implement mode at the Servlet level. I agree with your assessment that this would complicate things for Filters that are used with Servlets in different modes. I think those compications are surrmountable - essentially the servlet needs to expose the mode it is configured to use so the Filter can adjust accordingly. There are also a bunch of other ways applications could address (or just avoid) this issue. From an implementation perspective, I'm not sure per Servlet is that more (any?) more complicated than per web application.

Hmmmmm maybe. But I think this just complicates things like asynchronous threads that call the path methods. Already it is a problem that an async thread looking at a request that is passed to a request dispatcher will see different values in a race with the handling of the request in and out of the dispatcher. If we make this per servlet, then an async thread will see different encoding values in a race. The real fix for this is to get rid of object-identity-wrapping so that async threads can be passed a container-wrapped request and the values they see from the path methods would never change.... but that is another discussion.

Ultimately, it comes down to if this mode is something valid going forward or just some legacy help to get applications ported to the new 6.0 behaviour. I think the later, so I think having a context wide mode is fine. If there are really some filters/servlets that can run in that mode then we can provide utility wrappers that will take a 6.0 request running in the new mode and decode any reserved characters to provide the <6.0 behaviour (in fact I was already thinking of providing this mode with a container provided wrapper prior to the first dispatch - hence not making core code horridly modal).

An advantage of setting the mode per Servlet is that I think that removes the need for the additional methods you proposed. I am rather wary of adding extra methods and a solution that addresses the current ambiguities without the need for extra methods is very tempting. I also like flexibility this provides for larger applications that may need to migrate to the new way of doing things in stages.

I think the new methods I'm proposing can be justified even without this problem. Too frequently are servletPath and pathInfo recombined to a pathInContext. The resource methods really should be usable with URI and/or Path rather than the antique URL. They a good improvements anyway, and I just really a subject in this discussion because they can be specified with the new better behaviour and never need to be modal.

I think the final part of this puzzle is HttpServletMapping. If we choose option 1 for mapping requests then that would mean the current behaviour of HttpServletMapping is unchanged.
Or option 5.

I think we have at least mentioned all the areas affected by these changes. Can you think of any we've missed? Once we have the majority of this agreed I plan to start implementing this in Tomcat to confirm that what we come up with is practical to implement. I suspect that will result in a little fine-tuning.

How about prior to implementing in Tomcat we create a branch of the api/spec and a draft PR. We can word craft the spec/javadoc/api changes we are describing above in that PR and that will ensure we are really on the same page (I think we are currently in the same chapter of the same book, so that's good). In parallel we can try out that branch in Tomcat and Jetty.

@gregw
Copy link
Contributor

gregw commented Oct 7, 2021

@stuartwdouglas @markt-asf I've had a go at update the wiki.
I've adding in that we have to convert octet sequences to characters with UTF-8.

I wish there was a way to comment & suggest on the wiki like there is with PRs. Maybe we should move this to another PR so we can do so?

@stuartwdouglas
Copy link
Contributor

I think a PR would be easier. I mostly agree with everything there (including the TODO's that need more work), and I have added an additional TODO: if %2F is allowed, we may now have double '/', '/./' and '/../' segments in the URL, should stage 3 and 4 be re-run if this is allowed?

@gregw
Copy link
Contributor

gregw commented Oct 7, 2021

I'll move to a PR.... stand by.....

gregw added a commit that referenced this issue Oct 7, 2021
Issue #18 URI path processing. Moved the wiki text to a PR for better comment and suggestions.
@gregw
Copy link
Contributor

gregw commented Oct 7, 2021

I have created #424.
Sorry @markt-asf, I've kind of hacked the wiki contents around a fair bit... and not yet necessarily for the better. Hopefully the PR tools will allow us to iterate and refine.

gregw added a commit that referenced this issue Oct 7, 2021
moved TODOs to PR comments
gregw added a commit that referenced this issue Oct 7, 2021
Define URI transport as description list with HTTP/1.0 and HTTP/3 included.
gregw added a commit that referenced this issue Oct 7, 2021
Move all rejections to the last step.
gregw added a commit that referenced this issue Oct 7, 2021
gregw added a commit that referenced this issue Oct 7, 2021
Added example table. too long and needs review
gregw added a commit that referenced this issue Oct 19, 2021
Issue #18 URI path processing

Co-authored-by: Mark Thomas <markt@apache.org>
@gregw gregw linked a pull request Oct 20, 2021 that will close this issue
@gregw gregw linked a pull request Nov 30, 2021 that will close this issue
@markt-asf markt-asf reopened this May 23, 2024
@markt-asf
Copy link
Contributor

The final PR included an update to the Javadoc for HttpSservletRequest.getContextPath() that stated the return value would be canonicalized. However, this method has always returned exactly what the client sent. No canonicalization. No decoding.
This looks like an error on my part when I added the final updates to the Javadoc. Do we want to remove the canonicalization text?

@gregw
Copy link
Contributor

gregw commented May 23, 2024

We've always interpreted this as returning the encoded context path, but what is configured on the server rather than what the client sent. I think returning whatever the client sent is a bit non deterministic.

@markt-asf
Copy link
Contributor

We have a narrow window of opportunity to fix this for 6.1 since I need to re-roll the release to fix the Javadoc copyright dates. I intend to simply remove the "The path will be canonicalized as per..." sentence that was added unless we managed to agree an alternative text very quickly (today).

My understanding of the expected behaviour is that for a context path of /aaa this method behaves as follows:

Request URI Context Path
/aaa/index.html /aaa
/a%61a/index.html /a%61a
/bbb/../aaa/index.html /bbb/../aaa

The no decoding (2nd line) is explicit in the original Javadoc.

No canonicalization (3rd line) was implied. From memory that requirement originated from a need to be able to strip the context path from the return value of getRequestURI() .

My proposal for a change is to align it with getRequestURI() . ie:

  • remove the "The path will be canonicalized as per..." sentence
  • change the first sentence to start "Returns the portion of the request URI from the first line of the HTTP request that ..."

@markt-asf
Copy link
Contributor

I applied the simple fix that removed the "The path will be canonicalized as per..." sentence and re-tagged. We can agree the new wording for 6.2 in slightly slower time.

@gregw
Copy link
Contributor

gregw commented May 25, 2024

I'm good with the simple fix for 6.1. I'm still cautious about the idea of returning what the client sent for the context path, least not that it is actually difficult to work out. But that's an ambiguity that had existed before asked nobody had complained that jetty returns /aaa for all your examples. We only return an encoded character if it needs to be encoded in they context path. But agree we can sort that out in 6.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Candidate4NextRelease This issues should be consider for inclusion in the next release project. Enhancement New feature or request
Projects
None yet
5 participants