Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add URL Parser for RFC 3986 #33639

Closed
rstoyanchev opened this issue Oct 3, 2024 · 0 comments
Closed

Add URL Parser for RFC 3986 #33639

rstoyanchev opened this issue Oct 3, 2024 · 0 comments
Assignees
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) type: enhancement A general enhancement
Milestone

Comments

@rstoyanchev
Copy link
Contributor

rstoyanchev commented Oct 3, 2024

Before 6.2, UriComponentsBuilder used regex expressions. Generally, they split on the main component delimiters, ":", "/", "?", and "#", but did not enforce deviations from the allowed character set by component. The resulting UriComponents can then encode any non-conforming characters.

Regular expressions are convenient, but provide limited control and visibility. This is why in #32513 we added an implementation of the URL parsing algorithm from the WhatWg URL Living Standard that browsers use to align on how to handle a wide range of cases leniently. While this provides more robust parsing than before, arguably on a server we can expect URLs that don't deviate from the RFC quite as far as what browsers need to be able to handle.

We can add a new parser that follows RFC syntax along the lines of the java.net.URI or Jetty's HttpUri parsers. The new parser should respect the main component delimiters, but otherwise leave some room for leniency within each component to allow some characters like spaces or curly braces (URI variables), similar to what the regex expressions did. UriComponents can then encode any non-confirming characters that remain after URI variables are expanded.

It should be possible to choose which parser to use, RFC or the WhatWG, when more leniency or alignment with browsers is needed.

The topic of RFC vs WhatWG parsing was first brought up by @joakime in #33542. For broader context, and possible future effort to standardize lenient parsing of user provided URLs, see https://lists.w3.org/Archives/Public/ietf-http-wg/2024JulSep/0281.html.

@rstoyanchev rstoyanchev added in: web Issues in web modules (web, webmvc, webflux, websocket) type: enhancement A general enhancement labels Oct 3, 2024
@rstoyanchev rstoyanchev added this to the 6.2.0-RC2 milestone Oct 3, 2024
@rstoyanchev rstoyanchev self-assigned this Oct 3, 2024
rstoyanchev added a commit that referenced this issue Oct 7, 2024
An example of this can be found in RFC 2732, but it is obsoleted by
RFC 3986 whose syntax for IPv6address does not allow dots.

Also, Appendix D of RFC 3986:

As [RFC2732] defers to [RFC3513] for definition of an IPv6 literal
address, which, unfortunately, lacks an ABNF description of
IPv6address, we created a new ABNF rule for IPv6address that matches
the text representations defined by Section 2.2 of [RFC3513].

See gh-33639
rstoyanchev added a commit that referenced this issue Oct 7, 2024
isUnreserved and isSubDelimiter are usually checked together. It helps
to have a shortcut with an efficient lookup.

See gh-33639
rstoyanchev added a commit that referenced this issue Oct 7, 2024
rstoyanchev added a commit that referenced this issue Oct 7, 2024
spencergibb added a commit to spring-cloud/spring-cloud-gateway that referenced this issue Oct 11, 2024
ryanjbaxter added a commit to spring-cloud/spring-cloud-config that referenced this issue Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in: web Issues in web modules (web, webmvc, webflux, websocket) type: enhancement A general enhancement
Projects
None yet
Development

No branches or pull requests

1 participant