-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
Hey!
The url_encode and url_decode functions are already implemented but we have few questions about the expected encoding behavior in some cases, hence the reason they're both snapshot functions for now.
Under the hood, url_encode uses java.net.URLEncoder.encode(...) which is meant to encode URL form data (e.g.: the application/x-www-form-urlencoded MIME type). Its behavior slightly deviates from the specs outlined in RFC#3986 - section 2; percent-encoding. This is only the case for a few characters. For example, * stays the same instead of being encoded to %2A. Further, (space) gets encoded to + instead of %20.
Should url_encode strictly follow the RFC? Another thing to keep in mind is the URL decode processor which is already in place and calls the same code internally as the url_decode function; URLDecoder.decode(...). Just mentioning that in case we want keep all encoders/decoders in sync in terms of behavior.
In addition http://example.com becomes https%3A%2F%2Fwww.example.com and that's RFC-compliant. But do we really want that?
Useful Links
- https://en.wikipedia.org/wiki/Percent-encoding
- https://datatracker.ietf.org/doc/html/rfc3986
- Java's URLEncoder and URLDecoder
- Example of an encoder that's explicit about RFC3986 compliance: https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/util/UriUtils.html
Thanks for your input!! 🤠
Post-discussion update
We will do the following
- Add
URL_ENCODE_COMPONENTscalar function, which encodes spaces as%20, and change the existingURL_ENCODEto encode spaces as+. - For both functions, all characters in the input are encoded, except the RFC3986-safe set which consists of alphanumerics,
.,-,_, and~. These don't change. URL_DECODEisn't problematic, and does the exact same op as URL decode processor