[ESQL] Encoding behavior for the url_encode function

Hey!

The `url_encode` and `url_decode` functions are already implemented but we have few questions about the expected encoding behavior in some cases, hence the reason they're both snapshot functions for now. 

Under the hood, `url_encode` uses `java.net.URLEncoder.encode(...)` which is meant to encode URL form data (e.g.: the `application/x-www-form-urlencoded` MIME type). Its behavior slightly deviates from the specs outlined in [RFC#3986 - section 2](https://datatracker.ietf.org/doc/html/rfc3986#section-2); percent-encoding. This is only the case for a few characters. For example, `*` stays the same instead of being encoded to `%2A`. Further, ` ` (space) gets encoded to `+` instead of `%20`. 

Should `url_encode` strictly follow the RFC? Another thing to keep in mind is the [URL decode processor](https://www.elastic.co/docs/reference/enrich-processor/urldecode-processor) which is already in place and calls the same code internally as the `url_decode` function; `URLDecoder.decode(...)`. Just mentioning that in case we want keep all encoders/decoders in sync in terms of behavior.

In addition `http://example.com` becomes `https%3A%2F%2Fwww.example.com` and that's RFC-compliant. But do we really want that?


**Useful Links**

- https://en.wikipedia.org/wiki/Percent-encoding
- https://datatracker.ietf.org/doc/html/rfc3986
- Java's [URLEncoder](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/net/URLEncoder.html) and [URLDecoder](https://docs.oracle.com/en/java/javase/21/docs/api/java.base/java/net/URLDecoder.html)
- Example of an encoder that's explicit about RFC3986 compliance: https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/web/util/UriUtils.html

Thanks for your input!! 🤠

-----

**Post-discussion update**
We will do the following
- Add `URL_ENCODE_COMPONENT` scalar function, which encodes spaces as `%20`, and change the existing `URL_ENCODE` to encode spaces as `+`.
- For both functions, all characters in the input are encoded, except the RFC3986-safe set which consists of alphanumerics, `.`, `-`, `_`, and `~`.  These don't change.
- `URL_DECODE` isn't problematic, and does the exact same op as [URL decode processor](https://www.elastic.co/docs/reference/enrich-processor/urldecode-processor)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ESQL] Encoding behavior for the url_encode function #134087

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[ESQL] Encoding behavior for the url_encode function #134087

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions