Skip to content

[ESQL] Encoding behavior for the url_encode function #134087

@mouhc1ne

Description

@mouhc1ne

Hey!

The url_encode and url_decode functions are already implemented but we have few questions about the expected encoding behavior in some cases, hence the reason they're both snapshot functions for now.

Under the hood, url_encode uses java.net.URLEncoder.encode(...) which is meant to encode URL form data (e.g.: the application/x-www-form-urlencoded MIME type). Its behavior slightly deviates from the specs outlined in RFC#3986 - section 2; percent-encoding. This is only the case for a few characters. For example, * stays the same instead of being encoded to %2A. Further, (space) gets encoded to + instead of %20.

Should url_encode strictly follow the RFC? Another thing to keep in mind is the URL decode processor which is already in place and calls the same code internally as the url_decode function; URLDecoder.decode(...). Just mentioning that in case we want keep all encoders/decoders in sync in terms of behavior.

In addition http://example.com becomes https%3A%2F%2Fwww.example.com and that's RFC-compliant. But do we really want that?

Useful Links

Thanks for your input!! 🤠


Post-discussion update
We will do the following

  • Add URL_ENCODE_COMPONENT scalar function, which encodes spaces as %20, and change the existing URL_ENCODE to encode spaces as +.
  • For both functions, all characters in the input are encoded, except the RFC3986-safe set which consists of alphanumerics, ., -, _, and ~. These don't change.
  • URL_DECODE isn't problematic, and does the exact same op as URL decode processor

Metadata

Metadata

Assignees

Labels

:Analytics/ES|QLAKA ESQLTeam:AnalyticsMeta label for analytical engine team (ESQL/Aggs/Geo)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions