Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default nginx pipelines should decode url.original field #19088

Closed
SlavikCA opened this issue Jun 10, 2020 · 10 comments · Fixed by #24699
Closed

Default nginx pipelines should decode url.original field #19088

SlavikCA opened this issue Jun 10, 2020 · 10 comments · Fixed by #24699
Labels
Filebeat Filebeat Team:Services (Deprecated) Label for the former Integrations-Services team

Comments

@SlavikCA
Copy link
Contributor

SlavikCA commented Jun 10, 2020

Describe the enhancement:

Describe a specific use case for the enhancement or feature:

Here is how url.original field currently looks in Kibana:
/A%20Beka%20G1%20Howe/029_AND_30/15%20reading%20elephants.mp4

Here is how it should look:
/A Beka G1 Howe/009/17 Reading Elephants.mp4

Here is the original record in the Nginx access log:

lessons.example.com 192.168.0.1 - - [09/Jun/2020:12:10:39 -0700] "GET /A%20Beka%20G1%20Howe/029_AND_30/15%20reading%20elephants.mp4 HTTP/1.1" 206 7648063 "http://lessons.example.com/A%20Beka%20G1%20Howe/029_AND_30/15%20reading%20elephants.mp4" "Mozilla/5.0 (Linux; Android 5.1.1; KFFOWI) AppleWebKit/537.36 (KHTML, like Gecko) Silk/81.2.16 like Chrome/81.0.4044.138 Safari/537.36"

So, that example fixes the issue of HTML-encoded character, such as spaces.

Here is one more example, where decoding is needed to non-english characters:

Here is how url.original field currently looks in Kibana:
/%D0%A0%D1%83%D1%81%D1%81%D0%BA%D0%B0%D1%8F%20%D1%88%D0%BA%D0%BE%D0%BB%D0%B0%20-%20InternetUrok%201%D0%BA%D0%BB%D0%B0%D1%81%D1%81/

Here is how it should look:
/Русская школа - InternetUrok 1класс/

Here is the original record in the Nginx access log:

lessons.example.com 192.168.0.1 - - [09/Jun/2020:21:31:51 -0700] "GET /%D0%A0%D1%83%D1%81%D1%81%D0%BA%D0%B0%D1%8F%20%D1%88%D0%BA%D0%BE%D0%BB%D0%B0%20-%20InternetUrok%201%D0%BA%D0%BB%D0%B0%D1%81%D1%81/ HTTP/1.1" 200 894 "http://lessons.example.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"

Issue is the same for Nginx access and error pipelines (message field)

Screen Shot 2020-06-09 at 10 08 24 PM

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 10, 2020
@ChrsMark ChrsMark added the Team:Services (Deprecated) Label for the former Integrations-Services team label Jun 10, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-services (Team:Services)

@legoguy1000
Copy link
Contributor

@andrewkroh I'm working on a PR for this but trying to decide if the url components should be decoded or not. I guess it depends on if we're trying to show what the user entered into the browser/the link or what the browser/web server converted it to in order to remove special characters. Thoughts?

@andrewkroh
Copy link
Member

Based on my interpretation of ECS, I think url.original should remain unchanged from the original source. @elastic/ecs, wdyt?

The uri_parts processor should automatically URL decode the parts of the URI as described in https://docs.oracle.com/javase/7/docs/api/java/net/URI.html.

The getUserInfo, getPath, getQuery, getFragment, getAuthority, and getSchemeSpecificPart methods decode any escaped octets in their corresponding components. The strings returned by these methods may contain both other characters and illegal characters, and will not contain any escaped octets.

You'll see in its source that it doesn't use the getRaw* accessors.

So I would decode the same components as the uri_parts processor. I'd even see if you could use that processor, but I think it may require a more complete URI to work. So you may have to use urldecode and split it the parts with an alternative means. Or maybe reconstruct the url (assuming the necessary info is present) for url.full and then decode it and then run uri_parts.

@legoguy1000
Copy link
Contributor

looks like the raw java code u posted does URL decode it and works with incomplete URLs. So I feel that the processor should work. Going to continue to play with it.
image

@legoguy1000
Copy link
Contributor

legoguy1000 commented Mar 25, 2021

the uri_parts processor is working however it populates the uri.original with the raw un-decoded value so I've added a separate urldecode processor for url.original and http.request.referrer. I've update Apache and Nginx and am looking at the other modules that may not be fully decoding/parsing urls.

@legoguy1000
Copy link
Contributor

PR has been updated, please take a look to see if we think it meets the intent of the issue & ECS standard.

@legoguy1000
Copy link
Contributor

legoguy1000 commented Mar 27, 2021

I updated 15 Modules that i could find that have url.* data and have sample data to validate the changes. Marked the PR ready for review

@legoguy1000
Copy link
Contributor

@SlavikCA Please see the conversation in #24699. It was decided amongst the group and the ECS authors that per the spec, the url.original field should remain exactly as it was recorded by the system generating the log/event. However by adding the parts processor, the urls will be broken down into components and those will be url decoded.

@SlavikCA
Copy link
Contributor Author

@legoguy1000 That's good solution. Do you think we can expect it in 7.13?

@legoguy1000
Copy link
Contributor

I don't know what the cutoff date for 7.13 is but the ci pipeline should be done in a couple hours and based off my conversations with the elastic devs, I think it should be good to merge once it passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment