-
Notifications
You must be signed in to change notification settings - Fork 418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new fields file.origin_referrer_url & file.origin_url #2348
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks generally pretty good, just some small things to fix up
- name: origin_referrer_url | ||
level: extended | ||
type: keyword | ||
ignore_above: 8192 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly large ignore_above value, do you expect that there will be data this size? Have you looked into the performance impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you expect that there will be data this size?
Yes. When defining the URL size, I did some research and came across a discussion from a past ECS thread about URL size, which I used as a reference. There was a discussion that the URL field should be at least 4096 bytes, and ideally 8192 bytes based on the actual URL log data, so I followed that suggestion. In addition, after discussing it within the team, we concluded that since the typical maximum URL size accepted by servers is 8KB, 8KB seemed like the most reasonable choice.
Have you looked into the performance impact?
However, I haven't look at the performance impact of setting ignore_above
as 8192. If performance testing is required, I would be happy to carry it out. If there is any documentation on how to do it, could you please share it with me?
schemas/file.yml
Outdated
level: extended | ||
type: keyword | ||
ignore_above: 8192 | ||
description: The url of the webpage that linked to the file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize URL here and in the below description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! I have changed url to URL!
schemas/file.yml
Outdated
level: extended | ||
type: keyword | ||
ignore_above: 8192 | ||
description: The url where the file is hosted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered redirects? For example, when using CDNs, the initial URL can be redirected to various download URLs. Is that meaningful for your use case? Should the description specify if this is the initial request or a redirected URL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have implemented this field (origin_url
) to retrieve the actual URL from which the file was downloaded using the Mark of the Web (MoTW) information in Windows. (MoTW is metadata added to files downloaded from the internet in Windows. )
For example, if you download ejabberd_20.12-0_amd64.deb
from https://www.process-one.net/downloads/downloads-action.php?file=/20.12/ejabberd_20.12-0_amd64.deb, the download is redirected, and the file is actually retrieved from https://static.process-one.net/ejabberd/downloads/20.12/ejabberd_20.12-0_amd64.deb.
In MoTW, as shown in the attached image, the redirect URL is saved as the file's host URL.
Co-authored-by: Michael Wolf <michael.wolf@elastic.co>
@mjwolf |
level: extended | ||
type: keyword | ||
ignore_above: 8192 | ||
description: The URL of the webpage that linked to the file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand what does it mean? Could you elaborate it a bit, give some examples maybe? Is it url, where file was located before downloading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @trisch-me,
Thank you for the comments/feedbacks!🙇🏻♀️
Following is the actual example of the file.origin_referrer_url
and file.origin_url
.
When you download an image file (for example, news30_img3.png
) from the following web page,
the file.origin_url
will be the url for the downloaded file, and the file.origin_referrer_url
will be the url which hosted the file (http://girls.seccon.jp/news30.html).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explanation, can you add more information into description/examples of the actual fields?
Hey @AsuNa-jp I think it would be great to follow RFC process for this change, because it's a new addition and it should usually go through the process. Currently looking into fields I'm not sure we can add those to the file namespace because from the semantic logic they are of in Otel we have this differentiation between schema, i.e. https://github.com/open-telemetry/semantic-conventions/blob/main/model/registry/url.yaml and usage of schema (or how we call it - usecases), where you have fields from different namespaces together in 1 place, describing the usecase (https://github.com/open-telemetry/semantic-conventions/blob/main/model/trace/database.yaml#L263) Translating this into your PR and new fields, I would argue that they should be a part of the file namespace (though it's possible too) but rather be a part of that mix for specific usecase. Also another possibility to have them not as strings but as a reference to a In current Otel version one can't rename referenced attributes, it means url.full is always |
Hi @trisch-me! I’ve finally come to understand the differences between ECS and OpenTelemetry, so I’d like to answer your previous question. Whether or not it should be placed in the file namespaceECSBased on the definition of the file fields in ECS, I believe
This is because this For example, when you download an image file (
OpenTelemetryAs for OpenTelemetry ( If there is no need to unify the naming between ECS and OpenTelemetry, then I think it's fine to handle it as a mixed use case(currently, |
@magermark |
hey @AsuNa-jp great explanation, I agree with you that in that case we might add those fields into Do you foresee there any private information in the url path? I don't think we are in control of it, but we might have it, and we might have to or want to remove it. Should we add this to the description? We aim to have the same names for both otel and ecs fields, otherwise it will be a breaking change for us (small one but better to avoid). I'm not opposing this change in the ECS, but I would like to encourage you to create a similar PR in Otel (I can give you some hints and can show how to) to get initial response on the field names. |
@trisch-me Thank you for your reply.
Yes, if a password is included in the URL of a web page, it can be exposed as shown below.
If the field needs to be the same between ECS and Otel, how about first adding |
I think we can start with just adding fields to the otel as we want to have them in ECS, I don't see any problems in starting this discussion already. Can you do it? Should I create a PR? And we should add a note about possible sensitive data inside url |
Checklist
make test
?make
and committed those changes?PR summary
This PR adds
file.origin_referrer_url
andfile.origin_url
To reviewers
Size of the fields
The two fields added in this PR are intended to store URL. Therefore, the default field size of 1024 bytes is insufficient. As a result, we want to set ignore_above to 8k(8192) bytes, if possible.
Default Field
This field will not necessarily be included in all file events.
Therefore, I set it as
default_field: false
, but please let me know if this is incorrect.