-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modify Fingerprint to Hashed Values #29617
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I think the overall idea makes sense, but this detail is not as straightforward as it would seem:
There is a proposal here which would switch us over to using hashes to track files. The challenge, in short, is that we need to be able to match files even when content is appended to them. (e.g. a file has "ABC\n" the first time we see it, but "ABC\nDEF\n" the next time, and we need to recognize that "ABC\n" was already consumed.) I think if we can make that proposed change, we would naturally adopt this proposal as well. However, it remains to be seen whether that approach is at least similarly performant. |
Agreed @djaglowski, I actually ran across the issue with matching files with data being appended to them when first exploring this request. I am working on a solution and should have a proposal in the next few days. Hopefully it will be easier to discuss implementation details on a draft PR. Thanks for your input! |
Adding If the draft PR implementation can move forward after initial review we can most likely assign the issue. |
Hey @djaglowski, Just pushed a draft PR to illustrate the idea we had for this problem. It's a draft because I just wanted to get your thoughts on the PR. Thanks! |
@Danielzolty, do you plan to pick up the PR again? |
Having worked through an implementation in #31317 which does not appear to improve performance, I believe it's time to close this issue. If someone else wants to demonstrate an improvement based on hashing fingerprints, please let me know and I will reopen the issue. |
Component(s)
pkg/stanza, receiver/filelog
Is your feature request related to a problem? Please describe.
The current mechanism in which the File Log Receiver uses the Persister interface to identify files is through the use of Fingerprints. These fingerprints are stored as markers for the files that have already started to be processed. Fingerprints are stored on disk if the file storage extension is being used. The way in which they are markers is by storing the first n bytes of a file and encoding them in base 64 format. The problem is that base 64 format can easily be decoded by a third party enabling them to read what those fingerprints contain.
Describe the solution you'd like
I suggest storing the fingerprints using a hashing mechanism. This would ensure that the fingerprints are not deserializable, yet maintain their unique markers. Comparing files wouldn’t be a problem because the same hashing mechanism would be used each time, so fingerprints with the same hash value would indicate they share a unique file.
Describe alternatives you've considered
No response
Additional context
Configuration used:
First we create a fake log file containing sensitive data
echo "data in plain text" | sudo tee /test2.log
This is the content in the storage extension
The content is written in base 64 format, which is essentially plain text:
The text was updated successfully, but these errors were encountered: