Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Fingerprint to Hashed Values #29617

Closed
Danielzolty opened this issue Dec 1, 2023 · 7 comments
Closed

Modify Fingerprint to Hashed Values #29617

Danielzolty opened this issue Dec 1, 2023 · 7 comments

Comments

@Danielzolty
Copy link
Contributor

Component(s)

pkg/stanza, receiver/filelog

Is your feature request related to a problem? Please describe.

The current mechanism in which the File Log Receiver uses the Persister interface to identify files is through the use of Fingerprints. These fingerprints are stored as markers for the files that have already started to be processed. Fingerprints are stored on disk if the file storage extension is being used. The way in which they are markers is by storing the first n bytes of a file and encoding them in base 64 format. The problem is that base 64 format can easily be decoded by a third party enabling them to read what those fingerprints contain.

Describe the solution you'd like

I suggest storing the fingerprints using a hashing mechanism. This would ensure that the fingerprints are not deserializable, yet maintain their unique markers. Comparing files wouldn’t be a problem because the same hashing mechanism would be used each time, so fingerprints with the same hash value would indicate they share a unique file.

Describe alternatives you've considered

No response

Additional context

Configuration used:

extensions:
  file_storage/checkpoint:
    directory: /tmp/storage
    timeout: 1s
    compaction:
      on_start: true
      directory: /tmp/storage_compaction
      max_transaction_size: 65_536

receivers:
  filelog:
    storage: "file_storage/checkpoint"
    include: [/*.log]
    encoding: "ascii"
    
exporters:
  logging:    
  
service:
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [logging]
  extensions: [file_storage/checkpoint]

First we create a fake log file containing sensitive data

echo "data in plain text" | sudo tee /test2.log

This is the content in the storage extension

sh-4.2$ sudo cat /tmp/storage/receiver_filelog_
��
  ������������ݨ�9��
                   �����������Ҡ�Dzz�Hfile_input.knownFiles2
{"Fingerprint":{"FirstBytes":"MTY5MjcyMjc1Ny43Mjk4NDk2IDFzREx0bWhNVHRNcnk3WlhxMWZwYk55NHY5QlJGMHdqT1lsNmVzVDRVR0taM0tWTWZxMmxyanE2QmtzcTF0b3ZrOVJFN2dNTFU1ZUZzSVRXUnVINW92NjlDMUN1aVhiU2QxWWV0dk1BR3FDc2RhWU9CR2I1enRHNEhkV1FCWVppV3VjNmtHQjAxNmhJUWNlMXgzQ29DRUZNZnVQMzlwNE95Nng3TDQzblluVkV3ajZFQkVMZGpKWkswVWRiSE1kYTdvcTJJNGpLNjJiOGFoZVB3akxhMWI2TXpCc2RpcHE0QzM1cFRHazEwUFVpbXNBZ09YUWxWSFBZNjBaRUVqQXVjaVVrTWZhUFZ5emdIVWJ2ZHU5aUJqbWpYb0YwdnkxOWFhUVZnblE0bUVoVDdtN3hQazN5Y21yRko2dlZsTUIzdHlzaWs3MG83bmlmVVVvVkFEdmRBd3U2Z3JsVTNYeGh0NHdGcWJCYjZxekU0cENoSXVqbDQ2NlRLbUJlWEE3aEZUSExiak55c210clc5dWVKZ2kzaTNOd2lSSnI1S0RQQUc5b0dhd1M0bVhxRG40WFFMVjJsODNaQUM2V0lDZENBUndUazhZd2k5YUt3UmRSS0R3QWVZcGZ0VENSWlp1SmJ6REdjb3JMckM5Zks3Qkh5RU12T3BZNkR6N0pkZ1dOUXNUT0w2SWs1Q2RyRWpjWFFRYXRTdGM4TUlIT0lBSGJOWkFYZmM4cm9NOVJVQlJCbzJFcW5pYmh5c2VDSTVnREpPOUlYZVFmR0N1ck41bUpRS0lZcVZ0b09BNFJoN2tJelZnUjlyTkhINUJ1N0lmNFp3aEFTdW5LaVA1ZzFaekd3NzZ2b0htdEhtbjExYXA0ZWtuQ1cyaVVEdjRMQTlENGN3MEN4dTdxUzlZUWFZYWpSMXpLN2p1MWVrTWFHSWNqWExyWkYwRWpmT3BjYndzNnRVa201UjNFYmh4aXk2T0JRZXFFSkVFRVlLWVJRTGR4TG5NM3JsaFk0dmd5RlBLRWdUV2NzUWMzR0RuZUpwMzcweExHTjZ5ZkUza3BtM0t2OTZDTEZvNzdKUWZ6c21kaGdzUUVxUFFFY1JFMmpCYW1QcDBNT2F0bWw4Szh2SnkwM1M5aUlZak1KakNHNWN4MjkwUHZjY3pOU1RQemYxdnBkcVNDdW14N1VlRjJndzVVRWRhS25XT0tPUU1DV1NIWThVWExoMkZ2M2dNQXpCV3VlMXBoVUhoZ1h3N2pBSFpwV284Z2NkbkV5ZHgzMzJNblhwdFlNcGxoSWdIekU4Ums0NmJxNQ=="},"Offset":1563120000,"FileAttributes":{"HeaderAttributes":{}},"HeaderFinalized":false}
{"Fingerprint":{"FirstBytes":"ZGF0YSBpbiBwbGFpbiB0ZXh0="},"Offset":18,"FileAttributes":{"HeaderAttributes":{}},"HeaderFinalized":false}
Hfile_input.knownFiles2
{"Fingerprint":{"FirstBytes":"MTY5MjcyMjc1Ny43Mjk4NDk2IDFzREx0bWhNVHRNcnk3WlhxMWZwYk55NHY5QlJGMHdqT1lsNmVzVDRVR0taM0tWTWZxMmxyanE2QmtzcTF0b3ZrOVJFN2dNTFU1ZUZzSVRXUnVINW92NjlDMUN1aVhiU2QxWWV0dk1BR3FDc2RhWU9CR2I1enRHNEhkV1FCWVppV3VjNmtHQjAxNmhJUWNlMXgzQ29DRUZNZnVQMzlwNE95Nng3TDQzblluVkV3ajZFQkVMZGpKWkswVWRiSE1kYTdvcTJJNGpLNjJiOGFoZVB3akxhMWI2TXpCc2RpcHE0QzM1cFRHazEwUFVpbXNBZ09YUWxWSFBZNjBaRUVqQXVjaVVrTWZhUFZ5emdIVWJ2ZHU5aUJqbWpYb0YwdnkxOWFhUVZnblE0bUVoVDdtN3hQazN5Y21yRko2dlZsTUIzdHlzaWs3MG83bmlmVVVvVkFEdmRBd3U2Z3JsVTNYeGh0NHdGcWJCYjZxekU0cENoSXVqbDQ2NlRLbUJlWEE3aEZUSExiak55c210clc5dWVKZ2kzaTNOd2lSSnI1S0RQQUc5b0dhd1M0bVhxRG40WFFMVjJsODNaQUM2V0lDZENBUndUazhZd2k5YUt3UmRSS0R3QWVZcGZ0VENSWlp1SmJ6REdjb3JMckM5Zks3Qkh5RU12T3BZNkR6N0pkZ1dOUXNUT0w2SWs1Q2RyRWpjWFFRYXRTdGM4TUlIT0lBSGJOWkFYZmM4cm9NOVJVQlJCbzJFcW5pYmh5c2VDSTVnREpPOUlYZVFmR0N1ck41bUpRS0lZcVZ0b09BNFJoN2tJelZnUjlyTkhINUJ1N0lmNFp3aEFTdW5LaVA1ZzFaekd3NzZ2b0htdEhtbjExYXA0ZWtuQ1cyaVVEdjRMQTlENGN3MEN4dTdxUzlZUWFZYWpSMXpLN2p1MWVrTWFHSWNqWExyWkYwRWpmT3BjYndzNnRVa201UjNFYmh4aXk2T0JRZXFFSkVFRVlLWVJRTGR4TG5NM3JsaFk0dmd5RlBLRWdUV2NzUWMzR0RuZUpwMzcweExHTjZ5ZkUza3BtM0t2OTZDTEZvNzdKUWZ6c21kaGdzUUVxUFFFY1JFMmpCYW1QcDBNT2F0bWw4Szh2SnkwM1M5aUlZak1KakNHNWN4MjkwUHZjY3pOU1RQemYxdnBkcVNDdW14N1VlRjJndzVVRWRhS25XT0tPUU1DV1NIWThVWExoMkZ2M2dNQXpCV3VlMXBoVUhoZ1h3N2pBSFpwV284Z2NkbkV5ZHgzMzJNblhwdFlNcGxoSWdIekU4Ums0NmJxNQ=="},"Offset":1563120000,"FileAttributes":{"HeaderAttributes":{}},"HeaderFinalized":false}
{"Fingerprint":{"FirstBytes":"ZGF0YSBpbiBwbGFpbiB0ZXh0="},"Offset":18,"FileAttributes":{"HeaderAttributes":{}},"HeaderFinalized":false}

The content is written in base 64 format, which is essentially plain text:

echo "c2Vuc2l0aXZlIGRhdGEgaW4gcGxhaW4gdGV4dAo=" | base64 -d
data in plain text
@Danielzolty Danielzolty added enhancement New feature or request needs triage New item requiring triage labels Dec 1, 2023
Copy link
Contributor

github-actions bot commented Dec 1, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

I think the overall idea makes sense, but this detail is not as straightforward as it would seem:

Comparing files wouldn’t be a problem because the same hashing mechanism would be used each time, so fingerprints with the same hash value would indicate they share a unique file.

There is a proposal here which would switch us over to using hashes to track files. The challenge, in short, is that we need to be able to match files even when content is appended to them. (e.g. a file has "ABC\n" the first time we see it, but "ABC\nDEF\n" the next time, and we need to recognize that "ABC\n" was already consumed.)

I think if we can make that proposed change, we would naturally adopt this proposal as well. However, it remains to be seen whether that approach is at least similarly performant.

@Danielzolty
Copy link
Contributor Author

Agreed @djaglowski,

I actually ran across the issue with matching files with data being appended to them when first exploring this request. I am working on a solution and should have a proposal in the next few days. Hopefully it will be easier to discuss implementation details on a draft PR.

Thanks for your input!

@bryan-aguilar bryan-aguilar added priority:p2 Medium and removed needs triage New item requiring triage labels Dec 2, 2023
@bryan-aguilar
Copy link
Contributor

bryan-aguilar commented Dec 2, 2023

Adding p2 priority label here because I don't think it extends into p1 and IMO is higher than low priority. Between #29273 and this issue there seems to be a desire/need to use hashes instead of raw data for fingerprints. @Danielzolty is moving forward with a proposal.

If the draft PR implementation can move forward after initial review we can most likely assign the issue.

@Danielzolty
Copy link
Contributor Author

Hey @djaglowski,

Just pushed a draft PR to illustrate the idea we had for this problem. It's a draft because I just wanted to get your thoughts on the PR. Thanks!

@djaglowski
Copy link
Member

@Danielzolty, do you plan to pick up the PR again?

@djaglowski
Copy link
Member

Having worked through an implementation in #31317 which does not appear to improve performance, I believe it's time to close this issue. If someone else wants to demonstrate an improvement based on hashing fingerprints, please let me know and I will reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants