Modify Fingerprint to Hashed Values #29617

Danielzolty · 2023-12-01T21:42:34Z

Component(s)

pkg/stanza, receiver/filelog

Is your feature request related to a problem? Please describe.

The current mechanism in which the File Log Receiver uses the Persister interface to identify files is through the use of Fingerprints. These fingerprints are stored as markers for the files that have already started to be processed. Fingerprints are stored on disk if the file storage extension is being used. The way in which they are markers is by storing the first n bytes of a file and encoding them in base 64 format. The problem is that base 64 format can easily be decoded by a third party enabling them to read what those fingerprints contain.

Describe the solution you'd like

I suggest storing the fingerprints using a hashing mechanism. This would ensure that the fingerprints are not deserializable, yet maintain their unique markers. Comparing files wouldn’t be a problem because the same hashing mechanism would be used each time, so fingerprints with the same hash value would indicate they share a unique file.

Describe alternatives you've considered

No response

Additional context

Configuration used:

extensions:
  file_storage/checkpoint:
    directory: /tmp/storage
    timeout: 1s
    compaction:
      on_start: true
      directory: /tmp/storage_compaction
      max_transaction_size: 65_536

receivers:
  filelog:
    storage: "file_storage/checkpoint"
    include: [/*.log]
    encoding: "ascii"
    
exporters:
  logging:    
  
service:
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [logging]
  extensions: [file_storage/checkpoint]

First we create a fake log file containing sensitive data

echo "data in plain text" | sudo tee /test2.log

This is the content in the storage extension

sh-4.2$ sudo cat /tmp/storage/receiver_filelog_
��
  ������������ݨ�9��
                   �����������Ҡ�ǲz�Hfile_input.knownFiles2
{"Fingerprint":{"FirstBytes":"MTY5MjcyMjc1Ny43Mjk4NDk2IDFzREx0bWhNVHRNcnk3WlhxMWZwYk55NHY5QlJGMHdqT1lsNmVzVDRVR0taM0tWTWZxMmxyanE2QmtzcTF0b3ZrOVJFN2dNTFU1ZUZzSVRXUnVINW92NjlDMUN1aVhiU2QxWWV0dk1BR3FDc2RhWU9CR2I1enRHNEhkV1FCWVppV3VjNmtHQjAxNmhJUWNlMXgzQ29DRUZNZnVQMzlwNE95Nng3TDQzblluVkV3ajZFQkVMZGpKWkswVWRiSE1kYTdvcTJJNGpLNjJiOGFoZVB3akxhMWI2TXpCc2RpcHE0QzM1cFRHazEwUFVpbXNBZ09YUWxWSFBZNjBaRUVqQXVjaVVrTWZhUFZ5emdIVWJ2ZHU5aUJqbWpYb0YwdnkxOWFhUVZnblE0bUVoVDdtN3hQazN5Y21yRko2dlZsTUIzdHlzaWs3MG83bmlmVVVvVkFEdmRBd3U2Z3JsVTNYeGh0NHdGcWJCYjZxekU0cENoSXVqbDQ2NlRLbUJlWEE3aEZUSExiak55c210clc5dWVKZ2kzaTNOd2lSSnI1S0RQQUc5b0dhd1M0bVhxRG40WFFMVjJsODNaQUM2V0lDZENBUndUazhZd2k5YUt3UmRSS0R3QWVZcGZ0VENSWlp1SmJ6REdjb3JMckM5Zks3Qkh5RU12T3BZNkR6N0pkZ1dOUXNUT0w2SWs1Q2RyRWpjWFFRYXRTdGM4TUlIT0lBSGJOWkFYZmM4cm9NOVJVQlJCbzJFcW5pYmh5c2VDSTVnREpPOUlYZVFmR0N1ck41bUpRS0lZcVZ0b09BNFJoN2tJelZnUjlyTkhINUJ1N0lmNFp3aEFTdW5LaVA1ZzFaekd3NzZ2b0htdEhtbjExYXA0ZWtuQ1cyaVVEdjRMQTlENGN3MEN4dTdxUzlZUWFZYWpSMXpLN2p1MWVrTWFHSWNqWExyWkYwRWpmT3BjYndzNnRVa201UjNFYmh4aXk2T0JRZXFFSkVFRVlLWVJRTGR4TG5NM3JsaFk0dmd5RlBLRWdUV2NzUWMzR0RuZUpwMzcweExHTjZ5ZkUza3BtM0t2OTZDTEZvNzdKUWZ6c21kaGdzUUVxUFFFY1JFMmpCYW1QcDBNT2F0bWw4Szh2SnkwM1M5aUlZak1KakNHNWN4MjkwUHZjY3pOU1RQemYxdnBkcVNDdW14N1VlRjJndzVVRWRhS25XT0tPUU1DV1NIWThVWExoMkZ2M2dNQXpCV3VlMXBoVUhoZ1h3N2pBSFpwV284Z2NkbkV5ZHgzMzJNblhwdFlNcGxoSWdIekU4Ums0NmJxNQ=="},"Offset":1563120000,"FileAttributes":{"HeaderAttributes":{}},"HeaderFinalized":false}
{"Fingerprint":{"FirstBytes":"ZGF0YSBpbiBwbGFpbiB0ZXh0="},"Offset":18,"FileAttributes":{"HeaderAttributes":{}},"HeaderFinalized":false}
Hfile_input.knownFiles2
{"Fingerprint":{"FirstBytes":"MTY5MjcyMjc1Ny43Mjk4NDk2IDFzREx0bWhNVHRNcnk3WlhxMWZwYk55NHY5QlJGMHdqT1lsNmVzVDRVR0taM0tWTWZxMmxyanE2QmtzcTF0b3ZrOVJFN2dNTFU1ZUZzSVRXUnVINW92NjlDMUN1aVhiU2QxWWV0dk1BR3FDc2RhWU9CR2I1enRHNEhkV1FCWVppV3VjNmtHQjAxNmhJUWNlMXgzQ29DRUZNZnVQMzlwNE95Nng3TDQzblluVkV3ajZFQkVMZGpKWkswVWRiSE1kYTdvcTJJNGpLNjJiOGFoZVB3akxhMWI2TXpCc2RpcHE0QzM1cFRHazEwUFVpbXNBZ09YUWxWSFBZNjBaRUVqQXVjaVVrTWZhUFZ5emdIVWJ2ZHU5aUJqbWpYb0YwdnkxOWFhUVZnblE0bUVoVDdtN3hQazN5Y21yRko2dlZsTUIzdHlzaWs3MG83bmlmVVVvVkFEdmRBd3U2Z3JsVTNYeGh0NHdGcWJCYjZxekU0cENoSXVqbDQ2NlRLbUJlWEE3aEZUSExiak55c210clc5dWVKZ2kzaTNOd2lSSnI1S0RQQUc5b0dhd1M0bVhxRG40WFFMVjJsODNaQUM2V0lDZENBUndUazhZd2k5YUt3UmRSS0R3QWVZcGZ0VENSWlp1SmJ6REdjb3JMckM5Zks3Qkh5RU12T3BZNkR6N0pkZ1dOUXNUT0w2SWs1Q2RyRWpjWFFRYXRTdGM4TUlIT0lBSGJOWkFYZmM4cm9NOVJVQlJCbzJFcW5pYmh5c2VDSTVnREpPOUlYZVFmR0N1ck41bUpRS0lZcVZ0b09BNFJoN2tJelZnUjlyTkhINUJ1N0lmNFp3aEFTdW5LaVA1ZzFaekd3NzZ2b0htdEhtbjExYXA0ZWtuQ1cyaVVEdjRMQTlENGN3MEN4dTdxUzlZUWFZYWpSMXpLN2p1MWVrTWFHSWNqWExyWkYwRWpmT3BjYndzNnRVa201UjNFYmh4aXk2T0JRZXFFSkVFRVlLWVJRTGR4TG5NM3JsaFk0dmd5RlBLRWdUV2NzUWMzR0RuZUpwMzcweExHTjZ5ZkUza3BtM0t2OTZDTEZvNzdKUWZ6c21kaGdzUUVxUFFFY1JFMmpCYW1QcDBNT2F0bWw4Szh2SnkwM1M5aUlZak1KakNHNWN4MjkwUHZjY3pOU1RQemYxdnBkcVNDdW14N1VlRjJndzVVRWRhS25XT0tPUU1DV1NIWThVWExoMkZ2M2dNQXpCV3VlMXBoVUhoZ1h3N2pBSFpwV284Z2NkbkV5ZHgzMzJNblhwdFlNcGxoSWdIekU4Ums0NmJxNQ=="},"Offset":1563120000,"FileAttributes":{"HeaderAttributes":{}},"HeaderFinalized":false}
{"Fingerprint":{"FirstBytes":"ZGF0YSBpbiBwbGFpbiB0ZXh0="},"Offset":18,"FileAttributes":{"HeaderAttributes":{}},"HeaderFinalized":false}

The content is written in base 64 format, which is essentially plain text:

echo "c2Vuc2l0aXZlIGRhdGEgaW4gcGxhaW4gdGV4dAo=" | base64 -d
data in plain text

The text was updated successfully, but these errors were encountered:

github-actions · 2023-12-01T21:42:52Z

Pinging code owners:

pkg/stanza: @djaglowski
receiver/filelog: @djaglowski

See Adding Labels via Comments if you do not have permissions to add labels yourself.

djaglowski · 2023-12-01T21:51:14Z

I think the overall idea makes sense, but this detail is not as straightforward as it would seem:

Comparing files wouldn’t be a problem because the same hashing mechanism would be used each time, so fingerprints with the same hash value would indicate they share a unique file.

There is a proposal here which would switch us over to using hashes to track files. The challenge, in short, is that we need to be able to match files even when content is appended to them. (e.g. a file has "ABC\n" the first time we see it, but "ABC\nDEF\n" the next time, and we need to recognize that "ABC\n" was already consumed.)

I think if we can make that proposed change, we would naturally adopt this proposal as well. However, it remains to be seen whether that approach is at least similarly performant.

Danielzolty · 2023-12-01T22:26:32Z

Agreed @djaglowski,

I actually ran across the issue with matching files with data being appended to them when first exploring this request. I am working on a solution and should have a proposal in the next few days. Hopefully it will be easier to discuss implementation details on a draft PR.

Thanks for your input!

bryan-aguilar · 2023-12-02T01:48:45Z

Adding p2 priority label here because I don't think it extends into p1 and IMO is higher than low priority. Between #29273 and this issue there seems to be a desire/need to use hashes instead of raw data for fingerprints. @Danielzolty is moving forward with a proposal.

If the draft PR implementation can move forward after initial review we can most likely assign the issue.

Danielzolty · 2023-12-07T17:50:03Z

Hey @djaglowski,

Just pushed a draft PR to illustrate the idea we had for this problem. It's a draft because I just wanted to get your thoughts on the PR. Thanks!

djaglowski · 2024-02-02T18:03:28Z

@Danielzolty, do you plan to pick up the PR again?

djaglowski · 2024-03-29T13:26:29Z

Having worked through an implementation in #31317 which does not appear to improve performance, I believe it's time to close this issue. If someone else wants to demonstrate an improvement based on hashing fingerprints, please let me know and I will reopen the issue.

Danielzolty added enhancement New feature or request needs triage New item requiring triage labels Dec 1, 2023

github-actions bot added pkg/stanza receiver/filelog labels Dec 1, 2023

bryan-aguilar added priority:p2 Medium and removed needs triage New item requiring triage labels Dec 2, 2023

github-actions bot mentioned this issue Dec 5, 2023

Weekly Report: 2023-11-28 - 2023-12-05 #29650

Closed

Danielzolty mentioned this issue Dec 7, 2023

[Draft] Fingerprint Hashing Solution #29691

Closed

djaglowski mentioned this issue Feb 19, 2024

[pkg/stanza] Hash fingerprints #31317

Closed

djaglowski closed this as completed Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify Fingerprint to Hashed Values #29617

Modify Fingerprint to Hashed Values #29617

Danielzolty commented Dec 1, 2023

github-actions bot commented Dec 1, 2023

djaglowski commented Dec 1, 2023

Danielzolty commented Dec 1, 2023

bryan-aguilar commented Dec 2, 2023 •

edited

Loading

Danielzolty commented Dec 7, 2023

djaglowski commented Feb 2, 2024

djaglowski commented Mar 29, 2024

Modify Fingerprint to Hashed Values #29617

Modify Fingerprint to Hashed Values #29617

Comments

Danielzolty commented Dec 1, 2023

Component(s)

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

github-actions bot commented Dec 1, 2023

djaglowski commented Dec 1, 2023

Danielzolty commented Dec 1, 2023

bryan-aguilar commented Dec 2, 2023 • edited Loading

Danielzolty commented Dec 7, 2023

djaglowski commented Feb 2, 2024

djaglowski commented Mar 29, 2024

bryan-aguilar commented Dec 2, 2023 •

edited

Loading