Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fingerprint Processor Unexpected Results #98339

Open
neu5ron opened this issue Aug 10, 2023 · 7 comments
Open

Fingerprint Processor Unexpected Results #98339

neu5ron opened this issue Aug 10, 2023 · 7 comments
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team team-discuss

Comments

@neu5ron
Copy link

neu5ron commented Aug 10, 2023

Elasticsearch Version

8.9.0, tested also on 8.5 and 8.6

Installed Plugins

No response

Java Version

bundled

OS Version

N/A

Problem Description

When using the fingerprint processor there are unexpected results with showing the actual method's hex representation. For example using the method MD5 and the value a.

Expected:
hex: 0cc175b9c0f1b6a831c399e269772661
base64: DMF1ucDxtqgxw5niaXcmYQ==
Fingerprint Processor:
hex: 7687355dbc955b0074758acb4d5f9a
base64: dg91NXbylVsAdHWKy01fpg==

Steps to Reproduce

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "fingerprint": {
          "fields": ["a"],
          "method": "MD5",
          "target_field": "test"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
          "a": "a"
      }
    }
  ]
}

Logs (if relevant)

{
  "docs": [
    {
      "doc": {
        "_index": "_index",
        "_id": "_id",
        "_version": "-3",
        "_source": {
          "test": "dg91NXbylVsAdHWKy01fpg==",
          "a": "a"
        },
        "_ingest": {
          "timestamp": "2023-08-10T06:19:10.716599155Z"
        }
      }
    }
  ]
@neu5ron neu5ron added >bug needs:triage Requires assignment of a team area label labels Aug 10, 2023
@dreamquster
Copy link
Contributor

The response of ES is right. It's not just simplely calculate the MD5 of 'a', but concatenate all values of 'fileds' with a delimeter of byte '0'. So its result is more like this function = Base64(MD5(join(0, value of fields)

@neu5ron
Copy link
Author

neu5ron commented Aug 11, 2023

ok,is there a possibility to add an option to change this or. Have years of data with fingerprints/hashes and moving everything to ingest pipeline the fingerprinting does not match with logstash or previous ETL provided by Elastic.

@not-napoleon not-napoleon added :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP and removed needs:triage Requires assignment of a team area label labels Aug 16, 2023
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Aug 16, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@neu5ron
Copy link
Author

neu5ron commented Sep 13, 2023

it would be great to have consistent hashes over the years. thank you!

@neu5ron
Copy link
Author

neu5ron commented Sep 20, 2023

or at least make it not add a null byte if hashing a single field.

@g0tr3wt
Copy link

g0tr3wt commented Oct 4, 2023

Bump 🥶

@neu5ron
Copy link
Author

neu5ron commented Jun 5, 2024

hi I was wanting to follow up on this issue. I know this may be expected results as it was built for elasticsearch fingerprint process. However, this is not how it works for logstash or filebeat. Also, it makes it difficult for a field like cyber security where it is necessary to share hashes across communities and environments of all sorts of technology - and if those of us using Elastic are sharing inconsistent hashes with the community then it puts us in a difficult position.
I continue to see the fingerprint processor be used (as recent as 2 days ago) in Elastic ingest pipelines for ECS - and I know this issue will only continue to grow in the future.

Personally myself, I have solved this - I have found an undocumented hashing technique outside of a processor by using painless. However, I don't want the majority of the community using Elastic to continue to be in this realm of separation of sharing incorrect intel..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/Ingest Node Execution or management of Ingest Pipelines including GeoIP Team:Data Management Meta label for data/management team team-discuss
Projects
None yet
Development

No branches or pull requests

6 participants