Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beats fingerprint processor should generate same fingerprints as Logstash fingerprint filter #18542

Closed
ycombinator opened this issue May 14, 2020 · 8 comments
Labels
bug libbeat :Processors Stalled Team:Services (Deprecated) Label for the former Integrations-Services team

Comments

@ycombinator
Copy link
Contributor

According to https://discuss.elastic.co/t/integrity-issue-between-winlogbeat-and-logstash-fingerprint/232654, it appears the Beats fingerprint processor does not generate the same fingerprints (for the same fields, with the same hashing algorithm) as the Logstash fingerprint filter.

Looking at the two implementations, specifically the concatenation code (Beats | Logstash), it looks like the intent was for the two fingerprints to be the same.

@ycombinator ycombinator added bug libbeat :Processors Team:Services (Deprecated) Label for the former Integrations-Services team labels May 14, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations-services (Team:Services)

@urso
Copy link

urso commented May 25, 2020

I did check the implementations for potential bugs and did some testing. Each rely on the runtime specific string/value formatting when streaming the "message" to the hash function (e.g. differences in encoding special symbols might gives us slightly different results).

Simple test ruby script:

require "openssl"

$digest = OpenSSL::Digest::SHA256.new

def test_hash(key, value)
  input = "|#{key}|#{value}|"
  hash = $digest.hexdigest(input)
  puts "hash input: #{input}"
  puts "hash: #{hash}"
end

test_hash("message", 'hello world')
test_hash("message", 'test message \"hello world\"')

And go code (go playground):

package main

import (
	"crypto/sha256"
	"encoding/hex"
	"fmt"
	"io"
)

var digest = sha256.New()
var encoding = hex.EncodeToString

func main() {
	testHash("message", "hello world")
	testHash("message", `test message "hello world"`)
}

func testHash(key, value string) {
	digest.Reset()
	fmt.Fprintf(digest, "|%v|%v", key, value)
	io.WriteString(digest, "|")

	hash := digest.Sum(nil)
	fmt.Printf("hash input: |%v|%v|\n", key, value)
	fmt.Printf("hash: %s\n", encoding(hash))
}

The ruby script gets us this output:

hash input: |message|hello world|
hash: 50110bbfc1757f21caacc966b33f5ea2235c4176739447e0b3285dec4e1dd2a4
hash input: |message|test message "hello world"|
hash: 14a0364b79acbe4c78dd5e77db2c93ae8c750518b32581927d50b3eef407184e

and for the go script we get:

hash input: |message|hello world|
hash: 50110bbfc1757f21caacc966b33f5ea2235c4176739447e0b3285dec4e1dd2a4
hash input: |message|test message "hello world"|
hash: 14a0364b79acbe4c78dd5e77db2c93ae8c750518b32581927d50b3eef407184e

I also verified the fingerprint processor returning the same result by adding a custom unit test.

All in all, it looks like the implementations give us similar results. Potential directions to investigate why we have different hashes:

  • Different contents in the message field (e.g. timestamp?)
  • Different encoding of the message field? (e.g. go requires everything to be utf-8, while ruby allows different encodings)
  • Different formatting. Both implementation rely on the runtime specific string formatting capabilities. If special characters are encoded differently, we might end up with different hash values.
  • Fingerprint processor thread-safety. The hashing implemented by fingerprint is not really threadsafe. In case the beat.Client is shared by multiple go-routines we might run into issues (I don't think filebeat shares the beat.Client between multiple go-routines).

Without details about the full setup I can't really tell what we are seeing here.

I added thread safety and unit tests from the sample script in #18738.

@JeanN-17
Copy link

Thanks for your answer @urso.

I use Winlogbeat (On a Windows 7 device) which sends logs to Logstash (implemented on a Debian 10) and this one sends logs to Elastisearch (Implemented on the same Debian).

Here is my configuration from winlogbeat.yml:

winlogbeat.event_logs:
  - name: Application
    ignore_older: 72h
    include_xml: true

  - name: System
    include_xml: true

  - name: Security
    include_xml: true
    processors:
      - script:
          lang: javascript
          id: security
          file: ${path.home}/module/security/config/winlogbeat-security.js

  - name: Microsoft-Windows-Sysmon/Operational
    include_xml: true
    processors:
      - script:
          lang: javascript
          id: sysmon
          file: ${path.home}/module/sysmon/config/winlogbeat-sysmon.js


setup.template.pattern: "winlogbeat-%{[agent.version]}*"
setup.template.name: "winlogbeat-%{[agent.version]}"


output.logstash:
  hosts: ["192.168.1.1:5044"]


processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~

processors:
  -fingerprint:
     fields: ["message"]
     method: sha256

Here is my configuration from logstash-beat.yml:

input {
  beats {
    port => 5044
  }
}

filter {
  fingerprint {
    source => ["message"]
    method => "SHA256"
    target => "fingerprint-check"
  }
}

output {
 elasticsearch {
    hosts => ["https://localhost:9200"]
    index => "%{[@metadata][beat]}-%{[@metadata][version]}"
    user => "user"
    password => "password"
    ssl => true
    ssl_certificate_verification => false
    cacert => '/etc/logstash/ssl/elasticsearch-ca.pem'
  }
}

Even with these configurations (of fingerprint plugin) which seem to be the same, i got two differents Hash from both fingerprint.

Thanks.

@urso
Copy link

urso commented May 27, 2020

Maybe we can trim down the test case a little to make it more reproducible (best would be if we can construct unit tests with test input).

You winlogbeat configuration has two processor section at top-level. Unfortunately the YAML parser does not handle those correctly (and unfortunately we can't detected this). Correct configuration would be:

processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  - fingerprint:
      fields: ["message"]
      method: sha256

A good test to see if we have a threading issue would be to disable all event_logs but 1. For example

winlogbeat.event_logs:
  - name: System
    include_xml: true

output.logstash:
  hosts: ["192.168.1.1:5044"]


processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - add_docker_metadata: ~
  - fingerprint:
      fields: ["message"]
      method: sha256

I'm not even sure if the System logs will contain a message field. Next some test/debug output would be helpful. If you start winlogbeat from the console/terminal with -d 'publish' you will see all events to be published being printed to console.

In logstash we can print the message field to the console by adding a ruby filter with this script (you need to add "require json" to the init setting):

puts JSON.pretty_generate(event); puts '=' * 80

Having these messages allows us to see if the fingerprint processors really get the same message as input.

@JeanN-17
Copy link

JeanN-17 commented Jun 3, 2020

Thanks for your answer @urso

Even if I had two processor's sections at top-level the fingerprint works well, I guess. But thanks I will know it for the future to avoid wrong YAML syntax.

I've tried to disable all event_logs except 1 and the fingerprint are always different (between logstash and winlogbeat).

In logstash configuration I added ruby filter script like this:

filter {
  ruby {
    init => "require 'json'"
    code => 'puts JSON.pretty_generate("[message]")'
  }
}

But the console prints:

[ERROR][logstash.filters.ruby][main] Ruby exception occured : only generation of JSON objects or arrays allowed

So i've put an "add_fields" module in fingerprint module to see what the message field looks like when it get fingerprinted and it looks like the same as the input.

So I looked at the fingerprint code of logstash and winlogbeat. I am wondering on those scripts beats and logstash:
when I look at line 137 it looks like "|" is added between key and value while at line 139 it seems like "|" isn't added. Could it be the source of the issue?
Moreover in the beats implementation this character "|" is always added independently of the number of key and value fingerprinted.

Thanks.

@urso
Copy link

urso commented Jun 15, 2020

Comparing the Beats and logstash implementation, I think you should be able to get the same fingerprint in logstash if you set:

concatenate_sources => true

If concatenate_sources is false the loop overwrites the fingerprint, until the 'last' one is eventually added. In this case Logstash indeed does not add |.

@botelastic
Copy link

botelastic bot commented May 16, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@botelastic botelastic bot added the Stalled label May 16, 2021
@botelastic botelastic bot closed this as completed Jun 15, 2021
@zube zube bot added [zube]: Done and removed [zube]: Inbox labels Jun 15, 2021
@zube zube bot removed the [zube]: Done label Sep 14, 2021
@djmcgreal-cc
Copy link

They don't produce the same hashes when the fields are nested and a different syntax is employed, e.g. [log][file][path] in Logstash and log.file.path in Filebeat. I would recommend the fingerprint plugin to accept the Logstash format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug libbeat :Processors Stalled Team:Services (Deprecated) Label for the former Integrations-Services team
Projects
None yet
Development

No branches or pull requests

6 participants