Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fingerprint processor #14205

Merged
merged 28 commits into from
Nov 1, 2019
Merged

Conversation

ycombinator
Copy link
Contributor

@ycombinator ycombinator commented Oct 22, 2019

Resolves #11173.

This PR implements a fingerprint processor, similar to Logstash's fingerprint filter plugin.

The processor will take the following configuration options:

Name Required? Default Description
fields Yes List of fields to use as the source of the fingerprint
ignore_missing No false Whether to ignore missing fields
target_field No fingerprint Field in which the computed fingerprint should be stored
method No sha256 Algorithm to use for computing the fingerprint. Must be one of: md5, sha1, sha256 (default), sha384, sha512
encoding No hex Encoding to use on the fingerprint value. Must be one of hex (default), base32, or base64

type Method uint8

const (
MethodSHA1 Method = iota

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported const MethodSHA1 should have comment (or a comment on this block) or be unexported


var errMethodUnknown = errors.New("unknown method")

type Method uint8

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exported type Method should have comment or be unexported

@andrewkroh
Copy link
Member

@ycombinator
Copy link
Contributor Author

ycombinator commented Oct 23, 2019

This reminds me of https://github.com/andrewkroh/beats-processor-fingerprint. As well as #1872.

Thanks for flagging these, @andrewkroh; I wasn't aware.

@urso Looks like there's prior art (the processor in Andrew's repo) as well as discussions going on in the ES team (see elastic/elasticsearch#34085, which is eventually linked off #1872 and the related PR: elastic/elasticsearch#47047). Does it still make sense for me to continue working on this PR?

@urso
Copy link

urso commented Oct 23, 2019

Does it still make sense for me to continue working on this PR?

I think yes, it makes sense.

The one by Andrew is a private one. Its not to be found in Beats. Maybe we can take some of it.

there is also the Logstash fingerprint processor. I hope the ES one will be similar to the Logstash one. But I didn't check yet.

@ycombinator
Copy link
Contributor Author

Per the discussion in elastic/elasticsearch#47047 (comment), we are going to resume working on a fingerprint processor in Beats.

@andrewkroh Since you've already built a fingerprint processor in your repo, did you want to put up a PR with that code to the beats repo? If not, I'll continue working on this PR while looking at your work.

@andrewkroh
Copy link
Member

did you want to put up a PR with that code to the beats repo? If not, I'll continue working on this PR while looking at your work.

No, but please copy anything you want from my repo. 👍

@ycombinator ycombinator changed the title WIP: fingerprint processor fingerprint processor Oct 30, 2019
@ycombinator ycombinator changed the title fingerprint processor Fingerprint processor Oct 30, 2019
@ycombinator ycombinator marked this pull request as ready for review October 30, 2019 02:55
@ycombinator
Copy link
Contributor Author

jenkins, test this

}
if err != nil {
return "", errors.Wrapf(err, "failed when finding field [%v] in event", k)
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to have an option to ignore missing fields in case we have at least one field present?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but is this the same as your suggestion in #14205 (comment) or something different?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

almost. My suggestion originally was to add support to ignore missing fields. But apparently we can have other error types as well. Would it make sense to treat those other types as 'missing' as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. So if we can't "get" a field for whatever reason, we treat it as missing and then, if the missing_fields option is set, ignore it. Hmm, I think this makes sense but let me just look into what other types of errors (besides common.ErrKeyNotFound) might be returned here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other error that can be returned here is if we try to get a value for a nested field (e.g. a.b.c), but the ancestor path to the field (a.b) does not resolve to a map. To me this also feels like a missing field as, even in this case, we still could not find a.b.c for the user. So I'm okay with collapsing the two error cases into one and adding ignore_missing handling to it.


for _, test := range tests {
name := fmt.Sprintf("testing %v encoding", test.encoding)
t.Run(name, func(t *testing.T) {
Copy link

@urso urso Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test names can be filtered using regexes. For this I like to give it some structure instead of having full names (assuming that the parent test its name is expressive enough). e.g. just pass encoding as name to t.Run. (In case I have multiple parameters or want a name I use fmt.Sprintf("field=%v" , value)).

In this case the test name could just be TestEncoding/base64. I can run the test using go test -run Encoding/base64 (the / acts as a delimiter).

Copy link

@urso urso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are not well isolated. A many tests have the same test event as input. It is okay to have a similar event, but the processor modifies the original map. A deep copy of the fields is required to guarantee some isolation.

Copy link
Contributor

@dedemorton dedemorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the deets look good. I just have a few minor comments.


.Fingerprint options
[options="header"]
|======
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot the header row here. :-) The rendered table looks like this:

image

[options="header"]
|======
| `fields` | yes | | List of fields to use as the source for the fingerprint. |
| `target_field` | no | `fingerprint` | Field in which the generated fingerprint should be stored. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tables with more than 3 columns don't look very good in our HTML output (see screen capture in previous comment). Docbook swallows all table attributes, so we can't do anything about that right now. You could remove the example row and provide the example as part of the description. Or wait for some shift in the universe that will make this right.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the table with a definition list, like the one used in the Extract Array processor, for instance. Let me know what you think. Thanks!

@ycombinator
Copy link
Contributor Author

@urso @dedemorton Thanks for your reviews. I believe I've addressed all your feedback now. This PR is ready for re-review, when you get a chance. Thanks again!

Copy link
Contributor

@dedemorton dedemorton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! In this case, I think the list is actually easier to scan.

@ycombinator
Copy link
Contributor Author

jenkins, test this

1 similar comment
@ycombinator
Copy link
Contributor Author

jenkins, test this

@ycombinator
Copy link
Contributor Author

Travis CI is green. Jenkins CI is red because of x-pack/agent builds being yellow, which is unrelated to this PR. Merging.

@ycombinator ycombinator merged commit 7e06580 into elastic:master Nov 1, 2019
@ycombinator ycombinator deleted the lb-processor-fingerprint branch December 25, 2019 11:09
@andresrc andresrc added the Team:Integrations Label for the Integrations team label Mar 6, 2020
jorgemarey pushed a commit to jorgemarey/beats that referenced this pull request Jun 8, 2020
* WIP: fingerprint processor

* Implementing SHA256 fingerprinter

* Sort source fields

* Refactoring

* Add TODO

* Convert time fields to UTC

* Removing unnecessary function

* Adding SHA1

* WIP: add encoding

* Cleanup

* Running mage fmt

* More methods + consolidating tests

* Fleshing out tests

* Adding test for target field

* Adding documentation

* Adding CHANGELOG entry

* Fixing test

* Converting tests to map

* Isolating tests

* Use io.Writer to stream in fields

* Implement ignore_missing setting

* Replace table with definition list

* Adding `ignore_missing` to doc

* using io.Fprintf

* Use common.StringSet

* Adding typed errors

* Adding more typed errors

* Adding license header
@neu5ron
Copy link

neu5ron commented Jun 5, 2024

I am not sure where to move this issue forward, but it should be noted that the fingerprint processor for ingest processor creates inconsistent values when compared to using a hashing technique like md5, sha1, sha256, sha512 in any other software - includ Elastic software like logstash and filebeat.
In the issue it states, that even when hashing a single value - the fingerprint processor adds a byte to the value and then creates the hash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Fingerprint processor for Non-repudiation
7 participants