-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fingerprint ingest processor #68415
Fingerprint ingest processor #68415
Conversation
Pinging @elastic/es-core-features (Team:Core/Features) |
@graphaelli, could you or a member of your team look this over to verify that it provides the functionality that the APM use case requires? |
@elasticmachine run elasticsearch-ci/bwc |
Thanks @danhermann! @elastic/apm-server - can you provide feedback here? |
@elasticmachine update branch |
@danhermann thank you, this will do nicely for APM. We'll need MD5 (which I see is available) for now, but later I would like to migrate to something faster, such as Murmurhash3 (non-cryptographic would be fine). Any reason why we couldn't add that later? Doesn't look like it to me, just wanted to double check. |
Yes, adding support for the murmur3 hash should be doable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @danhermann, I left a few really minor comments, but I think we need to add regular docs also right?
x-pack/plugin/ingest/src/main/java/org/elasticsearch/xpack/ingest/FingerprintProcessor.java
Show resolved
Hide resolved
var setList = set.stream().sorted().collect(Collectors.toList()); | ||
for (int k = setList.size() - 1; k >= 0; k--) { | ||
values.push(setList.get(k)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Streams are super great, but they aren't quite as fast compared to foreach
clauses yet, and since this is performance sensitive code, what do you think about doing:
var setList = set.stream().sorted().collect(Collectors.toList()); | |
for (int k = setList.size() - 1; k >= 0; k--) { | |
values.push(setList.get(k)); | |
} | |
var setList = new ArrayList<?>(set); | |
setList.sort(Comparator.naturalOrder()); | |
for (Object thing : setList) { | |
values.push(thing); | |
} |
?
(I might be missing something about why you're using a for loop indexed rather than foreach
, as I think the foreach
compiles to the same bytecode, but I could totally be wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The for loop iterates backward over the sorted list to push the values onto the stack. That means that the values are processed in sorted order after being popped off the stack which is not strictly necessary since only a stable hash is required, but having the values in sorted order is certainly nicer when debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did replace the usage of streams both here and below.
x-pack/plugin/ingest/src/main/java/org/elasticsearch/xpack/ingest/FingerprintProcessor.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/ingest/src/main/java/org/elasticsearch/xpack/ingest/FingerprintProcessor.java
Outdated
Show resolved
Hide resolved
x-pack/plugin/ingest/src/main/java/org/elasticsearch/xpack/ingest/FingerprintProcessor.java
Outdated
Show resolved
Hide resolved
Thanks for the review, @dakrone. I'll open a separate PR with the docs for this one shortly. |
@dakrone, I've updated this with all your suggestions except the indexed for loop. Let me know if you're ok with the reverse iteration so elements from the stack are processed in sorted order. |
@elasticmachine update branch |
@elasticmachine run elasticsearch-ci/2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'm totally cool with the reverse ordering, thanks for iterating!
cc: @elastic/es-ui in case Kibana auto-complete needs to be updated with this new processor. |
Adds a fingerprint processor that computes hashes of document content for content fingerprinting use cases.
E.g.:
Which produces:
Supports any number of document fields, nested document content, any hash from [MD5, SHA-1, SHA-256, SHA-512], and a per-processor salt.
Closes #53578 though it addresses only content fingerprinting and not anonymization use cases.