-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OTTL] Supports hashing multiple capture groups in a replacement string with prefix/suffix #30403
[OTTL] Supports hashing multiple capture groups in a replacement string with prefix/suffix #30403
Conversation
bcfead8
to
06f6ecc
Compare
06f6ecc
to
d7c6cba
Compare
d7c6cba
to
2f346ac
Compare
@evan-bradley it actually looks like we don't have masking examples in our contrib repository, but rather there are examples in distribution repos like this one - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/docs/collecting-container-logs.md#modifying-log-records This is how some customers currently use masking at the moment. Also, our existing tests use the same pattern for masking |
2f346ac
to
57cd617
Compare
// Extract capture group from replacement value to apply the function on it | ||
r := regexp.MustCompile(`(\$[0-9])`) | ||
matches := r.FindAllStringSubmatch(replacementVal, -1) | ||
if len(matches) > 0 { | ||
for _, match := range matches { | ||
multipleCaptureGroups = append(multipleCaptureGroups, match[0]) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential performance improvement for later: check if the value inside StringGetter
is actually a literal and do this work during startup instead. I think we'd need some interface changes in OTTL.
if len(multipleCaptureGroups) > 0 { | ||
for _, captureGroup := range multipleCaptureGroups { | ||
replacementValStr, err = executeFunction(ctx, tCtx, compiledPattern, fn, originalValStr, captureGroup, submatch) | ||
if err != nil { | ||
return "", err | ||
} | ||
captureGroupMap[captureGroup] = replacementValStr | ||
} | ||
switch { | ||
case len(captureGroupMap) > 1: | ||
for key, value := range captureGroupMap { | ||
replacementVal = strings.ReplaceAll(replacementVal, key, value) | ||
} | ||
updatedString = strings.ReplaceAll(updatedString, fullMatch, replacementVal) | ||
|
||
case len(captureGroupMap) == 1: | ||
for key, value := range captureGroupMap { | ||
replacementValStr = strings.ReplaceAll(replacementVal, key, value) | ||
} | ||
updatedString = strings.ReplaceAll(updatedString, fullMatch, replacementValStr) | ||
|
||
default: | ||
updatedString = strings.ReplaceAll(updatedString, fullMatch, replacementValStr) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lot going on here, can you add describe why we need captureGroupMap
and multipleCaptureGroups
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, the steps are as below:
-
Add all the unresolved capture groups in the replacement string to a list [$1, $2...]
-
Then as you resolve all the capture groups in the original string, build a map of all the capture groups like so {$1: hash(world1), $2: hash(world2)}
-
If there are entries in the map, it means that there are references to capture groups in the replacement string that need to be resolved before updating the original string
-
The outer loop that loops through all the
submatch
indices from the original string ensures that all occurrences of capture groups are replaced in the original string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
multipleCaptureGroups
keeps track of all the unresolved capture groups in the replacement string. Then as we resolve the capture groups we build a captureGroupMap
of unresolved to resolved capture groups. This helps us replace all references to unresolved capture groups with the resolved values in the replacement string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TylerHelmuth does the above description make sense? Please let me know if you have any concerns with this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still find this pretty hard to follow, can you add comments in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments to this part of the code, please let me know if you have questions on it.
57cd617
to
f87155d
Compare
@rnishtala-sumo Thanks for linking to the masking docs. There may be a way around this even with only passing the capture groups, but I'm missing how we can address my concern here: #27820 (comment). A user may want to control the input to the hash function so it includes multiple parts from the input string or some static text, likely so the hash matches with hashes produced in other parts of their system. Using our test cases, with input string |
86eede7
to
994b9aa
Compare
7178973
to
fc72d6f
Compare
@evan-bradley how about giving the user an option to choose the hashing behavior, i've written a test for it. Since we're calling this a breaking change, the default behavior would be to hash only the capture groups. The user will be able to include the prefix/suffix in the hash by passing in a boolean (includeNonCapture) to mimic the current hashing behavior. I want to emphasize that if the user doesn't use a prefix/suffix with a capture group then there isn't any change in hashing behavior. All the existing test cases around this do pass. |
fc72d6f
to
452d7d6
Compare
@evan-bradley regarding this comment
I attempted to address this in the latest commit by giving the user a choice to control how they would hash the string
A simpler example for this would be if a user decides to hash a string
|
8a8a620
to
7280a1c
Compare
7280a1c
to
8600fc1
Compare
@TylerHelmuth @evan-bradley. Do we think the approach mentioned here is reasonable for this issue? The motivation for this approach was this comment. Or do you think we should look at the approach mentioned here to support curried functions in OTTL? |
@rnishtala-sumo thanks for being persistent on this issue and sorry it is taking so long. This is definitely the most complex function issue we've had in OTTL and its definitely pushing the language. I believe this PR is a solution for the problem, but I find it quite complex. Requiring an extra Optional param and the need to keep track of all the different situations within the business logic is not simple and the code reflects that. I am worried about the upkeep of this function and the trouble we'll have adding any future features or performing any future refactors. From an end user perspective the Optional parameters wont be used by most, but for those that do want to use the optional params I worry about understanding exactly how everything is applied. I think it would be good to do some comparisons with the other options. As I understand it, this PR implements Option 1 from #27820 (comment). I'd like to see Option 2 (format param) implemented as well. Since Option 3 (currying) takes a lot of language changes, I vote we start by looking at Option 2 and if we like it we could discuss taking it a step further with Option 3, which is a more powerful Option 2. How does that sound? Again, sorry for all the delays with this issue and thanks for sticking with it. This is an important use case for OTTL/transformprocessor and I want to make sure we get it right. |
@rnishtala-sumo Thanks for adding the extra parameter. I think the issue I still see here is that we aren't able to format the output of the optional function. At the end of the day I think we will need to allow the user to format the input to the optional function and how the output of the optional function is formatted before it is used as a replacement in the string. I also realize that the current implementation optimizes the UX for masking use cases, which may in fact be sufficient for a large majority of users. Without others chiming in to say they also expect the function to be used like this it's a little hard to independently determine whether this is the case. Could you detail more about how you expect end users to use this functionality? I think this could help inform what UX will be best, and when we can optimize for certain cases and suggest more complicated solutions for others. I agree with what @TylerHelmuth said about the implementation: I think it is great that you've made this work, but it is also complex. I think option 2 would be a good first exploration unless you think option 3 would be better. If we attempt option 3 I think we would want to know that it could be helpful in other places in OTTL just so we're not implementing something for a specific use case. I also want to echo what @TylerHelmuth said and thank you for your patience on this. I've spent the past few days seriously considering this and still haven't been able to come to a solid conclusion on what the best solution is here. It's certainly an important problem, but it's also a hard one. |
@TylerHelmuth @evan-bradley , no problem, thank you for the suggestions. Just to refresh our memories on this, Option 2 from this comment was what I started with. This was the PR that was closed, because of this comment which gained traction and led to option 1 🙂 The only reason I added the additional optional param in Option 1 was because we didn't want to lose an existing [use case].(#30403 (comment)). The original intent of option 1 was very much to change the existing hash behavior only when capture groups are used in the replacement string. Other use cases (not involving capture groups) wouldn't change. Having said that, I don't mind pivoting back to option 2 which is the simplest of all options for this issue and maybe open a separate issue for supporting option 3 where we could track all the use cases for it. |
I think the example below is a good starting point. Whether the user would like to mask or hash a card number in the log, they would have the same UX, if we stick to option 1. |
Another example
Hash the remote IP With option 2: To Mask the remote IP Option 1 is closer to what masking does today and requires fewer params/user inputs |
@rnishtala-sumo I totally forgot about #27686, thanks for bringing that up. Let's revive that PR with all the other changes/tests that have been made to the functions and see what it looks like. I'm really curious to see how that solution lets us format the output of the optional function. Also looking at this scenario, I am totally more in favor of Option 2 over Option 1. Option 2 is definitely easier for me to comprehend than Option 1 since I don't have to do any mental math on what is being hashed. @evan-bradley I can also see how currying functions together is maybe even clearer. |
@TylerHelmuth @evan-bradley Couldn't reopen the same PR, so created a new one for option 2 using the same feature branch. |
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
Closed as inactive. Feel free to reopen if this PR is still being worked on. |
Description:
Supports hashing multiple capture groups in a replacement string, with a prefix/suffix. Offers UX consistent with masking.
The following formats are now supported
The function is applied as follows
Link to tracking Issue:
#27820
Implements Option 1 from this comment
Testing: Unit tests
Documentation:
Potentially a breaking change since the prefix/suffix of a capture group is no longer a part of the hash. Marked this as an enhancement at present.
The following hashing example is now consistent with masking behavior, where the
k8s
prefix is preserved and not included in the hashreplace_all_patterns(attributes, "key", "^kube_([0-9A-Za-z]+_)", "k8s.$$1.")