[OTTL] Supports hashing multiple capture groups in a replacement string with prefix/suffix #30403

rnishtala-sumo · 2024-01-10T19:44:58Z

Description:
Supports hashing multiple capture groups in a replacement string, with a prefix/suffix. Offers UX consistent with masking.

The following formats are now supported

prefix-$1-suffix
prefix-$1-suffix-$2
prefix-$1-$2-suffix

The function is applied as follows

prefix-hash($1)-suffix
prefix-hash($1)-suffix-hash($2)
prefix-hash($1)-hash($2)-suffix

Link to tracking Issue:
#27820
Implements Option 1 from this comment

Testing: Unit tests

Documentation:
Potentially a breaking change since the prefix/suffix of a capture group is no longer a part of the hash. Marked this as an enhancement at present.

The following hashing example is now consistent with masking behavior, where the k8s prefix is preserved and not included in the hash
replace_all_patterns(attributes, "key", "^kube_([0-9A-Za-z]+_)", "k8s.$$1.")

rnishtala-sumo · 2024-01-17T17:40:03Z

@evan-bradley it actually looks like we don't have masking examples in our contrib repository, but rather there are examples in distribution repos like this one - https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/docs/collecting-container-logs.md#modifying-log-records

This is how some customers currently use masking at the moment.

Also, our existing tests use the same pattern for masking

.chloggen/hashed-repl-fmt.yaml

TylerHelmuth · 2024-01-17T21:56:19Z

pkg/ottl/ottlfuncs/func_replace_pattern.go

+	// Extract capture group from replacement value to apply the function on it
+	r := regexp.MustCompile(`(\$[0-9])`)
+	matches := r.FindAllStringSubmatch(replacementVal, -1)
+	if len(matches) > 0 {
+		for _, match := range matches {
+			multipleCaptureGroups = append(multipleCaptureGroups, match[0])
+		}
+	}


A potential performance improvement for later: check if the value inside StringGetter is actually a literal and do this work during startup instead. I think we'd need some interface changes in OTTL.

pkg/ottl/ottlfuncs/func_replace_pattern.go

TylerHelmuth · 2024-01-17T21:59:57Z

pkg/ottl/ottlfuncs/func_replace_pattern.go

+		if len(multipleCaptureGroups) > 0 {
+			for _, captureGroup := range multipleCaptureGroups {
+				replacementValStr, err = executeFunction(ctx, tCtx, compiledPattern, fn, originalValStr, captureGroup, submatch)
+				if err != nil {
+					return "", err
+				}
+				captureGroupMap[captureGroup] = replacementValStr
+			}
+			switch {
+			case len(captureGroupMap) > 1:
+				for key, value := range captureGroupMap {
+					replacementVal = strings.ReplaceAll(replacementVal, key, value)
+				}
+				updatedString = strings.ReplaceAll(updatedString, fullMatch, replacementVal)
+
+			case len(captureGroupMap) == 1:
+				for key, value := range captureGroupMap {
+					replacementValStr = strings.ReplaceAll(replacementVal, key, value)
+				}
+				updatedString = strings.ReplaceAll(updatedString, fullMatch, replacementValStr)
+
+			default:
+				updatedString = strings.ReplaceAll(updatedString, fullMatch, replacementValStr)
+			}


There is a lot going on here, can you add describe why we need captureGroupMap and multipleCaptureGroups?

Sure, the steps are as below:

Add all the unresolved capture groups in the replacement string to a list [$1, $2...]

Then as you resolve all the capture groups in the original string, build a map of all the capture groups like so {$1: hash(world1), $2: hash(world2)}

If there are entries in the map, it means that there are references to capture groups in the replacement string that need to be resolved before updating the original string

The outer loop that loops through all the submatch indices from the original string ensures that all occurrences of capture groups are replaced in the original string

multipleCaptureGroups keeps track of all the unresolved capture groups in the replacement string. Then as we resolve the capture groups we build a captureGroupMap of unresolved to resolved capture groups. This helps us replace all references to unresolved capture groups with the resolved values in the replacement string.

@TylerHelmuth does the above description make sense? Please let me know if you have any concerns with this approach.

I still find this pretty hard to follow, can you add comments in the code.

I added some comments to this part of the code, please let me know if you have questions on it.

evan-bradley · 2024-01-18T18:25:51Z

@rnishtala-sumo Thanks for linking to the masking docs. There may be a way around this even with only passing the capture groups, but I'm missing how we can address my concern here: #27820 (comment). A user may want to control the input to the hash function so it includes multiple parts from the input string or some static text, likely so the hash matches with hashes produced in other parts of their system.

Using our test cases, with input string goodbye world1 and world2, pattern (world1) and (world2), and some replacement string equivalent to blue-$1 and blue-$2, it's not clear to me how we could get goodbye hash(blue-world1 and blue-world2). This test case would work with the current implementation of the function.

rnishtala-sumo · 2024-01-18T22:52:35Z

@evan-bradley how about giving the user an option to choose the hashing behavior, i've written a test for it.

Since we're calling this a breaking change, the default behavior would be to hash only the capture groups. The user will be able to include the prefix/suffix in the hash by passing in a boolean (includeNonCapture) to mimic the current hashing behavior.

I want to emphasize that if the user doesn't use a prefix/suffix with a capture group then there isn't any change in hashing behavior. All the existing test cases around this do pass.

rnishtala-sumo · 2024-01-23T15:09:27Z

@evan-bradley regarding this comment

I don't have any real-world cases to point to here, but I'm a little worried we would lose too much control over how the hash is created if we do this.

I attempted to address this in the latest commit by giving the user a choice to control how they would hash the string goodbye world1 and world2 with the regex (world1) and (world2) and replacement - blue-$1 and blue-$2

Including non capture groups will result in goodbye hash(blue-world1 and blue-world2)
The default behavior would be goodbye blue-hash(world1) and blue-hash(world2)

A simpler example for this would be if a user decides to hash a string password=1234, both of the following would be supported with the optional boolean param (IncludeNonCapture)

hash(password=1234)
[default] password=hash(1234)

pkg/ottl/ottlfuncs/func_replace_pattern.go

rnishtala-sumo · 2024-01-26T15:40:16Z

@TylerHelmuth @evan-bradley. Do we think the approach mentioned here is reasonable for this issue? The motivation for this approach was this comment. Or do you think we should look at the approach mentioned here to support curried functions in OTTL?

TylerHelmuth · 2024-01-26T17:51:49Z

@rnishtala-sumo thanks for being persistent on this issue and sorry it is taking so long. This is definitely the most complex function issue we've had in OTTL and its definitely pushing the language.

I believe this PR is a solution for the problem, but I find it quite complex. Requiring an extra Optional param and the need to keep track of all the different situations within the business logic is not simple and the code reflects that. I am worried about the upkeep of this function and the trouble we'll have adding any future features or performing any future refactors. From an end user perspective the Optional parameters wont be used by most, but for those that do want to use the optional params I worry about understanding exactly how everything is applied.

I think it would be good to do some comparisons with the other options. As I understand it, this PR implements Option 1 from #27820 (comment). I'd like to see Option 2 (format param) implemented as well. Since Option 3 (currying) takes a lot of language changes, I vote we start by looking at Option 2 and if we like it we could discuss taking it a step further with Option 3, which is a more powerful Option 2.

How does that sound?

Again, sorry for all the delays with this issue and thanks for sticking with it. This is an important use case for OTTL/transformprocessor and I want to make sure we get it right.

evan-bradley · 2024-01-26T18:04:46Z

@rnishtala-sumo Thanks for adding the extra parameter. I think the issue I still see here is that we aren't able to format the output of the optional function. At the end of the day I think we will need to allow the user to format the input to the optional function and how the output of the optional function is formatted before it is used as a replacement in the string.

I also realize that the current implementation optimizes the UX for masking use cases, which may in fact be sufficient for a large majority of users. Without others chiming in to say they also expect the function to be used like this it's a little hard to independently determine whether this is the case. Could you detail more about how you expect end users to use this functionality? I think this could help inform what UX will be best, and when we can optimize for certain cases and suggest more complicated solutions for others.

I agree with what @TylerHelmuth said about the implementation: I think it is great that you've made this work, but it is also complex. I think option 2 would be a good first exploration unless you think option 3 would be better. If we attempt option 3 I think we would want to know that it could be helpful in other places in OTTL just so we're not implementing something for a specific use case.

I also want to echo what @TylerHelmuth said and thank you for your patience on this. I've spent the past few days seriously considering this and still haven't been able to come to a solid conclusion on what the best solution is here. It's certainly an important problem, but it's also a hard one.

rnishtala-sumo · 2024-01-26T18:23:49Z

@TylerHelmuth @evan-bradley , no problem, thank you for the suggestions. Just to refresh our memories on this, Option 2 from this comment was what I started with. This was the PR that was closed, because of this comment which gained traction and led to option 1 🙂

The only reason I added the additional optional param in Option 1 was because we didn't want to lose an existing [use case].(#30403 (comment)). The original intent of option 1 was very much to change the existing hash behavior only when capture groups are used in the replacement string. Other use cases (not involving capture groups) wouldn't change.

Having said that, I don't mind pivoting back to option 2 which is the simplest of all options for this issue and maybe open a separate issue for supporting option 3 where we could track all the use cases for it.

rnishtala-sumo · 2024-01-26T18:28:52Z

Could you detail more about how you expect end users to use this functionality? I think this could help inform what UX will be best

I think the example below is a good starting point. Whether the user would like to mask or hash a card number in the log, they would have the same UX, if we stick to option 1.

https://github.com/SumoLogic/sumologic-kubernetes-collection/blob/main/docs/collecting-container-logs.md#modifying-log-records

rnishtala-sumo · 2024-01-26T19:22:46Z

Another example

2012-05-16 09:43:39,607 -0700 DEBUG [hostId=prod-cass-raw-8] [module=RAW] [logger=scala.raw.InboundRawProtocolHandler] [memberid=dan@demo.com] [remote_ip=98.248.40.103] [local_ip=192.168.0.1]

Hash the remote IP
With option 1:
replace_pattern(body, "remote_ip=(^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$)", "remote_ip=$$1", SHA256)

With option 2:
replace_pattern(body, "remote_ip=(^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$)", "$$1", SHA256, "remote_ip=%s")

To Mask the remote IP
replace_pattern(body, "remote_ip=(^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$)", "remote_ip=***")

Option 1 is closer to what masking does today and requires fewer params/user inputs

TylerHelmuth · 2024-01-26T21:20:41Z

@rnishtala-sumo I totally forgot about #27686, thanks for bringing that up. Let's revive that PR with all the other changes/tests that have been made to the functions and see what it looks like. I'm really curious to see how that solution lets us format the output of the optional function.

Also looking at this scenario, I am totally more in favor of Option 2 over Option 1. Option 2 is definitely easier for me to comprehend than Option 1 since I don't have to do any mental math on what is being hashed. @evan-bradley I can also see how currying functions together is maybe even clearer.

rnishtala-sumo · 2024-01-29T20:33:31Z

@TylerHelmuth @evan-bradley Couldn't reopen the same PR, so created a new one for option 2 using the same feature branch.

#30837

github-actions · 2024-02-13T05:19:23Z

This PR was marked stale due to lack of activity. It will be closed in 14 days.

github-actions · 2024-02-27T05:19:32Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

github-actions bot added the pkg/ottl label Jan 10, 2024

rnishtala-sumo force-pushed the hashed-repl-fmt branch 2 times, most recently from bcfead8 to 06f6ecc Compare January 16, 2024 23:07

rnishtala-sumo marked this pull request as ready for review January 16, 2024 23:07

rnishtala-sumo requested review from TylerHelmuth, bogdandrutu and evan-bradley as code owners January 16, 2024 23:07

rnishtala-sumo requested a review from a team January 16, 2024 23:07

github-actions bot assigned dmitryax Jan 16, 2024

rnishtala-sumo force-pushed the hashed-repl-fmt branch from 06f6ecc to d7c6cba Compare January 17, 2024 00:09

rnishtala-sumo changed the title ~~Only hash the capture group in the replacement string if specified~~ Supports hashing multiple capture groups in a replacement string Jan 17, 2024

rnishtala-sumo mentioned this pull request Jan 17, 2024

[pkg/ottl] Allow users to specify the format of the hashed replacement string in the replace_pattern* editors #27820

Closed

rnishtala-sumo changed the title ~~Supports hashing multiple capture groups in a replacement string~~ Supports hashing multiple capture groups in a replacement string with prefix/suffix Jan 17, 2024

rnishtala-sumo force-pushed the hashed-repl-fmt branch from d7c6cba to 2f346ac Compare January 17, 2024 13:54

github-actions bot requested a review from kentquirk January 17, 2024 13:54

djaglowski changed the title ~~Supports hashing multiple capture groups in a replacement string with prefix/suffix~~ [OTTL] Supports hashing multiple capture groups in a replacement string with prefix/suffix Jan 17, 2024

rnishtala-sumo force-pushed the hashed-repl-fmt branch from 2f346ac to 57cd617 Compare January 17, 2024 17:52

TylerHelmuth reviewed Jan 17, 2024

View reviewed changes

rnishtala-sumo force-pushed the hashed-repl-fmt branch from 57cd617 to f87155d Compare January 18, 2024 14:27

rnishtala-sumo force-pushed the hashed-repl-fmt branch from 86eede7 to 994b9aa Compare January 18, 2024 22:40

rnishtala-sumo added 2 commits January 18, 2024 17:41

Only hash the capture group in the replacement string

979fede

Support hashing multiple capture groups in the replacement string

60f2970

rnishtala-sumo force-pushed the hashed-repl-fmt branch 2 times, most recently from 7178973 to fc72d6f Compare January 18, 2024 22:50

rnishtala-sumo force-pushed the hashed-repl-fmt branch from fc72d6f to 452d7d6 Compare January 18, 2024 22:55

rnishtala-sumo requested a review from TylerHelmuth January 22, 2024 15:49

TylerHelmuth reviewed Jan 23, 2024

View reviewed changes

pkg/ottl/ottlfuncs/func_replace_pattern.go Show resolved Hide resolved

rnishtala-sumo force-pushed the hashed-repl-fmt branch 3 times, most recently from 8a8a620 to 7280a1c Compare January 23, 2024 22:57

Include an option argument to revert to old hashing behavior

8600fc1

rnishtala-sumo force-pushed the hashed-repl-fmt branch from 7280a1c to 8600fc1 Compare January 23, 2024 23:04

rnishtala-sumo requested a review from TylerHelmuth January 25, 2024 15:11

TylerHelmuth mentioned this pull request Jan 29, 2024

Adding an optional replacementFormat argument to the replace_pattern editors that specified the format of the replacement string #30837

Merged

github-actions bot added the Stale label Feb 13, 2024

github-actions bot closed this Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OTTL] Supports hashing multiple capture groups in a replacement string with prefix/suffix #30403

[OTTL] Supports hashing multiple capture groups in a replacement string with prefix/suffix #30403

rnishtala-sumo commented Jan 10, 2024 •

edited

Loading

rnishtala-sumo commented Jan 17, 2024 •

edited

Loading

TylerHelmuth Jan 17, 2024

TylerHelmuth Jan 17, 2024

rnishtala-sumo Jan 18, 2024 •

edited

Loading

rnishtala-sumo Jan 19, 2024 •

edited

Loading

rnishtala-sumo Jan 23, 2024

TylerHelmuth Jan 23, 2024

rnishtala-sumo Jan 23, 2024 •

edited

Loading

evan-bradley commented Jan 18, 2024

rnishtala-sumo commented Jan 18, 2024 •

edited

Loading

rnishtala-sumo commented Jan 23, 2024 •

edited

Loading

rnishtala-sumo commented Jan 26, 2024 •

edited

Loading

TylerHelmuth commented Jan 26, 2024

evan-bradley commented Jan 26, 2024 •

edited

Loading

rnishtala-sumo commented Jan 26, 2024 •

edited

Loading

rnishtala-sumo commented Jan 26, 2024

rnishtala-sumo commented Jan 26, 2024 •

edited

Loading

TylerHelmuth commented Jan 26, 2024 •

edited

Loading

rnishtala-sumo commented Jan 29, 2024

github-actions bot commented Feb 13, 2024

github-actions bot commented Feb 27, 2024

[OTTL] Supports hashing multiple capture groups in a replacement string with prefix/suffix #30403

[OTTL] Supports hashing multiple capture groups in a replacement string with prefix/suffix #30403

Conversation

rnishtala-sumo commented Jan 10, 2024 • edited Loading

rnishtala-sumo commented Jan 17, 2024 • edited Loading

TylerHelmuth Jan 17, 2024

Choose a reason for hiding this comment

TylerHelmuth Jan 17, 2024

Choose a reason for hiding this comment

rnishtala-sumo Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

rnishtala-sumo Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

rnishtala-sumo Jan 23, 2024

Choose a reason for hiding this comment

TylerHelmuth Jan 23, 2024

Choose a reason for hiding this comment

rnishtala-sumo Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

evan-bradley commented Jan 18, 2024

rnishtala-sumo commented Jan 18, 2024 • edited Loading

rnishtala-sumo commented Jan 23, 2024 • edited Loading

rnishtala-sumo commented Jan 26, 2024 • edited Loading

TylerHelmuth commented Jan 26, 2024

evan-bradley commented Jan 26, 2024 • edited Loading

rnishtala-sumo commented Jan 26, 2024 • edited Loading

rnishtala-sumo commented Jan 26, 2024

rnishtala-sumo commented Jan 26, 2024 • edited Loading

TylerHelmuth commented Jan 26, 2024 • edited Loading

rnishtala-sumo commented Jan 29, 2024

github-actions bot commented Feb 13, 2024

github-actions bot commented Feb 27, 2024

rnishtala-sumo commented Jan 10, 2024 •

edited

Loading

rnishtala-sumo commented Jan 17, 2024 •

edited

Loading

rnishtala-sumo Jan 18, 2024 •

edited

Loading

rnishtala-sumo Jan 19, 2024 •

edited

Loading

rnishtala-sumo Jan 23, 2024 •

edited

Loading

rnishtala-sumo commented Jan 18, 2024 •

edited

Loading

rnishtala-sumo commented Jan 23, 2024 •

edited

Loading

rnishtala-sumo commented Jan 26, 2024 •

edited

Loading

evan-bradley commented Jan 26, 2024 •

edited

Loading

rnishtala-sumo commented Jan 26, 2024 •

edited

Loading

rnishtala-sumo commented Jan 26, 2024 •

edited

Loading

TylerHelmuth commented Jan 26, 2024 •

edited

Loading