[MAC cleaner] too relaxed IPV6_REG_4HEX matching UUIDs #3736

pmoravec · 2024-08-01T13:30:20Z

User story: cleaner runs for three days. The cause is it cleans postgres logs like:

2024-07-29 13:00:23 CEST LOG:  duration: 1195.895 ms  execute <unnamed>: SELECT "katello_rpms"."id", "katello_rpms"."pulp_id" FROM "katello_rpms" WHERE (pulp_id in ('/pulp/api/v3/content/rpm/packages/0190fe17-d1a6-7066-aa33-e7422232031f/','/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/','/pulp/api/v3/content/rpm/packages/0190fe17-d1b2-7377-bb34-47a9850ecb44/',..

with a list of >10k UUIDs - and MAC cleaner treats all the UUIDs as IPV6_REG_4HEX address (cf https://github.com/sosreport/sos/blob/main/sos/cleaner/parsers/mac_parser.py#L23-L26):

>>> IPV6_REG_4HEX = (
...     r'((?<!([0-9a-fA-F\'\"]:)|::)(([^:\-]?[0-9a-fA-F]{4}(:|-)){3}'
...     r'[0-9a-fA-F]{4}(\'|\")?(\/|\,|\-|\.|\s|$)))'
... )
>>> import re
>>> re.search(IPV6_REG_4HEX, '/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/')
<re.Match object; span=(37, 58), match='0fe17-d1ae-7091-a034-'>
>>>

There are three issues here:

we should not allow a character before the match. I.e. the negative lookbehind (?<!([0-9a-fA-F\'\"]:)|::) must be enhanced to e.g. (?<!([0-9a-fA-F\'\"]:)|::|[0-9a-fA-F]) (or to (?<!([0-9a-fA-F\'\"]:)|::|\w)?
the same applies to end of match: the string evidently continues after the match, so the match is just a substring of some other "word" that can't be an IPv6 address (due to other reasons as too many chars in one token 22ade2121dc0) - no big idea how to modify this (maybe remove the |\- if - is used as sub-address separator?)
sadly we must support - as sub-addresses separator (shouldnt we support dot also, per https://www.geeksforgeeks.org/mac-address-in-computer-network/#format-of-mac-address?) - but can't we 1) enforce just one type of separator is used, no mixed ones, 2) no separator just before or just after the match (cf previous point)?

I was also thinking "aren't we reinventing wheel?" but I haven't found an ultimate regular expression for MAC address :-o (do I use DuckDuckGo wrong?).

EDIT: why not "just" extend https://www.geeksforgeeks.org/how-to-validate-mac-address-using-regular-expression/ :

'^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})|([0-9a-fA-F]{4}\\.[0-9a-fA-F]{4}\\.[0-9a-fA-F]{4})$'

"just" with negative lookbehind and negative "look forward"? Why the three formats..?

The text was updated successfully, but these errors were encountered:

TurboTurtle · 2024-08-01T14:37:56Z

MAC addresses have instilled in me a new level of loathing for regex.

I was also thinking "aren't we reinventing wheel?"

A bit, yes. The issue is largely that the stock examples commonly found assume that the string to be matched against is a discreet unit having a straight forward comparison to make. What I mean by this practically can be seen by looking at your example source - the test strings are nice and simple cases, but what we found originally when implementing cleaner many years ago, is that these regexes largely fall on their face when given log content.

Logs will have the nice and simple use case of a MAC address or similar separated by spaces, but they also have:

MAC addresses logged as parameters to API endpoints
Addresses embedded in user-supplied data that may or may not fit the stock regexes
Addresses brackets by all manner of non-space characters

etc...etc... you get the idea.

Why the three formats..?

IPv6 mac addresses have 2 accepted formats, aa:bb:cc:dd:ee:ff::ab:cd and aabb:ccdd:eeff:abcd. IPv4 mac addresses (aa:bb:cc:dd:ee:ff) can be found as substrings of IPv6 addresses of the first format, but we don't want to match on those.

But can't we 1) enforce just one type of separator is used, no mixed ones

We could, and then we'd likely have things slipping through the cracks and end users (rightfully) complaining that we're not obfuscating everything.

If the extended look-behinds don't interfere with our ability to obfuscate addresses that aren't simply bracketed by spaces, for example my log message has value=aa:bb:cc:dd:ee:ff:ab:cd, then by all means let's improve this.

pierg75 · 2024-08-05T09:23:09Z

When we talk about the IPv6 mac addresses, do we mean the MAC address included in the IPv6 address?
If this is the case (which it shouldn't be applied anymore because of https://datatracker.ietf.org/doc/html/rfc7217), then shouldn't the IPv6 regex take care of that?

pmoravec · 2024-08-07T12:53:44Z

I tried to come up with some regexp that will cover all my points (dont match UUIDs, force same separator at all positions) but I lack specification of the IPv6 mac addresses - or specification what we should obfuscate and what not.

So I wrote a simple script with test cases with strings that should (or not) be obfuscated:

#!/usr/bin/python3
import re, sys

IPV6_REG_4HEX_OLD = (
    r'((?<!([0-9a-fA-F\'\"]:)|::)(([^:\-]?[0-9a-fA-F]{4}(:|-)){3}'
    r'[0-9a-fA-F]{4}(\'|\")?(\/|\,|\-|\.|\s|$)))'
)

IPV6_REG_4HEX_NEW = (
        r'((?<![[0-9a-fA-F:]:|\w])(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?![:|\w]))|'
        r'((?<![[0-9a-fA-F-]-|\w])(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?![-|\w]))'
)

IPV6_REG_4HEX = IPV6_REG_4HEX_OLD if (len(sys.argv)>1 and sys.argv[1]=="old") else IPV6_REG_4HEX_NEW

test_strings = (
        (False, '/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/'),
        (False, '0190fe17-d1ae-7091-a034-22ade2121dc0'),
        (True, 'aabb:ccee:ddee:ffaa'),
        (True, 'line=aabb:ccee:ddee:ffaa'),
        (True, 'line=aabb:ccee:ddee:ffaa, something else'),
        (True, 'address:aabb:ccee:ddee:ffaa'),
        (True, 'addr=[aabb:ccee:ddee:ffaa]'),
        (True, 'addr="aabb:ccee:ddee:ffaa"'),
        (True, "addr='aabb:ccee:ddee:ffaa'"),
        (True, 'aabb:ccee:ddee:ffaa, but not aa:bb:cc:dd'),
        (True, 'line=aabb:ccee:ddee:ffaa-should still match'),
        (False, 'not aabb:ccee:ddee:ffaaaaaa'),
#        (, ''),
#        (, ''),
#        (, ''),
)

intro=f"Matching against {IPV6_REG_4HEX}"
print(intro)
print("-"*len(intro))
print()

for match, line in test_strings:
    search = re.search(IPV6_REG_4HEX, line)
    print(f"{line}\t:\t\t{search}")
    if (None!=search) != match:
        print(f"\t\tERROR {line} should {'' if match else 'not '}match, but re.search={search}")

When running with argument old, current RE is used. When running without an argument, the new RE:

IPV6_REG_4HEX_NEW = (
        r'((?<![[0-9a-fA-F:]:|\w])(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?![:|\w]))|'
        r'((?<![[0-9a-fA-F-]-|\w])(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?![-|\w]))'
)

is used.

Please show me examples where the new RE does not work as you expect - in either way.

TurboTurtle · 2024-08-08T16:53:00Z

logged_mac:aabb:ccee:ddee:ffaa=some_other_data	:		None
		ERROR logged_mac:aabb:ccee:ddee:ffaa=some_other_data should match, but re.search=None

To be fair, it appears our current regex also fails here (which I thought we had previously fixed) - but this is the kind of logging that I'm referring to. Where there are no spaces breaking up the mac addr from the rest of the string.

Also,

logged_mac:aabb:ccee:ddee:ffaa some_other_data	:		None
		ERROR logged_mac:aabb:ccee:ddee:ffaa some_other_data should match, but re.search=None

This fails when there is a trailing space.

pmoravec · 2024-08-08T20:40:09Z

Current cleaner fails to identify either MAC address, now. SO at least there is no regression :) In both cases, the key problem is the prefix before the address.

OK, I will come up with something better. But why the current R.E. even has the "MAC address can't follow after :: or after [0-9a-fA-F\'\"]:" condition? I just tried to respect this, to prevent regression. What strings (starting by e.g. :: followed by "looks-like MAC address") do not contain a MAC address? Does e.g. ::aabb:ccee:ddee:ffaa contain a MAC address? Current regexp does not think so.

So what can not be before the match?

pmoravec · 2024-08-23T12:46:43Z

What about simple:

IPV6_REG_4HEX_NEW = (
        r'((?<!\w)(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?!\w))|'
        r'((?<!\w)(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?!\w))'
)

? I.e. disallow just word characters prior and after the MAC address.

All these tests passed (I added a few at the end):

test_strings = (
    (False, '/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/'),
    (False, '0190fe17-d1ae-7091-a034-22ade2121dc0'),
    (True, 'aabb:ccee:ddee:ffaa'),
    (True, 'line=aabb:ccee:ddee:ffaa'),
    (True, 'line=aabb:ccee:ddee:ffaa, something else'),
    (True, 'address:aabb:ccee:ddee:ffaa'),
    (True, 'addr=[aabb:ccee:ddee:ffaa]'),
    (True, 'addr="aabb:ccee:ddee:ffaa"'),
    (True, "addr='aabb:ccee:ddee:ffaa'"),
    (True, 'aabb:ccee:ddee:ffaa, but not aa:bb:cc:dd'),
    (True, 'line=aabb:ccee:ddee:ffaa-should still match'),
    (False, 'not aabb:ccee:ddee:ffaaaaaa'),
    (True, 'logged_mac:aabb:ccee:ddee:ffaa=some_other_data'),
    (True, 'logged_mac:aabb:ccee:ddee:ffaa some_other_data'),
    (True, '::aabb:ccee:ddee:ffaa'),
    (True, 'we should obfuscate addr: aabb:ccee:ddee:ffaa: YES'),
)

TurboTurtle · 2024-08-25T15:04:30Z

LGTM, let's get that PR open :)

Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3767 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <pmoravec@redhat.com>

Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <pmoravec@redhat.com>

pmoravec added a commit to pmoravec/sos that referenced this issue Aug 29, 2024

[cleaner] Improve IPV6 MAC address RE

b0f5530

Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3767 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <pmoravec@redhat.com>

pmoravec mentioned this issue Aug 29, 2024

[cleaner] Improve IPV6 MAC address RE #3766

Open

6 tasks

pmoravec added a commit to pmoravec/sos that referenced this issue Aug 29, 2024

[cleaner] Improve IPV6 MAC address RE

3d76300

Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <pmoravec@redhat.com>

pmoravec added a commit to pmoravec/sos that referenced this issue Aug 29, 2024

[cleaner] Improve IPV6 MAC address RE

3b6a1fe

Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <pmoravec@redhat.com>

pmoravec added a commit to pmoravec/sos that referenced this issue Sep 24, 2024

[cleaner] Improve IPV6 MAC address RE

a9b0161

Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <pmoravec@redhat.com>

pmoravec added a commit to pmoravec/sos that referenced this issue Sep 24, 2024

[cleaner] Improve IPV6 MAC address RE

cc91383

Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <pmoravec@redhat.com>

pmoravec added a commit to pmoravec/sos that referenced this issue Sep 24, 2024

[cleaner] Improve IPV6 MAC address RE

94c81c6

Current RE obfuscates also UUID strings, let be more strict. Resolves: sosreport#3766 Relevant: sosreport#3736 Signed-off-by: Pavel Moravec <pmoravec@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MAC cleaner] too relaxed IPV6_REG_4HEX matching UUIDs #3736

[MAC cleaner] too relaxed IPV6_REG_4HEX matching UUIDs #3736

pmoravec commented Aug 1, 2024 •

edited

Loading

TurboTurtle commented Aug 1, 2024

pierg75 commented Aug 5, 2024

pmoravec commented Aug 7, 2024

TurboTurtle commented Aug 8, 2024

pmoravec commented Aug 8, 2024

pmoravec commented Aug 23, 2024

TurboTurtle commented Aug 25, 2024

[MAC cleaner] too relaxed IPV6_REG_4HEX matching UUIDs #3736

[MAC cleaner] too relaxed IPV6_REG_4HEX matching UUIDs #3736

Comments

pmoravec commented Aug 1, 2024 • edited Loading

TurboTurtle commented Aug 1, 2024

pierg75 commented Aug 5, 2024

pmoravec commented Aug 7, 2024

TurboTurtle commented Aug 8, 2024

pmoravec commented Aug 8, 2024

pmoravec commented Aug 23, 2024

TurboTurtle commented Aug 25, 2024

pmoravec commented Aug 1, 2024 •

edited

Loading