Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAC cleaner] too relaxed IPV6_REG_4HEX matching UUIDs #3736

Open
pmoravec opened this issue Aug 1, 2024 · 7 comments
Open

[MAC cleaner] too relaxed IPV6_REG_4HEX matching UUIDs #3736

pmoravec opened this issue Aug 1, 2024 · 7 comments

Comments

@pmoravec
Copy link
Contributor

pmoravec commented Aug 1, 2024

User story: cleaner runs for three days. The cause is it cleans postgres logs like:

2024-07-29 13:00:23 CEST LOG:  duration: 1195.895 ms  execute <unnamed>: SELECT "katello_rpms"."id", "katello_rpms"."pulp_id" FROM "katello_rpms" WHERE (pulp_id in ('/pulp/api/v3/content/rpm/packages/0190fe17-d1a6-7066-aa33-e7422232031f/','/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/','/pulp/api/v3/content/rpm/packages/0190fe17-d1b2-7377-bb34-47a9850ecb44/',..

with a list of >10k UUIDs - and MAC cleaner treats all the UUIDs as IPV6_REG_4HEX address (cf https://github.com/sosreport/sos/blob/main/sos/cleaner/parsers/mac_parser.py#L23-L26):

>>> IPV6_REG_4HEX = (
...     r'((?<!([0-9a-fA-F\'\"]:)|::)(([^:\-]?[0-9a-fA-F]{4}(:|-)){3}'
...     r'[0-9a-fA-F]{4}(\'|\")?(\/|\,|\-|\.|\s|$)))'
... )
>>> import re
>>> re.search(IPV6_REG_4HEX, '/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/')
<re.Match object; span=(37, 58), match='0fe17-d1ae-7091-a034-'>
>>> 

There are three issues here:

  • we should not allow a character before the match. I.e. the negative lookbehind (?<!([0-9a-fA-F\'\"]:)|::) must be enhanced to e.g. (?<!([0-9a-fA-F\'\"]:)|::|[0-9a-fA-F]) (or to (?<!([0-9a-fA-F\'\"]:)|::|\w)?
  • the same applies to end of match: the string evidently continues after the match, so the match is just a substring of some other "word" that can't be an IPv6 address (due to other reasons as too many chars in one token 22ade2121dc0) - no big idea how to modify this (maybe remove the |\- if - is used as sub-address separator?)
  • sadly we must support - as sub-addresses separator (shouldnt we support dot also, per https://www.geeksforgeeks.org/mac-address-in-computer-network/#format-of-mac-address?) - but can't we 1) enforce just one type of separator is used, no mixed ones, 2) no separator just before or just after the match (cf previous point)?

I was also thinking "aren't we reinventing wheel?" but I haven't found an ultimate regular expression for MAC address :-o (do I use DuckDuckGo wrong?).

EDIT: why not "just" extend https://www.geeksforgeeks.org/how-to-validate-mac-address-using-regular-expression/ :

'^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})|([0-9a-fA-F]{4}\\.[0-9a-fA-F]{4}\\.[0-9a-fA-F]{4})$'

"just" with negative lookbehind and negative "look forward"? Why the three formats..?

@TurboTurtle
Copy link
Member

MAC addresses have instilled in me a new level of loathing for regex.

I was also thinking "aren't we reinventing wheel?"

A bit, yes. The issue is largely that the stock examples commonly found assume that the string to be matched against is a discreet unit having a straight forward comparison to make. What I mean by this practically can be seen by looking at your example source - the test strings are nice and simple cases, but what we found originally when implementing cleaner many years ago, is that these regexes largely fall on their face when given log content.

Logs will have the nice and simple use case of a MAC address or similar separated by spaces, but they also have:

  • MAC addresses logged as parameters to API endpoints
  • Addresses embedded in user-supplied data that may or may not fit the stock regexes
  • Addresses brackets by all manner of non-space characters

etc...etc... you get the idea.

Why the three formats..?

IPv6 mac addresses have 2 accepted formats, aa:bb:cc:dd:ee:ff::ab:cd and aabb:ccdd:eeff:abcd. IPv4 mac addresses (aa:bb:cc:dd:ee:ff) can be found as substrings of IPv6 addresses of the first format, but we don't want to match on those.

But can't we 1) enforce just one type of separator is used, no mixed ones

We could, and then we'd likely have things slipping through the cracks and end users (rightfully) complaining that we're not obfuscating everything.

If the extended look-behinds don't interfere with our ability to obfuscate addresses that aren't simply bracketed by spaces, for example my log message has value=aa:bb:cc:dd:ee:ff:ab:cd, then by all means let's improve this.

@pierg75
Copy link
Contributor

pierg75 commented Aug 5, 2024

When we talk about the IPv6 mac addresses, do we mean the MAC address included in the IPv6 address?
If this is the case (which it shouldn't be applied anymore because of https://datatracker.ietf.org/doc/html/rfc7217), then shouldn't the IPv6 regex take care of that?

@pmoravec
Copy link
Contributor Author

pmoravec commented Aug 7, 2024

I tried to come up with some regexp that will cover all my points (dont match UUIDs, force same separator at all positions) but I lack specification of the IPv6 mac addresses - or specification what we should obfuscate and what not.

So I wrote a simple script with test cases with strings that should (or not) be obfuscated:

#!/usr/bin/python3
import re, sys

IPV6_REG_4HEX_OLD = (
    r'((?<!([0-9a-fA-F\'\"]:)|::)(([^:\-]?[0-9a-fA-F]{4}(:|-)){3}'
    r'[0-9a-fA-F]{4}(\'|\")?(\/|\,|\-|\.|\s|$)))'
)

IPV6_REG_4HEX_NEW = (
        r'((?<![[0-9a-fA-F:]:|\w])(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?![:|\w]))|'
        r'((?<![[0-9a-fA-F-]-|\w])(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?![-|\w]))'
)

IPV6_REG_4HEX = IPV6_REG_4HEX_OLD if (len(sys.argv)>1 and sys.argv[1]=="old") else IPV6_REG_4HEX_NEW

test_strings = (
        (False, '/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/'),
        (False, '0190fe17-d1ae-7091-a034-22ade2121dc0'),
        (True, 'aabb:ccee:ddee:ffaa'),
        (True, 'line=aabb:ccee:ddee:ffaa'),
        (True, 'line=aabb:ccee:ddee:ffaa, something else'),
        (True, 'address:aabb:ccee:ddee:ffaa'),
        (True, 'addr=[aabb:ccee:ddee:ffaa]'),
        (True, 'addr="aabb:ccee:ddee:ffaa"'),
        (True, "addr='aabb:ccee:ddee:ffaa'"),
        (True, 'aabb:ccee:ddee:ffaa, but not aa:bb:cc:dd'),
        (True, 'line=aabb:ccee:ddee:ffaa-should still match'),
        (False, 'not aabb:ccee:ddee:ffaaaaaa'),
#        (, ''),
#        (, ''),
#        (, ''),
)

intro=f"Matching against {IPV6_REG_4HEX}"
print(intro)
print("-"*len(intro))
print()

for match, line in test_strings:
    search = re.search(IPV6_REG_4HEX, line)
    print(f"{line}\t:\t\t{search}")
    if (None!=search) != match:
        print(f"\t\tERROR {line} should {'' if match else 'not '}match, but re.search={search}")

When running with argument old, current RE is used. When running without an argument, the new RE:

IPV6_REG_4HEX_NEW = (
        r'((?<![[0-9a-fA-F:]:|\w])(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?![:|\w]))|'
        r'((?<![[0-9a-fA-F-]-|\w])(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?![-|\w]))'
)

is used.

Please show me examples where the new RE does not work as you expect - in either way.

@TurboTurtle
Copy link
Member

logged_mac:aabb:ccee:ddee:ffaa=some_other_data	:		None
		ERROR logged_mac:aabb:ccee:ddee:ffaa=some_other_data should match, but re.search=None

To be fair, it appears our current regex also fails here (which I thought we had previously fixed) - but this is the kind of logging that I'm referring to. Where there are no spaces breaking up the mac addr from the rest of the string.

Also,

logged_mac:aabb:ccee:ddee:ffaa some_other_data	:		None
		ERROR logged_mac:aabb:ccee:ddee:ffaa some_other_data should match, but re.search=None

This fails when there is a trailing space.

@pmoravec
Copy link
Contributor Author

pmoravec commented Aug 8, 2024

Current cleaner fails to identify either MAC address, now. SO at least there is no regression :) In both cases, the key problem is the prefix before the address.

OK, I will come up with something better. But why the current R.E. even has the "MAC address can't follow after :: or after [0-9a-fA-F\'\"]:" condition? I just tried to respect this, to prevent regression. What strings (starting by e.g. :: followed by "looks-like MAC address") do not contain a MAC address? Does e.g. ::aabb:ccee:ddee:ffaa contain a MAC address? Current regexp does not think so.

So what can not be before the match?

@pmoravec
Copy link
Contributor Author

What about simple:

IPV6_REG_4HEX_NEW = (
        r'((?<!\w)(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?!\w))|'
        r'((?<!\w)(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?!\w))'
)

? I.e. disallow just word characters prior and after the MAC address.

All these tests passed (I added a few at the end):

test_strings = (
    (False, '/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/'),
    (False, '0190fe17-d1ae-7091-a034-22ade2121dc0'),
    (True, 'aabb:ccee:ddee:ffaa'),
    (True, 'line=aabb:ccee:ddee:ffaa'),
    (True, 'line=aabb:ccee:ddee:ffaa, something else'),
    (True, 'address:aabb:ccee:ddee:ffaa'),
    (True, 'addr=[aabb:ccee:ddee:ffaa]'),
    (True, 'addr="aabb:ccee:ddee:ffaa"'),
    (True, "addr='aabb:ccee:ddee:ffaa'"),
    (True, 'aabb:ccee:ddee:ffaa, but not aa:bb:cc:dd'),
    (True, 'line=aabb:ccee:ddee:ffaa-should still match'),
    (False, 'not aabb:ccee:ddee:ffaaaaaa'),
    (True, 'logged_mac:aabb:ccee:ddee:ffaa=some_other_data'),
    (True, 'logged_mac:aabb:ccee:ddee:ffaa some_other_data'),
    (True, '::aabb:ccee:ddee:ffaa'),
    (True, 'we should obfuscate addr: aabb:ccee:ddee:ffaa: YES'),
)

@TurboTurtle
Copy link
Member

LGTM, let's get that PR open :)

pmoravec added a commit to pmoravec/sos that referenced this issue Aug 29, 2024
Current RE obfuscates also UUID strings, let be more strict.

Resolves: sosreport#3767
Relevant: sosreport#3736

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
pmoravec added a commit to pmoravec/sos that referenced this issue Aug 29, 2024
Current RE obfuscates also UUID strings, let be more strict.

Resolves: sosreport#3766
Relevant: sosreport#3736

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
pmoravec added a commit to pmoravec/sos that referenced this issue Aug 29, 2024
Current RE obfuscates also UUID strings, let be more strict.

Resolves: sosreport#3766
Relevant: sosreport#3736

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
pmoravec added a commit to pmoravec/sos that referenced this issue Sep 24, 2024
Current RE obfuscates also UUID strings, let be more strict.

Resolves: sosreport#3766
Relevant: sosreport#3736

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
pmoravec added a commit to pmoravec/sos that referenced this issue Sep 24, 2024
Current RE obfuscates also UUID strings, let be more strict.

Resolves: sosreport#3766
Relevant: sosreport#3736

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
pmoravec added a commit to pmoravec/sos that referenced this issue Sep 24, 2024
Current RE obfuscates also UUID strings, let be more strict.

Resolves: sosreport#3766
Relevant: sosreport#3736

Signed-off-by: Pavel Moravec <pmoravec@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants