Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple regex matches #62

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jplitza
Copy link

@jplitza jplitza commented Jan 4, 2022

The current implementation uses re.match and hence only replaces groups in the first match of the regexp. This PR changes that behavior to instead use re.finditer to replace all groups in all matches.

This actually makes it possible/easier to match all possible IP addresses (without depending on their surrounding context) by a user-supplied regexp. See the added testcase for an example regex that matches much more than IP addresses, but in particular also IP addresses (which is unproblematic if replace == None since the real verification takes place using the ipaddress module).

Also, the access log line in my newly added test case, with the [ip:1.2.3.4] at the end, actually appeared in an access log, with different IP addresses in the beginning and at the end!

This PR also closes #61 by using match group's indices for the replacement instead of str.replace

N.B.: I haven't verified that all newly used methods and features (in particular re.Match.start(), re.Match.end() and enumerate(start=1)) are available in all supported Python versions.

N.B.: I haven't evaluated the performance impact of this change.

(Rant: The choice to replace only groups instead of using look-aheads and look-behinds in the pattern made this change quite a bit more cumbersome, but I'm 10 days too late to change that I guess)

@jplitza
Copy link
Author

jplitza commented Jan 10, 2022

After rolling this out to multiple servers, I learned two things:

  • The initial version wasn't compatible with Python 3.5 (installed on Ubuntu 16.04). Should be fixed now.

  • Using a regex that matches too broadly not only causes heaps of warnings ("this doesn't look like an IPv4 or IPv6 address"), but also degrades performance. So I'm now using this quite specific regex for our deployment:

      --regex '(?a)((?:(?:25[0-5]|(?:2[0-4]|1[0-9]|[1-9])?[0-9])\.){3}(?:25[0-5]|(?:2[0-4]|1[0-9]|[1-9])?[0-9])|(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}|:(?:(?::[0-9a-fA-F]{1,4}){1,6})|[0-9a-fA-F]{1,4}:(?:(?::[0-9a-fA-F]{1,4}){1,5})|(?:[0-9a-fA-F]{1,4}:){1,2}(?::[0-9a-fA-F]{1,4}){1,4}|(?:[0-9a-fA-F]{1,4}:){1,3}(?::[0-9a-fA-F]{1,4}){1,3}|(?:[0-9a-fA-F]{1,4}:){1,4}(?::[0-9a-fA-F]{1,4}){1,2}|(?:[0-9a-fA-F]{1,4}:){1,5}(?::[0-9a-fA-F]{1,4})|(?:[0-9a-fA-F]{1,4}:){1,7}:)(?![\.:]?[0-9a-fA-F])'
    

    This expression is based on this and this, with comments of the latter and some more modifications applied, and reduced to what is actually found in logs. This now processes approx. 400k lines per minute on my test machine, vs. 100k with the too broad match.

    Maybe this could even become the default if --regex is passed without a custom regex.

By using the match group's indices, this also avoids replacing unrelated
chunks of the line.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Too broad regex replacement
1 participant