Rule idea: Unnecessary use of `re` #12283

JelleZijlstra · 2024-07-10T21:22:04Z

I recently saw someone write re.sub('\\\$', '', s) when they could have just written s.replace('$', ''). Similarly, re.match("abc", s) could be replaced with s.startswith("abc"), re.search("abc", s) could be replaced with "abc" in s, re.fullmatch("abc", s) could be replaced with s == "abc", re.split("abc", s) could be replaced with s.split("abc").

Why is it better to use the str methods directly? It makes the code simpler and therefore easier to understand, may require less escaping, and it's probably usually faster.

Some thoughts on what the implementation could look like: We'd have to inspect the pattern string and ensure it doesn't contain any characters that are special for the regex engine, other than \ (for escaping) and in some cases ^ and $ for the beginning and end of the string. The implementation would also need to check whether any other flags are passed to the regex function (e.g., re.IGNORECASE would mean that a regex that contains alphabetical characters is no longer replaceable). There are other subtleties that I haven't fully thought through; for example, $ also matches right before a newline at the end of a string. An initial implementation could limit itself to simpler cases, and leave possible more complex cases for later.

The text was updated successfully, but these errors were encountered:

tdulcet · 2024-12-01T12:50:21Z

Here are some slightly more complex examples not supported by the initial implementation in #14659:

re.search(r"abc", s) is None       # ⇒ "abc" not in s
re.search(r"abc", s) is not None   # ⇒ "abc" in s
re.sub(r"abc", "", s, count=1)     # ⇒ s.replace("abc", "", count=1)
re.split(r"abc", s, maxsplit=1)    # ⇒ s.split("abc", maxsplit=1)
re.search(r"^abc", s)              # ⇒ s.startswith("abc")
re.search(r"abc$", s)              # ⇒ s.endswith("abc")
re.search(r"^abc$", s)             # ⇒ s == "abc"
re.match(r"abc|xyz|123", s)        # ⇒ s.startswith(("abc", "xyz", "123"))
re.search(r"^abc|^xyz|^123", s)
re.search(r"^(?:abc|xyz|123)", s)
re.search(r"abc$|xyz$|123$", s)    # ⇒ s.endswith(("abc", "xyz", "123"))
re.search(r"(?:abc|xyz|123)$", s)
re.fullmatch(r"abc|xyz|123", s)    # ⇒ s in {"abc", "xyz", "123"}
re.search(r"^(?:abc|xyz|123)$", s)
re.match(r"[abc]", s)              # ⇒ s.startswith(("a", "b", "c"))
re.search(r"[abc]$", s)            # ⇒ s.endswith(("a", "b", "c"))
re.fullmatch(r"[abc]", s)          # ⇒ s in {"a", "b", "c"}
# Handle the ignore case flag
re.match(r"AbC", s, re.I)          # ⇒ s.casefold().startswith("abc")
re.search(r"AbC", s, re.I)         # ⇒ "abc" in s.casefold()
re.fullmatch(r"AbC", s, re.I)      # ⇒ "abc" == s.casefold()
# Handle any escaped special characters
re.fullmatch(r"""!"\#\$%\&'\(\)\*\+,\-\./:;<=>\?@\[\\\]\^_`\{\|\}\~""", s) # ⇒ s == """!"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~"""

Edit: The .endswith() examples would be unsafe, per #12283 (comment).

JelleZijlstra · 2024-12-01T15:42:50Z

Great to see that the rule was implemented!

re.search(r"abc$", s) # ⇒ s.endswith("abc")

This is not equivalent:

>>> s = "abc\n"
>>> re.search(r"abc$", s)
<re.Match object; span=(0, 3), match='abc'>
>>> s.endswith("abc")
False

I'm also not sure about the re.I ones; are those transformations correct in the presence of exotic casing conventions like in Turkish?

And I'm not sure the replacement is worth it in cases where the replacement is a little more complex (e.g., passing a tuple to startswith). The goal of the rule should be to make code simpler; with those transformations, that may not be the case any more.

tdulcet · 2024-12-01T16:48:15Z

This is not equivalent

Thanks. I edited my comment to note that they would be unsafe. If necessary, assuming Python 3.9+, one could always do something like s.removesuffix("\n").endswith("abc").

I'm also not sure about the re.I ones; are those transformations correct in the presence of exotic casing conventions like in Turkish?

I believe these would be safe as long as the user is not using the re.L/re.LOCALE flag, as otherwise both the re library and str.lower() should be using the same standard Unicode conventions. However, I changed my examples to use str.casefold() for good measure.

MichaReiser · 2024-12-02T17:29:48Z

I created a new issue to track the scope extension. I'll close this issue because the rule itself and what's been asked in the original issue has been implemented as RUF055

MichaReiser added the rule Implementing or modifying a lint rule label Jul 10, 2024

AlexWaygood added help wanted Contributions especially welcome accepted Ready for implementation labels Nov 24, 2024

ntBre mentioned this issue Nov 28, 2024

[ruff] Implement unnecessary-regular-expression (RUF055) #14659

Merged

MichaReiser mentioned this issue Dec 2, 2024

Extend RUF055 with more patterns #14738

Open

MichaReiser closed this as completed Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rule idea: Unnecessary use of `re` #12283

Rule idea: Unnecessary use of `re` #12283

JelleZijlstra commented Jul 10, 2024

tdulcet commented Dec 1, 2024 •

edited

Loading

JelleZijlstra commented Dec 1, 2024

tdulcet commented Dec 1, 2024

MichaReiser commented Dec 2, 2024

Rule idea: Unnecessary use of re #12283

Rule idea: Unnecessary use of re #12283

Comments

JelleZijlstra commented Jul 10, 2024

tdulcet commented Dec 1, 2024 • edited Loading

JelleZijlstra commented Dec 1, 2024

tdulcet commented Dec 1, 2024

MichaReiser commented Dec 2, 2024

Rule idea: Unnecessary use of `re` #12283

Rule idea: Unnecessary use of `re` #12283

tdulcet commented Dec 1, 2024 •

edited

Loading