[`ruff`] Extend unnecessary-regular-expression to non-literal strings (`RUF055`) #14679

ntBre · 2024-11-29T14:52:50Z

Summary

This is a follow-up to #14659 to try to resolve variable bindings for the pattern argument in re methods like sub and match. The rule currently only matches string literals, but these changes enable detection of patterns like this:

import re

pat1 = "xyz"
pat2 = pat1
pat3 = pat2

re.sub(pat3, "", s)

For sub specifically, it also handles non-literal repl arguments, which also have to be strings for the suggested fix to be valid.

Test Plan

cargo test with a new snapshot test based on the example above.

everything shifted down after wrapping the docstring

This reverts commit efcc4cf.

github-actions · 2024-11-29T14:59:49Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

crates/ruff_linter/src/rules/ruff/rules/unnecessary_regular_expression.rs

MichaReiser

Neat

crates/ruff_linter/src/rules/ruff/rules/unnecessary_regular_expression.rs

AlexWaygood

Nice!

AlexWaygood · 2024-11-29T17:12:40Z

crates/ruff_linter/src/rules/ruff/rules/unnecessary_regular_expression.rs

+                // make sure repl can be resolved to a string literal
+                resolve_string_literal(repl, semantic)?;


If I understand correctly, I think the need here is slightly different to the need on lines 95-96. On lines 95-96, we do need to know the value of the string in order to be able to check it doesn't have any metacharacters in it (so only a string literal will do, or something that we can resolve to a string literal). But here, we just need to know it's a string; any string will do, as long as the user isn't passing in a function.

Is that the case? If so, it might be worth adding back the is_str function you added in efcc4cf and using that here, rather than using resolve_string_literal in both places. The advantage of the is_str technique is that it also understands basic type hints, e.g. it would understand that re.sub() is being passed a string for the repl argument in something like this:

import re def foo(input_str: str, repl: str): re.sub("foobar", repl, input_str)

Now that I look at it, maybe we do need to know (and analyze) the value of the repl string for a fully accurate analysis, though. For example, it seems like the initial version of the check that we merged yesterday emits a false-positive diagnostic (and incorrect autofix) for this:

import re re.sub(r"a", r"\g<0>\g<0>\g<0>", "a")

Now, this is a massive edge case -- I had to work quite hard to find it! I believe the only way you get a false positive with the rule's current logic is if there's a \g in the replacement string but no backslashes or metacharacters in the pattenr string, and it's almost impossible to think of a way you could plausibly have a re.sub() call with those characteristics. So maybe we shouldn't worry about this -- I'm interested in your thoughts and @MichaReiser's!

Oh wow, good catch! I was working on adding is_str back in, but maybe instead I need to check for metacharacters in repl too.

I thought we were safe from backreferences by avoiding ( in the pattern, but I overlooked \g<0>. That exact sequence seems like the only way to trigger this behavior?

I thought we were safe from backreferences by avoiding ( in the pattern, but I overlooked \g<0>. That exact sequence seems like the only way to trigger this behavior?

I think so, yes! Although we also emit a RUF055 diagnostic on invalid re.sub() calls like this, and maybe we should just ignore them? It feels like it might be outside of this rule's purview to autofix invalid re.sub() calls into valid str.replace() calls. We probably don't really know what the user intended exactly if the re.sub() call is invalid:

>>> import re >>> re.sub(r"a", r"\1", "a") Traceback (most recent call last): File "<python-input-12>", line 1, in <module> re.sub(r"a", r"\1", "a") ~~~~~~^^^^^^^^^^^^^^^^^^ File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/__init__.py", line 208, in sub return _compile(pattern, flags).sub(repl, string, count) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^ File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/__init__.py", line 377, in _compile_template return _sre.template(pattern, _parser.parse_template(repl, pattern)) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/_parser.py", line 1070, in parse_template addgroup(int(this[1:]), len(this) - 1) ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/_parser.py", line 1015, in addgroup raise s.error("invalid group reference %d" % index, pos) re.PatternError: invalid group reference 1 at position 1

I've added back is_str locally, along with your function argument test case. It's really nice to handle that case, but I'm a bit bothered by this edge case too, so I could go either way. I'm interested to hear which approach you and Micha think is best overall.

Why don't you push the version with is_str to this PR, and we can see if it results in any more ecosystem hits? That might give us some more data on how useful it is to be able to detect that the repl argument is a string from the function annotation

Hmm, it doesn't look like it adds any new ecosystem hits :/

I guess in that case, I'd vote for removing is_str again, and fixing the false positives on \g<0> and \1 in repl arguments.

Thanks for putting up with my pernickitiness here!

No problem, thanks for the thorough review! Should I reuse the other code to reject any metacharacters, or are references to named or numbered capture groups the only problems? I'm picturing checking for \ followed by g or 1 through 9. That seems a bit nicer than rejecting any metacharacter like I did for the patterns but possibly less safe.

I'm picturing checking for \ followed by g or 1 through 9. That seems a bit nicer than rejecting any metacharacter like I did for the patterns but possibly less safe.

I think actually we could check for \ followed by any ASCII character except one of abfnrtv. Other than 0-9 and g (which both have special behaviour in repl strings, as we've just been discussing!), I believe those are the only ASCII escapes that will be permitted in a repl string by re.sub(), Anything else causes re.PatternError to be raised -- meaning it's probably out of scope for us to emit this diagnostic on it:

>>> re.sub(r"a", r"\d", "a") Traceback (most recent call last): File "<python-input-13>", line 1, in <module> re.sub(r"a", r"\d", "a") ~~~~~~^^^^^^^^^^^^^^^^^^ File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/__init__.py", line 208, in sub return _compile(pattern, flags).sub(repl, string, count) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^ File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/__init__.py", line 377, in _compile_template return _sre.template(pattern, _parser.parse_template(repl, pattern)) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/Users/alexw/.pyenv/versions/3.13.0/lib/python3.13/re/_parser.py", line 1076, in parse_template raise s.error('bad escape %s' % this, len(this)) from None re.PatternError: bad escape \d at position 0

Let me know what you think about this version. If it looks good, it might be nice to reuse this escape check for pattern as well instead of rejecting \ entirely.

This reverts commit b48fb18.

crates/ruff_linter/src/rules/ruff/rules/unnecessary_regular_expression.rs

MichaReiser · 2024-12-02T09:11:00Z

This looks good to me and nice find @AlexWaygood

AlexWaygood

Thanks @ntBre!! I pushed some minor fixes to address some other edge cases identified by @dscorbett in #14757

* main: [`ruff`] Extend unnecessary-regular-expression to non-literal strings (`RUF055`) (#14679) Minor followups to RUF052 (#14755) [red-knot] Property tests (#14178) [red-knot] `is_subtype_of` fix for `KnownInstance` types (#14750) Improve docs for flake8-use-pathlib rules (#14741) [`ruff`] Implemented `used-dummy-variable` (`RUF052`) (#14611) [red-knot] Simplify tuples containing `Never` (#14744) Possible fix for flaky file watching test (#14543) [`flake8-import-conventions`] Improve syntax check for aliases supplied in configuration for `unconventional-import-alias (ICN001)` (#14745) [red-knot] Deeper understanding of `LiteralString` (#14649) red-knot: support narrowing for bool(E) (#14668) [`refurb`] Handle non-finite decimals in `verbose-decimal-constructor (FURB157)` (#14596) [red-knot] Re-enable linter corpus tests (#14736)

ntBre added 13 commits November 28, 2024 14:36

rename ruf055.py and snapshot

3dec810

add is_str

efcc4cf

resolve a sequence of assignments

84d090d

simplify lifetimes

6a69402

reuse find_binding_value

8f9a118

don't break literal strings

55cbafb

clippy

9154fe6

add test, include named expr assignment

8d2539a

move resolve_name back to free function and use to check sub repl

c17e9ab

test repl resolution

1817fb6

update snapshot

76b9bd8

everything shifted down after wrapping the docstring

Revert "add is_str" since it's unused

b48fb18

This reverts commit efcc4cf.

add back comment

f778523

MichaReiser reviewed Nov 29, 2024

View reviewed changes

crates/ruff_linter/src/rules/ruff/rules/unnecessary_regular_expression.rs Outdated Show resolved Hide resolved

simplify lifetimes

7668328

MichaReiser approved these changes Nov 29, 2024

View reviewed changes

crates/ruff_linter/src/rules/ruff/rules/unnecessary_regular_expression.rs Outdated Show resolved Hide resolved

crates/ruff_linter/src/rules/ruff/rules/unnecessary_regular_expression.rs Outdated Show resolved Hide resolved

MichaReiser added rule Implementing or modifying a lint rule preview Related to preview mode features labels Nov 29, 2024

ntBre added 2 commits November 29, 2024 10:22

rename to resolve_string_literal

16a5073

only resolve direct references, update test

a63fa11

AlexWaygood reviewed Nov 29, 2024

View reviewed changes

ntBre mentioned this pull request Nov 29, 2024

[ruff] Expand unnecessary-regular-expression with re.compile (RUF055) #14680

Open

ntBre added 5 commits November 29, 2024 12:19

Reapply "add is_str"

5ca3b1f

This reverts commit b48fb18.

use is_str for repl and test annotated function argument

e9a6384

revert is_str changes

41b52dd

check for escapes in repl

5afa545

update test case

adb14d6

MichaReiser reviewed Dec 2, 2024

View reviewed changes

crates/ruff_linter/src/rules/ruff/rules/unnecessary_regular_expression.rs Outdated Show resolved Hide resolved

dscorbett mentioned this pull request Dec 3, 2024

RUF055 ignores backslashes in the repl argument of re.sub #14757

Closed

Fix other edge cases

2097142

AlexWaygood approved these changes Dec 3, 2024

View reviewed changes

AlexWaygood merged commit 62e358e into astral-sh:main Dec 3, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`ruff`] Extend unnecessary-regular-expression to non-literal strings (`RUF055`) #14679

[`ruff`] Extend unnecessary-regular-expression to non-literal strings (`RUF055`) #14679

ntBre commented Nov 29, 2024

github-actions bot commented Nov 29, 2024 •

edited

Loading

MichaReiser left a comment

AlexWaygood left a comment

AlexWaygood Nov 29, 2024

AlexWaygood Nov 29, 2024 •

edited

Loading

ntBre Nov 29, 2024

AlexWaygood Nov 29, 2024 •

edited

Loading

ntBre Nov 29, 2024

AlexWaygood Nov 29, 2024

AlexWaygood Nov 29, 2024

ntBre Nov 29, 2024

AlexWaygood Nov 29, 2024

ntBre Nov 29, 2024

MichaReiser commented Dec 2, 2024

AlexWaygood left a comment

		// make sure repl can be resolved to a string literal
		resolve_string_literal(repl, semantic)?;

[ruff] Extend unnecessary-regular-expression to non-literal strings (RUF055) #14679

[ruff] Extend unnecessary-regular-expression to non-literal strings (RUF055) #14679

Conversation

ntBre commented Nov 29, 2024

Summary

Test Plan

github-actions bot commented Nov 29, 2024 • edited Loading

ruff-ecosystem results

Linter (stable)

Linter (preview)

MichaReiser left a comment

Choose a reason for hiding this comment

AlexWaygood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexWaygood Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlexWaygood Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichaReiser commented Dec 2, 2024

AlexWaygood left a comment

Choose a reason for hiding this comment

[`ruff`] Extend unnecessary-regular-expression to non-literal strings (`RUF055`) #14679

[`ruff`] Extend unnecessary-regular-expression to non-literal strings (`RUF055`) #14679

github-actions bot commented Nov 29, 2024 •

edited

Loading

`ruff-ecosystem` results

AlexWaygood Nov 29, 2024 •

edited

Loading

AlexWaygood Nov 29, 2024 •

edited

Loading