Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refurb] Count codepoints not bytes for slice-to-remove-prefix-or-suffix (FURB188) #13631

Merged
merged 9 commits into from
Oct 7, 2024

Conversation

dylwil3
Copy link
Collaborator

@dylwil3 dylwil3 commented Oct 4, 2024

This PR fixes the calculation of string length for the purposes of verifying when to suggest removeprefix/removesuffix (FURB188). Before, we used text_len which was counting bytes rather than codepoints (chars) and therefore disagreed with Python's len for non-ASCII text.

Closes #13620

Copy link
Contributor

github-actions bot commented Oct 4, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

@dscorbett
Copy link

Another subtlety worth testing is strings with surrogates. In Python, each surrogate counts as 1 and surrogate pairs are not special so they count as 2; for example, len("\ud800\udc00") == 2 whereas len("\U00010000") == 1. I don’t know how Ruff distinguishes those two example strings or if chars().count() matches what Python does.

@dylwil3
Copy link
Collaborator Author

dylwil3 commented Oct 4, 2024

Another subtlety worth testing is strings with surrogates. In Python, each surrogate counts as 1 and surrogate pairs are not special so they count as 2; for example, len("\ud800\udc00") == 2 whereas len("\U00010000") == 1. I don’t know how Ruff distinguishes those two example strings or if chars().count() matches what Python does.

TIL @dscorbett - neat! Added a test for this, and it appears to be handled correctly (I think this happens in the guts of the parser, so by the time I'm looking at string_val here, the subtleties have been smoothed out).

@dscorbett
Copy link

I think the reason it works is that Ruff’s representation of a Python string as a Rust string replaces surrogates with replacement characters. That is fine for counting the code points but could be a problem for other rules.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks. I only have two nit comments.

Comment on lines 338 to 339
.and_then(ast::Int::as_u32)
.and_then(|x| usize::try_from(x).ok())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest converting to a u64 considering that you have to use usize::try_from anyways (for 32 bit platforms)

Suggested change
.and_then(ast::Int::as_u32)
.and_then(|x| usize::try_from(x).ok())
.and_then(ast::Int::as_u64)
.and_then(|x| usize::try_from(x).ok())

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you could consider adding a as_usize method to ast::Int

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went with the latter

// Only support prefix removal for size at most `u32::MAX`
.and_then(ast::Int::as_u32)
.and_then(|x| usize::try_from(x).ok())
.is_some_and(|x| x == string_val.to_str().chars().count()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.is_some_and(|x| x == string_val.to_str().chars().count()),
.is_some_and(|x| x == string_val.chars().count()),

@@ -370,7 +372,8 @@ fn affix_matches_slice_bound(data: &RemoveAffixData, semantic: &SemanticModel) -
value
.as_int()
.and_then(ast::Int::as_u32)
.is_some_and(|x| x == string_val.to_str().text_len().to_u32())
.and_then(|x| usize::try_from(x).ok())
.is_some_and(|x| x == string_val.to_str().chars().count())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.is_some_and(|x| x == string_val.to_str().chars().count())
.is_some_and(|x| x == string_val.chars().count())

@MichaReiser MichaReiser added fixes Related to suggested fixes for violations preview Related to preview mode features rule Implementing or modifying a lint rule and removed preview Related to preview mode features fixes Related to suggested fixes for violations labels Oct 7, 2024
@MichaReiser MichaReiser merged commit 14ee5db into astral-sh:main Oct 7, 2024
20 checks passed
@dylwil3 dylwil3 deleted the affix-unicode branch October 7, 2024 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Related to preview mode features rule Implementing or modifying a lint rule
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FURB188 calculates lengths wrong for non-ASCII affixes
3 participants