-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of builtin symbols in linter rules #10919
Improve handling of builtin symbols in linter rules #10919
Conversation
|
Ah, that's a regression: something can be imported explicitly from |
Seems like this leads to a 1% perf regression overall, with regressions of 2-3% on some linter benchmarks: https://codspeed.io/astral-sh/ruff/branches/AlexWaygood:improve-builtins-handling I don't know whether that's worth it or not — it's less repetitive code for us and it's more principled in that we're handling the semantics of Python more accurately, but importing |
Would you have expected ~no change here for benchmarks that don't import |
yeah. I would have hoped that the fast path from the Any idea what might be the cause of the slowdown? |
No, I'm not sure. I looked through the changes again and my expectation would've been no change. You can try just pushing again and see if the benchmarks move in the other direction, since it might just be noise. I wouldn't read into the flamegraph diff though -- I think CodSpeed is just confused. |
The other thing you might consider doing is benchmarking locally with Hyperfine. E.g., you could build a release build on main, stash it (
I'd suggest running it a few times and switching the order of |
I'm getting similar results from running |
If you run the micro-benchmarks locally, I suggest running them a few times consecutively for each PR. At least on my machine, the second run of the same binary is always significant faster. |
168f9b7
to
054f397
Compare
It seems like this PR is doing two separable things: improving detection of what is a builtin, and allowing auto-fixes that want to use a builtin to import it via Have you benchmarked those changes in isolation, to clarify which one the regression is coming from? IIUC we always eagerly generate fixes, so it seems like a change that allows more fixes in more cases could easily cause a regression? |
Yeah, although we should only be generating more diagnostics (and therefore more fixes) for files that import builtins, and we don’t have any such files in our benchmarks (IIUC). It could be that computing the fixes become more expensive (the use of get_or_import_symbol will be more expensive) even in the case that the file doesn’t import builtins, though I don’t have intuition on whether that could account for the delta here. You could try removing the changes that involve modifying the fixes, to see if the benchmarks return to normal as Carl suggests. |
It looks to me like all the rules that are changed to use I haven't checked whether the benchmarks in question do this, but this change seems entirely separable from the better detection of imported builtins, so if there are any mysteries it seems like a good first step is to separate the changes and narrow down the cause. (And if it's still a mystery, perhaps further separate by only changing one rule at a time.) |
Oh, that's true! Good call. I would be surprised if we had any instances of those in our benchmark files (and especially not in all of them) but it's possible there are a few and it's definitely a good idea to isolate the changes. |
Yet another possibility: the change to the fixes also involves changing a bunch of fixes to use |
if name_expr.id != "dict" { | ||
return; | ||
} | ||
if !checker.semantic().is_builtin("dict") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intentional that is_builtin
uses lookup_symbol
but the new implementation now uses resolve_qualified_name
(which also works for member expression which the old implementation didn't)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a need for the new abstraction to work with member expressions. I just used resolve_qualified_name
because it seemed like the simplest way to implement the new abstraction. Do you think using lookup_symbol
would be more efficient here? (Goes to compare the two definitions...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how we would be able to use lookup_symbol
here, as lookup_symbol
takes &str
(representing the name of the symbol) rather than an AST node. And we don't know the name of the symbol -- even if it's a builtin, as you can do from builtins import open as o
-- unless we resolve the qualified name
crates/ruff_linter/src/rules/ruff/rules/mutable_fromkeys_value.rs
Outdated
Show resolved
Hide resolved
crates/ruff_linter/src/rules/refurb/rules/verbose_decimal_constructor.rs
Show resolved
Hide resolved
match qualified_name.segments() { | ||
["" | "builtins", "min"] => Some(MinMax::Min), | ||
["" | "builtins", "max"] => Some(MinMax::Max), | ||
_ => None, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Consider adding a method is_builtin
to QualifiedName
that tests the segments
if qualified_name.is_builtin("min") {
Some(MinMax::Min)
} else if qualified_name.is_builtin("max") {
Some(MinMax::Max)
} else {
None
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QualifiedName
already has an is_builtin
method, which does something slightly different right now:
ruff/crates/ruff_python_ast/src/name.rs
Lines 50 to 54 in cbd5001
/// If the first segment is empty, the `CallPath` is that of a builtin. | |
/// Ex) `["", "bool"]` -> `"bool"` | |
pub fn is_builtin(&self) -> bool { | |
matches!(self.segments(), ["", ..]) | |
} |
/// Return `true` if `member` is a reference to `builtins.$target`, | ||
/// i.e. either `object` (where `object` is not overridden in the global scope), | ||
/// or `builtins.object` (where `builtins` is imported as a module at the top level) | ||
pub fn match_builtin_expr(&self, expr: &Expr, symbol: &str) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could instead change the method to return a QualifiedName
where constructing the QualifiedName
short-circuits if Modules::BUILTINS
isn't imported.
I'm bringing this up because QualifiedName
already has methods for handling builtins (that might be incorrect?)
pub fn is_builtin(&self) -> bool {
matches!(self.segments(), ["", ..])
}
So what I would have in mind is
model.resolve_builtin_name(expr).is_some_and(|builtin| builtin.is_builtin_name("open"))
We could still have a helper as you have today to avoid some of the repetition (is_some_and
is somewhat annoying). But it would build up on the same concepts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think 4f62717 goes some way to addressing this comment, though I haven't implemented your suggestion in exactly the way you suggested. What do you think of that solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it.
b8bfb9c
to
4e3c7c0
Compare
@MichaReiser's suggestions to move the symbol lookups lower down the function for each rule got rid of the performance regression 🥳🥳 |
513de18
to
08c5e30
Compare
That's great! In my head those lookups didn't matter since they were only occurring when we saw the "right" name (e.g., |
@@ -75,7 +69,7 @@ pub(crate) fn getattr_with_constant( | |||
if is_mangled_private(value.to_str()) { | |||
return; | |||
} | |||
if !checker.semantic().is_builtin("getattr") { | |||
if !checker.semantic().match_builtin_expr(func, "getattr") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The counterpoint is that we're now doing other work that isn't strictly necessary (for example, if this is a name, and builtins isn't imported, and the name isn't getattr
, then the call to is_mangled_private
wasn't necessary -- we could've exited much earlier; similarly, we risk calling is_identifier
on many more function calls that won't ever match).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clearly still the right tradeoff, but that's the tradeoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only way around this would be to, like, have some could_be_builtin("getattr")
method you call at the top that returns false
if builtins isn't imported, it's not a name, or the name isn't getattr
, but IDK, that seems not great / worthwhile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These things (like is_identifier
) do show up in traces sometimes because we tend to be calling these rules for every function call, which is why being able to gate on something as cheap as "the name of the function" is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to change anything here but wanted to raise visibility on it.
3f4729b
to
7cefed0
Compare
Benchmarks are now showing an... improvement? Not sure how that's possible, but I'll take it https://codspeed.io/astral-sh/ruff/branches/AlexWaygood:improve-builtins-handling |
03dada4
to
4f62717
Compare
4f62717
to
e441e77
Compare
Summary
A pattern that shows up a lot in the
ruff_linter
crate currently is to do something like this to determine whether anast::Expr
node refers to a builtin symbol:This has two problems:
builtins
module and e.g. refer tozip
by usingbuiltins.zip
rather than justzip
This PR adds a new method to the semantic model that means that this kind of logic becomes both more concise and more accurate. The above logic can now be replaced with this, which will also (unlike the old function) recognise
builtins.zip
as being a reference to the builtinzip
function:A new method is also added to
crates/ruff_linter/src/importer/mod.rs
to enable us to provide autofixes involving builtin functions/classes, even when they've been shadowed by a user-defined function or class in the current scope.Test Plan
cargo test
. Several fixtures have been extended with new examples to test the new functionality.This PR should be easiest to review commit-by-commit: each commit passes the full test suite by itself.