-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[ty] add nested bindings to the semantic index #19820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Diagnostic diff on typing conformance testsNo changes detected when running ty on typing conformance tests ✅ |
|
CodSpeed Performance ReportMerging #19820 will degrade performances by 7.39%Comparing Summary
Benchmarks breakdown
|
| // NOTE: Because these variables aren't bound (won't wind up in | ||
| // `nested_scopes_with_bindings`) and aren't `nonlocal` (can't trigger `nonlocal` | ||
| // syntax errors), collecting them currently has no effect. We could consider | ||
| // removing this bit and renaming `free_variables` to say `unresolved_nonlocals`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I delete this block, is there a good way to rerun the CodSpeed instrumentation and see if it helps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand. You can delete it and push the changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened a dummy PR for this: #19828
| // TODO increase this once we extend `UnionElement` throughout all union/intersection | ||
| // representations, so that we can make large unions of literals fast in all operations. | ||
| const MAX_UNION_LITERALS: usize = 200; | ||
| const MAX_UNION_LITERALS: usize = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tweak is necessary to avoid crashes in cases like this:
def f():
x = 1
def g():
nonlocal x
x += 1Previously we would exceed the cycle limit before we gave up on the int literal union and called x an int. This was @carljm's suggested fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not ideal that we will do 100 fixpoint iterations in a case like this before falling back, but I also think improving that is out of scope for this PR.
I think the best way to improve this is actually not cycle-specific; I think we should place a much lower limit on the size of literal integer unions we are willing to explode and perform literal math on, and above that limit we just fallback to int -- and similar for operations on other literal types for which we make precise inferences. The rationale here is that we'd like to have a reasonably large limit on the size of literal unions that users can explicitly create, but it's reasonable to have a much lower limit on the size of literal unions that we implicitly create via precise type inference of operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I filed astral-sh/ty#957 to track this separately.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to make this change in a separate PR to understand its performance and ecosystem impact?
|
MichaReiser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ty.
I only reviewed the code changes. I'd like to have a better understanding of the memory regression before merging this. Is it because we now have a better understanding of the code or is it because the semantic index grows that much. If it's the latter, than I think that's more than I would expect from this change.
Would you mind collecting and comparing the full memory report on a large repository and sharing it here.
| // NOTE: Because these variables aren't bound (won't wind up in | ||
| // `nested_scopes_with_bindings`) and aren't `nonlocal` (can't trigger `nonlocal` | ||
| // syntax errors), collecting them currently has no effect. We could consider | ||
| // removing this bit and renaming `free_variables` to say `unresolved_nonlocals`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand. You can delete it and push the changes?
| .or_default() | ||
| .extend(variables); | ||
| } | ||
| let popped_place_table = &self.place_tables[popped_scope_id]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit more fragile but I'd use std::mem::take here to work around the borrow checker issue rather than collecting the symbols
let popped_place_table = std::mem::take(&mut self.place_tables[popped_scope_id]);
for symbol in popped_place_table.symbols() {
// Collect new implicit (not `nonlocal`) free variables.
//
// NOTE: Because these variables aren't bound (won't wind up in
// `nested_scopes_with_bindings`) and aren't `nonlocal` (can't trigger `nonlocal`
// syntax errors), collecting them currently has no effect. We could consider
// removing this bit and renaming `free_variables` to say `unresolved_nonlocals`?
if symbol.is_used()
&& !symbol.is_bound()
&& !symbol.is_declared()
&& !symbol.is_global()
// `nonlocal` variables are handled in `visit_stmt`, which lets us stash an AST
// reference.
&& !symbol.is_nonlocal()
{
parent_free_variables
.entry(symbol.name().clone())
.or_default()
.push(FreeVariable {
scope_id: popped_scope_id,
nonlocal_identifier: None,
});
}
// Record bindings of global variables. Put these in a temporary Vec as another
// borrowck workaround.
if symbol.is_global() && symbol.is_bound() {
// Add this symbol to the global scope, if it isn't there already.
let global_symbol_id =
self.add_symbol_to_scope(symbol.name().clone(), FileScopeId::global());
// Update the global place table with this reference. Doing this here rather than
// when we first encounter the `global` statement lets us see whether the symbol is
// bound.
self.place_tables[FileScopeId::global()]
.add_nested_scope_with_binding(global_symbol_id, popped_scope_id);
}
}
self.place_tables[popped_scope_id] = popped_place_table;You'll need to apply the same trick to parent_free_variables or iterate over symbols twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary Vec that's empty (doesn't allocate) in scopes that don't bind a global variable. I think the performance and memory cost of it should be negligible. Are you sure it's worth these sorts of hijinks to avoid it? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know. Codspeed might tell you 😆
| // syntax errors), collecting them currently has no effect. We could consider | ||
| // removing this bit and renaming `free_variables` to say `unresolved_nonlocals`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the implications of this, but I've a very strong preference not to store anything in the semantic index for which we currently have no use for.
The semantic index is already very complicated and it's not very obvious which information is used for what. That makes it non-trivial to identify unused data that could be removed. That's why I rather not add data until it's absolutely necessary because removing it later is hard (it's not like Rust tells us that it's unused)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These references don't get stored in the index itself, only in the builder. Does that make it better? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily. Memory access isn't cheap compared to pure computations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be inclined to remove the unnecessary collection and do the rename, for now. It seems easy to reverse later if/when we have a use for the fuller data, and it just seems like a good principle (both for readability/maintainability and performance) to avoid doing unnecessary work, if it's easy not to do it.
| // bound in nested scopes (by being marked `global` or `nonlocal` there). These (keys) are | ||
| // similar to what CPython calls "cell" variables, except that this scope may also be the | ||
| // global scope. | ||
| pub(super) nested_scopes_with_bindings: FxHashMap<ScopedSymbolId, Vec<FileScopeId>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How common are these? Could we use a ThinVec instead?
Given that we collect all entries in the builder. Could we store them in two vecs where:
- The first vec is a list of `ScopedSymbolId's with a range into the second vec
- The second vec stores the
FileScopeId. All elements for the sameScopeSymbolIdare added next to each other.
This reduces the cost two exactly two (thin vecs?) rather than a HashMap (which is larger than a vec) with nested vecs
An alternative is to use a SmallVec for the inner Vec, given that FileScopeId is a u32 (meaning up to 4 elements fit into an inline vec)
It's hard for me to advise on the best representation without knowing more about how many elements we expect (per scope and per symbol)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keys of this map are the IDs of symbols in this scope that have one or more bindings in nested scopes (via nonlocal or global references in those nested scopes). So for the (vast?) majority of scopes / symbol tables, this map will be empty. Most variables that do have nonlocal bindings probably only have one or two, so a SmallVec might be a good optimization for the inner Vec, but if the map is empty most of the time it also might not matter?
If we're worried about the footprint of the map in the SymbolTable struct (that is, even when it's empty), we could consider a couple other options:
- Wrap the map inside an
Option<Box<...>>. That keeps the space overhead down to one pointer inSymbolTable, at the cost of an extra allocation when we do populate the table. - Get rid of the per-symbol-table maps and put a file-wide map at the top level of the
SemanticIndexwith keys like(FileScopeId, ScopedSymbolId). Then to cut down on unnecessary lookups in that table, maybe allocate a bit inSymbolFlagscalled something likeHAS_BINDINGS_IN_NESTED_SCOPES?
But stepping back from all that, is it possible that the memory usage stats I just posted below tell us we don't need to worry about this? What else should I measure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we expect this map to be empty for most scopes, and when it is not empty, to have only one or two keys in it, and each key's value to likely also have only one or two entries in it. So overall, we expect this to be small. I think this means both that we could probably find a more efficient representation here, and that it may not matter much to overall memory usage in practice (as suggested by the memory usage stats). I'll defer to @MichaReiser on whether it's worth pursuing an alternative representation here. My own feeling is that the memory increase here seems minimal, and I'm more concerned about reducing the performance regression (for which I think the focus should likely be on place_by_id.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel strongly about it but a ThinVec could be a great use case because it reduces the overhead per scope from 48 bytes to 8 (if empty) and a linear scan should be cheap.
Constructing such a vec based map from a real map (it's fine to use a proper map during building) should be trivial. It's only the lookup that requires a scan (but you could even consider to use a ThinVec during building, which might be cheaper, but I don't care much bout that)
Here's the summary of running And with this PR: It seems like the change is small? But I'm new to these stats, and I'm not sure I'm measuring the right thing. |
|
That's good :). Can you run the full report. I'm curious which specific structs are increasing. Can you run it on prefect which shows up in the mypy primer check result |
|
Here are full reports from checking https://github.com/PrefectHQ/prefect. On our With this PR: |
|
Looking at the diff, it seems mainly to be due to changes in type inference but not to the size increase in semantic index itself. Consider my concerns resolved. Thanks for digging into this |
carljm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I love the simplifications in type inference, and the fact that we now match mypy and pyright's inference capability in cases with nonlocal writes. The collection logic in the semantic index seems clear to me, though I do think we should streamline it to avoid doing any extra work for things we don't use.
I think we do need to better understand/fix the anyio regression, and it looks like we have a 1% regression overall in the cold-check benchmark, which I believe checks a codebase that doesn't use nonlocal. I don't think we should pay any detectable performance penalty for this PR in code that doesn't use nonlocal, and I think the main focus to achieve that needs to be in better optimizing the no-nonlocal-writes path in place_by_id.
| def nested_scope(): | ||
| global __loader__ | ||
| reveal_type(__loader__) # revealed: LoaderProtocol | None | ||
| reveal_type(__loader__) # revealed: Unknown | LoaderProtocol | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not-ideal, but it may not be easy to get rid of, and I guess given it results from a type error, maybe it's fine -- fix the type error and it will go away. (I assume this is coming from the fact that we infer Unknown from the invalid assignment below.)
| // bound in nested scopes (by being marked `global` or `nonlocal` there). These (keys) are | ||
| // similar to what CPython calls "cell" variables, except that this scope may also be the | ||
| // global scope. | ||
| pub(super) nested_scopes_with_bindings: FxHashMap<ScopedSymbolId, Vec<FileScopeId>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we expect this map to be empty for most scopes, and when it is not empty, to have only one or two keys in it, and each key's value to likely also have only one or two entries in it. So overall, we expect this to be small. I think this means both that we could probably find a more efficient representation here, and that it may not matter much to overall memory usage in practice (as suggested by the memory usage stats). I'll defer to @MichaReiser on whether it's worth pursuing an alternative representation here. My own feeling is that the memory increase here seems minimal, and I'm more concerned about reducing the performance regression (for which I think the focus should likely be on place_by_id.)
| // TODO increase this once we extend `UnionElement` throughout all union/intersection | ||
| // representations, so that we can make large unions of literals fast in all operations. | ||
| const MAX_UNION_LITERALS: usize = 200; | ||
| const MAX_UNION_LITERALS: usize = 100; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not ideal that we will do 100 fixpoint iterations in a case like this before falling back, but I also think improving that is out of scope for this PR.
I think the best way to improve this is actually not cycle-specific; I think we should place a much lower limit on the size of literal integer unions we are willing to explode and perform literal math on, and above that limit we just fallback to int -- and similar for operations on other literal types for which we make precise inferences. The rationale here is that we'd like to have a reasonably large limit on the size of literal unions that users can explicitly create, but it's reasonable to have a much lower limit on the size of literal unions that we implicitly create via precise type inference of operations.
| // If there are any nested bindings (via `global` or `nonlocal` variables) for this symbol, | ||
| // infer them and union the results. Nested bindings aren't allowed to have declarations or | ||
| // qualifiers, and we can just union their inferred types. | ||
| let mut nested_bindings_union = UnionBuilder::new(db); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places (e.g. place_from_bindings_impl) we currently go to a bit of trouble to avoid allocating a UnionBuilder in the most common case (one visible binding) where it won't be needed. We have seen in past PRs that this has noticeable performance impact, because this path is so hot.
I suspect that the changes in this function, rather than anything in the semantic index building, are the most likely cause of the benchmark slowdowns in this PR. I think this is supported by the CodSpeed execution profile on anyio, where the regressing functions all seem to be in type inference, not semantic index building.
I think it would be worth some effort here to optimize the case where there are no nested bindings to consider. Ideally, in the no-nested-bindings scenario, this function would do exactly one hash-table lookup more than it did before this PR. (Or possibly even less than that, e.g. maybe a scope-level "this scope has nested writes" flag that allows us to bypass all additional work if it's not set?)
We should also be able to do zero additional work here in the case where the name has a declared type in the outer scope. In that scenario we should just use the declared type and not query the nonlocal bindings at all. Currently this PR doesn't use the types of the nonlocal bindings in that case, but it queries them anyway. That's wasteful, and I suspect cleaning it up will address much of the regression in anyio (which appears to have exactly one nonlocal declaration, but of a name which has a declared type in its enclosing scope, so shouldn't require us to do any additional work.)
| // If there are nested bindings, union whatever we inferred from those into what we've | ||
| // inferred here. | ||
| if let Some(nested_bindings_type) = nested_bindings_union.try_build() { | ||
| match &mut inferred { | ||
| Place::Type(inferred_type, _) => { | ||
| *inferred_type = | ||
| UnionType::from_elements(db, [*inferred_type, nested_bindings_type]); | ||
| } | ||
| Place::Unbound => { | ||
| inferred = Place::Type(nested_bindings_type, Boundness::PossiblyUnbound); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder here why we aren't just adding the inferred type (if any) to nested_bindings_union, before building it?
But don't pay too much attention to this comment, since I suspect this will all change anyway if we try to optimize the happy path in this function better.
| // syntax errors), collecting them currently has no effect. We could consider | ||
| // removing this bit and renaming `free_variables` to say `unresolved_nonlocals`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be inclined to remove the unnecessary collection and do the rename, for now. It seems easy to reverse later if/when we have a use for the fuller data, and it just seems like a good principle (both for readability/maintainability and performance) to avoid doing unnecessary work, if it's easy not to do it.
|
@carljm I'm able to reduce the 10% slowdown with https://github.com/agronholm/anyio locally (though I have to set |
|
@oconnor663 Ah, hm, I thought the repo only had one
What about ensuring that we never query nonlocal bindings at all if the symbol has declarations in the outer scope and we aren't going to use the nonlocal bindings anyway? That seems like the highest-priority perf improvement, which we should do regardless. Have you implemented that yet? How much does that impact perf, if at all?
Interesting! That strongly suggests that we are hitting a cycle in the nonlocal bindings where we grow a literal union on every iteration of the cycle. I did some grep-based spot-checking of the If this is really the only thing that fixes it, that suggests that we would need astral-sh/ty#957 to properly reclaim this perf. But I would love to actually identify the code that is actually causing this, to make sure we aren't missing something. One way to do this could be to benchmark each test file individually, with main vs this PR, to narrow down the file in which the problem occurs. (I think this can be done non-painfully by pre-building release versions of main and this PR, and then using
Yes, that's generally the right thing to do for reliable benchmarking (if your goal is to improve straight-line perf, not specifically to improve parallelism). |
I see that in the CodSpeed report, but I wonder if it's within the error bars. When I try to reproduce it locally (
I thought I had tested this above and found it didn't help, but I tried it again just now and it does. Locally it looks like it recovers about half of the slowdown, but definitely not all of it. I'll push the change and see what happens in CodSpeed.
Yeah I'd like to know too. I'll see if I can divide and conquer it somehow. |
This is an instrumented benchmark on CodSpeed, not a walltime benchmark. That means it's counting CPU instructions (and I think assigning some fixed cost estimate to syscalls?). In some ways that makes it less "real-world performance", but it also generally makes it pretty consistent; I don't usually see a lot of fluctuation in those. But if it really doesn't show up locally at all, maybe that's a case where CodSpeed is misleading us a bit. |
|
Ah very neat. In that case I'm not sure who's right 😅 The re-run speed report shows the Anyio regression at 7% now instead of 10%, so laziness did help a bit. Still digging into the specifics. |
|
One thing you could try is to record a profile locally using simply (with
I often find this more useful than codspeed's profile because codspeed filters and truncates the stack frames which sometimes is great but sometimes hides exactly what you're supposed to see |
|
We have some notes in an internal Notion document here with some tips 'n' tricks for profiling and benchmarking ty locally: https://www.notion.so/astral-sh/Performance-measurements-15848797e1ca808f97c9d928e6923d03?source=copy_link |
|
Ok some progress, deleting lots of code and re-benching narrowed the entire (remaining) slowdown to this generator-context-manager: https://github.com/agronholm/anyio/blob/fef093c1e33563493182b95a0c7028b86c61dd90/src/anyio/pytest_plugin.py#L37-L64 |
Found it! |
Yes! That means the bulk of the anyio regression is astral-sh/ty#957. It might mean that if we land this for beta, we will also need to fix that for beta, depending on how this shows up in user codebases. I do feel there's still room to improve the performance here outside of that particular anyio issue. Currently this is showing a very consistent 1-2% across-the-board regression in instrumented CodSpeed benchmarks. Apparently local walltime runs aren't sensitive enough to show this, but I feel like there are still some clear things we could do to improve here. We are no longer taking on the overhead of allocating a UnionBuilder or querying nonlocal types if we won't use the nonlocal types, which is good. We are still (within the lazy closure) eagerly allocating a UnionBuilder before we even know whether there are any nonlocal bindings to consider, in any case where we would consider nonlocal bindings. I think it should be possible for this PR to show effectively zero performance impact on a codebase that doesn't use any I'm fine with setting this aside for the moment and moving on to NewType. I do consider this a worthwhile feature (it brings us to parity with mypy/pyright), but it's probably not critical for the beta. If we put it on ice for now, hopefully it doesn't accumulate too many conflicts? |
This PR replaces #19703. It has two main components:
SemanticIndexBuilder, and resolve free variables inpop_scope. This lets us determine which variables resolve where in the semantic index building pass, without requiring a second pass and without requiring us to defer function bodies. Previously we were doing some of this ininfer.rsas a poor man's second pass, particularlyinfer_nonlocal, which used to flag invalidnonlocalstatements. That's now a semantic syntax error in the builder, andinfer_nonlocalis deleted. In my first pass at [ty] Track enclosing definitions and nested references in the semantic index #19703 I was recording several new types of metadata in the index itself, including all nonlocal resolutions and nested references. But based on feedback from Micha and Carl, I've pared that down substantially to avoid wasting space. In this PR the only new metadata in the semantic index (specifically in the symbol tables) isnested_scopes_with_bindings, which should be empty for most scopes. (The new free variables collection lives in the builder and isn't persisted in the index.)nested_scopes_with_bindingsis available, take advantage of it to consider all nested bindings when inferring the "public" type of a variable (i.e. its type as viewed from other scopes or other modules). This is done inplace_by_id. Previouslyinfer_place_load(which callsplace->place_by_id) was computing a union of all the local inferred types it encountered when walking enclosing scopes, but that was missing bindings in sibling/cousin scopes. Now we see those bindings, and the union builder ininfer_place_loadis deleted.Examples:
Now that we can see nested bindings of globals, I'm going to put up a follow-up PR on top of this one that deletes
infer_globaland allowsglobalstatements that don't match an explicit symbol in the module scope. I don't know if we'll want that change, so I'm keeping it out of this PR.There's still a lot of scope walking in
infer.rs. The type unioning is deleted frominfer_place_load, but the scope walk is still there to find the defining scope and consider snapshots on the way.add_bindingalso still does a scope walk to find the relevant declaration. We could consider storing these resolutions in the semantic index, but even if we don't, it might be nice to define one function likeresolve_symbolthat all these callers (pluside_support.rs) can use. This PR doesn't cover that.