-
Notifications
You must be signed in to change notification settings - Fork 1.6k
[ty] Track enclosing definitions and nested references in the semantic index #19703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…c index This commit adds duplicate errors for invalid `nonlocal` statements, which breaks some tests. The following commit removes `infer_nonlocal` from `infer.rs` and unbreaks those tests.
These checks are now handled in the `SemanticIndexBuilder`. Removing them unbreaks the tests broken in the previous commit.
Diagnostic diff on typing conformance testsNo changes detected when running ty on typing conformance tests ✅ |
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this
What would help me to review this PR is to extend the PR summary with more details on what the new data is that we collect and why it is essential that we do this during semantic indexing.
I'm asking because we build the semantic index for every file in completeness, even for third-party files (eagerly) and we keep it in memory forever. The semantic index is also by far the largest query result and we've made various efforts to reduce its size (and have more planned). This raises the questions if it would be better if this computation is a separate lazy query (should be sufficient for rename, or it doesn't even have to be a query), recomputing, in fact, is fine (the performance profiles on this PR suggest a small regression). The part that's the least clear to me is that we want to consider those bindings for type infernece. I don't understand the use case enough to judge if there isn't an alternative or if upfront collection is indeed the only way to go.
The good news is: This PR doesn't regress memory usage that much. I ran it on a large repository and it only increases the semantic index size by 216 MB or ~3% (but that's still more than what an optimization like #19572 gives us back). So maybe that's fine and maybe there's a way we can get this down if we can avoid some of the hash maps or inner vecs
I haven't been involved in this work before. So I'm sorry if this is something that you discussed at length with @carljm and @AlexWaygood. If that's the case, feel free to disregard this.
| // The resolutions of variables that are either used-but-not-defined or explicitly marked | ||
| // `global` or `nonlocal` in this scope. These (keys) are similar to what CPython calls "free" | ||
| // variables, except that we also include variables marked `global`. | ||
| pub(super) definitions_in_enclosing_scopes: | ||
| FxHashMap<ScopedSymbolId, (FileScopeId, ScopedSymbolId)>, | ||
|
|
||
| // The inverse of `definitions_in_enclosing_scopes` above: variables defined in this scope, | ||
| // and not marked `global` or `nonlocal` here, which are used or bound in nested scopes. | ||
| // These (keys) are similar to what CPython calls "cell" variables, except that this scope | ||
| // may also be the global scope. We also include nested scopes that only mark a variable | ||
| // `global` or `nonlocal` but don't touch it after that. Those aren't very useful, but IDE | ||
| // features like "rename all" will still want to know about them. | ||
| pub(super) references_in_nested_scopes: | ||
| FxHashMap<ScopedSymbolId, Vec<(FileScopeId, ScopedSymbolId)>>, | ||
|
|
||
| // A subset of `references_in_nested_scopes` above: variables defined in this scope, and not | ||
| // marked `nonlocal` or `global`, which are bound in nested scopes. | ||
| pub(super) bindings_in_nested_scopes: | ||
| FxHashMap<ScopedSymbolId, Vec<(FileScopeId, ScopedSymbolId)>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to store this information directly on Symbol, given that these maps are indexed by ScopedSymbolId. Or are we using a hash map because only few symbols in a scope need this information?
| // A subset of `references_in_nested_scopes` above: variables defined in this scope, and not | ||
| // marked `nonlocal` or `global`, which are bound in nested scopes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is a subset. Could we store an extra flag in references_in_nested_scopes that tells them apart. This reduces the number of Vecs and hash maps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thinking was that most variables aren't bound in nested scopes, but many are used, and I wanted it to be quick to ask "do we actually need to look at any of these references?" during typechecking. But now that you have me thinking more about storage overhead, maybe the best thing would be to just allocate another bit in SymbolFlags called something like IS_BOUND_IN_NESTED_SCOPE?
| // `global` or `nonlocal` but don't touch it after that. Those aren't very useful, but IDE | ||
| // features like "rename all" will still want to know about them. | ||
| pub(super) references_in_nested_scopes: | ||
| FxHashMap<ScopedSymbolId, Vec<(FileScopeId, ScopedSymbolId)>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You Could use a SmallVec here. This would allow us to store 4 scoped symbol ids inline.
| // `global` or `nonlocal` in this scope. These (keys) are similar to what CPython calls "free" | ||
| // variables, except that we also include variables marked `global`. | ||
| pub(super) definitions_in_enclosing_scopes: | ||
| FxHashMap<ScopedSymbolId, (FileScopeId, ScopedSymbolId)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be enough to just store FileScopeId as the value here, rather than (FileScopeId, ScopedSymbolId)?
Because, for instance, the recorded values are used here, but we can get the PlaceTable of the enclosing scope from enclosing_scope_id (using self.index.place_table()), and then get the ScopedSymbolId with PlaceTable::symbol_id.
| // `global` or `nonlocal` but don't touch it after that. Those aren't very useful, but IDE | ||
| // features like "rename all" will still want to know about them. | ||
| pub(super) references_in_nested_scopes: | ||
| FxHashMap<ScopedSymbolId, Vec<(FileScopeId, ScopedSymbolId)>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same reason as above, wouldn't it be enough to just store FileScopeId as the value for the following maps?
|
|
||
| struct FreeVariable<'ast> { | ||
| scope_id: FileScopeId, | ||
| symbol_id: ScopedSymbolId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same reason as described below (symbol.rs), it seems that this ScopedSymbolId is redundant (it can be obtained from the PlaceTable of the nested scope).
@carljm and I discussed it, but I can't say it was at length :) My impression was that he didn't want "find all references" / "rename all" to require walking all scopes. But now that you mention it, my understanding of the problem isn't very good: I don't know whether that's because we expect it to perform poorly, or whether we don't want the LSP to be doing complicated things, or what? Carl could you clarify? (I wonder if part of the problem with doing a big walk is just that it would get invalidated frequently?)
In my head it would be nice to handle cases like this: x = None
def f():
global x
x = 42
reveal_type(x); # revealed: `None` (should be `None | Literal[42]`?)I know I've seen cases like that come up in ecosystem analysis results for |
|
In our initial discussions around this PR, I was mostly thinking about the On the "benefit" side, it was primarily motivated by a) goto-definition including nested definitions (and sibling definitions, if you goto-definition on a nested reference!) without having to re-walk all nested scopes, which @UnboundVariable mentioned as a significant TODO in the goto-definition PR, and b) considering these nested definitions in type inference, which I had originally thought maybe we wouldn't do, but is definitely nicer / more accurate if we can do it, for scenarios like the one shown by @oconnor663 (and similar ones with On the "cost" side, my assumption had been that use of I'm less clear on the cost-benefit analysis for The latter can avoid the need to re-walk scope ancestors respecting Python scoping rules (which, as the PR description notes, we do, with some code redundancy, in a few different places). This could be a nice code simplification, but I'm not sure if it is a good tradeoff performance-wise or not. Walking just ancestor scopes is cheaper than walking all sibling/nested scopes. The former seems mostly useful for rename-all and find-all-references. But these are features where I think users will accept a bit slower response time, and features that are less frequently used (relative to e.g. core type inference), so it's less clear to me that it's worth paying a cost in every semantic indexing just in order to speed up these features. If it's not too much work (I haven't reviewed the PR in detail yet), I think it would be ideal to split these, and initially land a diff adding only |
|
Closing in favor of #19820. |
Add three new maps to the
SemanticIndex:definitions_in_enclosing_scopesreferences_in_nested_scopesbindings_in_nested_scopesThese will serve several purposes:
add_binding,infer_place_load, andide_support.rs. I don't know whether the performance cost of that matters (maybe not, since scopes aren't usually super deeply nested), but there are a lot of corner cases that it would be nice to unify, like skipping class bodies,nonlocalon top ofglobal, etc.To populate these maps,
SemanticIndexBuildertracks a set of free variables for each scope. (There's an interesting reason this has to be per-scope, see the new comment instruct ScopeInfo.) When popping scopes, it checks to see whether the popped scope resolves any free variables from nested scopes, and whether it creates any new ones. This makes us agnostic to whether the definition or the use comes first, since either way we'll have encountered both by the time we pop the defining scope.The first/main commit in this PR defines the new maps and populates them. There are a couple of small follow-on commits that make use of the new data:
infer_nonlocal(), which is now fully redundant with checks theSemanticIndexBuilderis already doingadd_bindingLarger changes are still TODO. I could add more to this PR, but I need some help with these bits to understand how things work today and how best to change things:
infer_place_load. This one is kinda doing two separate scope walks at the same time. One is callingplace()and unioningnonlocaltypes as it goes, which will benefit from using the new maps (particularly to includenonlocalbindings from other nested scoeps). The other is looking atenclosing_snapshots, which is a completely separate mechanism. (There are also a couple of bugs in how the snapshot mechanism handles scopes: Incorrect narrowing of class/global variables in nested scopes ty#916 and nonlocal snapshot sweeping considers unrelated scopes, sweeps too much ty#927.)ide_support.rs. Some of this might be straightforward for all I know, but I haven't touched this file at all yet.cc @mtshiba