[ty] Track enclosing definitions and nested references in the semantic index #19703

oconnor663 · 2025-08-02T01:15:41Z

Add three new maps to the SemanticIndex:

definitions_in_enclosing_scopes
references_in_nested_scopes
bindings_in_nested_scopes

These will serve several purposes:

LSP is going to need these for features like "find all references" and "rename".
We currently re-implement scope walking in several places, including add_binding, infer_place_load, and ide_support.rs. I don't know whether the performance cost of that matters (maybe not, since scopes aren't usually super deeply nested), but there are a lot of corner cases that it would be nice to unify, like skipping class bodies, nonlocal on top of global, etc.
We don't currently consider bindings from sibling/cousin scopes when inferring types, and it would be nice to consider them.

To populate these maps, SemanticIndexBuilder tracks a set of free variables for each scope. (There's an interesting reason this has to be per-scope, see the new comment in struct ScopeInfo.) When popping scopes, it checks to see whether the popped scope resolves any free variables from nested scopes, and whether it creates any new ones. This makes us agnostic to whether the definition or the use comes first, since either way we'll have encountered both by the time we pop the defining scope.

The first/main commit in this PR defines the new maps and populates them. There are a couple of small follow-on commits that make use of the new data:

deleting infer_nonlocal(), which is now fully redundant with checks the SemanticIndexBuilder is already doing
removing the scope walk from add_binding

Larger changes are still TODO. I could add more to this PR, but I need some help with these bits to understand how things work today and how best to change things:

infer_place_load. This one is kinda doing two separate scope walks at the same time. One is calling place() and unioning nonlocal types as it goes, which will benefit from using the new maps (particularly to include nonlocal bindings from other nested scoeps). The other is looking at enclosing_snapshots, which is a completely separate mechanism. (There are also a couple of bugs in how the snapshot mechanism handles scopes: Incorrect narrowing of class/global variables in nested scopes ty#916 and nonlocal snapshot sweeping considers unrelated scopes, sweeps too much ty#927.)
ide_support.rs. Some of this might be straightforward for all I know, but I haven't touched this file at all yet.

cc @mtshiba

…c index This commit adds duplicate errors for invalid `nonlocal` statements, which breaks some tests. The following commit removes `infer_nonlocal` from `infer.rs` and unbreaks those tests.

These checks are now handled in the `SemanticIndexBuilder`. Removing them unbreaks the tests broken in the previous commit.

… `add_binding`

github-actions · 2025-08-02T01:17:36Z

Diagnostic diff on typing conformance tests

No changes detected when running ty on typing conformance tests ✅

github-actions · 2025-08-02T01:19:31Z

`mypy_primer` results

No ecosystem changes detected ✅

Memory usage changes were detected when running on open source projects

flake8 (https://github.com/pycqa/flake8)
-     memo fields = ~52MB
+     memo fields = ~54MB

github-actions · 2025-08-02T01:27:43Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

MichaReiser

Thank you for working on this

What would help me to review this PR is to extend the PR summary with more details on what the new data is that we collect and why it is essential that we do this during semantic indexing.

I'm asking because we build the semantic index for every file in completeness, even for third-party files (eagerly) and we keep it in memory forever. The semantic index is also by far the largest query result and we've made various efforts to reduce its size (and have more planned). This raises the questions if it would be better if this computation is a separate lazy query (should be sufficient for rename, or it doesn't even have to be a query), recomputing, in fact, is fine (the performance profiles on this PR suggest a small regression). The part that's the least clear to me is that we want to consider those bindings for type infernece. I don't understand the use case enough to judge if there isn't an alternative or if upfront collection is indeed the only way to go.

The good news is: This PR doesn't regress memory usage that much. I ran it on a large repository and it only increases the semantic index size by 216 MB or ~3% (but that's still more than what an optimization like #19572 gives us back). So maybe that's fine and maybe there's a way we can get this down if we can avoid some of the hash maps or inner vecs

I haven't been involved in this work before. So I'm sorry if this is something that you discussed at length with @carljm and @AlexWaygood. If that's the case, feel free to disregard this.

MichaReiser · 2025-08-02T08:22:35Z