Skip to content

Conversation

@oconnor663
Copy link
Contributor

@oconnor663 oconnor663 commented Aug 2, 2025

Add three new maps to the SemanticIndex:

  • definitions_in_enclosing_scopes
  • references_in_nested_scopes
  • bindings_in_nested_scopes

These will serve several purposes:

  • LSP is going to need these for features like "find all references" and "rename".
  • We currently re-implement scope walking in several places, including add_binding, infer_place_load, and ide_support.rs. I don't know whether the performance cost of that matters (maybe not, since scopes aren't usually super deeply nested), but there are a lot of corner cases that it would be nice to unify, like skipping class bodies, nonlocal on top of global, etc.
  • We don't currently consider bindings from sibling/cousin scopes when inferring types, and it would be nice to consider them.

To populate these maps, SemanticIndexBuilder tracks a set of free variables for each scope. (There's an interesting reason this has to be per-scope, see the new comment in struct ScopeInfo.) When popping scopes, it checks to see whether the popped scope resolves any free variables from nested scopes, and whether it creates any new ones. This makes us agnostic to whether the definition or the use comes first, since either way we'll have encountered both by the time we pop the defining scope.

The first/main commit in this PR defines the new maps and populates them. There are a couple of small follow-on commits that make use of the new data:

  • deleting infer_nonlocal(), which is now fully redundant with checks the SemanticIndexBuilder is already doing
  • removing the scope walk from add_binding

Larger changes are still TODO. I could add more to this PR, but I need some help with these bits to understand how things work today and how best to change things:

cc @mtshiba

…c index

This commit adds duplicate errors for invalid `nonlocal` statements,
which breaks some tests. The following commit removes `infer_nonlocal`
from `infer.rs` and unbreaks those tests.
These checks are now handled in the `SemanticIndexBuilder`. Removing
them unbreaks the tests broken in the previous commit.
@oconnor663 oconnor663 added the ty Multi-file analysis & type inference label Aug 2, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2025

Diagnostic diff on typing conformance tests

No changes detected when running ty on typing conformance tests ✅

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2025

mypy_primer results

No ecosystem changes detected ✅

Memory usage changes were detected when running on open source projects
flake8 (https://github.com/pycqa/flake8)
-     memo fields = ~52MB
+     memo fields = ~54MB

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2025

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

Formatter (stable)

✅ ecosystem check detected no format changes.

Formatter (preview)

✅ ecosystem check detected no format changes.

Copy link
Member

@MichaReiser MichaReiser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this

What would help me to review this PR is to extend the PR summary with more details on what the new data is that we collect and why it is essential that we do this during semantic indexing.

I'm asking because we build the semantic index for every file in completeness, even for third-party files (eagerly) and we keep it in memory forever. The semantic index is also by far the largest query result and we've made various efforts to reduce its size (and have more planned). This raises the questions if it would be better if this computation is a separate lazy query (should be sufficient for rename, or it doesn't even have to be a query), recomputing, in fact, is fine (the performance profiles on this PR suggest a small regression). The part that's the least clear to me is that we want to consider those bindings for type infernece. I don't understand the use case enough to judge if there isn't an alternative or if upfront collection is indeed the only way to go.

The good news is: This PR doesn't regress memory usage that much. I ran it on a large repository and it only increases the semantic index size by 216 MB or ~3% (but that's still more than what an optimization like #19572 gives us back). So maybe that's fine and maybe there's a way we can get this down if we can avoid some of the hash maps or inner vecs

I haven't been involved in this work before. So I'm sorry if this is something that you discussed at length with @carljm and @AlexWaygood. If that's the case, feel free to disregard this.

Comment on lines +128 to +146
// The resolutions of variables that are either used-but-not-defined or explicitly marked
// `global` or `nonlocal` in this scope. These (keys) are similar to what CPython calls "free"
// variables, except that we also include variables marked `global`.
pub(super) definitions_in_enclosing_scopes:
FxHashMap<ScopedSymbolId, (FileScopeId, ScopedSymbolId)>,

// The inverse of `definitions_in_enclosing_scopes` above: variables defined in this scope,
// and not marked `global` or `nonlocal` here, which are used or bound in nested scopes.
// These (keys) are similar to what CPython calls "cell" variables, except that this scope
// may also be the global scope. We also include nested scopes that only mark a variable
// `global` or `nonlocal` but don't touch it after that. Those aren't very useful, but IDE
// features like "rename all" will still want to know about them.
pub(super) references_in_nested_scopes:
FxHashMap<ScopedSymbolId, Vec<(FileScopeId, ScopedSymbolId)>>,

// A subset of `references_in_nested_scopes` above: variables defined in this scope, and not
// marked `nonlocal` or `global`, which are bound in nested scopes.
pub(super) bindings_in_nested_scopes:
FxHashMap<ScopedSymbolId, Vec<(FileScopeId, ScopedSymbolId)>>,
Copy link
Member

@MichaReiser MichaReiser Aug 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to store this information directly on Symbol, given that these maps are indexed by ScopedSymbolId. Or are we using a hash map because only few symbols in a scope need this information?

Comment on lines +143 to +144
// A subset of `references_in_nested_scopes` above: variables defined in this scope, and not
// marked `nonlocal` or `global`, which are bound in nested scopes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a subset. Could we store an extra flag in references_in_nested_scopes that tells them apart. This reduces the number of Vecs and hash maps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thinking was that most variables aren't bound in nested scopes, but many are used, and I wanted it to be quick to ask "do we actually need to look at any of these references?" during typechecking. But now that you have me thinking more about storage overhead, maybe the best thing would be to just allocate another bit in SymbolFlags called something like IS_BOUND_IN_NESTED_SCOPE?

// `global` or `nonlocal` but don't touch it after that. Those aren't very useful, but IDE
// features like "rename all" will still want to know about them.
pub(super) references_in_nested_scopes:
FxHashMap<ScopedSymbolId, Vec<(FileScopeId, ScopedSymbolId)>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You Could use a SmallVec here. This would allow us to store 4 scoped symbol ids inline.

// `global` or `nonlocal` in this scope. These (keys) are similar to what CPython calls "free"
// variables, except that we also include variables marked `global`.
pub(super) definitions_in_enclosing_scopes:
FxHashMap<ScopedSymbolId, (FileScopeId, ScopedSymbolId)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be enough to just store FileScopeId as the value here, rather than (FileScopeId, ScopedSymbolId)?
Because, for instance, the recorded values are used here, but we can get the PlaceTable of the enclosing scope from enclosing_scope_id (using self.index.place_table()), and then get the ScopedSymbolId with PlaceTable::symbol_id.

// `global` or `nonlocal` but don't touch it after that. Those aren't very useful, but IDE
// features like "rename all" will still want to know about them.
pub(super) references_in_nested_scopes:
FxHashMap<ScopedSymbolId, Vec<(FileScopeId, ScopedSymbolId)>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same reason as above, wouldn't it be enough to just store FileScopeId as the value for the following maps?


struct FreeVariable<'ast> {
scope_id: FileScopeId,
symbol_id: ScopedSymbolId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the same reason as described below (symbol.rs), it seems that this ScopedSymbolId is redundant (it can be obtained from the PlaceTable of the nested scope).

@oconnor663
Copy link
Contributor Author

oconnor663 commented Aug 4, 2025

I haven't been involved in this work before. So I'm sorry if this is something that you discussed at length with @carljm and @AlexWaygood. If that's the case, feel free to disregard this.

@carljm and I discussed it, but I can't say it was at length :) My impression was that he didn't want "find all references" / "rename all" to require walking all scopes. But now that you mention it, my understanding of the problem isn't very good: I don't know whether that's because we expect it to perform poorly, or whether we don't want the LSP to be doing complicated things, or what? Carl could you clarify? (I wonder if part of the problem with doing a big walk is just that it would get invalidated frequently?)

The part that's the least clear to me is that we want to consider those bindings for type infernece.

In my head it would be nice to handle cases like this:

x = None
def f():
    global x
    x = 42
reveal_type(x);  # revealed: `None` (should be `None | Literal[42]`?)

I know I've seen cases like that come up in ecosystem analysis results for nonlocal changes, but I don't know how common any of it is. @AlexWaygood I think you've expressed some skepticism about whether we'd want do analyze these? (Related TODOs in mdtests: this one and these ones.)

@carljm
Copy link
Contributor

carljm commented Aug 5, 2025

In our initial discussions around this PR, I was mostly thinking about the bindings_in_nested_scopes piece of it, which seems the most beneficial to me.

On the "benefit" side, it was primarily motivated by a) goto-definition including nested definitions (and sibling definitions, if you goto-definition on a nested reference!) without having to re-walk all nested scopes, which @UnboundVariable mentioned as a significant TODO in the goto-definition PR, and b) considering these nested definitions in type inference, which I had originally thought maybe we wouldn't do, but is definitely nicer / more accurate if we can do it, for scenarios like the one shown by @oconnor663 (and similar ones with nonlocal), and if we track this information for goto-definition anyway, then it is not too hard to do.

On the "cost" side, my assumption had been that use of nonlocal and global is rare enough that this wouldn't be a significant extra memory cost, if we pay some attention to optimizing our storage. (And this also makes it a good tradeoff to do up-front when we are already walking the AST anyway, since it is cheap to store relatively rare information, but would be expensive to always re-walk all nested/sibling scopes later on, even though by doing so we are unlikely to discover any nested global or nonlocal statements referencing a given symbol.)

I'm less clear on the cost-benefit analysis for references_in_nested_scopes and definitions_in_enclosing_scopes. Both of these will be much more commonly populated (nested references are more common than nonlocal and global bindings), therefore have a higher storage cost.

The latter can avoid the need to re-walk scope ancestors respecting Python scoping rules (which, as the PR description notes, we do, with some code redundancy, in a few different places). This could be a nice code simplification, but I'm not sure if it is a good tradeoff performance-wise or not. Walking just ancestor scopes is cheaper than walking all sibling/nested scopes.

The former seems mostly useful for rename-all and find-all-references. But these are features where I think users will accept a bit slower response time, and features that are less frequently used (relative to e.g. core type inference), so it's less clear to me that it's worth paying a cost in every semantic indexing just in order to speed up these features.

If it's not too much work (I haven't reviewed the PR in detail yet), I think it would be ideal to split these, and initially land a diff adding only bindings_in_nested_scopes, evaluate its memory cost, and then also make use of it to fix the TODOs in goto-definition, and consider nested bindings in type inference. And then separately evaluate the tradeoffs of the other two maps.

@oconnor663
Copy link
Contributor Author

Closing in favor of #19820.

@oconnor663 oconnor663 closed this Aug 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ty Multi-file analysis & type inference

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants