Skip to content

Proposed refactoring of mypy/build.py (and cache metadata) #4365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gvanrossum opened this issue Dec 14, 2017 · 5 comments
Closed

Proposed refactoring of mypy/build.py (and cache metadata) #4365

gvanrossum opened this issue Dec 14, 2017 · 5 comments

Comments

@gvanrossum
Copy link
Member

gvanrossum commented Dec 14, 2017

Over the years, build.py has become a dumping ground of all things having to do with module dependencies and caching, making it the 4th largest file in mypy. Prompted by #4353 I think it's time to refactor build.py.

One particular idea I'd like to focus on is the distinction between imports, which are determined (almost) purely syntactically by pass one of the semantic analyzer, and have priorities; and dependencies, which include indirect dependencies, and which are associated with the interface hash of the depended-upon module (once available). Dependencies are seeded from the imports, minus missing modules, in the load_graph() phase. They are extended (after type checking of the SCC) with indirect dependencies (computed as always by TypeIndirectionVisitor). Both tables (imports and full dependencies) are then written to the cache metadata, together with a bit representing the presence of errors in this particular module.

A module for which a cache file exists is then considered fresh (no need to process) if all of the following hold:

  • The source hash computed from the source matches the source hash in the cache (or the mtime+size matches, which is an acceptable proxy)
  • The error bit is off
  • For every dependency, the computed interface hash matches the cached interface hash

For SCCs this needs to be tweaked somewhat -- dependencies within the SCC don't count, and the condition must hold for every module in the SCC. (There are other tweaks needed to account for changed options and changes in the "library path".)

One benefit of this algorithm is that we no longer have to depend on linear mtimes for cache data files to compute freshness. Another is that we may be able to skip processing modules even if there are errors upstream, as long as those errors don't affect the interface hash.

Other things to refactor include the "stat cache" that's used by find_module(), logging, and the fact that the constructor of the State class does way too much work.

This is a big refactoring and I expect it will take a few weeks at least. But I think it's time to start this operation. [UPDATE: I won't start until January 2018 at the earliest.]

@carljm
Copy link
Member

carljm commented Dec 14, 2017

There is also #4277 in-flight for supporting PEP 420 namespace packages, which extensively reworks the import parts of build.py.

@gvanrossum
Copy link
Member Author

gvanrossum commented Dec 14, 2017 via email

@gvanrossum
Copy link
Member Author

I'm also going to try to wait for #4278 to land.

@emmatyping
Copy link
Member

I think it would also be nice to keep in mind (but perhaps not change in the same diff) that abstracting the build from "it runs on files" to "it runs on one or more file like objects" has many benefits.

Allowing StringIO as a way to tell mypy "type check this text, and only this text" is a nice way to a) speed up tests) and b) make editor integrations easier.

@emmatyping
Copy link
Member

I believe this was basically done in #5686.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants