-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster, incremental type checking #932
Comments
I think this can definitely wait, though. You can at least hope that the core dependency cycles are relatively static compared to the stuff at the front. I don't think using JSON is that great of an idea, I think just using a binary format is better. These things have no need to be human-readable. Automatic serialization could definitely be a possibility and its probably easier to have less boilerplate with a non-json format (especially in the context of unicode) |
We could easily use something like BSON (http://bsonspec.org/) or any number of binary formats. However, JSON has the benefit of not requiring any third party packages, and it probably is efficient enough. If not, we can easily replace it with something else. Whatever format we are going to use, we should have a tool for dumping the contents of a serialized file in a human-readable format such as pretty-printed JSON for easier debugging. |
Why not just use |
pickle is brittle and slow - see http://www.benfrederickson.com/dont-pickle-your-data/ for example |
Instead of generating modules corresponding to stubs, can we serialize the contents of the semantically analyzed tree? I think this is the same thing we would want to transfer between processes in the parallelization scenarios. |
We'd serialize the symbol table and some AST nodes that are included in the symbol table. We'd do the serialization after type checking so that we can have inferred types there and don't need to run type checking on the deserialized symbol table. There would be one serialized file per The serialization format would include things like these (this might not be 100% right):
These things would not be included in the serialized data, as they can't affect the public interface of a file:
Deserialization would work like this:
Plausible implementation plan:
This explanation is still not very detailed. Feel free to ask questions about things that are unclear. |
I've done a bunch of research and thinking on how this would fit into the existing code. Here are some raw notes (mostly just me thinking aloud):
Deciding whether to use serialized data or not, and how:
Thoughts about where to store the blobs:
[I better save this before I lose my browser window.] |
For definiteness I'm going to try the following storage scheme.
The schema for a meta blob could be something like this: {
"id": "foo",
"path": "/Users/guido/foo.py",
"mtime": 1234567890.123, # POSIX timestamp in seconds
"size": 123,
"data_mtime": 1234567890.456,
"dependencies": ["requests", "requests.packages"],
"flags": [] # E.g. ["--implicit-any", "--py2"]
} |
Sounds good! Some notes:
|
I've got an initial iteration of the load/save infrastructure working. This adds a few more 'State' subclasses that represent files for which cached data is present in various stages of loading. Adding additional passes if needed should now be straightforward. However I haven't really gotten to the [de]serialization of MypyFile; I haven't really managed to follow exactly what's stored in the symbol table for the various values of 'kind' or how much of that is used after a module has been fully type-checked. So that's the next project. It seems that symbol table nodes are mutable; is this used even after the module itself has been type-checked? |
I've finally found a reasonable rhythm for doing the (de)serialization. Each class that needs to be serialized has a method named serialize() that returns a JSON object (a Python dict, not a string) and a matching class method named deserialize() that takes the corresponding JSON object. By convention the JSON object has a ".class" key whose value is the class name (e.g. "SymbolTableNode"). The other keys are typically serialized versions of the constructor arguments and/or some of the important attributes. Sometimes a shortcut or special case is used to make the output one compact or to avoid cycles (e.g. for SymbolTableNode with kind=MODULE_REF, just the full name of the referenced module is returned; for simple variable declarations I intend to return just the full name of the type). Having the serialize() and deserialize() methods next to each other helps to implicitly document the serialization format for each class and makes it easy to keep them in sync when the format changes (which it will a lot during initial development). For deserializing classes that have meaningful subclasses (e.g. Type, SymbolNode) there's a deserialize() class method on the base class that looks at the ".class" key and then dispatches to the appropriate subclass's deserialize() method. There are many challenges ahead but this approach makes it easy to tackle them one class at a time. |
Once a module has been type checked, the symbol table should remain immutable. If it gets mutated, that's a bug. The design of serialize() and deserialize() sounds reasonable. One of the next interesting issues is the detailed design of the serialization format. Classes and functions and probably the trickiest. For a function, we could potentially store a serialized callable type and use that to construct a |
For anyone reading this, I am in the middle of an implementation. I am reluctant to publish my working branch because it's full of debugging code. However, I've made a lot of progress in the areas of serializing and deserializing nodes and types. I've also added several State subclasses (in build.py) to track the state of modules for which cached data is available. These also deal with waiting for dependent modules in order to patch up cross-module references. I've also got the framework for actually reading and writing JSON working nicely (and I've got the feeling that we'll be getting a nice speed-up). The next big issue is fixing up various cross references; for this I am still trying to understand exactly what types of cross references exist and how many passes it may take to fix them all up. |
This will land in 0.3.2; see #1292 . |
So #1292 has landed. The situation is now as follows:
For some follow-up work, see #1349 (add tests), #1350 (check which files we rechecked). |
Mypy should cache type checking results from previous runs to speed up subsequent runs which only contain small code changes. Between two successive runs the vast majority of a code in a large project typically stays untouched, so we could save a lot of work.
I've started doing some very early prototyping. This is probably quite a lot of work and won't happen very soon, but this is a critical feature for larger projects (with more than 100k LOC, say).
Some random ideas I currently have:
It's not obvious how much this could help, but some back of the napkin math says that we might get, say, 10-20x the type checking performance in cases where almost everything can be cached from previous runs.
Large module dependency cycles are one scenario where this might not be too helpful. Maybe we can figure out a way to break these dependency cycles somehow and to avoid always having to fully process every module in the cycle if any of them changes.
The text was updated successfully, but these errors were encountered: