Faster, incremental type checking #932

JukkaL · 2015-10-15T06:02:50Z

Mypy should cache type checking results from previous runs to speed up subsequent runs which only contain small code changes. Between two successive runs the vast majority of a code in a large project typically stays untouched, so we could save a lot of work.

I've started doing some very early prototyping. This is probably quite a lot of work and won't happen very soon, but this is a critical feature for larger projects (with more than 100k LOC, say).

Some random ideas I currently have:

Generate a JSON file for each type checked module that contains roughly the same information as a stub file for the module would have (i.e., function signatures, classes, module level variables -- but no function bodies) but in a format that maps very directly to mypy ASTs.
If a module and its dependencies haven't changed since the last run, use the JSON file to represent the module and don't parse or type check the module at all.

It's not obvious how much this could help, but some back of the napkin math says that we might get, say, 10-20x the type checking performance in cases where almost everything can be cached from previous runs.

Large module dependency cycles are one scenario where this might not be too helpful. Maybe we can figure out a way to break these dependency cycles somehow and to avoid always having to fully process every module in the cycle if any of them changes.

jhance · 2015-10-15T16:05:09Z

Large module dependency cycles are one scenario where this might not be too helpful. Maybe we can figure out a way to break these dependency cycles somehow and to avoid always having to fully process every module in the cycle if any of them changes.

I think this can definitely wait, though. You can at least hope that the core dependency cycles are relatively static compared to the stuff at the front.

I don't think using JSON is that great of an idea, I think just using a binary format is better. These things have no need to be human-readable. Automatic serialization could definitely be a possibility and its probably easier to have less boilerplate with a non-json format (especially in the context of unicode)

JukkaL · 2015-10-16T04:40:02Z

We could easily use something like BSON (http://bsonspec.org/) or any number of binary formats. However, JSON has the benefit of not requiring any third party packages, and it probably is efficient enough. If not, we can easily replace it with something else.

Whatever format we are going to use, we should have a tool for dumping the contents of a serialized file in a human-readable format such as pretty-printed JSON for easier debugging.

refi64 · 2015-10-16T14:08:19Z

Why not just use pickle? It comes with the Python stdlib, and you don't have to worry about serializing to JSON object structures.

The-Compiler · 2015-10-16T14:11:38Z

pickle is brittle and slow - see http://www.benfrederickson.com/dont-pickle-your-data/ for example

jhance · 2015-10-17T02:52:46Z

Instead of generating modules corresponding to stubs, can we serialize the contents of the semantically analyzed tree?

I think this is the same thing we would want to transfer between processes in the parallelization scenarios.

JukkaL · 2015-10-17T19:47:50Z

We'd serialize the symbol table and some AST nodes that are included in the symbol table. We'd do the serialization after type checking so that we can have inferred types there and don't need to run type checking on the deserialized symbol table.

There would be one serialized file per .py or .pyi file, including stubs. That way stubs could also be processed more quickly. The payoff for stubs would be smaller than for regular files, because stubs don't have function bodies so stripping them away wouldn't help.

The serialization format would include things like these (this might not be 100% right):

Enough information to reconstruct MypyFile for a file, but defs would be empty.
SymbolTable (names of MypyFile) and enough information to reconstruct SymbolTableNode, FuncDef, Var, TypeInfo and TypeVarExpr objects included in the table at least. Module references also need to go in. We'd include declared and inferred types.
References to other modules and names defined in other modules would be encoded as fully qualified names. These need to be resolved during deserialization.

These things would not be included in the serialized data, as they can't affect the public interface of a file:

Most AST nodes for statements, except for definitions like FuncDefs.
Most expressions, except maybe for definitions like type variables.

Deserialization would work like this:

Optionally, deserialize a set of cyclically dependent modules as a single unit. Alternatively, map the following steps to the build phases/states used in build.py.
Deserialize to intermediate Python objects such as nested dictionaries (from JSON).
Create MypyFile nodes and populate SymbolTables, but don't bind inter-module name references yet (definitions in the symbol table will have some attributes that are not populated yet). We need this so that the next step can bind all inter-module dependencies no matter in which order we process the files in a set of cyclic modules.
After every dependent file has MypyFile and at least a stub SymbolTable, bind all names in things like base class lists and inferred/declared types.

Plausible implementation plan:

Write a serialization prototype that can serialize simple function definitions, type infos and instance types, but doesn't support all features such as properties or decorators. Run some benchmarks to validate that this would actually speed up things sufficiently, and iterate on the prototype to improve performance, clarify and maintainability.
Write a short spec for the serialization format once we've settled on an approach, and iterate until it looks good.
Figure out a plan for testing this. There's a lot of state information that needs to be preserved and testing it all can be hard unless we are clever about it.
Implement the whole thing and write tests. Run some benchmarks to validate that we've reached our goals.

This explanation is still not very detailed. Feel free to ask questions about things that are unclear.

gvanrossum · 2016-02-12T00:38:07Z

I've done a bunch of research and thinking on how this would fit into the existing code.

Here are some raw notes (mostly just me thinking aloud):

For serialization we can probably use a specialized NodeVisitor (similar to the one for computing the __str__).
For deserialization the visitor pattern can't work (because there's no node tree to walk yet) so it'd probably have to be a classic recursive deserializer.
I've not thought much more about JSON vs. something else; but JSON sounds easy and the json module should be relatively fast. The (de)serializers should not deal with the strings but with the object tree that's passed into json.dump() and returned by json.load().
It seems the key things to serialize are MypyFile instances and SymbolTableNode instances. The latter also contain Type instances, of which there are many varieties.
The decision whether to use previously serialized data (if found) or not should be made in build.py roughly where it currently creates UnprocessedFile instances. This is when taking the initial set of BuildSources and using them to construct the initial list of States for the BuildManager (at the end of build()) and also in import_module(). I'll elaborate on how to decide below.
Writing serialized data can happen whenever a new module transitions to TypecheckedFile.
There are some command line flags (e.g. --silent) and other environmental things (e.g. lib_path) that can affect the type checker and hence the serialized data. We should make sure not to reuse serialized data from a past run with different flags (however, if all that changed was lib_path and the source was found at the same absolute path that should be okay).

Deciding whether to use serialized data or not, and how:

Let's have two separate serialized blobs for each path. One ("meta") should be relatively small and contain everything needed to decide whether to use the other for this path or not. The other ("data") contains the serialized MypyFile object, symbol table etc.
The meta blob should contain a tuple of simple fields (id, path, mtime, size) and possibly some flags (TBD), and a list of direct dependencies (module ids, or paths, or both -- TBD).
When we first consider a module we take the id (the fully-qualified module name, e.g. 'foo.bar') and the path (the full pathname where we found it using lib_path). For command line argument we typically start with the path and compute the id; for imports we start with the id and then find the path.
We then use this to find a meta blob. If we can't find one we give up.
When we have a meta blob, we double check that both id and path match, else we give up.
Now stat() the path and extract the mtime and size. If these don't match the meta blob, give up.
Also double check that the data blob exists (but don't load it yet).
Now get the dependencies from the meta blob.
For each (id, path) from the dependencies apply the same process recursively for the transitive dependency graph. If we find any dependency that is out of date or missing, give up. (Use a cache to avoid duplicate work or going around in circles.)
If all goes well (all meta blobs are found and still up to date), load all the meta blobs, create TypecheckedFile state objects for them, and insert those into the BuildManager.
If there's a problem with a data blob at this point we're in trouble. Probably just give up then.
When we find a stale or missing meta blob after we've already traversed part of the dependency graph we may still load the data blobs for the subgraph that was loaded successfully, if that subgraph has no dependencies on stale/missing blobs. This avoids going through the same motions again later.
There's another possible optimization: load the data blobs lazily. This could avoid doing a bunch of redundant work if some file that needs to be re-analyzed contains an error.

Thoughts about where to store the blobs:

We could have separate files containing the meta and data blobs in the same directory as the module (or stub) source code, e.g. using extensions .mypy.meta and .mypy.data. Each could contain JSON.
This makes directory listings kind of noisy though.
We could make the listing less noisy by using a subdirectory of the directory containing the .py/.pyi files named __mypy__ (echoing Python's __python__).
Or we could have a single mypy-cache directory in a central place (maybe ~/.mypy_cache).
Or we could store everything in sqlite3 -- this might make querying the meta blobs a little simpler (the various fields could be table columns).

[I better save this before I lose my browser window.]

gvanrossum · 2016-02-12T00:56:50Z

For definiteness I'm going to try the following storage scheme.

Everything is in a cache directory named .mypy_cache in the current directory.
For a module foo.py or a stub foo.pyi, the data is in this cache directory as foo.meta.json and foo.data.json.
For a submodule foo.py[i] of a package bar, the data is in bar/foo.{meta,data}.json.
There's some ambiguity when foo could be either a package foo or a module foo.py; we resolve this in the same way it's resolved for regular imports (the package wins).
For the package bar, the data is in bar/init.{meta,data}.json.
We'll write the .data.json file first and store its mtime in the .meta.json file, so we can detect whether the meta and data blobs go together.
The dependencies in the .meta.json file will use module ids (e.g. 'foo' or 'bar.foo') to identify dependencies. These will be mapped to paths for the actual module source code using lib_path, and to paths for the meta/data blob files using the straightforward mapping above.

The schema for a meta blob could be something like this:

{
  "id": "foo",
  "path": "/Users/guido/foo.py",
  "mtime": 1234567890.123,  # POSIX timestamp in seconds
  "size": 123,
  "data_mtime": 1234567890.456,
  "dependencies": ["requests", "requests.packages"],
  "flags": []  # E.g. ["--implicit-any", "--py2"]
}

JukkaL · 2016-02-12T11:44:13Z

Sounds good!

Some notes:

We'll need to process cached JSON data in at least two passes when deserializing: first collect all definitions, then bind references. Otherwise we can't guarantee that a definition is available when we try to bind a reference to it. We need to run phase 1 on all dependent modules first before we can run phase 2 on a module, similar to how we have multiple semantic analysis passes right now. However, two passes should be enough here.
If the list of files/directories given on the command line changes when using --silent, we may have to reanalyze files as there may be more/less type information available for dependent modules. Also, if a new stub gets introduced, we may have reanalyze files if using --silent. We could perhaps record a special metadata file for each module that has been silently ignored and treated as Any, and consider the module changed if an implementation becomes available in a later run. Similarly, if a module was previously type checked but becomes ignored via --silent, we should consider it modified (for example, a stub file was removed for some reason).
Lazy loading: We don't necessarily need to load some unmodified dependencies at all even if they are indirectly imported by a modified file, if none of the modified files import them directly. For example, assume a imports b and b imports c, and b just calls a function in c but doesn't depend on it in any other way. Now if only a has been modified and a doesn't access b.c (which it usually doesn't), we don't need to load data for c at all, giving a further speedup. However, if a node in b has a type defined in c, we have to process c since the type definition lives there.

gvanrossum · 2016-02-13T05:04:44Z

I've got an initial iteration of the load/save infrastructure working. This adds a few more 'State' subclasses that represent files for which cached data is present in various stages of loading. Adding additional passes if needed should now be straightforward.

However I haven't really gotten to the [de]serialization of MypyFile; I haven't really managed to follow exactly what's stored in the symbol table for the various values of 'kind' or how much of that is used after a module has been fully type-checked. So that's the next project.

It seems that symbol table nodes are mutable; is this used even after the module itself has been type-checked?

gvanrossum · 2016-02-15T03:20:19Z

I've finally found a reasonable rhythm for doing the (de)serialization. Each class that needs to be serialized has a method named serialize() that returns a JSON object (a Python dict, not a string) and a matching class method named deserialize() that takes the corresponding JSON object.

By convention the JSON object has a ".class" key whose value is the class name (e.g. "SymbolTableNode"). The other keys are typically serialized versions of the constructor arguments and/or some of the important attributes. Sometimes a shortcut or special case is used to make the output one compact or to avoid cycles (e.g. for SymbolTableNode with kind=MODULE_REF, just the full name of the referenced module is returned; for simple variable declarations I intend to return just the full name of the type).

Having the serialize() and deserialize() methods next to each other helps to implicitly document the serialization format for each class and makes it easy to keep them in sync when the format changes (which it will a lot during initial development).

For deserializing classes that have meaningful subclasses (e.g. Type, SymbolNode) there's a deserialize() class method on the base class that looks at the ".class" key and then dispatches to the appropriate subclass's deserialize() method.

There are many challenges ahead but this approach makes it easy to tackle them one class at a time.

JukkaL · 2016-02-15T18:34:12Z

Once a module has been type checked, the symbol table should remain immutable. If it gets mutated, that's a bug.

The design of serialize() and deserialize() sounds reasonable. One of the next interesting issues is the detailed design of the serialization format. Classes and functions and probably the trickiest. For a function, we could potentially store a serialized callable type and use that to construct a FuncDef instance. This way we don't have to include redundant information about argument names, etc.

gvanrossum · 2016-02-19T01:13:07Z

For anyone reading this, I am in the middle of an implementation. I am reluctant to publish my working branch because it's full of debugging code. However, I've made a lot of progress in the areas of serializing and deserializing nodes and types. I've also added several State subclasses (in build.py) to track the state of modules for which cached data is available. These also deal with waiting for dependent modules in order to patch up cross-module references. I've also got the framework for actually reading and writing JSON working nicely (and I've got the feeling that we'll be getting a nice speed-up).

The next big issue is fixing up various cross references; for this I am still trying to understand exactly what types of cross references exist and how many passes it may take to fix them all up.

gvanrossum · 2016-04-07T18:16:28Z

This will land in 0.3.2; see #1292 .

gvanrossum · 2016-04-08T18:12:22Z

So #1292 has landed. The situation is now as follows:

By default the cache is not used
To use the cache, pass -i or --incremental
The cache lives in .mypy_cache; remove that directory to reset the cache
There are separate cache subdirectories per Python version (e.g. 2.7, 3.5)
The cache may get confused by --silent-import (-s); reset when passing a different set of files

For some follow-up work, see #1349 (add tests), #1350 (check which files we rechecked).

JukkaL added the needs discussion label Oct 15, 2015

JukkaL mentioned this issue Oct 15, 2015

Faster, parallel type checking #933

Open

gvanrossum self-assigned this Feb 10, 2016

ddfisher added this to the 0.4.0 milestone Mar 2, 2016

gvanrossum modified the milestones: 0.3.2, 0.4.0 Apr 7, 2016

JukkaL added feature and removed needs discussion labels Apr 8, 2016

gvanrossum closed this as completed Apr 8, 2016

gvanrossum mentioned this issue Apr 8, 2016

Turn on the incremental cache by default #1352

Closed

gvanrossum added the topic-incremental label Apr 8, 2016

dosisod mentioned this issue Aug 27, 2023

Feature: Reduce Mypy's Cache Size #15731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster, incremental type checking #932

Faster, incremental type checking #932

JukkaL commented Oct 15, 2015

jhance commented Oct 15, 2015

JukkaL commented Oct 16, 2015

refi64 commented Oct 16, 2015

The-Compiler commented Oct 16, 2015

jhance commented Oct 17, 2015

JukkaL commented Oct 17, 2015

gvanrossum commented Feb 12, 2016

gvanrossum commented Feb 12, 2016

JukkaL commented Feb 12, 2016

gvanrossum commented Feb 13, 2016

gvanrossum commented Feb 15, 2016

JukkaL commented Feb 15, 2016

gvanrossum commented Feb 19, 2016

gvanrossum commented Apr 7, 2016

gvanrossum commented Apr 8, 2016

Faster, incremental type checking #932

Faster, incremental type checking #932

Comments

JukkaL commented Oct 15, 2015

jhance commented Oct 15, 2015

JukkaL commented Oct 16, 2015

refi64 commented Oct 16, 2015

The-Compiler commented Oct 16, 2015

jhance commented Oct 17, 2015

JukkaL commented Oct 17, 2015

gvanrossum commented Feb 12, 2016

gvanrossum commented Feb 12, 2016

JukkaL commented Feb 12, 2016

gvanrossum commented Feb 13, 2016

gvanrossum commented Feb 15, 2016

JukkaL commented Feb 15, 2016

gvanrossum commented Feb 19, 2016

gvanrossum commented Apr 7, 2016

gvanrossum commented Apr 8, 2016