Faster, parallel type checking #933

JukkaL · 2015-10-15T06:30:19Z

If we have a sufficiently large program to type check we should be able to speed up type checking significantly by using multiple type checker processes that work in parallel.

One potential way to implement this:

Have a coordinator process that maintains a module dependency graph and keeps track of a work list of modules that haven't been type checked. Initially this will contain all modules in the program.
Have multiple worker processes that repeatedly get (sets of cyclically dependent) modules from the coordinator that have their dependencies processed, type check them and mark them as processed on the coordinator.

For this to really work we are going to need a way of efficiently communicating type check results for a module between processes (to avoid type checking shared dependencies multiple times). Having a JSON serialization format (see #932) would be sufficient.

Additionally we need a quick way of figuring out the dependency graph of all modules (or at least an approximation of it). We'll probably have to cache that between runs and update it incrementally, similar to the approach outlined in #932.

So how much would this help? Obviously this depends on the shape of the dependency graph. Under reasonable assumptions we should be able to hit an effective level of parallelism of at least 4 for larger programs, but I wouldn't be surprised if we could get even better than that. Cyclic module dependencies can add a limit to how far we can parallelize. We can probably estimate the achievable level of parallelism for a particular program by analyzing the module dependency graph.

This is probably only worth implementing after we have incremental type checking (#932), and we should preserve incrementalism -- i.e., we'd only type check modules modified since the last run and modules that depend on them.

jhance · 2015-10-17T02:53:34Z

To give an idea of the scope of this, I think it would be useful to know precisely what data needs to be shared.

Is it enough to give the slave processes the semantically analyzed tree of the modules it depends on?

Idea: In this case we can envision changing the current build process from, instead of directly processing the next ready state, to finding states that are ready and adding them to a queue, which get assigned to slave processes. Slave processes will be sent the semantically analyzed trees of modules they depend on before being asked to run a pass on a particular tree. They will then send back the resulting tree (or whatever data).

JukkaL · 2015-10-17T20:17:50Z

I explained a possible serialization format in #932. We should be able to use the same format to share data about modules in parallel builds.

A slave process could work like this, I think:

Get a (cyclically dependent) set of files to type check from the coordinator process. For files with no cyclic dependencies the set will have just a single file. Exit if no more files are remaining in the work list at the coordinator. The get operation may block, in case the coordinator needs to wait for some other workers to finish their current jobs first.
If none of the files are more recent than cached serialized data about the files, just do nothing and skip to the last step. Alternatively, the coordinator can do this check and only send files that need to be processed.
Type check the files exactly like we'd do it in a single process case. In particular, process dependencies normally. We are guaranteed that the dependencies all have up-to-date serialized data available so dependencies will be quick to process. Send any type errors to the coordinator so that it can serialize output.
Store serialized data about all the files that we just processed on the disk so that other workers can reuse our results.
Notify coordinator that we're finished with these files. Loop back to beginning. Maintain state so that we only have to process each dependency at most once per worker.

I'm less clear about how the coordinator would work. Some ideas:

First incrementally update module dependency information for all modules, and cache that in separate JSON files, for example. For each file we'd maintain a list of directly imported modules. This data needs to be invalidated for all files whenever the module search path changes. Maybe we can use the standard library Python parser module to find imports in each module, as the mypy parser is pretty slow and we don't want this to become a bottleneck.
Load and analyze the module dependency graph to at least find sets of cyclically dependent files. Again, this needs to be quick. From this point on, every unit of type checking is a set of cyclically dependent files.
Divide units into four mutually exclusive states: 1) unprocessed units that have unprocessed dependencies, 2) unprocessed units that are ready to be type checked, 3) units being processed by a worker, 4) processed units.
Fire up N workers.
Send units in state 2 to workers and change them to state 3. If there are no units in state 2 but there are units in state 1, we may have to wait until a unit gets added to state 1. If there are no units in states 1 or 2, we can exit any workers that have nothing left to do.
Once a worker finished with a unit, switch it to state 4. Potentially move some units from state 1 to 2 if they now have all their dependencies processed.
We are finished when every unit is in state 4.

fzyzcjy · 2022-12-15T14:33:48Z

Hi, is there any updates? thanks

manmartgarc · 2023-04-25T05:29:00Z

This would be really cool. Is there any way I can help?

JukkaL · 2023-04-25T14:37:39Z

@manmartgarc Any help would be very welcome! I'm happy to chat about the details over Zoom if you are interested in working on this (you can find my email in the mypy commit history).

jgarvin · 2024-01-20T04:08:52Z

Currently I invoke mypy on all my files like mypy $(find . -name "*.py"). It's starting to get a bit slow so I found this ticket. Until real support is added, is it expected to work if I run two separate mypy instances myself at the same time on different files? Then I could still parallelize at the build system level with make or similar. I realize this may lead to some redundant work but it might still come out ahead. I'm mostly concerned that the instances might conflict both trying to write to mypy's cache.

aiven-anton · 2024-01-22T09:26:17Z

@jgarvin In my experience, running mypy only on a part of a code-base sometimes gives different results from a full mypy run. However, I believe the intention is for this to work, and any deviation is really a bug. I assume you would need to use separate cache directories for the processes to not corrupt each other (not sure this is true though).

Also just noting, when doing a full mypy run, there should usually not be any need to filter files with find, you can just configure mypy with your source directory and it will traverse it.

jgarvin · 2024-01-22T18:44:55Z

Hmm mypy cache on a relatively small project (<20k lines) is ~40MB. So for a 128 core machine I'd be spending 4GB of disk on mypy caches which in the grand scheme of things is not huge disk space consumption but I have to imagine touching that much disk must slow down the checking. I looked at the caches to see what was responsible for the size and it seems it's a ton of JSON files which is not a very compact format. Maybe pickle would work better, or sqlite.

dosisod · 2024-01-22T18:58:58Z

@jgarvin #15731 and #15981 include some methods on reducing the size of the cached JSON files. I mentioned it in the comments somewhere, but pickling might take up more space since we only store certain fields in the JSON files, meaning pickling could potentially include data we don't need/want (have yet to look into this).

Mypy does have a SQLite cache option, though it basically just stores the JSON data and filename in a table, see #3456 (comment) .

JukkaL added the needs discussion label Oct 15, 2015

gvanrossum added the topic-incremental label Apr 8, 2016

gvanrossum added this to the Future milestone Apr 8, 2016

gvanrossum added the priority-2-low label Jan 6, 2017

gvanrossum removed this from the Future milestone Mar 29, 2017

ilevkivskyi added the performance label May 11, 2019

potiuk mentioned this issue Nov 30, 2021

Re-enable MyPy apache/airflow#19891

Closed

10 tasks

97littleleaf11 mentioned this issue Mar 14, 2022

multi-platform mypy #12345

Closed

JukkaL removed the priority-2-low label May 4, 2023

jamesbraza mentioned this issue Aug 16, 2023

[Question]: parallelism within refurb? dosisod/refurb#279

Open

JukkaL mentioned this issue Sep 4, 2023

Faster cache deserialization #3456

Open

JukkaL mentioned this issue Nov 13, 2023

Meta issue tracker, pt 2 #16472

Open

potiuk mentioned this issue Nov 27, 2023

Finish refactoring of breeze-dependent pre-commits to use breeze shell apache/airflow#35830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster, parallel type checking #933

Faster, parallel type checking #933

JukkaL commented Oct 15, 2015

jhance commented Oct 17, 2015

JukkaL commented Oct 17, 2015

fzyzcjy commented Dec 15, 2022

manmartgarc commented Apr 25, 2023

JukkaL commented Apr 25, 2023

jgarvin commented Jan 20, 2024

aiven-anton commented Jan 22, 2024

jgarvin commented Jan 22, 2024 •

edited

Loading

dosisod commented Jan 22, 2024

Faster, parallel type checking #933

Faster, parallel type checking #933

Comments

JukkaL commented Oct 15, 2015

jhance commented Oct 17, 2015

JukkaL commented Oct 17, 2015

fzyzcjy commented Dec 15, 2022

manmartgarc commented Apr 25, 2023

JukkaL commented Apr 25, 2023

jgarvin commented Jan 20, 2024

aiven-anton commented Jan 22, 2024

jgarvin commented Jan 22, 2024 • edited Loading

dosisod commented Jan 22, 2024

jgarvin commented Jan 22, 2024 •

edited

Loading