Add support for parallelization #171

lencioni · 2016-12-19T17:34:09Z

In large projects the happo run may take too long. Although we have some performance improvements in the pipeline (#62), there is a limit to how fast we can make things. We need to provide a way for examples to be split up among multiple machines and have the results aggregated at the end.

There is a good amount of overlap here with #73, so this might make sense to do at the same time.

To enable this, I think we need to do 2 things:

Add options to happo run to only run on a subset of examples and return/output metadata about the run.
Add a mechanism to happo to aggregate partial results into a single result.

In an interest to keep the API simple, it seems like the arguments we want are the number of split points (i.e. the number of machines to parallelize across) and the split point to run on. For instance, if you have 4 machines, you would end up calling happo run 4 times, with arguments like happo run 1/4, happo run 2/4, happo run 3/4, and happo run 4/4. Of course, the arguments could use more explicit flags as well, something like: happo run --split=1 --of=2 (naming needs to be improved). This will work if the order of examples will always be deterministic.

cc @lelandrichardson

The text was updated successfully, but these errors were encountered:

trotzig · 2016-12-20T09:03:11Z

I added some ideas to #73 that apply here as well.

lencioni added the enhancement label Dec 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for parallelization #171

Add support for parallelization #171

lencioni commented Dec 19, 2016

trotzig commented Dec 20, 2016

Add support for parallelization #171

Add support for parallelization #171

Comments

lencioni commented Dec 19, 2016

trotzig commented Dec 20, 2016