Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize cross validation as a provisional optimization #302

Closed
ghgr opened this issue Nov 4, 2016 · 5 comments
Closed

Parallelize cross validation as a provisional optimization #302

ghgr opened this issue Nov 4, 2016 · 5 comments

Comments

@ghgr
Copy link

ghgr commented Nov 4, 2016

I propose setting the parameter n_jobs to num_cv_folds to get a sort of quick parallelism. When better solutions with dask are implemented we could set it again to 1

On base.py, _evaluate_individual method (line 575)
cv_scores = cross_val_score(self, sklearn_pipeline, features, classes, cv=self.num_cv_folds, scoring=self.scoring_function)

to

cv_scores = cross_val_score(self, sklearn_pipeline, features, classes, cv=self.num_cv_folds, scoring=self.scoring_function, n_jobs=self.num_cv_folds)

@weixuanfu
Copy link
Contributor

Thank you for sharing this nice tips. Based on the User Guide of cross_val_score from scikit-learn njobs parameter determines the CPU number to use in cross-validation while cv parameter determines the number of folds. Maybe adding njobs as another parameter in TPOT for paralleling the cross-validation with default of 1 since this way may use much more system resource. @rhiever

@ghgr
Copy link
Author

ghgr commented Nov 4, 2016

Indeed that would be more precise. I proposed to make njobs == num_cv_folds since the default number of cv folds in tpot is 3, and most machines (used for machine learning) have more than 3 cores. Just to make @minimumnz feel better not having idle cores [1] ;-)

[1] #177

@rhiever
Copy link
Contributor

rhiever commented Nov 4, 2016

We've been talking about adding a n_jobs parameter to TPOT for quite some time, which would basically do this. Perhaps we should just do that.

@s-udhaya
Copy link

s-udhaya commented Nov 17, 2016

Would not be better to use the multiprocessing capabilities of DEAP. i.e Each combination(preprocessor, algorithm, postprocessor, etc) is an individual in the population of different combinations in TPOT. Hence exploiting the DEAP's multiprocessing feature helps TPOT in parallelizing, through running different individual in different cores?

@rhiever
Copy link
Contributor

rhiever commented Dec 19, 2016

We looked into using the multiprocessing capabilities of DEAP, but ran into issues with pickling lambda functions and a few other tricks we use in TPOT. Maybe @weixuanfu2016 can provide full details.

In the meantime, I've merged the PR into the development branch that exposes n_jobs for the cross-validation procedure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants