Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distributed environment #177

Closed
drorasaf opened this issue Jun 21, 2016 · 6 comments
Closed

distributed environment #177

drorasaf opened this issue Jun 21, 2016 · 6 comments
Labels

Comments

@drorasaf
Copy link

Are there any plans to allow tpot to be used in a distributed environments?

@rhiever
Copy link
Contributor

rhiever commented Jun 21, 2016

Eventually, yes. There's been discussion of using Dask to parallelize TPOT (cc @tonyfast). We've also been thinking about PySpark for parallel cloud computing. However, we're still focused on getting the core algorithm and tool finished before we really work our way into scaling to distributed environments.

@drorasaf
Copy link
Author

My common use case is parallel cloud computing and I think that in order for any interesting dataset to come in handy with TPOT it has to scale.
I might consider leaving it up to the user which one he prefers to use since he knows best the use case.

@minimumnz
Copy link

I'd love better parallel processing on a single machine. I feel sad when i see 3 cores at 0% and 1 at 100%

@danthedaniel
Copy link
Contributor

@minimumnz: it's fairly easy to make that change - https://github.com/teaearlgraycold/tpot/tree/parallelize

But TPOT itself likely won't have local parallelization until cluster support is also added, since it'd be much nicer to have both cases covered by one library.

@ghgr
Copy link

ghgr commented Nov 4, 2016

May I ask what is the current priority level of using distributed computing libraries (ideally DASK, that comes with caching) in tpot? I think that's vital for such a project to be usable in the real world and it should be orthogonal to the "core" branch development.

I think that if we manage to represent the whole population of pipelines in a huge dask graph it would be a good start. Then, caching of intermediate results (with the current development branch I'm spending most of the computing time recalculating the same xgboost!), multicore and multi-server would be hand in hand.
Any chance of reopening this issue?

@rhiever
Copy link
Contributor

rhiever commented Nov 4, 2016

I agree. Can you file a separate issue and list the possible options?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants