Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithms for distributed training #4

Open
mrocklin opened this issue May 5, 2018 · 16 comments
Open

Algorithms for distributed training #4

mrocklin opened this issue May 5, 2018 · 16 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented May 5, 2018

If others have the time, I'm inclined to experiment a bit with algorithms for distributed training. I think that this would be an interesting stress test of the technology, and would also, I think, raise some interest from external groups.

I'm inclined to try something like a parameter-server based SGD system. To do this I think that Dask needs to grow something like a low-latency inter-worker pub-sub system, which I'm happy to build beforehand.

Is this of sufficient interest that others would likely engage? If so, I welcome recommendations on papers to read, architectures to be aware of, and obstacles to avoid.

@mrocklin
Copy link
Member Author

mrocklin commented May 5, 2018

cc @ogrisel @GaelVaroquaux @stsievert @amueller . Also, feel free to ping others who may be more involved in this topic or have more time to engage.

@GaelVaroquaux
Copy link

GaelVaroquaux commented May 5, 2018 via email

@NelleV
Copy link
Contributor

NelleV commented May 5, 2018

I'm working on the distributed hardware aspect. I think it might be worth me calling you sometimes next week, with @yuvipanda in the room to see if we can set something up. Do you think what Yuvi proposed in #3 would work? I can explore other options as well.

@GaelVaroquaux
Copy link

GaelVaroquaux commented May 5, 2018 via email

@NelleV
Copy link
Contributor

NelleV commented May 5, 2018

Would a kubernetes cluster with ssh access work? If you tell me your ideal computing environment, I can try to set up something as close as possible to it.

@GaelVaroquaux
Copy link

GaelVaroquaux commented May 5, 2018 via email

@NelleV
Copy link
Contributor

NelleV commented May 5, 2018

OK. I'll try to make sure that we have something set up before you arrive, and I'll try to see whether Yuvi can join us the first day of the sprint (there's a jupyter-dev meeting the same week)

@mrocklin
Copy link
Member Author

mrocklin commented May 5, 2018

Well, as I mentioned IRL, my priority is to use scikit-learn in
distributed settings on simple problems (for instance distributed
grid-search of random forests), and to do benchmarks. I want to feel what
are the bottlenecks, and maybe address them.

Agreed. I have this same goal. I think that we'll have enough people for enough time that we can address a few issues at the same time.

The blocker for this task is to have access to distributed hardware where
I can do benchmarks.

I don't anticipate that this will be a problem. I think that the JupyterHub + Dask-Kubernetes setup we have now for http://pangeo.pydata.org/ will suffice for this group. We can set up something similar with a software environment that we like fairly easily. You could use the current pangeo deployment today if you install sklearn in your local environment and also add it to your worker-template.yaml file in the EXTRA_PIP_PACKAGES environment variable.

@stsievert
Copy link

I'm inclined to try something like a parameter-server based SGD system.

There's interest here. I'd engage and am interested in extensions. Vanilla SGD would definitely work, and a lot of useful SGD variants that rely on it that use some very specific features (async reads/writes to model vector, coding).

A good paper that walks through the design and implementation of a high performance param server is "Scaling Distributed Machine Learning with the Parameter Server". Some of the features required for more particular algorithms are mentioned in sections 2 and 3 of "Communication Efficient Distributed Machine Learning with the Parameter Server", and extension of the previous work.

distributed grid-search of random forests

Adaptive hyper param tuning is related: dask/dask-ml#161

@amueller
Copy link

amueller commented May 7, 2018

It's probably relatively straight-forward to ask for cloud credits if that's your blocker @GaelVaroquaux. Google is usually generous, and I have some connections at microsoft.

I don't have time to spend time on this before scipy.
Also, I will be offline for two weeks starting tomorrow because I get my tonsils removed.

@amueller
Copy link

amueller commented May 7, 2018

also ping @jnothman I guess?

@fabianp
Copy link

fabianp commented May 18, 2018

Cc me. Interested and happy to help in whatever I can during the sprint.

@yuvipanda
Copy link

It looks like @mrocklin thinks pangeo.pydata.org is good enough here. LMK if that changes :)

@mrocklin
Copy link
Member Author

Cc me. Interested and happy to help in whatever I can during the sprint.

@fabianp are there topics in particular that you'd like to pursue together?

@fabianp
Copy link

fabianp commented May 19, 2018

parameter-server based SGD sounds like a good starting point, but open to other ideas that might surge. I've done some work on distributed/async methods but I'm quite new to dask.

@mrocklin
Copy link
Member Author

I've opened up a longer-term issue on the dask-ml tracker: dask/dask-ml#171

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants