Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile results #26

Open
mrocklin opened this issue Feb 3, 2017 · 8 comments
Open

Profile results #26

mrocklin opened this issue Feb 3, 2017 · 8 comments

Comments

@mrocklin
Copy link
Member

mrocklin commented Feb 3, 2017

On eight m4.2xlarges I created the following dataset

N = 1e8
beta = np.array([-1, 0, 1, 2])
M = 4
chunks = 1e6
seed = 20009

X = da.random.random((N, M), chunks=(chunks, M))
z0 = X.dot(beta)
y = da.random.random(z0.shape, chunks=z0.chunks) < sigmoid(z0)

X, y = persist(X, y)

I then ran the various methods within this project and recorded the profiles as bokeh plots. They are linked to below:

Additionally, I ran against a 10x larger dataset and got the following results

Most runtimes were around a minute. The BFGS solution gave wrong results.

Notes

On larger problems with smallish chunks (8 * 4 * 1e6 == 24 MB) we seem to be bound by scheduling overhead. I've created an isolated benchmark here that is representative of this overhead: https://gist.github.com/mrocklin/48b7c4b610db63b2ee816bd387b5a328

@mrocklin
Copy link
Member Author

mrocklin commented Feb 3, 2017

@cicdw
Copy link
Collaborator

cicdw commented Feb 10, 2017

@mrocklin Has gradient_descent been optimized (using delayed, persist, etc.) in the same way that the other functions have? I might be refactoring soon and I wanted to make sure that piece was taken care of first.

@mrocklin
Copy link
Member Author

mrocklin commented Feb 10, 2017 via email

@mrocklin
Copy link
Member Author

Note that @eriknw is working on a dask optimization that may help to reduce overhead here: dask/dask#1979

@mrocklin
Copy link
Member Author

I sat down with @amueller and we compare with sklearn's SGD. We found that proximal_grad and sklearn.SGD were similar in terms of runtime on a single machine (using dask.distributed, we didn't try the threaded scheduler). Presumably SGD was being a bit smarter and dask-glm was using more hardware.

@cicdw
Copy link
Collaborator

cicdw commented Feb 15, 2017

@mrocklin Did you look at ADMM? I'm currently starting to think that, going forward, we only employ ADMM, Newton, and gradient_descent.

@mrocklin
Copy link
Member Author

Nope, we only spent a few minutes on it. We ran the following:

Prep

import dask.array as da
import numpy as np
from dask_glm.logistic import *
from dask_glm.utils import *

from distributed import Client
c = Client('localhost:8786')

N = 1e7
chunks = 1e6
seed = 20009

X = da.random.random((N,2), chunks=chunks)
y = make_y(X, beta=np.array([-1.5, 3]), chunks=chunks)

X, y = persist(X, y)

Dask GLM

%time proximal_grad(X,y)

SKLearn

from sklearn.linear_model import SGDClassifier
nX, ny = compute(X, y)
%time sgd = SGDClassifier(loss='log', n_iter=10, verbose=10, fit_intercept=False).fit(nX, ny)

@cicdw
Copy link
Collaborator

cicdw commented Mar 13, 2017

I haven't looked into whether we could use this data for benchmarking, but the incredibly large dataset over at https://www.kaggle.com/c/outbrain-click-prediction/data seems like it could be a good candidate. We might have to process the data a little bit before fitting a model, but I wouldn't mind taking a stab at that piece.

cc: @hussainsultan @jcrist

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants