Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Sketch out Estimator API #40

Merged
merged 1 commit into from
Apr 23, 2017
Merged

API: Sketch out Estimator API #40

merged 1 commit into from
Apr 23, 2017

Conversation

TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Apr 12, 2017

Added a couple Estimators for an end-user API.

Basically

from dask_glm.estimators import LogisticRegression
lr = LogisticRegresssion()
lr.fit(X, y)

To kick the API discussion even further down the road, right now it accepts either scikit-learn style Estimator(...).fit(X, y) or statsmodels-style Estimator(X, y, ...).fit(), but we should probably choose one or the other.

Here's an example notebook

Some miscellaneous changes to help

  • Added a datasets.make_classification helper
  • Added string names for solvers
  • Added an add_intercept helper (which I think might be broken right now...)

@TomAugspurger TomAugspurger mentioned this pull request Apr 12, 2017
rho = rho if rho is not None else 1
over_relax = over_relax if over_relax is not None else 1
abstol = abstol if abstol is not None else 1e-4
reltol = reltol if reltol is not None else 1e-2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should avoid calling out algorithms explicitly by name here if possible. Is it possible to put these defaults in algorithms.py somehow?


fit_kwargs = {'max_steps', 'tol', 'family'}
# TODO: use scikit-learn names throughout the lib
renamer = {'max_steps': 'max_iter', 'reg': 'regularizer'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would also be reasonable to change the existing algorithms to use these names consistently. I wouldn't be shy about touching that code.

from .utils import sigmoid, dot, add_intercept


class _GLM(BaseEstimator):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for _GLM being a base-class.

return mean_square_error(y, self.predict(X))


def accuracy_score(y_true, y_pred):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these functions be in:

  • utils or
  • within the corresponding Estimator class
    ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely not at the bottom of estimators.py, forgot to move them before pushing :) I didn't want to go down the whole rabbit-hole of our own metrics.py, but maybe we should do a couple basic ones like this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be comfortable including these in utils.py for now.

@TomAugspurger
Copy link
Member Author

Here's an update to the notebook showing scikit-learn and dask on a slice of the taxi dataset

http://nbviewer.jupyter.org/gist/anonymous/c07a617684ec9629c204825c212383fe

It shows dask-glm interacting with scikit-learn's pipeline API, and a small bit at the end with dask-searchcv. I haven't looked closely, at the output, but the fit doesn't seem great. Not sure if it's the defaults / starting parameters or something else.


pointwise_loss = family.pointwise_loss
pointwise_gradient = family.pointwise_gradient
regularizer = _regularizers.get(regularizer, regularizer) # string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha this line took me a second.


self._fit_kwargs = {k: getattr(self, k) for k in fit_kwargs}

def fit(self, X, y=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever we include p-value output, we will need access to the data that was used to fit the model (in order to compute the Hessian).

We could have this post-fit processing inside the fit method with an inert call to self.post_process or something as a place-holder, or store that info in some other cleverly-named attributes?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other call-out I'll make here is that this initialization setup will make it difficult for a user to create an unregularized model (they will need to specify a hand-picked solver). I'm not worried about it for this PR, but I think we should be aware of it and make it easier (without having the user choose the solver) in the future.

@mrocklin
Copy link
Member

Cool stuff. Comments on the notebook:

  1. da.asarray may not operate as expected on dask.dataframes. You might want to try df.values instead
  2. for overflow = overflow.compute() if hasattr(overflow, 'compute') else overflow you might want to try dask.compute(overflow)[0] instead. This function was designed to pass through non-dask things harmlessly
  3. "Converged". Looks like we need to remove print statements from dask-glm
  4. What do you think we should do with things like DummyEncoder?

@cicdw
Copy link
Collaborator

cicdw commented Apr 17, 2017

Comments on the notebook:

  • the model fit by scikit-learn isn't immediately comparable to the model fit by dask-glm because of regularization strength. To make them more comparable, do:
lam = 0.1 # or whatever number you want <-- reg strength for dask-glm
C = 1 / lam # <--- reg strength for scikit-learn
  • I'd recommend using admm as the default solver for regularized problems
  • +1 on removing print statements; I can do a separate PR that stores whether the chosen solver actually converged so the user has access to that.

@TomAugspurger
Copy link
Member Author

What do you think we should do with things like DummyEncoder?

I took it from https://github.com/TomAugspurger/sktransformers/blob/master/sktransformers/preprocessing.py

I'm not sure of the best home for them. I don't think we can reasonably expect the scikit-learn devs to "do the right thing" for dask stuff, like they're mostly doing for pandas dataframes.

That said, it's not a ton of work to support regular- and dask-versions of these kinds of tranformers. The DummyEncoder one is so long because there isn't anything exactly like it in scikit-learn. I could probably take on building-out / maintaining this if needed.

@cicdw
Copy link
Collaborator

cicdw commented Apr 17, 2017

Oh, an additional comment: it looks like scikit-learn also defaults to l2 regularization, whereas I think dask-glm defaults to l1. (writing this out also makes me wonder if we should lower-case these to better fit the mathematical notation).

We might want to set all of our defaults to scikit-learn defaults to avoid confusion whenever people run these sorts of comparison experiments in the future.

@cicdw
Copy link
Collaborator

cicdw commented Apr 20, 2017

If you could move those stray functions around, change the default regularizer to 'l2' and the default lamduh to 1.0 (to match sci-kit) then I think we're good to go.

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Apr 20, 2017 via email

@TomAugspurger
Copy link
Member Author

Ok, rebased and addressed those last couple comments @moody-marlin

And for fun, another example on the SUSY dataset, (5,000,000 x 18; all numeric): http://nbviewer.jupyter.org/gist/anonymous/15742155693794ddd31ea85b654cbc7e

The takeaway, assuming I've done everything properly

Library Training time Score
dask-glm 9:25 .788
scikit-learn 25:01 .788

That's all on my local machine still.

@cicdw
Copy link
Collaborator

cicdw commented Apr 20, 2017

Awesome, LGTM after flaking.

- Added a `datasets.make_classification` helper
- Added string names for solvers and regularizers
- Changed defaults to match scikit-learn where needed
- Changed parameter names to match scikit-learn
@mrocklin
Copy link
Member

It's very cool to see those performance numbers. cc @amueller @ogrisel in case they're interested. Notebook URL copied from above for convenience:

And for fun, another example on the SUSY dataset, (5,000,000 x 18; all numeric): http://nbviewer.jupyter.org/gist/anonymous/15742155693794ddd31ea85b654cbc7e

@cicdw cicdw merged commit 670d554 into dask:master Apr 23, 2017
@amueller
Copy link

Nice. how many cores was that?

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Apr 25, 2017 via email

@mrocklin
Copy link
Member

8 logical or 8 physical cores. My original assumption was that you were doing this from a personal laptop, presumably running something like a core i7 which has four physical cores with hyperthreading up to eight logical cores.

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Apr 25, 2017 via email

@ogrisel
Copy link

ogrisel commented Apr 26, 2017

Nice though I get 0.78832579999999997 accuracy in ~1min with a single core and the 'saga' solver in sklearn master (on standardized data) ;)

@mrocklin
Copy link
Member

Heh, I wonder how large of a cluster we would need to beat @ogrisel 's single laptop core :)

@amueller
Copy link

you need a dataset that doesn't fit into his ram ;)

@mrocklin
Copy link
Member

@TomAugspurger @moody-marlin if you were interested, I think that this would be an entertaining blogpost to put out there:

"Dask-glm beats scikit-learn by using parallelism. Scikit-learn then beats dask-glm by using smarter algorithms" :)

It would both show off the sklearn API compatibility developed in this PR and would also have the larger message of "there are usually better options than parallel computing".

@ogrisel
Copy link

ogrisel commented Apr 26, 2017

Also I checked the resulting coef_ attribute and no coefficient is set to zero for this value of C. The l1 penalty probably does not make sense on such low dimensional data. I suspect XGBoost to perform much better on this kind of data.

@ogrisel
Copy link

ogrisel commented Apr 26, 2017

To dask-glm's defense, the fit takes only 6min on 2 cores when you scale the data (same final training accuracy). saga is still 6x faster on a single core though.

Scaling the data is always a good idea :) and is required for the saga solver otherwise it might not converge.

@mrocklin
Copy link
Member

More broadly, are there parallelizable convex optimization algorithms that we should implement in dask-glm that would likely perform better on this sort of problem than what is being used now (ADMM I think, for this problem).

@ogrisel
Copy link

ogrisel commented Apr 26, 2017

I am not very familiar with ADMM and its competitors. Stochastic sequential solvers are very efficient for gradient based estimators when the number of samples is large.

Still @fabianp presented last week a parallel version of SAGA at AISTATS:

https://twitter.com/fpedregosa/status/856565126744334336

@ogrisel
Copy link

ogrisel commented Apr 26, 2017

For reference here is my notebooks (evaluated on my laptop with 2 physical cores):

http://nbviewer.jupyter.org/gist/ogrisel/5f2d31bc5e7df852b4ca63f5f6049f42

@mrocklin
Copy link
Member

@fabianp is this the right paper to read: https://arxiv.org/pdf/1606.04809.pdf ?

@ogrisel
Copy link

ogrisel commented Apr 26, 2017

@TomAugspurger if you do a blog post, you should use the official train test split (last 500000 samples as held out test set) and report accuracy on both train set (to assess the quality of the optimizers) and test set (to assess the quality of the statistical model).

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Apr 26, 2017 via email

@cicdw
Copy link
Collaborator

cicdw commented Apr 26, 2017

I'd like to throw in a quick comment: we shouldn't be trying to beat any benchmarks for output quality / predictive power; dask-glm is just fitting a simple regularized logistic regression, and so is scikit-learn. We just want our output to be the true optimizer for the problem the user has provided. I believe reporting the score metrics was initially being used as a proxy to prove our two algorithms were converging on the same "true" MAP coefficient estimates (i.e., our output is just as reliable as scikit-learn).

I believe the sag / saga algorithm in scikit-learn only fits differentiable regularizers, i.e., it only works with l2 regularization, so comparing it against l1 regularized fits isn't a comparison of our packages' performance but merely a statement that for this dataset l2 regularization works better than l1 (and is faster, which is typically true because of differentiability).

@ogrisel
Copy link

ogrisel commented Apr 26, 2017

SAGA works well with l1.

@cicdw
Copy link
Collaborator

cicdw commented Apr 26, 2017

Ah you're right, it works with the proximal operator instead of the gradient. 👍

@fabianp
Copy link

fabianp commented Apr 26, 2017

@ogrisel: thanks for the plug ;-)

@mrocklin: this is indeed the right paper (https://arxiv.org/pdf/1606.04809.pdf)

While this asynchronous algorithm works really well in practice, I highlight a couple of implementation caveats just to be 100% clear on it. I am not familiar with Dask, so feel free to comment, I love to have feedback on this:

  • The algorithm is fully asynchronous except for the per-coefficient update, which is assumed to be atomic (lines 11-13 in Algorithm 2), which is the common setting for these algorithms. That is, there is no locks on the full vector of coefficients as in other synchronous algorithms, but the individual scalar updates to the vector of coefficient do need to be atomic. This means that in order to get convergence to machine precision (one can argue whether this is necessary from a machine learning perspective, but it is important from an optimization perspective), one needs to have atomic double operations (we also refer to these as compare-and-swap operations in the paper). This turns out to be necessary not only in theory, but also in pratice, see Fig. 6 and paragraph "Necessity of compare-and-swap operations".
    I could do this in Java and C++, but I couldn't find a way to do this (efficiently) in Python.

  • The asynchronous algorithm in that paper is restricted to smooth objectives, but it can be extended to proximal terms. We have a paper under review on this that we will post in the next weeks on ArXiv.

(apologies for side-tracking the conversation and thanks for dask-glm!)

@mrocklin
Copy link
Member

Yeah, Dask certainly isn't going to operate at the low latencies required by algorithms such as you've implemented. Dask today imposes latencies on the order of low tens of milliseconds. This is almost certainly well above what your algorithm can tolerate.

However, my understanding is that these same sorts of algorithms sometimes develop blocked variants that are more tolerant of high latencies and so more amenable to distributed parallelism. Is this the case for asynchronous SAGA?

@ogrisel
Copy link

ogrisel commented Apr 26, 2017

In practice in that case it would probably make sense to run n instances of the penalized logistic regression fit with different values of C and on different CV splits to find the best model (in terms of validation accuracy). This is embarrassingly parallel so would well suited for dask-glm (even if less interesting that distributed optimizers from an implementer's point of view).

In this is basically what is implemented by LogisticRegressionCV in scikit-learn (using single machine threads via joblib).

@ogrisel
Copy link

ogrisel commented Apr 26, 2017

Out of core SAGA would be nice too.

@ogrisel
Copy link

ogrisel commented Apr 26, 2017

However, my understanding is that these same sorts of algorithms sometimes develop blocked variants that are more tolerant of high latencies and so more amenable to distributed parallelism. Is this the case for asynchronous SAGA?

Blocked Coordinate Descent is competitive for L1 (or L1+L2) penalized LogisticRegression when n_features >> n_samples. But naive implementations are bad compared to variants with screening rules and working sets. As soon as you implement those, parallelism might get trickier (or even impossible) but I am not expert enough to tell. @agramfort or @mblondel might know better.

@agramfort
Copy link
Contributor

agramfort commented Apr 26, 2017 via email

@mrocklin
Copy link
Member

@MLnick did a quick version of this with decent results over here: #30 (comment)

@fabianp
Copy link

fabianp commented Apr 28, 2017 via email

@MLnick
Copy link
Contributor

MLnick commented Apr 28, 2017

@agramfort interesting for L1 with L-BFGS-B. I thought the OWLQN variant was required for L1. Do you mind sharing what the "trick" is? :)

On the subject of distributed SGD - I'm quite keen to look into a "parameter server" type version of parallel SGD on dask - so either fully async or bounded stale sync. With a distributed parameter vector. I think that it may be quite nice to do it via the local Client interface (or perhaps Channel?) but I haven't had a chance to really dig into it just yet.

For large sparse problems I think it would work well (e.g. sparse LR and FMs cf. DiFacto).

@agramfort
Copy link
Contributor

agramfort commented Apr 28, 2017 via email

@MLnick MLnick mentioned this pull request May 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants