API: Sketch out Estimator API #40

TomAugspurger · 2017-04-12T19:03:31Z

Added a couple Estimators for an end-user API.

Basically

from dask_glm.estimators import LogisticRegression
lr = LogisticRegresssion()
lr.fit(X, y)

To kick the API discussion even further down the road, right now it accepts either scikit-learn style Estimator(...).fit(X, y) or statsmodels-style Estimator(X, y, ...).fit(), but we should probably choose one or the other.

Here's an example notebook

Some miscellaneous changes to help

Added a datasets.make_classification helper
Added string names for solvers
Added an add_intercept helper ~~(which I think might be broken right now...)~~

mrocklin · 2017-04-12T23:25:08Z

dask_glm/estimators.py

+            rho = rho if rho is not None else 1
+            over_relax = over_relax if over_relax is not None else 1
+            abstol = abstol if abstol is not None else 1e-4
+            reltol = reltol if reltol is not None else 1e-2


I think that we should avoid calling out algorithms explicitly by name here if possible. Is it possible to put these defaults in algorithms.py somehow?

mrocklin · 2017-04-12T23:26:23Z

dask_glm/estimators.py

+
+        fit_kwargs = {'max_steps', 'tol', 'family'}
+        # TODO: use scikit-learn names throughout the lib
+        renamer = {'max_steps': 'max_iter', 'reg': 'regularizer'}


I think it would also be reasonable to change the existing algorithms to use these names consistently. I wouldn't be shy about touching that code.

cicdw · 2017-04-17T15:01:30Z

dask_glm/estimators.py

+from .utils import sigmoid, dot, add_intercept
+
+
+class _GLM(BaseEstimator):


+1 for _GLM being a base-class.

cicdw · 2017-04-17T15:03:47Z

dask_glm/estimators.py

+        return mean_square_error(y, self.predict(X))
+
+
+def accuracy_score(y_true, y_pred):


Should these functions be in:

utils or

within the corresponding Estimator class
?

Definitely not at the bottom of estimators.py, forgot to move them before pushing :) I didn't want to go down the whole rabbit-hole of our own metrics.py, but maybe we should do a couple basic ones like this.

I'd be comfortable including these in utils.py for now.

TomAugspurger · 2017-04-17T15:04:27Z

Here's an update to the notebook showing scikit-learn and dask on a slice of the taxi dataset

http://nbviewer.jupyter.org/gist/anonymous/c07a617684ec9629c204825c212383fe

It shows dask-glm interacting with scikit-learn's pipeline API, and a small bit at the end with dask-searchcv. I haven't looked closely, at the output, but the fit doesn't seem great. Not sure if it's the defaults / starting parameters or something else.

cicdw · 2017-04-17T15:07:22Z

dask_glm/algorithms.py


    pointwise_loss = family.pointwise_loss
    pointwise_gradient = family.pointwise_gradient
+    regularizer = _regularizers.get(regularizer, regularizer)  # string


Ha this line took me a second.

cicdw · 2017-04-17T15:11:04Z

dask_glm/estimators.py

+
+        self._fit_kwargs = {k: getattr(self, k) for k in fit_kwargs}
+
+    def fit(self, X, y=None):


Whenever we include p-value output, we will need access to the data that was used to fit the model (in order to compute the Hessian).

We could have this post-fit processing inside the fit method with an inert call to self.post_process or something as a place-holder, or store that info in some other cleverly-named attributes?

The only other call-out I'll make here is that this initialization setup will make it difficult for a user to create an unregularized model (they will need to specify a hand-picked solver). I'm not worried about it for this PR, but I think we should be aware of it and make it easier (without having the user choose the solver) in the future.

mrocklin · 2017-04-17T15:14:03Z

Cool stuff. Comments on the notebook:

da.asarray may not operate as expected on dask.dataframes. You might want to try df.values instead
for overflow = overflow.compute() if hasattr(overflow, 'compute') else overflow you might want to try dask.compute(overflow)[0] instead. This function was designed to pass through non-dask things harmlessly
"Converged". Looks like we need to remove print statements from dask-glm
What do you think we should do with things like DummyEncoder?

cicdw · 2017-04-17T15:22:40Z

Comments on the notebook:

the model fit by scikit-learn isn't immediately comparable to the model fit by dask-glm because of regularization strength. To make them more comparable, do:

lam = 0.1 # or whatever number you want <-- reg strength for dask-glm
C = 1 / lam # <--- reg strength for scikit-learn

I'd recommend using admm as the default solver for regularized problems
+1 on removing print statements; I can do a separate PR that stores whether the chosen solver actually converged so the user has access to that.

TomAugspurger · 2017-04-17T15:23:05Z

What do you think we should do with things like DummyEncoder?

I took it from https://github.com/TomAugspurger/sktransformers/blob/master/sktransformers/preprocessing.py

I'm not sure of the best home for them. I don't think we can reasonably expect the scikit-learn devs to "do the right thing" for dask stuff, like they're mostly doing for pandas dataframes.

That said, it's not a ton of work to support regular- and dask-versions of these kinds of tranformers. The DummyEncoder one is so long because there isn't anything exactly like it in scikit-learn. I could probably take on building-out / maintaining this if needed.

cicdw · 2017-04-17T15:25:27Z

Oh, an additional comment: it looks like scikit-learn also defaults to l2 regularization, whereas I think dask-glm defaults to l1. (writing this out also makes me wonder if we should lower-case these to better fit the mathematical notation).

We might want to set all of our defaults to scikit-learn defaults to avoid confusion whenever people run these sorts of comparison experiments in the future.

cicdw · 2017-04-20T15:52:25Z

If you could move those stray functions around, change the default regularizer to 'l2' and the default lamduh to 1.0 (to match sci-kit) then I think we're good to go.

TomAugspurger · 2017-04-20T15:53:42Z

Cool, I'll get to those this afternoon.

…

On Thu, Apr 20, 2017 at 10:52 AM, Chris White ***@***.***> wrote: If you could move those stray functions around, change the default regularizer to 'l2' and the default lamduh to 1.0 (to match sci-kit) then I think we're good to go. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIswq0-Zm8wJJwNfPi2zUdPjlQ9ihks5rx386gaJpZM4M71rR> .

TomAugspurger · 2017-04-20T19:07:27Z

Ok, rebased and addressed those last couple comments @moody-marlin

And for fun, another example on the SUSY dataset, (5,000,000 x 18; all numeric): http://nbviewer.jupyter.org/gist/anonymous/15742155693794ddd31ea85b654cbc7e

The takeaway, assuming I've done everything properly

Library	Training time	Score
dask-glm	9:25	.788
scikit-learn	25:01	.788

That's all on my local machine still.

cicdw · 2017-04-20T19:59:36Z

Awesome, LGTM after flaking.

- Added a `datasets.make_classification` helper - Added string names for solvers and regularizers - Changed defaults to match scikit-learn where needed - Changed parameter names to match scikit-learn

mrocklin · 2017-04-22T22:43:42Z

It's very cool to see those performance numbers. cc @amueller @ogrisel in case they're interested. Notebook URL copied from above for convenience:

And for fun, another example on the SUSY dataset, (5,000,000 x 18; all numeric): http://nbviewer.jupyter.org/gist/anonymous/15742155693794ddd31ea85b654cbc7e

amueller · 2017-04-24T22:52:34Z

Nice. how many cores was that?

TomAugspurger · 2017-04-25T01:45:43Z

I believe it was 8 cores.

…

On Mon, Apr 24, 2017 at 5:52 PM, Andreas Mueller ***@***.***> wrote: Nice. how many cores was that? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHInIn3AIA4byZ_Ol2-ZCeZYzWgrq0ks5rzSezgaJpZM4M71rR> .

mrocklin · 2017-04-25T02:59:14Z

8 logical or 8 physical cores. My original assumption was that you were doing this from a personal laptop, presumably running something like a core i7 which has four physical cores with hyperthreading up to eight logical cores.

TomAugspurger · 2017-04-25T11:41:04Z

Ah, 8 logical cores I suppose. I was just going off the number of workers that `Client` created (8). Looking in the settings there are indeed 4 cores.

…

On Mon, Apr 24, 2017 at 9:59 PM, Matthew Rocklin ***@***.***> wrote: 8 logical or 8 physical cores. My original assumption was that you were doing this from a personal laptop, presumably running something like a core i7 which has four physical cores with hyperthreading up to eight logical cores. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIsYpAWpiJ9ewmCqISGpop3-A3bMeks5rzWGDgaJpZM4M71rR> .

ogrisel · 2017-04-26T15:55:58Z

Nice though I get 0.78832579999999997 accuracy in ~1min with a single core and the 'saga' solver in sklearn master (on standardized data) ;)

mrocklin · 2017-04-26T16:00:41Z

Heh, I wonder how large of a cluster we would need to beat @ogrisel 's single laptop core :)

amueller · 2017-04-26T16:01:15Z

you need a dataset that doesn't fit into his ram ;)

mrocklin · 2017-04-26T16:06:14Z

@TomAugspurger @moody-marlin if you were interested, I think that this would be an entertaining blogpost to put out there:

"Dask-glm beats scikit-learn by using parallelism. Scikit-learn then beats dask-glm by using smarter algorithms" :)

It would both show off the sklearn API compatibility developed in this PR and would also have the larger message of "there are usually better options than parallel computing".

ogrisel · 2017-04-26T16:07:10Z

Also I checked the resulting coef_ attribute and no coefficient is set to zero for this value of C. The l1 penalty probably does not make sense on such low dimensional data. I suspect XGBoost to perform much better on this kind of data.

ogrisel · 2017-04-26T16:10:07Z

To dask-glm's defense, the fit takes only 6min on 2 cores when you scale the data (same final training accuracy). saga is still 6x faster on a single core though.

Scaling the data is always a good idea :) and is required for the saga solver otherwise it might not converge.

mrocklin · 2017-04-26T16:14:17Z

More broadly, are there parallelizable convex optimization algorithms that we should implement in dask-glm that would likely perform better on this sort of problem than what is being used now (ADMM I think, for this problem).

ogrisel · 2017-04-26T16:19:50Z

I am not very familiar with ADMM and its competitors. Stochastic sequential solvers are very efficient for gradient based estimators when the number of samples is large.

Still @fabianp presented last week a parallel version of SAGA at AISTATS:

https://twitter.com/fpedregosa/status/856565126744334336

ogrisel · 2017-04-26T16:21:00Z

For reference here is my notebooks (evaluated on my laptop with 2 physical cores):

http://nbviewer.jupyter.org/gist/ogrisel/5f2d31bc5e7df852b4ca63f5f6049f42

mrocklin · 2017-04-26T16:26:55Z

@fabianp is this the right paper to read: https://arxiv.org/pdf/1606.04809.pdf ?

ogrisel · 2017-04-26T16:38:02Z

@TomAugspurger if you do a blog post, you should use the official train test split (last 500000 samples as held out test set) and report accuracy on both train set (to assess the quality of the optimizers) and test set (to assess the quality of the statistical model).

TomAugspurger · 2017-04-26T16:39:56Z

Thanks, will do.

…

On Wed, Apr 26, 2017 at 11:38 AM, Olivier Grisel ***@***.***> wrote: @TomAugspurger <https://github.com/TomAugspurger> if you do a blog post, you should use the official train test split (last 500000 samples as held out test set) and report accuracy on both train set (to assess the quality of the optimizers) and test set (to assess the quality of the statistical model). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIq2V0v31OQBtiGD-BsLqoXtywpT8ks5rz3LqgaJpZM4M71rR> .

cicdw · 2017-04-26T16:48:26Z

I'd like to throw in a quick comment: we shouldn't be trying to beat any benchmarks for output quality / predictive power; dask-glm is just fitting a simple regularized logistic regression, and so is scikit-learn. We just want our output to be the true optimizer for the problem the user has provided. I believe reporting the score metrics was initially being used as a proxy to prove our two algorithms were converging on the same "true" MAP coefficient estimates (i.e., our output is just as reliable as scikit-learn).

I believe the sag / saga algorithm in scikit-learn only fits differentiable regularizers, i.e., it only works with l2 regularization, so comparing it against l1 regularized fits isn't a comparison of our packages' performance but merely a statement that for this dataset l2 regularization works better than l1 (and is faster, which is typically true because of differentiability).

ogrisel · 2017-04-26T16:57:19Z

SAGA works well with l1.

cicdw · 2017-04-26T17:16:07Z

Ah you're right, it works with the proximal operator instead of the gradient. 👍

fabianp · 2017-04-26T17:22:33Z

@ogrisel: thanks for the plug ;-)

@mrocklin: this is indeed the right paper (https://arxiv.org/pdf/1606.04809.pdf)

While this asynchronous algorithm works really well in practice, I highlight a couple of implementation caveats just to be 100% clear on it. I am not familiar with Dask, so feel free to comment, I love to have feedback on this:

The algorithm is fully asynchronous except for the per-coefficient update, which is assumed to be atomic (lines 11-13 in Algorithm 2), which is the common setting for these algorithms. That is, there is no locks on the full vector of coefficients as in other synchronous algorithms, but the individual scalar updates to the vector of coefficient do need to be atomic. This means that in order to get convergence to machine precision (one can argue whether this is necessary from a machine learning perspective, but it is important from an optimization perspective), one needs to have atomic double operations (we also refer to these as compare-and-swap operations in the paper). This turns out to be necessary not only in theory, but also in pratice, see Fig. 6 and paragraph "Necessity of compare-and-swap operations".
I could do this in Java and C++, but I couldn't find a way to do this (efficiently) in Python.
The asynchronous algorithm in that paper is restricted to smooth objectives, but it can be extended to proximal terms. We have a paper under review on this that we will post in the next weeks on ArXiv.

(apologies for side-tracking the conversation and thanks for dask-glm!)

mrocklin · 2017-04-26T17:29:47Z

Yeah, Dask certainly isn't going to operate at the low latencies required by algorithms such as you've implemented. Dask today imposes latencies on the order of low tens of milliseconds. This is almost certainly well above what your algorithm can tolerate.

However, my understanding is that these same sorts of algorithms sometimes develop blocked variants that are more tolerant of high latencies and so more amenable to distributed parallelism. Is this the case for asynchronous SAGA?

ogrisel · 2017-04-26T17:34:59Z

In practice in that case it would probably make sense to run n instances of the penalized logistic regression fit with different values of C and on different CV splits to find the best model (in terms of validation accuracy). This is embarrassingly parallel so would well suited for dask-glm (even if less interesting that distributed optimizers from an implementer's point of view).

In this is basically what is implemented by LogisticRegressionCV in scikit-learn (using single machine threads via joblib).

ogrisel · 2017-04-26T17:35:38Z

Out of core SAGA would be nice too.

ogrisel · 2017-04-26T17:38:46Z

However, my understanding is that these same sorts of algorithms sometimes develop blocked variants that are more tolerant of high latencies and so more amenable to distributed parallelism. Is this the case for asynchronous SAGA?

Blocked Coordinate Descent is competitive for L1 (or L1+L2) penalized LogisticRegression when n_features >> n_samples. But naive implementations are bad compared to variants with screening rules and working sets. As soon as you implement those, parallelism might get trickier (or even impossible) but I am not expert enough to tell. @agramfort or @mblondel might know better.

agramfort · 2017-04-26T19:29:30Z

before doing something heavy did you consider a distributed L-BFGS where you distribute the gradient computations? You can use L-BFGS-B for L1 penalty with a tiny trick.

mrocklin · 2017-04-26T19:32:51Z

@MLnick did a quick version of this with decent results over here: #30 (comment)

fabianp · 2017-04-28T00:15:40Z

I’m afraid that block (or batch) versions of asynchronous SGD or SAGA are less attractive than for coordinate descent. The reason is that being able to do sparse updates (in the sense of modifying a small subset of coefficients at each iteration) is a big part of the efficiency of asynchronous SGD/SAGA, and batching the updates would destroy the sparsity in the updates.

…

On 26 Apr 2017, at 10:38, Olivier Grisel ***@***.***> wrote: However, my understanding is that these same sorts of algorithms sometimes develop blocked variants that are more tolerant of high latencies and so more amenable to distributed parallelism. Is this the case for asynchronous SAGA? Blocked Coordinate Descent is competitive for L1 (or L1+L2) penalized LogisticRegression when n_features >> n_samples. But naive implementations are bad compared to variants with screening rules and working sets. As soon as you implement those, parallelism might get trickier (or even impossible) but I am not expert enough to tell. @agramfort <https://github.com/agramfort> or @mblondel <https://github.com/mblondel> might know better. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAQ8hzQXFrREY95G6IwSq1Y5Fj4bjgIeks5rz4EngaJpZM4M71rR>.

MLnick · 2017-04-28T07:15:40Z

@agramfort interesting for L1 with L-BFGS-B. I thought the OWLQN variant was required for L1. Do you mind sharing what the "trick" is? :)

On the subject of distributed SGD - I'm quite keen to look into a "parameter server" type version of parallel SGD on dask - so either fully async or bounded stale sync. With a distributed parameter vector. I think that it may be quite nice to do it via the local Client interface (or perhaps Channel?) but I haven't had a chance to really dig into it just yet.

For large sparse problems I think it would work well (e.g. sparse LR and FMs cf. DiFacto).

agramfort · 2017-04-28T09:12:42Z

see: https://gist.github.com/vene/fab06038a00309569cdd you basically split the coefs in positive and negative part.

TomAugspurger mentioned this pull request Apr 12, 2017

API discussion #11

Open

mrocklin reviewed Apr 12, 2017

View reviewed changes

cicdw reviewed Apr 17, 2017

View reviewed changes

TomAugspurger mentioned this pull request Apr 19, 2017

ADMM is susceptible to overflow #41

Closed

TomAugspurger force-pushed the api branch from 6b7e5eb to 15d6c49 Compare April 20, 2017 19:04

API: Implement scikit-learn compat API

fe7b875

- Added a `datasets.make_classification` helper - Added string names for solvers and regularizers - Changed defaults to match scikit-learn where needed - Changed parameter names to match scikit-learn

TomAugspurger force-pushed the api branch from 15d6c49 to fe7b875 Compare April 20, 2017 20:04

cicdw merged commit 670d554 into dask:master Apr 23, 2017

MLnick mentioned this pull request May 10, 2017

Criteo benchmark #52

Open

		from .utils import sigmoid, dot, add_intercept


		class _GLM(BaseEstimator):

		return mean_square_error(y, self.predict(X))


		def accuracy_score(y_true, y_pred):


		self._fit_kwargs = {k: getattr(self, k) for k in fit_kwargs}

		def fit(self, X, y=None):

API: Sketch out Estimator API #40

API: Sketch out Estimator API #40

Conversation

TomAugspurger commented Apr 12, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Apr 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrocklin commented Apr 17, 2017

cicdw commented Apr 17, 2017

TomAugspurger commented Apr 17, 2017

cicdw commented Apr 17, 2017

cicdw commented Apr 20, 2017

TomAugspurger commented Apr 20, 2017 via email

TomAugspurger commented Apr 20, 2017

cicdw commented Apr 20, 2017

mrocklin commented Apr 22, 2017

amueller commented Apr 24, 2017

TomAugspurger commented Apr 25, 2017 via email

mrocklin commented Apr 25, 2017

TomAugspurger commented Apr 25, 2017 via email

ogrisel commented Apr 26, 2017

mrocklin commented Apr 26, 2017

amueller commented Apr 26, 2017

mrocklin commented Apr 26, 2017

ogrisel commented Apr 26, 2017

ogrisel commented Apr 26, 2017

mrocklin commented Apr 26, 2017

ogrisel commented Apr 26, 2017

ogrisel commented Apr 26, 2017

mrocklin commented Apr 26, 2017

ogrisel commented Apr 26, 2017

TomAugspurger commented Apr 26, 2017 via email

cicdw commented Apr 26, 2017

ogrisel commented Apr 26, 2017

cicdw commented Apr 26, 2017

fabianp commented Apr 26, 2017

mrocklin commented Apr 26, 2017

ogrisel commented Apr 26, 2017

ogrisel commented Apr 26, 2017

ogrisel commented Apr 26, 2017

agramfort commented Apr 26, 2017 via email

mrocklin commented Apr 26, 2017

fabianp commented Apr 28, 2017 via email

MLnick commented Apr 28, 2017

agramfort commented Apr 28, 2017 via email

TomAugspurger commented Apr 12, 2017 •

edited

Loading