-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: Sketch out Estimator API #40
Conversation
dask_glm/estimators.py
Outdated
rho = rho if rho is not None else 1 | ||
over_relax = over_relax if over_relax is not None else 1 | ||
abstol = abstol if abstol is not None else 1e-4 | ||
reltol = reltol if reltol is not None else 1e-2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should avoid calling out algorithms explicitly by name here if possible. Is it possible to put these defaults in algorithms.py somehow?
dask_glm/estimators.py
Outdated
|
||
fit_kwargs = {'max_steps', 'tol', 'family'} | ||
# TODO: use scikit-learn names throughout the lib | ||
renamer = {'max_steps': 'max_iter', 'reg': 'regularizer'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would also be reasonable to change the existing algorithms to use these names consistently. I wouldn't be shy about touching that code.
dask_glm/estimators.py
Outdated
from .utils import sigmoid, dot, add_intercept | ||
|
||
|
||
class _GLM(BaseEstimator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for _GLM
being a base-class.
dask_glm/estimators.py
Outdated
return mean_square_error(y, self.predict(X)) | ||
|
||
|
||
def accuracy_score(y_true, y_pred): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these functions be in:
utils
or- within the corresponding Estimator class
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely not at the bottom of estimators.py
, forgot to move them before pushing :) I didn't want to go down the whole rabbit-hole of our own metrics.py
, but maybe we should do a couple basic ones like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be comfortable including these in utils.py
for now.
Here's an update to the notebook showing scikit-learn and dask on a slice of the taxi dataset http://nbviewer.jupyter.org/gist/anonymous/c07a617684ec9629c204825c212383fe It shows dask-glm interacting with scikit-learn's pipeline API, and a small bit at the end with dask-searchcv. I haven't looked closely, at the output, but the fit doesn't seem great. Not sure if it's the defaults / starting parameters or something else. |
|
||
pointwise_loss = family.pointwise_loss | ||
pointwise_gradient = family.pointwise_gradient | ||
regularizer = _regularizers.get(regularizer, regularizer) # string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha this line took me a second.
|
||
self._fit_kwargs = {k: getattr(self, k) for k in fit_kwargs} | ||
|
||
def fit(self, X, y=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever we include p
-value output, we will need access to the data that was used to fit the model (in order to compute the Hessian).
We could have this post-fit processing inside the fit
method with an inert call to self.post_process
or something as a place-holder, or store that info in some other cleverly-named attributes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only other call-out I'll make here is that this initialization setup will make it difficult for a user to create an unregularized model (they will need to specify a hand-picked solver). I'm not worried about it for this PR, but I think we should be aware of it and make it easier (without having the user choose the solver) in the future.
Cool stuff. Comments on the notebook:
|
Comments on the notebook:
|
I took it from https://github.com/TomAugspurger/sktransformers/blob/master/sktransformers/preprocessing.py I'm not sure of the best home for them. I don't think we can reasonably expect the scikit-learn devs to "do the right thing" for dask stuff, like they're mostly doing for pandas dataframes. That said, it's not a ton of work to support regular- and dask-versions of these kinds of tranformers. The |
Oh, an additional comment: it looks like We might want to set all of our defaults to |
If you could move those stray functions around, change the default regularizer to 'l2' and the default |
Cool, I'll get to those this afternoon.
…On Thu, Apr 20, 2017 at 10:52 AM, Chris White ***@***.***> wrote:
If you could move those stray functions around, change the default
regularizer to 'l2' and the default lamduh to 1.0 (to match sci-kit) then
I think we're good to go.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#40 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIswq0-Zm8wJJwNfPi2zUdPjlQ9ihks5rx386gaJpZM4M71rR>
.
|
Ok, rebased and addressed those last couple comments @moody-marlin And for fun, another example on the SUSY dataset, (5,000,000 x 18; all numeric): http://nbviewer.jupyter.org/gist/anonymous/15742155693794ddd31ea85b654cbc7e The takeaway, assuming I've done everything properly
That's all on my local machine still. |
Awesome, LGTM after flaking. |
- Added a `datasets.make_classification` helper - Added string names for solvers and regularizers - Changed defaults to match scikit-learn where needed - Changed parameter names to match scikit-learn
It's very cool to see those performance numbers. cc @amueller @ogrisel in case they're interested. Notebook URL copied from above for convenience:
|
Nice. how many cores was that? |
I believe it was 8 cores.
…On Mon, Apr 24, 2017 at 5:52 PM, Andreas Mueller ***@***.***> wrote:
Nice. how many cores was that?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#40 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInIn3AIA4byZ_Ol2-ZCeZYzWgrq0ks5rzSezgaJpZM4M71rR>
.
|
8 logical or 8 physical cores. My original assumption was that you were doing this from a personal laptop, presumably running something like a core i7 which has four physical cores with hyperthreading up to eight logical cores. |
Ah, 8 logical cores I suppose. I was just going off the number of workers
that `Client` created (8). Looking in the settings there are indeed 4 cores.
…On Mon, Apr 24, 2017 at 9:59 PM, Matthew Rocklin ***@***.***> wrote:
8 logical or 8 physical cores. My original assumption was that you were
doing this from a personal laptop, presumably running something like a core
i7 which has four physical cores with hyperthreading up to eight logical
cores.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#40 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIsYpAWpiJ9ewmCqISGpop3-A3bMeks5rzWGDgaJpZM4M71rR>
.
|
Nice though I get |
Heh, I wonder how large of a cluster we would need to beat @ogrisel 's single laptop core :) |
you need a dataset that doesn't fit into his ram ;) |
@TomAugspurger @moody-marlin if you were interested, I think that this would be an entertaining blogpost to put out there: "Dask-glm beats scikit-learn by using parallelism. Scikit-learn then beats dask-glm by using smarter algorithms" :) It would both show off the sklearn API compatibility developed in this PR and would also have the larger message of "there are usually better options than parallel computing". |
Also I checked the resulting |
To dask-glm's defense, the fit takes only 6min on 2 cores when you scale the data (same final training accuracy). saga is still 6x faster on a single core though. Scaling the data is always a good idea :) and is required for the saga solver otherwise it might not converge. |
More broadly, are there parallelizable convex optimization algorithms that we should implement in dask-glm that would likely perform better on this sort of problem than what is being used now (ADMM I think, for this problem). |
I am not very familiar with ADMM and its competitors. Stochastic sequential solvers are very efficient for gradient based estimators when the number of samples is large. Still @fabianp presented last week a parallel version of SAGA at AISTATS: |
For reference here is my notebooks (evaluated on my laptop with 2 physical cores): http://nbviewer.jupyter.org/gist/ogrisel/5f2d31bc5e7df852b4ca63f5f6049f42 |
@fabianp is this the right paper to read: https://arxiv.org/pdf/1606.04809.pdf ? |
@TomAugspurger if you do a blog post, you should use the official train test split (last 500000 samples as held out test set) and report accuracy on both train set (to assess the quality of the optimizers) and test set (to assess the quality of the statistical model). |
Thanks, will do.
…On Wed, Apr 26, 2017 at 11:38 AM, Olivier Grisel ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> if you do a blog post,
you should use the official train test split (last 500000 samples as held
out test set) and report accuracy on both train set (to assess the quality
of the optimizers) and test set (to assess the quality of the statistical
model).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#40 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIq2V0v31OQBtiGD-BsLqoXtywpT8ks5rz3LqgaJpZM4M71rR>
.
|
I'd like to throw in a quick comment: we shouldn't be trying to beat any benchmarks for output quality / predictive power; I believe the |
SAGA works well with l1. |
Ah you're right, it works with the proximal operator instead of the gradient. 👍 |
@ogrisel: thanks for the plug ;-) @mrocklin: this is indeed the right paper (https://arxiv.org/pdf/1606.04809.pdf) While this asynchronous algorithm works really well in practice, I highlight a couple of implementation caveats just to be 100% clear on it. I am not familiar with Dask, so feel free to comment, I love to have feedback on this:
(apologies for side-tracking the conversation and thanks for dask-glm!) |
Yeah, Dask certainly isn't going to operate at the low latencies required by algorithms such as you've implemented. Dask today imposes latencies on the order of low tens of milliseconds. This is almost certainly well above what your algorithm can tolerate. However, my understanding is that these same sorts of algorithms sometimes develop blocked variants that are more tolerant of high latencies and so more amenable to distributed parallelism. Is this the case for asynchronous SAGA? |
In practice in that case it would probably make sense to run In this is basically what is implemented by |
Out of core SAGA would be nice too. |
Blocked Coordinate Descent is competitive for L1 (or L1+L2) penalized LogisticRegression when n_features >> n_samples. But naive implementations are bad compared to variants with screening rules and working sets. As soon as you implement those, parallelism might get trickier (or even impossible) but I am not expert enough to tell. @agramfort or @mblondel might know better. |
before doing something heavy did you consider a distributed L-BFGS where
you distribute the gradient computations? You can use L-BFGS-B for L1
penalty with a tiny trick.
|
@MLnick did a quick version of this with decent results over here: #30 (comment) |
I’m afraid that block (or batch) versions of asynchronous SGD or SAGA are less attractive than for coordinate descent. The reason is that being able to do sparse updates (in the sense of modifying a small subset of coefficients at each iteration) is a big part of the efficiency of asynchronous SGD/SAGA, and batching the updates would destroy the sparsity in the updates.
… On 26 Apr 2017, at 10:38, Olivier Grisel ***@***.***> wrote:
However, my understanding is that these same sorts of algorithms sometimes develop blocked variants that are more tolerant of high latencies and so more amenable to distributed parallelism. Is this the case for asynchronous SAGA?
Blocked Coordinate Descent is competitive for L1 (or L1+L2) penalized LogisticRegression when n_features >> n_samples. But naive implementations are bad compared to variants with screening rules and working sets. As soon as you implement those, parallelism might get trickier (or even impossible) but I am not expert enough to tell. @agramfort <https://github.com/agramfort> or @mblondel <https://github.com/mblondel> might know better.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#40 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAQ8hzQXFrREY95G6IwSq1Y5Fj4bjgIeks5rz4EngaJpZM4M71rR>.
|
@agramfort interesting for L1 with L-BFGS-B. I thought the OWLQN variant was required for L1. Do you mind sharing what the "trick" is? :) On the subject of distributed SGD - I'm quite keen to look into a "parameter server" type version of parallel SGD on dask - so either fully async or bounded stale sync. With a distributed parameter vector. I think that it may be quite nice to do it via the local For large sparse problems I think it would work well (e.g. sparse LR and FMs cf. DiFacto). |
see:
https://gist.github.com/vene/fab06038a00309569cdd
you basically split the coefs in positive and negative part.
|
Added a couple Estimators for an end-user API.
Basically
To kick the API discussion even further down the road, right now it accepts either scikit-learn style
Estimator(...).fit(X, y)
or statsmodels-styleEstimator(X, y, ...).fit()
, but we should probably choose one or the other.Here's an example notebook
Some miscellaneous changes to help
datasets.make_classification
helperadd_intercept
helper(which I think might be broken right now...)