-
-
Notifications
You must be signed in to change notification settings - Fork 260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update coefficient assignment #914
Update coefficient assignment #914
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍 I'm glad to see this fix. I'm really glad to see these tests! They're long overdue.
I have some questions about the tests (and some style nits):
tests/linear_model/test_glm.py
Outdated
|
||
def test_poisson_regressor_against_sklearn(single_chunk_count_classification): | ||
X, y = single_chunk_count_classification | ||
skl_model = sklPoissonRegressor(alpha=0, fit_intercept=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is alpha = 0
to make sure that the regression is not regularized for both Dask-ML and Scikit-learn?
(not important: I think C = 1 / alpha
or C = 1 / (n * alpha)
with len(y) == n
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is exactly the reasoning in this test. The scikit-learn
documentation indicates that skl.linear_model.PoissonRegressor
with alpha=0
is equivalent to the unpenalized GLM (link).
@@ -173,6 +182,14 @@ def test_add_intercept_raises_chunks(): | |||
|
|||
assert m.match("Chunking is only allowed") | |||
|
|||
def test_add_intercept_ordering(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks pretty good. A couple nitpicky comments below.
I'm glad to see the tests in this PR. They ensure that add_intercept
works the same for NumPy and Dask arrays, and also that various Dask-ML linear estimators produce the same results as the corresponding Scikit-learn estimators.
dask_ml/datasets.py
Outdated
informative_idx = rng.choice( | ||
n_features, n_informative, chunks=n_informative, replace=False | ||
) | ||
beta = (rng.random(n_features, chunks=n_features) - 0.5) * 2 * scale |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this change required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had changed this due to some issues with convergence for the Poisson Regressor. I am not an expert in Poisson regression, but this previously had beta
values between [-1, 0)
and the update brought that to [-1, 1]
. Reverting it back makes no difference in the updated tests, so I will revert the change.
As I consolidated the tests to address your comments, I realized that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good! I've got a couple style nits/etc below.
dask_ml/datasets.py
Outdated
beta = (rng.random(n_features, chunks=n_features) - 1) * scale | ||
|
||
informative_idx, beta = dask.compute(informative_idx, beta) | ||
|
||
z0 = X[:, informative_idx].dot(beta[informative_idx]) | ||
z0 = X[:, informative_idx].dot(beta[informative_idx]) + 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are changes to this file required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds an explicit intercept term, as the X
array does not have a constant term for the intercept. I am happy to remove it, however, as it does not change the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM after the changes below are implemented!
dask_ml/datasets.py
Outdated
@@ -64,12 +64,14 @@ def make_counts( | |||
rng = dask_ml.utils.check_random_state(random_state) | |||
|
|||
X = rng.normal(0, 1, size=(n_samples, n_features), chunks=(chunks, n_features)) | |||
informative_idx = rng.choice(n_features, n_informative, chunks=n_informative) | |||
informative_idx = rng.choice( | |||
n_features, n_informative, chunks=n_informative, replace=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good change.
I think this PR will get merge faster if all the changes in this file are left to another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be good to do some better error handling in this function (like making sure that n_informative < n_features
). I think a future PR would be a great place for that alongside workarounds for the items in #914 (comment) if they're still relevant.
Past the changes in this file, this PR LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you! I have reverted the two changes from the datasets.py
file. When I did so, I had to make one small change to the atol
condition of the tests to make it pass for the PoissonRegressor
. The original condition checked to the 1e-4
, and the maximum with the dataset update was ~0.00014, so I updated the condition to restrict only to 2e-4
.
A future PR for the other changes to the datasets.py
file sounds like a good idea to me. I would be happy to work on that at some point, but may not be able to do so in the short term.
All but two of the tests/checks pass:
The doc error isn't relevant (unexpected warning). Here are some details on the (not relevant) errors/warnings for the 3.9 tests:
None of those are relevant to this PR. |
I've verified that the CI errors on this PR are not relevant (see #914 (comment)). I will squash and merge this PR next weekend unless I hear otherwise. |
Let's merge this and see if CI passes on |
What
This PR contains three primary additions:
dask_ml.linear_model.utils
to update theadd_intercept
functionality to be consistent between NumPy and Dask arrays. Previously, this was done differently, resulting in.coef_
and.intercept_
terms that were not equal for models fit on the same data with a NumPy array or Dask array.scikit-learn
for this same data, ensuring that coefficients match for in-memory data.Why
#860
This issue traces the bug, and (a) suggests a fix, (b) proposes tests to tie out Dask ML linear models between input data types, and (c) tests for equality of model coefficients between Dask ML and
scikit-learn
.