Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update coefficient assignment #914

Merged
merged 16 commits into from
May 27, 2022

Conversation

kchare
Copy link
Contributor

@kchare kchare commented Apr 1, 2022

What

This PR contains three primary additions:

  1. A bug fix for coefficient assignment in dask_ml.linear_model.utils to update the add_intercept functionality to be consistent between NumPy and Dask arrays. Previously, this was done differently, resulting in .coef_ and .intercept_ terms that were not equal for models fit on the same data with a NumPy array or Dask array.
  2. Tests to ensure that models fit on the same data in NumPy array and Dask arrays will yield the same result.
  3. Tests of model coefficients against scikit-learn for this same data, ensuring that coefficients match for in-memory data.

Why

#860
This issue traces the bug, and (a) suggests a fix, (b) proposes tests to tie out Dask ML linear models between input data types, and (c) tests for equality of model coefficients between Dask ML and scikit-learn.

@kchare kchare marked this pull request as ready for review April 1, 2022 16:39
Copy link
Member

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 I'm glad to see this fix. I'm really glad to see these tests! They're long overdue.

I have some questions about the tests (and some style nits):


def test_poisson_regressor_against_sklearn(single_chunk_count_classification):
X, y = single_chunk_count_classification
skl_model = sklPoissonRegressor(alpha=0, fit_intercept=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is alpha = 0 to make sure that the regression is not regularized for both Dask-ML and Scikit-learn?

(not important: I think C = 1 / alpha or C = 1 / (n * alpha) with len(y) == n).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is exactly the reasoning in this test. The scikit-learn documentation indicates that skl.linear_model.PoissonRegressor with alpha=0 is equivalent to the unpenalized GLM (link).

@@ -173,6 +182,14 @@ def test_add_intercept_raises_chunks():

assert m.match("Chunking is only allowed")

def test_add_intercept_ordering():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍

Copy link
Member

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks pretty good. A couple nitpicky comments below.

I'm glad to see the tests in this PR. They ensure that add_intercept works the same for NumPy and Dask arrays, and also that various Dask-ML linear estimators produce the same results as the corresponding Scikit-learn estimators.

informative_idx = rng.choice(
n_features, n_informative, chunks=n_informative, replace=False
)
beta = (rng.random(n_features, chunks=n_features) - 0.5) * 2 * scale
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had changed this due to some issues with convergence for the Poisson Regressor. I am not an expert in Poisson regression, but this previously had beta values between [-1, 0) and the update brought that to [-1, 1]. Reverting it back makes no difference in the updated tests, so I will revert the change.

@kchare
Copy link
Contributor Author

kchare commented Apr 20, 2022

As I consolidated the tests to address your comments, I realized that scikit-learn actually assigns an intercept value of 0.0 when fit_intercept=False (see here). To ensure that the results are the same between scikit-learn and dask-ml, I have also addressed that difference in the latest commit.

Copy link
Member

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking pretty good! I've got a couple style nits/etc below.

beta = (rng.random(n_features, chunks=n_features) - 1) * scale

informative_idx, beta = dask.compute(informative_idx, beta)

z0 = X[:, informative_idx].dot(beta[informative_idx])
z0 = X[:, informative_idx].dot(beta[informative_idx]) + 0.5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are changes to this file required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds an explicit intercept term, as the X array does not have a constant term for the intercept. I am happy to remove it, however, as it does not change the results.

Copy link
Member

@stsievert stsievert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after the changes below are implemented!

@@ -64,12 +64,14 @@ def make_counts(
rng = dask_ml.utils.check_random_state(random_state)

X = rng.normal(0, 1, size=(n_samples, n_features), chunks=(chunks, n_features))
informative_idx = rng.choice(n_features, n_informative, chunks=n_informative)
informative_idx = rng.choice(
n_features, n_informative, chunks=n_informative, replace=False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good change.

I think this PR will get merge faster if all the changes in this file are left to another PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be good to do some better error handling in this function (like making sure that n_informative < n_features). I think a future PR would be a great place for that alongside workarounds for the items in #914 (comment) if they're still relevant.

Past the changes in this file, this PR LGTM!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you! I have reverted the two changes from the datasets.py file. When I did so, I had to make one small change to the atol condition of the tests to make it pass for the PoissonRegressor. The original condition checked to the 1e-4, and the maximum with the dataset update was ~0.00014, so I updated the condition to restrict only to 2e-4.

A future PR for the other changes to the datasets.py file sounds like a good idea to me. I would be happy to work on that at some point, but may not be able to do so in the short term.

@stsievert
Copy link
Member

stsievert commented May 21, 2022

All but two of the tests/checks pass:

Name Status
Documentation
Linting ✔︎
Tests (3.7, ubuntu) ✔︎
Tests (3.8, ubuntu) ✔︎
Tests (3.9, ubuntu)
Upstream / check ✔︎

The doc error isn't relevant (unexpected warning). Here are some details on the (not relevant) errors/warnings for the 3.9 tests:

  • TypeError: _fit() got an unexpected keyword argument 'return_counts'
  • AttributeError: 'OneHotEncoder' object has no attribute '_infrequent_indices'
  • FutureWarning: if_delegate_has_method was deprecated in version 1.1 and will be removed in version 1.3. Use if_available instead.
  • FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use loss='log_loss' which is equivalent.
  • ValueError: ndarray is not C-contiguous

None of those are relevant to this PR.

@stsievert
Copy link
Member

I've verified that the CI errors on this PR are not relevant (see #914 (comment)). I will squash and merge this PR next weekend unless I hear otherwise.

@stsievert stsievert mentioned this pull request May 22, 2022
@TomAugspurger
Copy link
Member

TomAugspurger commented May 27, 2022

Let's merge this and see if CI passes on main. Thanks for the review @stsievert.

@TomAugspurger TomAugspurger merged commit fa40fa3 into dask:main May 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants