Update coefficient assignment #914

kchare · 2022-04-01T16:38:49Z

What

This PR contains three primary additions:

A bug fix for coefficient assignment in dask_ml.linear_model.utils to update the add_intercept functionality to be consistent between NumPy and Dask arrays. Previously, this was done differently, resulting in .coef_ and .intercept_ terms that were not equal for models fit on the same data with a NumPy array or Dask array.
Tests to ensure that models fit on the same data in NumPy array and Dask arrays will yield the same result.
Tests of model coefficients against scikit-learn for this same data, ensuring that coefficients match for in-memory data.

Why

#860
This issue traces the bug, and (a) suggests a fix, (b) proposes tests to tie out Dask ML linear models between input data types, and (c) tests for equality of model coefficients between Dask ML and scikit-learn.

stsievert

👍 👍 I'm glad to see this fix. I'm really glad to see these tests! They're long overdue.

I have some questions about the tests (and some style nits):

tests/linear_model/test_glm.py

stsievert · 2022-04-03T01:15:50Z

tests/linear_model/test_glm.py

+
+def test_poisson_regressor_against_sklearn(single_chunk_count_classification):
+    X, y = single_chunk_count_classification
+    skl_model = sklPoissonRegressor(alpha=0, fit_intercept=True)


Is alpha = 0 to make sure that the regression is not regularized for both Dask-ML and Scikit-learn?

(not important: I think C = 1 / alpha or C = 1 / (n * alpha) with len(y) == n).

Yes, this is exactly the reasoning in this test. The scikit-learn documentation indicates that skl.linear_model.PoissonRegressor with alpha=0 is equivalent to the unpenalized GLM (link).

tests/linear_model/test_glm.py

stsievert · 2022-04-03T01:24:37Z

tests/linear_model/test_glm.py

@@ -173,6 +182,14 @@ def test_add_intercept_raises_chunks():

    assert m.match("Chunking is only allowed")

+def test_add_intercept_ordering():


tests/linear_model/test_glm.py

tests/conftest.py

tests/linear_model/test_glm.py

stsievert

This PR looks pretty good. A couple nitpicky comments below.

I'm glad to see the tests in this PR. They ensure that add_intercept works the same for NumPy and Dask arrays, and also that various Dask-ML linear estimators produce the same results as the corresponding Scikit-learn estimators.

tests/linear_model/test_glm.py

stsievert · 2022-04-16T19:49:27Z

dask_ml/datasets.py

+    informative_idx = rng.choice(
+        n_features, n_informative, chunks=n_informative, replace=False
+    )
+    beta = (rng.random(n_features, chunks=n_features) - 0.5) * 2 * scale


Why is this change required?

I had changed this due to some issues with convergence for the Poisson Regressor. I am not an expert in Poisson regression, but this previously had beta values between [-1, 0) and the update brought that to [-1, 1]. Reverting it back makes no difference in the updated tests, so I will revert the change.

kchare · 2022-04-20T17:35:32Z

As I consolidated the tests to address your comments, I realized that scikit-learn actually assigns an intercept value of 0.0 when fit_intercept=False (see here). To ensure that the results are the same between scikit-learn and dask-ml, I have also addressed that difference in the latest commit.

stsievert

This is looking pretty good! I've got a couple style nits/etc below.

tests/linear_model/test_glm.py

stsievert · 2022-04-22T18:06:30Z

dask_ml/datasets.py

    beta = (rng.random(n_features, chunks=n_features) - 1) * scale

    informative_idx, beta = dask.compute(informative_idx, beta)

-    z0 = X[:, informative_idx].dot(beta[informative_idx])
+    z0 = X[:, informative_idx].dot(beta[informative_idx]) + 0.5


Why are changes to this file required?

This adds an explicit intercept term, as the X array does not have a constant term for the intercept. I am happy to remove it, however, as it does not change the results.

tests/linear_model/test_glm.py

stsievert

LGTM after the changes below are implemented!

stsievert · 2022-05-06T01:31:13Z

dask_ml/datasets.py

@@ -64,12 +64,14 @@ def make_counts(
    rng = dask_ml.utils.check_random_state(random_state)

    X = rng.normal(0, 1, size=(n_samples, n_features), chunks=(chunks, n_features))
-    informative_idx = rng.choice(n_features, n_informative, chunks=n_informative)
+    informative_idx = rng.choice(
+        n_features, n_informative, chunks=n_informative, replace=False


This is a good change.

I think this PR will get merge faster if all the changes in this file are left to another PR.

I think it'd be good to do some better error handling in this function (like making sure that n_informative < n_features). I think a future PR would be a great place for that alongside workarounds for the items in #914 (comment) if they're still relevant.

Past the changes in this file, this PR LGTM!

Awesome, thank you! I have reverted the two changes from the datasets.py file. When I did so, I had to make one small change to the atol condition of the tests to make it pass for the PoissonRegressor. The original condition checked to the 1e-4, and the maximum with the dataset update was ~0.00014, so I updated the condition to restrict only to 2e-4.

A future PR for the other changes to the datasets.py file sounds like a good idea to me. I would be happy to work on that at some point, but may not be able to do so in the short term.

tests/linear_model/test_glm.py

This reverts commit 54771f3.

stsievert · 2022-05-21T03:23:50Z

All but two of the tests/checks pass:

Name	Status
Documentation	❌
Linting	✔︎
Tests (3.7, ubuntu)	✔︎
Tests (3.8, ubuntu)	✔︎
Tests (3.9, ubuntu)	❌
Upstream / check	✔︎

The doc error isn't relevant (unexpected warning). Here are some details on the (not relevant) errors/warnings for the 3.9 tests:

TypeError: _fit() got an unexpected keyword argument 'return_counts'
AttributeError: 'OneHotEncoder' object has no attribute '_infrequent_indices'
FutureWarning: if_delegate_has_method was deprecated in version 1.1 and will be removed in version 1.3. Use if_available instead.
FutureWarning: The loss 'log' was deprecated in v1.1 and will be removed in version 1.3. Use loss='log_loss' which is equivalent.
ValueError: ndarray is not C-contiguous

None of those are relevant to this PR.

This reverts commit d7c57d4.

This reverts commit dbf0db5.

stsievert · 2022-05-21T19:57:24Z

I've verified that the CI errors on this PR are not relevant (see #914 (comment)). I will squash and merge this PR next weekend unless I hear otherwise.

TomAugspurger · 2022-05-27T14:03:47Z

Let's merge this and see if CI passes on main. Thanks for the review @stsievert.

kchare added 3 commits April 1, 2022 09:29

Fix bug assignment issue

994f97d

Add tests for dask and numpy array coefficient parity

08db086

Add tests against sklearn

f7e8f60

kchare marked this pull request as ready for review April 1, 2022 16:39

stsievert reviewed Apr 3, 2022

View reviewed changes

kchare added 2 commits April 4, 2022 15:44

Update style of tests to pass CI checks, minor fixes to tests

3463b51

Update blobs for gradient descent algorithms

1caea44

stsievert reviewed Apr 10, 2022

View reviewed changes

tests/conftest.py Outdated Show resolved Hide resolved

tests/conftest.py Outdated Show resolved Hide resolved

stsievert reviewed Apr 10, 2022

View reviewed changes

tests/linear_model/test_glm.py Outdated Show resolved Hide resolved

kchare added 2 commits April 15, 2022 12:55

Update tests to be relative error

9c25090

Update N/P; Add LBFGS

12f3b95

stsievert reviewed Apr 16, 2022

View reviewed changes

Update intercept for fit_intercept=False; consolidate tests

875b7bd

stsievert reviewed Apr 22, 2022

View reviewed changes

Update datasets and tests

9773f12

stsievert approved these changes May 6, 2022

View reviewed changes

kchare and others added 5 commits May 9, 2022 16:00

Revert datasets.py; update atol in tests

af4d96c

Remove commented out 0.5

1ed7f6f

manually merge dask#929

54771f3

no extra whitespace

8cd41a3

Revert "manually merge dask#929"

d7c57d4

This reverts commit 54771f3.

stsievert added 2 commits May 21, 2022 13:38

Revert "Revert "manually merge 929""

dbf0db5

This reverts commit d7c57d4.

Revert "Revert "Revert "manually merge 929"""

9ea5c79

This reverts commit dbf0db5.

stsievert mentioned this pull request May 22, 2022

Roll a new release #928

Closed

TomAugspurger merged commit fa40fa3 into dask:main May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update coefficient assignment #914

Update coefficient assignment #914

kchare commented Apr 1, 2022

stsievert left a comment

stsievert Apr 3, 2022

kchare Apr 4, 2022

stsievert Apr 3, 2022

stsievert left a comment

stsievert Apr 16, 2022

kchare Apr 20, 2022

kchare commented Apr 20, 2022

stsievert left a comment •

edited

Loading

stsievert Apr 22, 2022

kchare Apr 26, 2022

stsievert left a comment

stsievert May 6, 2022

stsievert May 8, 2022

kchare May 9, 2022

stsievert commented May 21, 2022 •

edited

Loading

stsievert commented May 21, 2022

TomAugspurger commented May 27, 2022 •

edited

Loading

		@@ -173,6 +182,14 @@ def test_add_intercept_raises_chunks():

		assert m.match("Chunking is only allowed")

		def test_add_intercept_ordering():

Update coefficient assignment #914

Update coefficient assignment #914

Conversation

kchare commented Apr 1, 2022

What

Why

stsievert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsievert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kchare commented Apr 20, 2022

stsievert left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsievert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stsievert commented May 21, 2022 • edited Loading

stsievert commented May 21, 2022

TomAugspurger commented May 27, 2022 • edited Loading

stsievert left a comment •

edited

Loading

stsievert commented May 21, 2022 •

edited

Loading

TomAugspurger commented May 27, 2022 •

edited

Loading