Taking weighting seriously #487

gragusa · 2022-07-15T16:07:11Z

This PR addresses several problems with the current GLM implementation.

Current status
In master, GLM/LM only accepts weights through the keyword wts. These weights are implicitly frequency weights.

With this PR
FrequencyWeights, AnalyticWeights, and ProbabilityWeights are possible. The API is the following

## Frequency Weights
lm(@formula(y~x), df; wts=fweights(df.wts)
## Analytic Weights
lm(@formula(y~x), df; wts=aweights(df.wts)
## ProbabilityWeights
lm(@formula(y~x), df; wts=pweights(df.wts)

The old behavior -- passing a vector wts=df.wts is deprecated and for the moment, the array os coerced df.wts to FrequencyWeights.

To allow dispatching on the weights, CholPred takes a parameter T<:AbstractWeights. The unweighted LM/GLM has UnitWeights as the parameter for the type.

This PR also implements residuals(r::RegressionModel; weighted::Bool=false) and modelmatrix(r::RegressionModel; weighted::Bool = false). The new signature for these two methods is pending in StatsApi.

There are many changes that I had to make to make everything work. Tests are passing, but some new feature needs new tests. Before implementing them, I wanted to ensure that the approach taken was liked.

I have also implemented momentmatrix, which returns the estimating function of the estimator. I arrived to the conclusion that it does not make sense to have a keyword argument weighted. Thus I will amend JuliaStats/StatsAPI.jl#16 to remove such a keyword from the signature.

Update

I think I covered all the suggestions/comments with this exception as I have to think about it. Maybe this can be addressed later. The new standard errors (the one for ProbabilityWeights) also work in the rank deficient case (and so does cooksdistance).

Tests are passing and I think they cover everything that I have implemented. Also, added a section in the documentation about using Weights and updated jldoc with the new signature of CholeskyPivoted.

To do:

Deal with weighted standard errors with rank deficient designs
Document the new API
Improve testing

Closes #186.

…liaStats-master

codecov-commenter · 2022-07-16T08:43:43Z

Codecov Report

Attention: Patch coverage is 79.81073% with 64 lines in your changes missing coverage. Please review.

Project coverage is 86.45%. Comparing base (89493a4) to head (574ec69).

Files with missing lines	Patch %	Lines
src/glmfit.jl	78.30%	23 Missing ⚠️
src/lm.jl	75.60%	20 Missing ⚠️
src/linpred.jl	84.74%	18 Missing ⚠️
src/glmtools.jl	62.50%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #487      +/-   ##
==========================================
- Coverage   90.33%   86.45%   -3.89%     
==========================================
  Files           8        8              
  Lines        1107     1277     +170     
==========================================
+ Hits         1000     1104     +104     
- Misses        107      173      +66

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lrnv · 2022-07-20T07:45:33Z

Hey,

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

I think the interfacing should allow for a DataFrame input of weights, that would take care of such things (like it does for the other variables).

gragusa · 2022-07-20T17:14:41Z

Would that fix the issue I am having, which is that if rows of the data contains missing values, GLM discard those rows, but does not discard the corresponding values of df.weights and then yells that there are too many weights ?

not really. But it would be easy to make this a feature. But before digging further on this I would like to know whether there is consensus on the approach of this PR.

alecloudenback · 2022-08-14T19:14:57Z

FYI this appears to fix #420; a PR was started in #432 and the author closed for lack of time on their part to investigate CI failures.

Here's the test case pulled from #432 which passes with the in #487.

@testset "collinearity and weights" begin
    rng = StableRNG(1234321)
    x1 = randn(100)
    x1_2 = 3 * x1
    x2 = 10 * randn(100)
    x2_2 = -2.4 * x2
    y = 1 .+ randn() * x1 + randn() * x2 + 2 * randn(100)
    df = DataFrame(y = y, x1 = x1, x2 = x1_2, x3 = x2, x4 = x2_2, weights = repeat([1, 0.5],50))
    f = @formula(y ~ x1 + x2 + x3 + x4)
    lm_model = lm(f, df, wts = df.weights)#, dropcollinear = true)
    X = [ones(length(y)) x1_2 x2_2]
    W = Diagonal(df.weights)
    coef_naive = (X'W*X)\X'W*y
    @test lm_model.model.pp.chol isa CholeskyPivoted
    @test rank(lm_model.model.pp.chol) == 3
    @test isapprox(filter(!=(0.0), coef(lm_model)), coef_naive)
end

Can this test set be added?

Is there any other feedback for @gragusa ? It would be great to get this merged if good to go.

nalimilan · 2022-08-28T18:27:50Z

Sorry for the long delay, I hadn't realized you were waiting for feedback. Looks great overall, please feel free to finish it! I'll try to find the time to make more specific comments.

nalimilan

I've read the code. Lots of comments, but all of these are minor. The main one is mostly stylistic: in most cases it seems that using if wts isa UnitWeights inside a single method (like the current structure) gives simpler code than defining several methods. Otherwise the PR looks really clean!

What are you thoughts regarding testing? There are a lot of combinations to test and it's not easy to see how to integrate that into the current organization of tests. One way would be to add code for each kind of test to each @testset that checks a given model family (or a particular case, like collinear variables). There's also the issue of testing the QR factorization, which isn't used by default.

src/GLM.jl

src/glmfit.jl

src/lm.jl

test/runtests.jl

bkamins · 2022-08-31T08:49:28Z

A very nice PR. In the tests can we have some test set that compares the results of aweights, fweights, and pweights for the same set of data (coeffs, predictions, covariance matrix of the estimates, p-values etc.).

andreasnoack · 2024-12-10T20:15:38Z

It looks like one of the last digits is flipping in a doctests. Would you be able to add a regex filter to that block?

nalimilan · 2024-12-11T13:04:50Z

src/linpred.jl

-"""
-    nobs(obj::LinearModel)
-    nobs(obj::GLM)
+residuals(obj::LinPredModel; weighted::Bool=false) = residuals(obj.rr; weighted=weighted)

-For linear and generalized linear models, returns the number of rows, or,
-when prior weights are specified, the sum of weights.
-"""


Looks like you have removed this docstring, which explains why CI fails when building docs.

nalimilan · 2024-12-11T13:11:03Z

Sorry to point this, but a few comments @bkamins and I made are still unresolved AFAICT. Can you have a look? Codecov also indicates that some parts of the code that have been changed are not tested.

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

gragusa · 2024-12-13T21:41:29Z

@nalimilan I think I addressed all issues and comments.

nalimilan · 2024-12-18T11:25:23Z

Thanks and sorry for the delay. I think we're close, but I still see some comments from reviews by @bkamins and I in 2022 which still seem to apply. For example https://github.com/JuliaStats/GLM.jl/pull/487/files#r1032949805, which is an important point to decide.

Also Codecov indicates that only 80% of the diff is tested, ideally it should be 100%, at least for code that was introduced by this PR. For example right below the comment I mentioned there seem to be loglik_apweights_obs methods that are not tested at all. Same for some isweighted, loglikelihood or residuals methods.

docs/src/index.md

nalimilan · 2024-12-12T13:12:43Z

src/linpred.jl

+"""
+    nobs(obj::LinearModel)
+    nobs(obj::GLM)
+
+For linear and generalized linear models, returns the number of rows, or,
+when prior weights are specified, the sum of weights.
+"""


Returning the sum of weights is only correct when using FrequencyWeights, right? For other weights the number of rows is more appropriate.

nalimilan · 2024-12-12T13:18:04Z

src/lm.jl


 r2(obj::LinearModel) = 1 - deviance(obj)/nulldeviance(obj)
 adjr2(obj::LinearModel) = 1 - (1 - r²(obj))*(nobs(obj)-hasintercept(obj))/dof_residual(obj)

+working_residuals(x::LinearModel) = residuals(x)
+working_weights(x::LinearModel) = x.pp.wts


Define working_weights(x::LinPred) and call that from here for consistency.

nalimilan · 2024-12-12T13:22:40Z

src/lm.jl

-    u = residuals(obj)
-    mse = dispersion(obj,true)
+    u = residuals(obj; weighted=isweighted(obj))
+    mse = GLM.dispersion(obj,true)


Not really needed AFAICT?

Suggested change

mse = GLM.dispersion(obj,true)

mse = dispersion(obj,true)

nalimilan · 2024-12-18T11:00:39Z

src/linpred.jl

-end
+nobs(obj::LinPredModel) = nobs(obj.rr)
+
+weights(obj::RegressionModel) = weights(obj.model)


This is type piracy and no longer needed anyway in git master as we don't use TableRegressionModel anymore.

Suggested change

weights(obj::RegressionModel) = weights(obj.model)

src/linpred.jl

nalimilan · 2024-12-18T11:06:23Z

src/lm.jl


+    f, (y, X) = modelframe(f, data, contrasts, LinearModel)
+    _wts = convert_weights(wts)
+    _wts = isempty(_wts) ? uweights(length(y)) : _wts


Also print a deprecation warning when weights have a different length from y. We don't want to continue accepting empty vectors in the future as people should use UnitWeights instead.

nalimilan · 2024-12-18T11:08:56Z

src/lm.jl

+        N = length(m.rr.y)
+        n = sum(log, wts)
+        0.5*(n - N * (log(2π * nulldeviance(m)/N) + 1))


Are we sure this definition is OK for both analytical weights and probability weights? I think we discussed this before, but loglikelihood throws an error for probability weights so I'm surprised that nullloglikelihood doesn't.

nalimilan · 2024-12-18T11:12:34Z

src/lm.jl

@@ -316,8 +355,7 @@ function StatsModels.predict!(res::Union{AbstractVector,
        prediction, lower, upper = res
        length(prediction) == length(lower) == length(upper) == size(newx, 1) ||
            throw(DimensionMismatch("length of vectors in `res` must equal the number of rows in `newx`"))
-        length(mm.rr.wts) == 0 || error("prediction with confidence intervals not yet implemented for weighted regression")
-
+        mm.rr.wts isa UnitWeights || error("prediction with confidence intervals not yet implemented for weighted regression")


Suggested change

mm.rr.wts isa UnitWeights || error("prediction with confidence intervals not yet implemented for weighted regression")

isweighted(mm) && error("prediction with confidence intervals not yet implemented for weighted regression")

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

gragusa added 20 commits June 10, 2022 20:53

WIP

1754cbd

WIP

1d778a5

WIP

12121a3

Taking weights seriously

4363ba4

WIP

ca702dc

Taking weights seriously

e2b2d12

Merge branch 'master' of https://github.com/JuliaStats/GLM.jl into Ju…

bc8709a

…liaStats-master

Add depwarn for passing wts with Vector

84cd990

Cosmettic changes

cbc329f

WIP

23d67f5

Fix loglik for weighted models

f4d90a9

Fix remaining issues

6b7d95c

Final commit

c236b82

Merge branch 'master'

d4bd0c2

Fix merge

8bdfb55

Fix nulldeviance

3eb2ca4

Bypass crossmodelmatrix drom StatsAPI

63c8358

Delete momentmatrix.jl

e93a919

Delete scratch.jl

7bb0959

Delete settings.json

ded17a8

ararslan requested review from andreasnoack and nalimilan August 15, 2022 19:54

nalimilan mentioned this pull request Aug 28, 2022

Fixed linear model with perfectly collinear rhs variables and weights #432

Closed

nalimilan reviewed Aug 31, 2022

View reviewed changes

gragusa added 2 commits November 26, 2024 00:13

Remove StatsPlots dependence.

5d948de

Fix weighting with :qr method.

4fb18df

gragusa added 2 commits December 11, 2024 13:12

Add filter to jldoctest string

56d81ae

Fix problem with docstrings

a2357cf

nalimilan reviewed Dec 11, 2024

View reviewed changes

gragusa and others added 16 commits December 12, 2024 10:38

Update docs/src/index.md

6068d2a

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Remove trailing white spaces

930a8cb

Add mention of UnitWeights in the weights discussion

107d17d

Remove trailing white spaces

a003b10

Change delbeta! signature

1c06c7e

Add tests for dropcollinear=false

b41cce7

Minor cosmethic changes

2730277

Add weighting information in COMMON_FIT_KWARGS_DOCS

cdeb1a3

Add test for leverage

95d506e

[wip] work on leverage

f124589

Use inverse

cbdadbc

Test leverage

2386ab9

Comment cookdistance

36326ff

Committed by mistake

f26bc0e

leverage returns a vec

2bc2138

Fix cookdistance return type

0569600

nalimilan reviewed Dec 18, 2024

View reviewed changes

gragusa and others added 4 commits December 18, 2024 12:51

Update docs/src/index.md

dd1b4a8

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Update docs/src/index.md

1c5953d

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Update src/glmfit.jl

cd39578

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Update src/linpred.jl

574ec69

Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taking weighting seriously #487

Taking weighting seriously #487

gragusa commented Jul 15, 2022 •

edited

Loading

codecov-commenter commented Jul 16, 2022 •

edited by codecov bot

Loading

lrnv commented Jul 20, 2022

gragusa commented Jul 20, 2022

alecloudenback commented Aug 14, 2022 •

edited

Loading

nalimilan commented Aug 28, 2022

nalimilan left a comment

bkamins commented Aug 31, 2022

andreasnoack commented Dec 10, 2024

nalimilan Dec 11, 2024

nalimilan commented Dec 11, 2024

gragusa commented Dec 13, 2024

nalimilan commented Dec 18, 2024

nalimilan Dec 12, 2024

nalimilan Dec 12, 2024

nalimilan Dec 12, 2024

nalimilan Dec 18, 2024

nalimilan Dec 18, 2024

nalimilan Dec 18, 2024

nalimilan Dec 18, 2024

	mm.rr.wts isa UnitWeights \|\| error("prediction with confidence intervals not yet implemented for weighted regression")
	isweighted(mm) && error("prediction with confidence intervals not yet implemented for weighted regression")

Taking weighting seriously #487

Are you sure you want to change the base?

Taking weighting seriously #487

Conversation

gragusa commented Jul 15, 2022 • edited Loading

codecov-commenter commented Jul 16, 2022 • edited by codecov bot Loading

Codecov Report

lrnv commented Jul 20, 2022

gragusa commented Jul 20, 2022

alecloudenback commented Aug 14, 2022 • edited Loading

nalimilan commented Aug 28, 2022

nalimilan left a comment

Choose a reason for hiding this comment

bkamins commented Aug 31, 2022

andreasnoack commented Dec 10, 2024

Choose a reason for hiding this comment

nalimilan commented Dec 11, 2024

gragusa commented Dec 13, 2024

nalimilan commented Dec 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gragusa commented Jul 15, 2022 •

edited

Loading

codecov-commenter commented Jul 16, 2022 •

edited by codecov bot

Loading

alecloudenback commented Aug 14, 2022 •

edited

Loading