Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add residuals method for GLM #499

Open
wants to merge 34 commits into
base: master
Choose a base branch
from
Open

add residuals method for GLM #499

wants to merge 34 commits into from

Conversation

palday
Copy link
Member

@palday palday commented Sep 20, 2022

I haven't had the time to figure out how the denominator is computed for Pearson residuals and so haven't yet implemented them.

I set the default residual type to deviance to match R's behavior, though that might be surprising for GLMs with normal distribution and identity link.

Note: once again I've noticed that -0 in outputs creates a problem. Tests fail for me on Julia 1.8 with Apple Silicon because of sign-swapping on zero.

@palday palday requested a review from nalimilan September 20, 2022 22:06
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Could you add more tests to cover other families?

I'm not sure what's the best default, but doing the same as R is reasonable unless we have reasons to differ.

Regarding your XXX comment, if the relationship appears to holds for all families we could ask for confirmation to others.

src/glmfit.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated Show resolved Hide resolved
src/glmtools.jl Outdated Show resolved Hide resolved
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Comment on lines +757 to +758
# XXX I think this might be the same as
# 2 * wrkresid, but I'm not 100% sure if that holds across families
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmbates Does this relationship hold across families?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the devresid function here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it seems that that function is lower level and used in the computation of the associated field rather returning said field.

@codecov-commenter
Copy link

codecov-commenter commented Sep 21, 2022

Codecov Report

Attention: Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.49%. Comparing base (c13577e) to head (c3fbc4a).
Report is 8 commits behind head on master.

Files Patch % Lines
src/glmfit.jl 90.90% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master     #499   +/-   ##
=======================================
  Coverage   90.48%   90.49%           
=======================================
  Files           8        8           
  Lines        1125     1136   +11     
=======================================
+ Hits         1018     1028   +10     
- Misses        107      108    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@palday
Copy link
Member Author

palday commented Sep 21, 2022

@nalimilan the current doctests failures are unrelated to this PR, but do you have any objections to me updating them? For now, I would update update the expected output, but longterm we should improve pretty-printing of the types.

@nalimilan
Copy link
Member

Sure.

Would you feel like adding tests for all other model families? I'm always uncomfortable when what we test varies from one family to the next.

@palday
Copy link
Member Author

palday commented Sep 29, 2022

Would you feel like adding tests for all other model families? I'm always uncomfortable when what we test varies from one family to the next.

Of course -- I've already got tests in there for deviance, response and working residuals of the families:

  • Normal with identity link
  • Binomial + Bernoulli
  • InverseGaussian
  • Gamma

I've also added commented out tests for the same families with Pearson residuals.

@palday palday requested a review from nalimilan September 29, 2022 14:21
@ViralBShah
Copy link
Contributor

Bump. Perhaps @mousum-github can also help review this PR?

@mousum-github
Copy link
Collaborator

Thanks, @ViralBShah, let me go through the PR.

@mousum-github
Copy link
Collaborator

We can easily add Pearson Residuals using the following: -
sign.(response(model) .- fitted(model)) .* sqrt.(model.rr.wrkwt .* abs2.(model.rr.wrkresid))

(and the sum of square of Pearson residuals / residual dof, is the estimated dispersion parameter)

@nalimilan
Copy link
Member

@mousum-github Maybe you know the answer to https://github.com/JuliaStats/GLM.jl/pull/499/files#r976581242?

@palday Do you plan to add test for other method families?

src/glmfit.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
@mousum-github
Copy link
Collaborator

@mousum-github Maybe you know the answer to https://github.com/JuliaStats/GLM.jl/pull/499/files#r976581242?

@palday Do you plan to add test for other method families?

I am not sure about the relationship. I tried but was unable to establish such a relationship easily.

@ararslan
Copy link
Member

ararslan commented Dec 7, 2022

I haven't had the time to figure out how the denominator is computed for Pearson residuals and so haven't yet implemented them.

It's based on the variance function:

$$ e_i = \frac{y_i - \hat{\mu}_i}{\sqrt{V(\hat{\mu}_i)}} $$

so should be sqrt(glmvar(model.rr.d, _))

@nalimilan
Copy link
Member

Models with weights are not tested though, we need to make sure the result is correct or an error is thrown.

palday and others added 14 commits April 13, 2023 19:35
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
Co-authored-by: Alex Arslan <ararslan@comcast.net>
@palday
Copy link
Member Author

palday commented May 4, 2023

@mousum-github please don't start making changes on my PR without asking me first.

Co-authored-by: Alex Arslan <ararslan@comcast.net>
@mousum-github
Copy link
Collaborator

mousum-github commented May 4, 2023

I would like to add a few more test cases this week.

@mousum-github
Copy link
Collaborator

@mousum-github please don't start making changes on my PR without asking me first.

Sure, @palday. It is my bad.
Please let me know if I can do anything. I am preparing a few more test cases in R with weights and offsets.

@mousum-github
Copy link
Collaborator

mousum-github commented May 18, 2023

The following compares three types of residuals (response, deviance and Pearson) for different distributions and link functions in Julia and R. Overall the comparisons look good.

julia> using GLM, Random, LinearAlgebra, DataFrames, RCall

julia> Random.seed!(12345);

julia> df = DataFrame(x0 = ones(10), x1 = rand(10), x2 = rand(10), n = rand(50:70, 10), w=1 .+ 0.01*rand(10), off=1 .+ 0.01*rand(10));

julia> β = [1, -1, 1];

julia> X = convert(Matrix, df[:,1:3]);

julia> η = X*β;

julia> σ = 1;

julia>

julia> #  Poisson Distribution with Log link #

julia> Random.seed!(123456);

julia> μ = GLM.linkinv.(LogLink(), η);

julia> df["y"] = rand.(Poisson.(μ));

julia> mdl = glm(@formula(y ~ x1 + x2), df, Poisson(););

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> @rput df;

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=poisson());
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

julia> #  ..with weights and offset #

julia> mdl = glm(@formula(y ~ x1 + x2), df, Poisson(); wts=df["w"], offset=df["off"]);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=poisson(), weights=w, offset=off);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

julia>

julia> #  Binomial Distribution with Logit link #

julia> Random.seed!(123456);

julia> μ = GLM.linkinv.(LogitLink(), η);

julia> n = df["n"];

julia> df["y"] = rand.(Binomial.(n, μ)) ./ n;

julia> mdl = glm(@formula(y ~ x1 + x2), df, Binomial(););

julia> @rput df;

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=binomial());
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;
┌ Warning: RCall.jl: Warning in eval(family$initialize) :
│   non-integer #successes in a binomial glm!
└ @ RCall C:\Users\user\.julia\packages\RCall\6kphM\src\io.jl:172

julia> @rget rresid1 rresid2 rresid3;

julia> rresid1  jresid1 && rresid2  jresid2 && rresid3  jresid3
true

julia> # .. with weights and offset #

julia> mdl = glm(@formula(y ~ x1 + x2), df, Binomial(); wts=df["w"], offset=df["off"]);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=binomial(), weights=w, offset=off);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;
┌ Warning: RCall.jl: Warning in eval(family$initialize) :
│   non-integer #successes in a binomial glm!
│ Warning in eval(family$initialize) :
│   non-integer #successes in a binomial glm!
└ @ RCall C:\Users\user\.julia\packages\RCall\6kphM\src\io.jl:172

julia> @rget rresid1 rresid2 rresid3;

julia> rresid1  jresid1 && rresid2  jresid2 && rresid3  jresid3
true

julia>

julia> #  Bernoulli Distribution with Logit link #

julia> Random.seed!(123456);

julia> μ = GLM.linkinv.(LogitLink(), η);

julia> df["y"] = rand.(Bernoulli.(μ));

julia> mdl = glm(@formula(y ~ x1 + x2), df, Bernoulli(););

julia> @rput df;

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=binomial());
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> rresid1  jresid1 && rresid2  jresid2 && rresid3  jresid3
true

julia> # .. with weights and offset #

julia> mdl = glm(@formula(y ~ x1 + x2), df, Bernoulli(); wts=df["w"], offset=df["off"]);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=binomial(), weights=w, offset=off);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;
┌ Warning: RCall.jl: Warning in eval(family$initialize) :
│   non-integer #successes in a binomial glm!
│ Warning in eval(family$initialize) :
│   non-integer #successes in a binomial glm!
└ @ RCall C:\Users\user\.julia\packages\RCall\6kphM\src\io.jl:172

julia> @rget rresid1 rresid2 rresid3;

julia> rresid1  jresid1 && rresid2  jresid2 && rresid3  jresid3
true

julia>

julia>

julia> #  Negative Binomial Distribution with Logit link #

julia> Random.seed!(123456);

julia> μ = GLM.linkinv.(LogLink(), η);

julia> π = μ ./.+ 10.0);

julia> df["y"] = rand.(NegativeBinomial.(10, π));

julia> mdl = negbin(@formula(y ~ x1 + x2), df, LogLink(); rtol=1.0E-12);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> @rput df;

julia> R"""
       library("MASS");
       mdl = glm.nb(y ~ x1+x2, data=df);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-4) && isapprox(rresid2, jresid2, atol=1.0E-4) && isapprox(rresid3, jresid3, atol=1.0E-4)
false

julia>

julia> #  Gamma Distribution with InverseLink link #

julia> Random.seed!(1234);

julia> μ = GLM.linkinv.(LogLink(), η);

julia> df["y"] = rand.(Gamma.(μ, σ));

julia> mdl = glm(@formula(y ~ x1 + x2), df, Gamma(););

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> @rput df;

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=Gamma());
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

julia> # .. with weights and offset #

julia> mdl = glm(@formula(y ~ x1 + x2), df, Gamma(),; wts=df["w"], offset=df["off"]);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=Gamma(), weights=w, offset=off);
       cf = coef(mdl);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

julia>

julia> #  InverseGaussian Distribution with InverseSquareLink link #

julia> Random.seed!(1234);

julia> μ = GLM.linkinv.(InverseSquareLink(), η);

julia> df["y"] = rand.(InverseGaussian.(μ, σ));

julia> mdl = glm(@formula(y ~ x1 + x2), df, InverseGaussian(););

julia> @rput df;

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=inverse.gaussian());
       cff = coef(mdl);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

julia> # .. with weights and offset #

julia> mdl = glm(@formula(y ~ x1 + x2), df, InverseGaussian(); wts=df["w"], offset=df["off"]);

julia> jresid1,jresid2,jresid3 = GLM.residuals(mdl, type=:response), GLM.residuals(mdl, type=:deviance),  GLM.residuals(mdl, type=:pearson);

julia> R"""
       mdl = glm(y ~ x1+x2, data=df, family=inverse.gaussian(), weights=w, offset=off);
       rresid1 = resid(mdl, type='response');
       rresid2 = resid(mdl, type='deviance');
       rresid3 = resid(mdl, type='pearson');
       """;

julia> @rget rresid1 rresid2 rresid3;

julia> isapprox(rresid1, jresid1, atol=1.0E-6) && isapprox(rresid2, jresid2, atol=1.0E-6) && isapprox(rresid3, jresid3, atol=1.0E-6)
true

@nalimilan
Copy link
Member

@palday Are you OK with @mousum-github adding more tests?

@mousum-github Thanks for checking. However, existing tests already fit models for all families, so better reuse these models and just add checks that residuals return the expected value rather than add new models.

@mousum-github
Copy link
Collaborator

@nalimilan - I also considered suggesting some tests within existing test cases. Since there are conflicts in the runtests.jl right now, just checked new functionalities only with R, especially with weights and offsets.

@mousum-github
Copy link
Collaborator

For residuals functionalities,

Existing Test cases:

  • Poisson GLM
  • (Binomial, Bernoulli) with LogitLink
  • Gamma
  • InverseGaussian
  • Geometric is a special case of NegativeBinomial
  • linear model with weights

So far I have added test cases (locally) to the following

  • Linear model
  • Linear model with weights
  • Linear model with dropcollinearity
  • Linear model with rankdeficieny
  • Linear model without intercept
  • Bernoulli ProbitLink
  • Bernoulli CauchitLink
  • Bernoulli CloglogLink
  • Normal with offset
  • Normal LogLink and offset
  • Poisson LogLink with offset and weights
  • Gamma with LogLink
  • Gamma with IdentityLink

And, I am planning to add some more test cases to the following by this week

  • GLM with no intercept
  • Geometric LogLink
  • NegativeBinomial LogLink
  • NegativeBinomial LogLink Fixed θ

Also, I have changed the existing residuals function for lm. Now it will support all types like GLM.

My plan is to raise a PR to pa/resid branch whenever I am done.

@nalimilan, please let us know if there is a better way to proceed.

@mousum-github
Copy link
Collaborator

mousum-github commented Jun 7, 2023

The changes in the residuals function with lm are like the following,

    resid = response(model) - fitted(model)
    if length(model.rr.wts) > 0 && (type === :deviance || type === :pearson)
        return resid .* sqrt.(model.rr.wts)
    else
        return resid
    end

@palday
Copy link
Member Author

palday commented Jun 7, 2023

Sorry, I've been a bit underwater with other priorities. I will start integrating these suggestions -- thanks @mousum-github !

@nalimilan
Copy link
Member

My plan is to raise a PR to pa/resid branch whenever I am done.

@nalimilan, please let us know if there is a better way to proceed.

Thanks. I guess that's the best approach so that @palday can have a look at the changes.

@mousum-github
Copy link
Collaborator

Thanks @nalimilan.
I have raised the PR #540.
@palday - hope it will help.

@ararslan ararslan mentioned this pull request Jun 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants