Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of dropcollinear feature in GeneralizedLinearModel #488

Merged
merged 88 commits into from
Oct 21, 2022
Merged
Changes from 1 commit
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
363b74c
Add support for rank deficient Generalized Linear Models
DilumAluthge Jun 22, 2019
c415886
remove unexectued code
jason-xuan Oct 6, 2019
1fd0790
solve issue about converges slower
jason-xuan Oct 18, 2019
2eb96e5
remove file
jason-xuan Sep 14, 2021
405fe93
PowerLink refers to a class of techniques that use a power function
Apr 12, 2022
742455f
added a few more testcases
Apr 13, 2022
f788854
Fixing doctests for PowerLink
ayushpatnaikgit Apr 15, 2022
f65b842
Making subheading smaller than heading.
ayushpatnaikgit Apr 15, 2022
6b80352
Merge pull request #1 from xKDR/powerlink
ayushpatnaikgit Apr 16, 2022
683d512
Rounding off a test case result to 2 digits
ayushpatnaikgit Apr 16, 2022
2e3b67a
adding Optim dependancy to docs
ayushpatnaikgit Apr 16, 2022
5b09ae1
No need to have optim as a dependancy for the entire package. Only ne…
ayushpatnaikgit Apr 16, 2022
fddeebe
Rounding off GLM.linkinv(PowerLink(-1), 10) and GLM.linkinv(PowerLink…
ayushpatnaikgit Apr 16, 2022
e1e7631
Rounding off GLM.mueta(PowerLink(-1), 10) to make the test work on Ju…
ayushpatnaikgit Apr 16, 2022
b9ffe36
Using inexact equality comparison instead of rounding.
ayushpatnaikgit Apr 17, 2022
ad4c986
Apply suggestions from code review
ayushpatnaikgit Apr 18, 2022
f9dcc4c
Apply suggestions from code review
ayushpatnaikgit Apr 18, 2022
9dae795
Apply suggestions from code review
ayushpatnaikgit Apr 18, 2022
83e7b12
Update docs/src/examples.md
ayushpatnaikgit Apr 18, 2022
9d703a5
Update docs/src/examples.md
ayushpatnaikgit Apr 18, 2022
a0571a6
Adding the missing comma.
ayushpatnaikgit Apr 18, 2022
99aa23f
Update src/glmtools.jl
ayushpatnaikgit Apr 18, 2022
60bcde3
Update src/glmtools.jl
ayushpatnaikgit Apr 18, 2022
f8c7809
Apply suggestions from code review
ayushpatnaikgit Apr 18, 2022
aef3048
Follwing alphabetical order in references
ayushpatnaikgit Apr 18, 2022
93a607a
Adding suggested changes by nalimilan
ayushpatnaikgit Apr 18, 2022
9620cf9
Update test/runtests.jl
ayushpatnaikgit Apr 18, 2022
0d94609
Update test/runtests.jl
ayushpatnaikgit Apr 18, 2022
7943818
replaced `isapprox` function by `≈` symbol whereever possible, added …
Apr 18, 2022
3cf14f4
Update src/glmfit.jl
mousum-github Apr 19, 2022
12bdab6
Update src/glmfit.jl
mousum-github Apr 19, 2022
bc80e7f
updated test cases, added a few more metrics to check, move link memb…
Apr 20, 2022
daf91a0
Rephrased the description of PowerLink
Apr 20, 2022
2138ece
Merge branch 'JuliaStats:master' into master
ayushpatnaikgit Apr 22, 2022
421eb3b
Update docs/src/examples.md
ayushpatnaikgit Apr 22, 2022
48983a6
Removed `GLM.Link(mm::AbstractGLM) = mm.l`, redefine defination of Po…
Apr 24, 2022
e32a2ef
Merge branch 'master' of https://github.com/xKDR/GLM.jl
Apr 24, 2022
634ece5
update test case
May 6, 2022
06d6a59
Update src/glmtools.jl
mousum-github May 20, 2022
6a3ceed
changed the `inverselink` function for PowerLink as suggested in PR
May 20, 2022
bda271a
Merge branch 'JuliaStats:master' into master
mousum-github Jun 28, 2022
7e559de
resolved conflicts
harsharora21 Jun 30, 2022
fbc7c8a
Merge branch 'jason-xuan-da/allow-rank-deficient' into dropcollinearity
harsharora21 Jun 30, 2022
4b2d329
resolved conflicts
harsharora21 Jun 30, 2022
6853761
resolved conflicts again
harsharora21 Jun 30, 2022
d9fd30c
changed allowrankdeficient to dropcollinear
harsharora21 Jun 30, 2022
d0888bf
fixed deviance test
harsharora21 Jun 30, 2022
017d0af
used StableRNGs instead of Random
harsharora21 Jun 30, 2022
c87d8c1
removed Random
harsharora21 Jun 30, 2022
ff72913
added test cases, defined dof function for GLM with pivoting
Jun 30, 2022
d6634dd
minor update
Jul 1, 2022
55122a8
added dropcollinear argument in API doc
Jul 15, 2022
4316b64
added more test cases
Jul 15, 2022
743c4bf
default value for dropcollinear is changed to false
Jul 15, 2022
845d449
Updated a test case
Aug 21, 2022
295f001
Update src/glmfit.jl
mousum-github Sep 9, 2022
9e12526
Update test/runtests.jl
mousum-github Sep 9, 2022
236f184
Update test/runtests.jl
mousum-github Sep 9, 2022
24d88c2
Update delbeta! function to correct the standard errors
Oct 1, 2022
38ee501
updated assigments in the test cases. `dfrm[!, :x1]` is changed to `d…
Oct 1, 2022
78e7a3f
added `confint` tests in test case
Oct 2, 2022
c3c3bb2
Tested `NaN` in a test case
Oct 2, 2022
c7cc974
Update test/runtests.jl
mousum-github Oct 4, 2022
d5120f1
Update test/runtests.jl
mousum-github Oct 4, 2022
a41810d
Update test/runtests.jl
mousum-github Oct 4, 2022
8db9021
Update test/runtests.jl
mousum-github Oct 4, 2022
223bcd5
Update test/runtests.jl
mousum-github Oct 4, 2022
120b804
Update test/runtests.jl
mousum-github Oct 4, 2022
02dd6e3
Update test/runtests.jl
mousum-github Oct 4, 2022
905c6da
Update test/runtests.jl
mousum-github Oct 4, 2022
998393f
Merge branch 'master' into dropcollinearity
nalimilan Oct 8, 2022
c5e980f
Fix typo
nalimilan Oct 8, 2022
f0e01a4
Added test cases and changed `model_rank` to `linpred_rank`
Oct 17, 2022
a2bd6ec
Removing `PosDefException` test for timing. It is throwing error in s…
Oct 17, 2022
dbba47d
Added test cases and changed `model_rank` to `linpred_rank`
Oct 17, 2022
69a1015
Changed glm model desciptions example.md file
Oct 17, 2022
353449f
Added `rankdeficient` test cases
Oct 17, 2022
3201d4c
Update src/glmfit.jl
mousum-github Oct 20, 2022
371d01f
Update src/glmfit.jl
mousum-github Oct 20, 2022
3c31d26
Update src/linpred.jl
mousum-github Oct 20, 2022
85cbb52
Update test/runtests.jl
mousum-github Oct 20, 2022
b989aae
Update test/runtests.jl
mousum-github Oct 20, 2022
9c86701
Update test/runtests.jl
mousum-github Oct 20, 2022
779ee29
Update test/runtests.jl
mousum-github Oct 20, 2022
601cfc6
Update src/linpred.jl
mousum-github Oct 20, 2022
4f8c5b4
Update test/runtests.jl
mousum-github Oct 20, 2022
a4a2466
Shifted the dropcollinear test case to the end. Removed commneted tes…
Oct 20, 2022
07e44de
Update src/glmfit.jl
mousum-github Oct 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
231 changes: 149 additions & 82 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,155 @@ function simplemm(x::AbstractVecOrMat)
mat
end

@testset "dropcollinear with GLMs" begin
mousum-github marked this conversation as resolved.
Show resolved Hide resolved
data = DataFrame(x1=[4, 5, 9, 6, 5], x2=[5, 3, 6, 7, 1],
x3=[4.2, 4.6, 8.4, 6.2, 4.2], y=[14, 14, 24, 20, 11])

@testset "Check normal with identity link against equivalent linear model" begin
mdl1 = lm(@formula(y ~ x1 + x2 + x3), data; dropcollinear=true)
mdl2 = glm(@formula(y ~ x1 + x2 + x3), data, Normal(), IdentityLink();
dropcollinear=true)

@test coef(mdl1) ≈ coef(mdl2)
@test stderror(mdl1)[1:3] ≈ stderror(mdl2)[1:3]
@test isnan(stderror(mdl1)[4])
@test dof(mdl1) == dof(mdl2)
@test dof_residual(mdl1) == dof_residual(mdl2)
@test GLM.dispersion(mdl1.model, true) ≈ GLM.dispersion(mdl2.model,true)
@test deviance(mdl1) ≈ deviance(mdl2)
@test loglikelihood(mdl1) ≈ loglikelihood(mdl2)
@test aic(mdl1) ≈ aic(mdl2)
@test predict(mdl1) ≈ predict(mdl2)
end
@testset "Check against equivalent linear model when dropcollinear = false" begin
mdl1 = lm(@formula(y ~ x1 + x2), data; dropcollinear=false)
mdl2 = glm(@formula(y ~ x1 + x2), data, Normal(), IdentityLink();
dropcollinear=false)

@test coef(mdl1) ≈ coef(mdl2)
@test stderror(mdl1) ≈ stderror(mdl2)
@test dof(mdl1) == dof(mdl2)
@test dof_residual(mdl1) == dof_residual(mdl2)
@test GLM.dispersion(mdl1.model, true) ≈ GLM.dispersion(mdl2.model,true)
@test deviance(mdl1) ≈ deviance(mdl2)
@test loglikelihood(mdl1) ≈ loglikelihood(mdl2)
@test aic(mdl1) ≈ aic(mdl2)
@test predict(mdl1) ≈ predict(mdl2)
end

#@testset "Check `PosDefException` handling when dropcollinear = false" begin
bkamins marked this conversation as resolved.
Show resolved Hide resolved
# @test_throws PosDefException mdl1 = lm(@formula(y ~ x1 + x2 + x3), data; dropcollinear=false)
# @test_throws PosDefException mdl2 = glm(@formula(y ~ x1 + x2 + x3), data, Normal(), IdentityLink();
# dropcollinear=false)
#end

mousum-github marked this conversation as resolved.
Show resolved Hide resolved
@testset "Check normal with identity link against outputs from R" begin
mdl = glm(@formula(y ~ x1 + x2 + x3), data, Normal(), IdentityLink();
dropcollinear=true)
@test coef(mdl) ≈ [1.350439882697950, 1.740469208211143, 1.171554252199414, 0.0]
@test stderror(mdl)[1:3] ≈ [0.58371400875263, 0.10681694901238, 0.08531532203251]
@test dof(mdl) == 4
@test dof_residual(mdl) == 2
@test GLM.dispersion(mdl.model, true) ≈ 0.1341642228738996
@test deviance(mdl) ≈ 0.2683284457477991
@test loglikelihood(mdl) ≈ 0.2177608775670037
@test aic(mdl) ≈ 7.564478244866
@test predict(mdl) ≈ [14.17008797653959, 13.56744868035191, 24.04398826979472,
19.99413489736071, 11.22434017595308]
end

num_rows = 100
dfrm = DataFrame()
dfrm.x1 = randn(StableRNG(123), num_rows)
dfrm.x2 = randn(StableRNG(1234), num_rows)
dfrm.x3 = 2*dfrm.x1 + 3*dfrm.x2
dfrm.y = Int.(randn(StableRNG(12345), num_rows) .> 0)

@testset "Test Logistic Regression Outputs from R" begin
mdl = glm(@formula(y ~ x1 + x2 + x3), dfrm, Binomial(), LogitLink();
dropcollinear=true)
@test coef(mdl) ≈ [-0.1402582892604246, 0.1362176272953289, 0, -0.1134751362230204] atol = 1.0E-6
stderr = stderror(mdl)
@test isnan(stderr[3]) == true
@test vcat(stderr[1:2], stderr[4]) ≈ [0.20652049856206, 0.25292632684716, 0.07496476901643] atol = 1.0E-4
@test deviance(mdl) ≈ 135.68506068159
@test loglikelihood(mdl) ≈ -67.8425303407948
@test dof(mdl) == 3
@test dof_residual(mdl) == 98
@test aic(mdl) ≈ 141.68506068159
@test GLM.dispersion(mdl.model, true) ≈ 1
@test predict(mdl)[1:3] ≈ [0.4241893070433117, 0.3754516361306202, 0.6327877688720133] atol = 1.0E-6
@test confint(mdl)[1:2,1:2] ≈ [-0.5493329715011036 0.26350316142056085;
-0.3582545657827583 0.64313795309765587] atol = 1.0E-1
end

@testset "`rankdeficient` test case of lm in glm" begin
rng = StableRNG(1234321)
# an example of rank deficiency caused by a missing cell in a table
dfrm = DataFrame([categorical(repeat(string.('A':'D'), inner = 6)),
categorical(repeat(string.('a':'c'), inner = 2, outer = 4))],
[:G, :H])
mousum-github marked this conversation as resolved.
Show resolved Hide resolved
f = @formula(0 ~ 1 + G*H)
X = ModelMatrix(ModelFrame(f, dfrm)).m
y = X * (1:size(X, 2)) + 0.1 * randn(rng, size(X, 1))
inds = deleteat!(collect(1:length(y)), 7:8)
m1 = fit(GeneralizedLinearModel, X, y, Normal())
@test isapprox(deviance(m1), 0.12160301538297297)
Xmissingcell = X[inds, :]
ymissingcell = y[inds]
@test_throws PosDefException m2 = fit(GeneralizedLinearModel, Xmissingcell, ymissingcell, Normal(); dropcollinear=false)
mousum-github marked this conversation as resolved.
Show resolved Hide resolved
m2p = fit(GeneralizedLinearModel, Xmissingcell, ymissingcell, Normal(); dropcollinear=true)
@test isa(m2p.pp.chol, CholeskyPivoted)
@test rank(m2p.pp.chol) == 11
@test isapprox(deviance(m2p), 0.1215758392280204)
@test isapprox(coef(m2p), [0.9772643585228885, 8.903341608496437, 3.027347397503281,
3.9661379199401257, 5.079410103608552, 6.1944618141188625, 0.0, 7.930328728005131,
8.879994918604757, 2.986388408421915, 10.84972230524356, 11.844809275711485])
@test all(isnan, hcat(coeftable(m2p).cols[2:end]...)[7,:])

m2p_dep_pos = fit(GeneralizedLinearModel, Xmissingcell, ymissingcell, Normal();)
mousum-github marked this conversation as resolved.
Show resolved Hide resolved
@test_logs (:warn, "Positional argument `allowrankdeficient` is deprecated, use keyword " *
"argument `dropcollinear` instead. Proceeding with positional argument value: true") fit(LinearModel, Xmissingcell, ymissingcell, true)
@test isa(m2p_dep_pos.pp.chol, CholeskyPivoted)
@test rank(m2p_dep_pos.pp.chol) == rank(m2p.pp.chol)
@test isapprox(deviance(m2p_dep_pos), deviance(m2p))
@test isapprox(coef(m2p_dep_pos), coef(m2p))
end

@testset "`rankdeficient` test in GLM with Poisson" begin
rng = StableRNG(1234321)
# an example of rank deficiency caused by a missing cell in a table
dfrm = DataFrame([categorical(repeat(string.('A':'D'), inner = 6)),
categorical(repeat(string.('a':'c'), inner = 2, outer = 4))],
[:G, :H])
mousum-github marked this conversation as resolved.
Show resolved Hide resolved
f = @formula(0 ~ 1 + G*H)
X = ModelMatrix(ModelFrame(f, dfrm)).m
y = X * (1:size(X, 2)) + 0.1 * randn(rng, size(X, 1))
inds = deleteat!(collect(1:length(y)), 7:8)
m1 = fit(GeneralizedLinearModel, X, y, Gamma())
@test isapprox(deviance(m1), 0.0407069934950098)
Xmissingcell = X[inds, :]
ymissingcell = y[inds]
@test_throws PosDefException m2 = fit(GeneralizedLinearModel, Xmissingcell, ymissingcell, Gamma(); dropcollinear=false)
mousum-github marked this conversation as resolved.
Show resolved Hide resolved
m2p = fit(GeneralizedLinearModel, Xmissingcell, ymissingcell, Gamma(); dropcollinear=true)
@test isa(m2p.pp.chol, CholeskyPivoted)
@test rank(m2p.pp.chol) == 11
@test isapprox(deviance(m2p), 0.04070377141288433)
@test isapprox(coef(m2p), [ 1.0232644374837732, -0.0982622592717195, -0.7735523403010212,
-0.820974608805111, -0.8581573302333557, -0.8838279927663583, 0.0, 0.667219148331652,
0.7087696966674913, 0.011287703617517712, 0.6816245514668273, 0.7250492032072612])
@test all(isnan, hcat(coeftable(m2p).cols[2:end]...)[7,:])

m2p_dep_pos = fit(GeneralizedLinearModel, Xmissingcell, ymissingcell, Gamma();)
mousum-github marked this conversation as resolved.
Show resolved Hide resolved
@test_logs (:warn, "Positional argument `allowrankdeficient` is deprecated, use keyword " *
"argument `dropcollinear` instead. Proceeding with positional argument value: true") fit(LinearModel, Xmissingcell, ymissingcell, true)
@test isa(m2p_dep_pos.pp.chol, CholeskyPivoted)
@test rank(m2p_dep_pos.pp.chol) == rank(m2p.pp.chol)
@test isapprox(deviance(m2p_dep_pos), deviance(m2p))
@test isapprox(coef(m2p_dep_pos), coef(m2p))
end
end

linreg(x::AbstractVecOrMat, y::AbstractVector) = qr!(simplemm(x)) \ y

@testset "lm" begin
Expand Down Expand Up @@ -1447,85 +1596,3 @@ end
end
end

@testset "dropcollinear with GLMs" begin
data = DataFrame(x1=[4, 5, 9, 6, 5], x2=[5, 3, 6, 7, 1],
x3=[4.2, 4.6, 8.4, 6.2, 4.2], y=[14, 14, 24, 20, 11])

@testset "Check normal with identity link against equivalent linear model" begin
mdl1 = lm(@formula(y ~ x1 + x2 + x3), data; dropcollinear=true)
mdl2 = glm(@formula(y ~ x1 + x2 + x3), data, Normal(), IdentityLink();
dropcollinear=true)

@test coef(mdl1) ≈ coef(mdl2)
@test stderror(mdl1)[1:3] ≈ stderror(mdl2)[1:3]
@test isnan(stderror(mdl1)[4])
@test dof(mdl1) == dof(mdl2)
@test dof_residual(mdl1) == dof_residual(mdl2)
@test GLM.dispersion(mdl1.model, true) ≈ GLM.dispersion(mdl2.model,true)
@test deviance(mdl1) ≈ deviance(mdl2)
@test loglikelihood(mdl1) ≈ loglikelihood(mdl2)
@test aic(mdl1) ≈ aic(mdl2)
@test predict(mdl1) ≈ predict(mdl2)
end
@testset "Check against equivalent linear model when dropcollinear = false" begin
mdl1 = lm(@formula(y ~ x1 + x2), data; dropcollinear=false)
mdl2 = glm(@formula(y ~ x1 + x2), data, Normal(), IdentityLink();
dropcollinear=false)

@test coef(mdl1) ≈ coef(mdl2)
@test stderror(mdl1) ≈ stderror(mdl2)
@test dof(mdl1) == dof(mdl2)
@test dof_residual(mdl1) == dof_residual(mdl2)
@test GLM.dispersion(mdl1.model, true) ≈ GLM.dispersion(mdl2.model,true)
@test deviance(mdl1) ≈ deviance(mdl2)
@test loglikelihood(mdl1) ≈ loglikelihood(mdl2)
@test aic(mdl1) ≈ aic(mdl2)
@test predict(mdl1) ≈ predict(mdl2)
end

#@testset "Check `PosDefException` handling when dropcollinear = false" begin
# @test_throws PosDefException mdl1 = lm(@formula(y ~ x1 + x2 + x3), data; dropcollinear=false)
# @test_throws PosDefException mdl2 = glm(@formula(y ~ x1 + x2 + x3), data, Normal(), IdentityLink();
# dropcollinear=false)
#end

@testset "Check normal with identity link against outputs from R" begin
mdl = glm(@formula(y ~ x1 + x2 + x3), data, Normal(), IdentityLink();
dropcollinear=true)
@test coef(mdl) ≈ [1.350439882697950, 1.740469208211143, 1.171554252199414, 0.0]
@test stderror(mdl)[1:3] ≈ [0.58371400875263, 0.10681694901238, 0.08531532203251]
@test dof(mdl) == 4
@test dof_residual(mdl) == 2
@test GLM.dispersion(mdl.model, true) ≈ 0.1341642228738996
@test deviance(mdl) ≈ 0.2683284457477991
@test loglikelihood(mdl) ≈ 0.2177608775670037
@test aic(mdl) ≈ 7.564478244866
@test predict(mdl) ≈ [14.17008797653959, 13.56744868035191, 24.04398826979472,
19.99413489736071, 11.22434017595308]
end

num_rows = 100
dfrm = DataFrame()
dfrm.x1 = randn(StableRNG(123), num_rows)
dfrm.x2 = randn(StableRNG(1234), num_rows)
dfrm.x3 = 2*dfrm.x1 + 3*dfrm.x2
dfrm.y = Int.(randn(StableRNG(12345), num_rows) .> 0)

@testset "Test Logistic Regression Outputs from R" begin
mdl = glm(@formula(y ~ x1 + x2 + x3), dfrm, Binomial(), LogitLink();
dropcollinear=true)
@test coef(mdl) ≈ [-0.1402582892604246, 0.1362176272953289, 0, -0.1134751362230204] atol = 1.0E-6
stderr = stderror(mdl)
@test isnan(stderr[3]) == true
@test vcat(stderr[1:2], stderr[4]) ≈ [0.20652049856206, 0.25292632684716, 0.07496476901643] atol = 1.0E-4
@test deviance(mdl) ≈ 135.68506068159
@test loglikelihood(mdl) ≈ -67.8425303407948
@test dof(mdl) == 3
@test dof_residual(mdl) == 98
@test aic(mdl) ≈ 141.68506068159
@test GLM.dispersion(mdl.model, true) ≈ 1
@test predict(mdl)[1:3] ≈ [0.4241893070433117, 0.3754516361306202, 0.6327877688720133] atol = 1.0E-6
@test confint(mdl)[1:2,1:2] ≈ [-0.5493329715011036 0.26350316142056085;
-0.3582545657827583 0.64313795309765587] atol = 1.0E-1
end
end