Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArgumentError: F test is only valid for nested models #489

Open
PharmCat opened this issue Jul 23, 2022 · 15 comments
Open

ArgumentError: F test is only valid for nested models #489

PharmCat opened this issue Jul 23, 2022 · 15 comments

Comments

@PharmCat
Copy link

I try to use GLM for standart bioequivalence testing for crossover design to get ANOVA table (protocol defined test) using ftest and have:

ArgumentError: F test is only valid for nested models

models:

olscmax  = lm(@formula(log(Cmax) ~ Sequence+Period+Formulation+Subject), ncadf; dropcollinear = true)
olscmax_null  = lm(@formula(log(Cmax) ~ Subject), ncadf; dropcollinear = true,)
olscmax_form  = lm(@formula(log(Cmax) ~ Subject + Formulation), ncadf; dropcollinear = true,)

ftest(olscmax_null.model, olscmax_form.model, olscmax.model)

Is any way to perfom ANOVA using GLM with crossover design?

Example data:

Subject	Sequence	Period	Formulation	Cmax
3	TR	1	T	225.95
1	RT	1	R	181.09
2	RT	1	R	114.48
4	RT	1	R	176.91
5	TR	1	T	147.01
6	TR	1	T	97.53
7	RT	1	R	146.60
8	TR	1	T	45.58
9	RT	1	R	109.20
10	RT	1	R	125.61
11	TR	1	T	92.26
12	RT	1	R	237.95
13	TR	1	T	145.46
14	TR	1	T	179.96
15	TR	1	T	173.86
16	RT	1	R	144.00
17	RT	1	R	185.10
18	TR	1	T	117.99
1	RT	2	T	210.14
2	RT	2	T	98.72
3	TR	2	R	241.09
4	RT	2	T	186.65
5	TR	2	R	139.56
6	TR	2	R	124.77
7	RT	2	T	137.62
8	TR	2	R	57.71
9	RT	2	T	139.36
10	RT	2	T	120.43
11	TR	2	R	116.10
12	RT	2	T	228.63
13	TR	2	R	165.09
14	TR	2	R	181.09
15	TR	2	R	206.66
16	RT	2	T	143.25
17	RT	2	T	192.22
18	TR	2	R	125.50
@PharmCat
Copy link
Author

For simmilar data I have:

should values for Subject: 15 should be dropped from results? (RegressionTables.jl, not working with this results because P is NaN)

julia> olscmax
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, CholeskyPivoted{Float64, Matrix{Float64}}}}, Matrix{Float64}}

:(log(Cmax)) ~ 1 + Sequence + Period + Formulation + Subject

Coefficients:
────────────────────────────────────────────────────────────────────────────────────
                     Coef.   Std. Error       t  Pr(>|t|)     Lower 95%    Upper 95%
────────────────────────────────────────────────────────────────────────────────────
(Intercept)      6.46193      0.077827    83.03    <1e-28    6.30052       6.62333
Sequence: tr    -0.0143149    0.105746    -0.14    0.8935   -0.233619      0.204989
Period: 2        0.039951     0.0305263    1.31    0.2041   -0.0233566     0.103259
Formulation: t   0.0358771    0.0305263    1.18    0.2524   -0.0274304     0.0991847
Subject: 2       0.518319     0.105746     4.90    <1e-04    0.299015      0.737623
Subject: 3       0.301846     0.105746     2.85    0.0092    0.082542      0.52115
Subject: 4       0.0409981    0.105746     0.39    0.7020   -0.178306      0.260302
Subject: 5      -0.596241     0.105746    -5.64    <1e-04   -0.815545     -0.376937
Subject: 6      -0.950327     0.105746    -8.99    <1e-08   -1.16963      -0.731023
Subject: 7       0.115056     0.105746     1.09    0.2884   -0.104247      0.33436
Subject: 8       0.61519      0.105746     5.82    <1e-05    0.395887      0.834494
Subject: 9       0.674252     0.105746     6.38    <1e-05    0.454948      0.893555
Subject: 10      0.843855     0.105746     7.98    <1e-07    0.624551      1.06316
Subject: 11      0.32713      0.105746     3.09    0.0053    0.107826      0.546434
Subject: 12      0.556382     0.105746     5.26    <1e-04    0.337078      0.775686
Subject: 13     -0.277867     0.105746    -2.63    0.0154   -0.49717      -0.0585627
Subject: 14     -0.690512     0.105746    -6.53    <1e-05   -0.909816     -0.471208
Subject: 15      0.0        NaN          NaN       NaN     NaN           NaN
Subject: 16     -0.0551107    0.105746    -0.52    0.6075   -0.274415      0.164193
Subject: 17     -0.324189     0.105746    -3.07    0.0057   -0.543493     -0.104885
Subject: 18     -0.237008     0.105746    -2.24    0.0354   -0.456312     -0.017704
Subject: 19     -0.461697     0.105746    -4.37    0.0002   -0.681001     -0.242393
Subject: 20     -0.234932     0.105746    -2.22    0.0369   -0.454236     -0.0156284
Subject: 21     -0.859009     0.105746    -8.12    <1e-07   -1.07831      -0.639705
Subject: 22      0.7923       0.105746     7.49    <1e-06    0.572996      1.0116
Subject: 23     -0.256151     0.105746    -2.42    0.0241   -0.475455     -0.0368475
Subject: 24      0.21707      0.105746     2.05    0.0522   -0.00223418    0.436374
────────────────────────────────────────────────────────────────────────────────────

@palday
Copy link
Member

palday commented Sep 29, 2022

This looks like two questions.

  1. Regarding your first question on nesting: the example you have works for me without error. Testing of the nesting is somewhat sensitive to floating point issues: this is what the kwarg atol is for, so you might need to set it to some nonzero value.
  2. Subject 15 has an estimate of 0 and test statistics of NaN because the model is rank deficient. dropcollinear=true pivots numerically collinear columns and sets them to zero; it does not exclude them from the coefficient table. This is in the docstring:

dropcollinear controls whether or not lm accepts a model matrix which is less-than-full rank. If true (the default), only the first of each set of linearly-dependent columns is used. The coefficient for redundant linearly dependent columns is 0.0 and all associated statistics are set to NaN.

If you want them removed from the table, then you should remove the columns from the data before fitting.

@palday palday closed this as completed Sep 29, 2022
@PharmCat
Copy link
Author

If you want them removed from the table, then you should remove the columns from the data before fitting.

Hi! I can't remove data from a table. Also if I remove Subject: 15 this problem appears for subject 14 or 12 ets...

Bioequivalence is a common task in clinical trials and now it can be done partially because we can't
get ANOVA table or F-test for this model. No problem do this in R, SAS, SPSS ets...

Main question: is it possible to avoid this for similar models:

ArgumentError: F test is only valid for nested models

If it is not possible and not in to-do, please, let's say - "yes, it is not possible", because some people have many unsuccessful attempts to do this.

@palday
Copy link
Member

palday commented Sep 29, 2022

I think we're talking past each other:

  • Your F-test MWE works for me and does not error.
  • Your example of dropcollinear does exactly what is documented, i.e. sets the coefficients corresponding to collinear terms to 0.0 and the associated standard errors to NaN.

In other words, there is no unexpected behavior. We are not going to change lm to omit the collinear terms from the output, though in a particular mathematical sense, they are omitted from the model. (Setting a coefficient to zero is equivalent to not having the term in the model.)

For the F-test, I will also note that the F-test (like the likelihood ratio test) is mathematically only well defined for nested models, so we're not going to add support for computing the F-test on non-nested models.

I think the bigger latent issue is numerical tolerance and floating point error. In the case of the F-test, the nesting detection may fail for truly nested models because of floating point error. Likewise for collinearity/rank deficiency, a model may be numerically rank deficient, even if it is not "platonically" / truly rank deficient.

For the nesting detection provided by the F-test, we provide the atol keyword argument to increase numerical tolerance.

For linear regression, there are two options. One, if you know that your model is not rank deficient, then you can set dropcollinear=false and prevent the pivoting strategy that removes numerically rank deficient columns. Alternatively, you can use the solver based on the QR factorization, instead of the solver based on the Cholesky factorization. The QR solver is slightly slower but is more numerically stable and so less sensitive to numerical error. Unfortunately, while most of the numerical infrastructure for using the QR method exists in GLM.jl, it's not exposed to the user in a convenient way. This has been discussed in a few other issues, e.g. #426.

@PharmCat
Copy link
Author

I think we're talking past each other:

  • Your F-test MWE works for me and does not error.

Sorry, I really don't understand:

For this models I have error:

olscmax  = lm(@formula(log(Cmax) ~ Sequence+Period+Formulation+Subject), ncadf)
olscmax2  = lm(@formula(log(Cmax) ~ Period+Formulation+Subject), ncadf)

ftest(olscmax.model, olscmax2.model)
ArgumentError: F test is only valid for nested models

But they are nested... or not?

using atol not effect at all...

@palday
Copy link
Member

palday commented Sep 29, 2022

Can you post your versioninfo() and using Pkg; Pkg.status() so that I can see all the relevant architecture and version details?

Note that nesting and rank deficiency is one of those problems that is well-defined in theory, but not always in practice due to the vagaries of floating point.

@PharmCat
Copy link
Author

versioninfo()

julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: AMD Ryzen 9 5950X 16-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver3)
Environment:
  JULIA_EDITOR = "C:\Users\vsarn\AppData\Local\atom\app-1.60.0\atom.exe"  -a
  JULIA_NUM_THREADS = 32

julia> 
julia> using Pkg; Pkg.status()
      Status `C:\Users\vsarn\.julia\environments\v1.7\Project.toml`
  [0825541b] ANOVA v0.2.1 `https://github.com/marcpabst/ANOVA.jl.git#master`
  [1520ce14] AbstractTrees v0.3.4
  [4c88cf16] Aqua v0.5.5
  [c52e3926] Atom v0.12.38
  [6e4b80f9] BenchmarkTools v1.3.1
  [e28b5b4c] Bootstrap v2.3.3
  [2a0fbf3d] CPUSummary v0.1.25
  [336ed68f] CSV v0.9.11       
  [052768ef] CUDA v3.12.0
  [324d7699] CategoricalArrays v0.10.6
  [535c2557] ClinicalTrialUtilities v0.6.5 `https://github.com/PharmCat/ClinicalTrialUtilities.jl.git#master`
  [861a8166] Combinatorics v1.0.2
  [ae264745] Copulas v0.1.0
  [667455a9] Cubature v1.5.1        
  [717857b8] DSP v0.7.7
  [a93c6f00] DataFrames v1.3.4      
  [864edb3b] DataStructures v0.18.13
  [41ba435c] Deconvolution v1.1.1   
  [1130ab10] DiffEqParamEstim v1.26.0
  [163ba53b] DiffResults v1.0.3
  [0c46a032] DifferentialEquations v7.2.0
  [31c24e10] Distributions v0.25.68
  [e30172f5] Documenter v0.26.3
  [35a29f4d] DocumenterTools v0.1.15
  [8f5d6c58] EzXML v1.1.0
  [7a1cc6ca] FFTW v1.5.0
  [1a297f60] FillArrays v0.13.2
  [08572546] FlameGraphs v0.2.10
  [f6369f11] ForwardDiff v0.10.32
  [b18b359b] FourierTools v0.3.4
  [38e38edf] GLM v1.8.0
  [c43c736e] Genie v4.18.1
  [dcc97b0b] GeoStats v0.27.0
  [86223c79] Graphs v1.7.2
  [4c0ca9eb] Gtk v1.2.2
  [19dc6840] HCubature v1.5.0
  [cd3eb016] HTTP v0.9.17
  [09f84164] HypothesisTests v0.10.10
  [7073ff75] IJulia v1.23.3
  [86fae568] ImageView v0.11.1
  [916415d5] Images v0.25.2
  [c601a237] Interact v0.10.4
  [682c06a0] JSON v0.21.3
  [e5e0dc1b] Juno v0.8.4
  [23fbe1c1] Latexify v0.15.16
  [d3d80556] LineSearches v7.2.0
  [7ed4a6bd] LinearSolve v1.26.0
  [2fda8390] LsqFit v0.12.1
  [6ee0df7b] MLJLinearModels v0.7.0
  [ee78f7c6] Makie v0.16.6
  [a1dec852] Metida v0.14.1 `https://github.com/PharmCat/Metida.jl.git#master`
  [075456b7] MetidaBase v0.9.1 `https://github.com/PharmCat/MetidaBase.jl.git#main`
  [bd16ee1e] MetidaFreq v0.1.0 `https://github.com/PharmCat/MetidaFreq.jl.git#main`
  [097c2839] MetidaNCA v0.3.1 `https://github.com/PharmCat/MetidaNCA.jl.git#main`
  [ea555271] MetidaReports v0.1.0 `https://github.com/PharmCat/MetidaReports.jl.git#main`
  [75cdad26] MetidaStats v0.1.0 `https://github.com/PharmCat/MetidaStats.jl.git#main`
  [e1d29d7a] Missings v1.0.2
  [ff71e718] MixedModels v4.7.0
  [961ee093] ModelingToolkit v8.12.0
  [4886b29c] MonteCarloIntegration v0.0.3
  [ffc61752] Mustache v1.0.14
  [37188c8d] MvNormalCDF v0.2.5
  [76087f3c] NLopt v0.6.5
  [2456a17b] ODMXMLTools v0.2.1 `https://github.com/PharmCat/ODMXMLTools.jl.git#main`
  [429524aa] Optim v1.7.2
  [90014a1f] PDMats v0.11.16
  [e4faabce] PProf v2.2.0
  [f853b5e0] Pandoc v0.2.5
  [14b8a8f1] PkgTemplates v0.7.29
  [a03496cd] PlotlyBase v0.8.18
  [91a5bcdd] Plots v1.31.7
  [c3e4b0f8] Pluto v0.19.9
  [08abe8d2] PrettyTables v1.3.1
  [27ebfcd6] Primes v0.5.3
  [c46f51b8] ProfileView v1.5.1
  [92933f4c] ProgressMeter v1.7.2
  [d330b81b] PyPlot v2.11.0
  [1fd47b50] QuadGK v2.4.2
  [67601950] Quadrature v2.1.0
  [8a4e6c94] QuasiMonteCarlo v0.2.9
  [1a8c2f83] Query v1.0.0
  [6f49c342] RCall v0.13.13
  [ce6b1742] RDatasets v0.7.7
  [3cdcf5f2] RecipesBase v1.2.1
  [189a3867] Reexport v1.2.2
  [295af30f] Revise v3.4.0
  [f2b01f46] Roots v2.0.2
  [47aef6b3] SimpleWeightedGraphs v1.2.1
  [aa65fe97] SnoopCompile v2.9.2
  [276daf66] SpecialFunctions v1.8.7
  [860ef19b] StableRNGs v1.0.0
  [90137ffa] StaticArrays v1.5.6
  [2913bbd2] StatsBase v0.33.21
  [3eaba693] StatsModels v0.6.31
  [f3b207a7] StatsPlots v0.15.1
  [ec984513] StipplePlotly v0.12.4
  [bd369af6] Tables v1.7.0
  [a759f4b9] TimerOutputs v0.5.21
  [9d95f2ec] TypedTables v1.4.0
  [1986cc42] Unitful v1.11.0
  [239c3e63] Vega v2.3.1
  [44d3d7a6] Weave v0.10.10
  [0f1e0344] WebIO v0.8.18
  [fdbf4ff8] XLSX v0.8.2
  [ade2ca70] Dates
  [de0858da] Printf
  [9abbd945] Profile
  [9a3f8284] Random
  [10745b16] Statistics

julia> 

@PharmCat
Copy link
Author

Note that nesting and rank deficiency is one of those problems that is well-defined in theory, but not always in practice due to the vagaries of floating point.

Yes, I understand :) and want to explore this problem because I try to make a "recipe" for students how to switch from R to Julia in common tasks for bioequivalence and any other crossover design trial.

I think that this model matrix not make quite correctly because we have Subject: 15, but fine, other packages (in R or in SPSS) do the same. But when we remove Sequence factor after that model matrix became full-rank. But this, I think, should not be a problem for ftest.

@PharmCat
Copy link
Author

Note that nesting and rank deficiency is one of those problems that is well-defined in theory, but not always in practice due to the vagaries of floating point.

May be this can help: https://github.com/PharmCat/edu/blob/main/ipynb/bioequivalence.ipynb

@palday
Copy link
Member

palday commented Sep 30, 2022

The problem is definitely numerical instability. I have now tested this example on:

  • Julia 1.7 with OpenBLAS on Linux x64 (11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz)
  • Julia 1.8 with OpenBLAS on Linux x64 (ibid)
  • Julia 1.7 with MKL on Linux x64 (ibid)
  • Julia 1.8 with MKL on Linux x64 (ibid)
  • Julia 1.8 on Apple M1

and am unable to replicate. I strongly suspect that some of the SIMD instructions on the Ryzen are causing this problem -- we've seen similar issues with AVX instructions on some Intel chips.

If you set dropcollinear=false, then you avoid the pivoting issues causing problems with Subject 15. This should also fix the the nesting check for the F-test. Setting some of the coefficients equal to zero (as the pivoting for collinearity does) reduces the column span of the model, which causes it to fail the nesting check.

@PharmCat
Copy link
Author

dropcollinear=false

In this case results is definitly wrong:

olscmax_noseq  = lm(@formula(log(Cmax) ~ Period+Formulation+Subject), df; dropcollinear=false)

olscmax  = lm(@formula(log(Cmax) ~ Sequence+Period+Formulation+Subject), df; dropcollinear=false)

ftest(olscmax.model, olscmax_noseq.model)
F-test: 2 models fitted on 36 observations
──────────────────────────────────────────────────────────────
     DOF  ΔDOF     SSR    ΔSSR      R²     ΔR²      F*   p(>F)
──────────────────────────────────────────────────────────────
[1]   22        0.1023          0.9779                        
[2]   21    -1  0.1023  0.0000  0.9779  0.0000  0.0000  1.0000
──────────────────────────────────────────────────────────────

Also for this model I have (same on Intel processor: Intel(R) Core(TM) i7-7600U):

olscmax_seq   = lm(@formula(log(Cmax) ~ Subject + Sequence), df; dropcollinear=false)
PosDefException: matrix is not positive definite; Cholesky factorization failed.

Stacktrace:
  [1] checkpositivedefinite
    @ C:\Users\vsarn\AppData\Local\Programs\Julia-1.7.3\share\julia\stdlib\v1.7\LinearAlgebra\src\factorization.jl:18 [inlined]
  [2] cholesky!(A::LinearAlgebra.Hermitian{Float64, Matrix{Float64}}, ::Val{false}; check::Bool)
    @ LinearAlgebra C:\Users\vsarn\AppData\Local\Programs\Julia-1.7.3\share\julia\stdlib\v1.7\LinearAlgebra\src\cholesky.jl:266
  [3] cholesky! (repeats 2 times)
    @ C:\Users\vsarn\AppData\Local\Programs\Julia-1.7.3\share\julia\stdlib\v1.7\LinearAlgebra\src\cholesky.jl:265 [inlined]
  [4] GLM.DensePredChol(X::Matrix{Float64}, pivot::Bool)
    @ GLM C:\Users\vsarn\.julia\packages\GLM\P0Ris\src\linpred.jl:107
  [5] cholpred
    @ C:\Users\vsarn\.julia\packages\GLM\P0Ris\src\linpred.jl:117 [inlined]
  [6] #fit#4
    @ C:\Users\vsarn\.julia\packages\GLM\P0Ris\src\lm.jl:170 [inlined]
  [7] fit(::Type{LinearModel}, f::FormulaTerm{FunctionTerm{typeof(log), var"#105#106", (:Cmax,)}, Tuple{Term, Term}}, data::DataFrame, args::Nothing; contrasts::Dict{Symbol, Any}, kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:dropcollinear,), Tuple{Bool}}})
    @ StatsModels C:\Users\vsarn\.julia\packages\StatsModels\s9GMM\src\statsmodel.jl:88
  [8] #lm#5
    @ C:\Users\vsarn\.julia\packages\GLM\P0Ris\src\lm.jl:184 [inlined]
  [9] top-level scope
    @ In[26]:1
 [10] eval
    @ .\boot.jl:373 [inlined]
 [11] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base .\loading.jl:1196

for Intel make versioninfo() 10 min later...

@PharmCat
Copy link
Author

dropcollinear=false

using CategoricalArrays, DataFrames, CSV, GLM


df = download("https://raw.githubusercontent.com/PharmCat/edu/main/csv/be.csv") |> CSV.File |> DataFrame

transform!(df, :Subject => categorical, renamecols=false)
transform!(df, :Period => categorical, renamecols=false)
transform!(df, :Sequence => categorical, renamecols=false)
transform!(df, :Formulation => categorical, renamecols=false)

olscmax_null  = lm(@formula(log(Cmax) ~ Subject), df)
olscmax_form  = lm(@formula(log(Cmax) ~ Subject + Formulation), df)
olscmax_noseq  = lm(@formula(log(Cmax) ~ Period+Formulation+Subject), df)
olscmax  = lm(@formula(log(Cmax) ~ Sequence+Period+Formulation+Subject), df)

ftest(olscmax.model, olscmax_noseq.model)
ArgumentError: F test is only valid for nested models

Stacktrace:
 [1] ftest(::LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, ::Vararg{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, N} where N; atol::Float64)
   @ GLM C:\Users\vsarn\.julia\packages\GLM\P0Ris\src\ftest.jl:141
 [2] ftest(::LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, ::LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, ::Vararg{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}}}}, N} where N)
   @ GLM C:\Users\vsarn\.julia\packages\GLM\P0Ris\src\ftest.jl:133
 [3] top-level scope
   @ In[13]:1
 [4] eval
   @ .\boot.jl:360 [inlined]
 [5] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
   @ Base .\loading.jl:1116
julia> versioninfo()
Julia Version 1.6.6
Commit b8708f954a (2022-03-28 07:17 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-7600U CPU @ 2.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

@nalimilan
Copy link
Member

I think we should switch to using the QR decomposition by default in GLM.jl 2.0 as it seems to be more reliable.

@palday If a predictor was dropped due to collinearity, in theory it adds no information at all, so we could consider that models are nested if their predictors are equal, including the dropped one even if its coefficient is 0?

We could also add an argument to disable the nesting check, for people to use at their own risk.

@nalimilan nalimilan reopened this Oct 7, 2022
@palday
Copy link
Member

palday commented Oct 7, 2022

@nalimilan completely agree on the QR decomposition. I wanted to mock up how to do it for this example, but it was actually not nearly as polished as the Cholesky stuff and I had more pressing commitments.

Regarding the dropped predictor ... it depends! Concretely, the rank deficiency/collinearity detection drops the predictor before any fitting happens and so actually fits the reduced model and just re-inserts the pivoted coefficients as zeros. So in a certain sense, the reduced model is the one being compared for nesting purposes. I mostly agree with this -- the full, unpivoted model isn't well-defined because one of the assumptions of OLS is that the model is of full column rank, so instead pivoting fits the closest well-defined model and inserts zeroed coefficients to highlight how to obtain this model from the originally specified model. The difficulty is that the theoretical and numerical assessments of "full column rank" can differ.

There is another way to view this: dropping predictors effectively reduces the model degrees of freedom by the number of dropped predictors. Now when you consider the a likelihood-ratio or F-test, how does this impact the (denominator) degrees of freedom for the comparison?

We could also add an argument to disable the nesting check, for people to use at their own risk.

Yes, and we could do this in a minor release. But I don't know if that will really solve the problem here since there appears to be other numerical issues before that point.

@PharmCat
Copy link
Author

PharmCat commented Oct 7, 2022

Now when you consider the a likelihood-ratio or F-test, how does this impact the (denominator) degrees of freedom for the comparison?

I think for current case DOF will be correct.

PS Example above is a basic crossover case. As factor subject "inside" sequence StatsModels can't handle this case to make full-rank matrix. I talk about this here: https://discourse.julialang.org/t/should-model-matrix-for-nested-factors-be-full-rank/84733 , but all I understand, that now is impossible to make appropiate coding for this. So StatsModels make not full-rank matrix for this model, then GLM can't make appropriate F-test if doesn't drop redundant columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants