Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate missing values from data when returning vectors #546

Open
nalimilan opened this issue Aug 24, 2023 · 7 comments
Open

Propagate missing values from data when returning vectors #546

nalimilan opened this issue Aug 24, 2023 · 7 comments
Milestone

Comments

@nalimilan
Copy link
Member

When a model was fitted on data containing missing values which have been dropped, predict(model), residuals(model), fitted(model) and response(model) should return a vector of the same length as the data with missing for corresponding entries. It would make sense to change this before releasing GLM 2.0, even if a more general fix is implemented later in StatsModels (JuliaStats/StatsModels.jl#202).

See JuliaStats/MixedModels.jl#650 for a similar issue in MixedModels.

@nalimilan nalimilan added this to the 2.0 milestone Aug 24, 2023
@ararslan
Copy link
Member

Would we do the same for modelmatrix? If not, the dimensions of response and modelmatrix would disagree, which seems pretty awful. If so, crossmodelmatrix is modelmatrix(model)' * modelmatrix(model), which becomes pretty useless if modelmatrix includes missing values...

@ararslan
Copy link
Member

This also means that response(model) et al. won't have length nobs(model) (for unweighted models).

@bkamins
Copy link
Contributor

bkamins commented Aug 27, 2023

A related issue (if a separate issue should be opened for this please let me know, but I think this should be revolved along with the original issue with a consistent approach):

julia> lm(@formula(x1~lag(x2)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

(the issue is that, if I understand things correctly, lagging is applied after dropping of rows with missing is performed)

More generally e.g.:

julia> f(x) = x > 0.5 ? 1 : missing
f (generic function with 1 method)

julia> lm(@formula(x1~f(x2)), df)
ERROR: MethodError: Cannot `convert` an object of type Missing to an object of type Float64

@nalimilan
Copy link
Member Author

I think that's JuliaStats/StatsModels.jl#202. It's a bit harder to fix.

@nalimilan
Copy link
Member Author

Would we do the same for modelmatrix? If not, the dimensions of response and modelmatrix would disagree, which seems pretty awful. If so, crossmodelmatrix is modelmatrix(model)' * modelmatrix(model), which becomes pretty useless if modelmatrix includes missing values...

Good point. We could say that functions related to the fitting procedure, like modelmatrix and crossmodelmatrix never include missing values. But I admit that's a bit ad hoc, but that's what R does when using na.exclude (i.e. only pad residuals and predicted values with missing values).

@bkamins
Copy link
Contributor

bkamins commented Aug 27, 2023

It's a bit harder to fix.

I agree that it is probably harder, but my point was that we should keep also this case in mind when developing a solution (as it might be affected by that requirement).

@ararslan
Copy link
Member

functions related to the fitting procedure

That arguably includes response, residuals, and fitted... probably everything but predict. "Related to the fitting procedure" may also depend on the kind of model and wouldn't necessarily be the same between GLM and another statistical modeling package.

IMO the current behavior that these functions return the objects/quantities as actually used by the model is the right one, and the case of including any missing values that were present in the input is better handled separately. I actually think that the existing predict(model, data), where data is the original input that includes missing values, is a sensible solution to getting back fitted values that include missing: you're asking for the model's output when given that input, which is inherently different from what the model used for fitting (in the presence of missing values).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants