Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Handle missing values #153

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from
Draft

[WIP] Handle missing values #153

wants to merge 8 commits into from

Conversation

palday
Copy link
Member

@palday palday commented Sep 24, 2019

When done, this will close #145 and potentially address #17 .

  • Handle missing in continuous variables.
  • Handle missing in categorical variables.

@codecov-io
Copy link

codecov-io commented Sep 24, 2019

Codecov Report

Merging #153 into master will decrease coverage by 0.1%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master    #153      +/-   ##
=========================================
- Coverage    83.6%   83.5%   -0.11%     
=========================================
  Files           9       9              
  Lines         494     497       +3     
=========================================
+ Hits          413     415       +2     
- Misses         81      82       +1
Impacted Files Coverage Δ
src/schema.jl 81.08% <100%> (+0.17%) ⬆️
src/terms.jl 83.76% <50%> (-0.59%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d03b5c2...a799919. Read the comment docs.

@kleinschmidt
Copy link
Member

This generally looks good but needs tests. I think the problem I ran into was that copy(::Missing) isn't defined, but modelcols(::ContinuousTerm, ...) uses copy on teh elements of the underlying vector, instead of just writing the elements into a new vector. The idea was that maybe someone has a custom type that is mutable but they define it as continuous (for instance, Survival.jl uses a custom EventTime struct, which isn't mutable but still it's not a stretch), but maybe that's being too defensive.

@palday
Copy link
Member Author

palday commented Sep 25, 2019

Yeah, that's about as far as I had gotten as well when I ran out of steam late local time. deepcopy(::Missing) is defined, and given that missing is a singleton, having copy(m::Missing) = deepcopy(m) seems like a reasonable definition, but probably one that doesn't belong here but rather in Base.

# layer of indirection
function copy end
copy(x::Any) = Base.copy(x)
copy(m::Missing) = m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this feels like a bad idea. Maybe better to just drop the use of copy (and broadcast identity or use copy on the vector).

@palday
Copy link
Member Author

palday commented Sep 25, 2019

Ignore the horrible hacking of getindex -- that will not stay.

@@ -176,6 +176,8 @@ concrete_term(t::Term, d) = concrete_term(t, d, nothing)
concrete_term(t, d, hint) = t

concrete_term(t::Term, xs::AbstractVector{<:Number}, ::Nothing) = concrete_term(t, xs, ContinuousTerm)
# and for missing values
concrete_term(t::Term, xs::AbstractVector{Union{Missing,T}} where T<:Number, ::Nothing) = concrete_term(t, xs, ContinuousTerm)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't those lines be

concrete_term(t::Term, xs::AbstractVector{<:Union{<:Number,<:Union{Missing,<:Number}}}) =
    concrete_term(t, xs, ContinuousTerm)

I don't think we need/should specialize.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't just having AbstractVector{<:Union{Missing, <:Number}} work?

julia> [1,2] isa AbstractVector{<:Union{Missing,<:Number}}
true

julia> [1,2, missing] isa AbstractVector{<:Union{Missing,<:Number}}
true

@kleinschmidt kleinschmidt added this to the 0.7 milestone Dec 18, 2020
@greimel
Copy link

greimel commented Jun 30, 2022

It would be nice if the following

julia> using DataFrames, StatsModels

julia> df = DataFrame(t=1:5, x=(1:5).*2.0, y=1:5.0, z=[12; missing; 16.:2:20])

julia> f = @formula(y ~ x + lag(x) + z + lag(z))

gave this outcome

julia> df_hand0 = transform(df, :z => lag => :z_lagged, :x => lag => :x_lagged)
5×6 DataFrame
 Row │ t      x        y        z          z_lagged   x_lagged  
     │ Int64  Float64  Float64  Float64?   Float64?   Float64?  
─────┼──────────────────────────────────────────────────────────
   11      2.0      1.0       12.0  missing    missing   
   22      4.0      2.0  missing         12.0        2.0
   33      6.0      3.0       16.0  missing          4.0
   44      8.0      4.0       18.0       16.0        6.0
   55     10.0      5.0       20.0       18.0        8.0

julia> df_hand = df_hand0[completecases(df_hand0),:] |> disallowmissing!

julia> f_hand = @formula(y ~ x + x_lagged + z + z_lagged)

julia> ff_hand = apply_schema(f_hand, schema(f_hand, df_hand))

julia> modelmatrix(ff_hand, df_hand)
2×4 Matrix{Float64}:
  8.0  6.0  18.0  16.0
 10.0  8.0  20.0  18.0

(If you happen to make this work in this PR, feel free to turn this code into a test.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect Default Coding of Union{Missing, Int or Float64}
5 participants