-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing values and weighting #88
Comments
Thanks for your hard work on this. I know what a tricky PR it is, I really think something like
which passes
that seems like a reasonable solution. Of course, that doesn't work for something like
but tbh I think that having weights as a separate vector from I'm glad you didn't propose dispatch based on As for |
Ah, yes, wrapping the function rather than the data is indeed an interesting solution to allow passing weights while skipping missing values. One advantage would be that The downside I see is that
Are you sure? I'm not saying we should keep dispatch based on |
Ah sorry, yes. But currently |
We've discussed
AFAICT the number of functions in that case in StatsBase is relatively limited:
So overall maybe Another corner case to consider are in-place functions like |
We could also do |
Let me add one more thing. When do we need something like To answer this note the following:
This means that we need
Now the challenge is that general Tables.jl tables will not be able to support As an additional note some functions will still need a special treatment of missings, e.g. Having said all that I have the following thought:
@nalimilan - I know I am going back to what we already discussed but I really feel that most of typical users will have a much easier time learning to use |
I think the main argument against a Conversely, one of the advantages of |
Let me ask in #data and #statistics. |
What statistics functions work on Tables.jl objects which don't already have an internal handling of |
My comment is that in general there is an expectation to potentially allow Tables.jl objects in the functions provided (and the discussion how to do it has not happened yet). |
It's great to see Julia's evolution in the space of missingness handling, and the thoughtful discussions that underlie them. I have a thought I'd like to get some of your expert feedback on, and I'm hoping this is the right place for it. I consider among Julia's biggest data-analytic advantages over its competitors that Julia functions respect missingness rather than hiding it. But as @nalimilan writes,
there is a whole class of important functions that ignore missing data. When I'm using these functions, I'm always nervous I'll make a mistake about which rows I'm using. In my applications, values are not "missing completely at random", and assuming they are will lead me to incorrect analytic conclusions. An analysis that drops rows missing e.g. income values may overlook a significant relationship if the missingness is correlated with other features of the data. Furthermore, even if the top-line results are fine, I like to attend to missingness even in intermediate steps like model selection. As Richard McElreath puts it in Statistical Rethinking,
Getting missing values in results or error messages will alert me that I need to be concerned about missingness in my dataset. There are many possibilities for how to be explicit about it when that is desired, e.g. I would love to hear how you think about this issue and how I might operate under the next statistics API. |
Yes we should probably not drop rows with missing values silently when fitting models as it's inconsistent with the rest of Julia. (Note that none of this is defined in Statistics. That's really a separate issue IMO since it cannot be handled using |
Yes, the issue is that how and which missings should be dropped is a analytical procedure specific information and has to be handled internally by it. But it has a consequence for a general design that maybe we in the end need to accept that there will be a |
In an ideal world, this goes in the model. The ideal model is something like
But that's something for the very long term and dedicated funding. |
We currently have an efficient and consistent solution to skip missing values for unweighted single-argument functions via
f(skipmissing(x))
. For multiple-argument functions likecor
we don't have a great solution yet (https://github.com/JuliaLang/Statistics.jl/pull/34). Another case where we don't have a good solution is weighted functions, which are not currently in Statistics but should be imported from StatsBase (https://github.com/JuliaLang/Statistics.jl/issues/87).A reasonable solution would be to use
f(skipmissing(x), weights=w)
, with a typical definition being:That is, we would assume that weights refer to the original vector so that we skip those corresponding to missing entries. This is admittedly a bit weird in terms of implementation as weights are not wrapped in
skipmissing
. A wrapper likeskipmissing(weighted(x, w))
(inspired by what was proposed at JuliaLang/julia#33310) would be cleaner in that regard. But that would still be quite ad-hoc, asskipmissing
currently only accepts collections (andweighted
cannot be one since it's not just about multiplying weights and values), and the resulting object would basically be only used for dispatch without implementing any common methods.The generalization to multiple-argument functions poses the same challenges as
cor
. For these, the simplest solution would be to use askipmissing
keyword argument, a bit likepairwise
. Again, the alternative would be to use wrappers likeskipmissing(weighted(w, x, y))
.Overall, the problem is that we have conflicting goals:
f(skipmissing(x))
f(skipmissing(x))
vsf(skipmissing(x), weights=w)
, orf(skipmissing(x))
vsf(skipmissing(weighted(x, w)))
, orf(x, skipmissing=true)
vsf(x, skipmissing=true, weights=w)
f(skipmissing(x))
vsf(skipmissing(x, y))
, orf(x, skipmissing=true)
vsf(x, y, skipmissing=true)
mean
) and complex functions operating on whole tables (likefit(MODEL, ..., data=df, weights=w)
and which skip missing values by default)The text was updated successfully, but these errors were encountered: