Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Don't allow NA inside indices #39

Closed
johnmyleswhite opened this issue Dec 23, 2013 · 32 comments · Fixed by #104
Closed

Don't allow NA inside indices #39

johnmyleswhite opened this issue Dec 23, 2013 · 32 comments · Fixed by #104

Comments

@johnmyleswhite
Copy link
Member

As discussed in #38, any indexing operation that uses an NA index should fail. Right now, we simply drop NA's indices, but it's safer if you know that leaving NA's in your indices will always fail.

@nalimilan
Copy link
Member

Coming a little late in the conversation, but I don't think throwing an error when the index contains NAs is the best solution. While it may sound safer, it's neither logical nor practical.

Logical: returning NA when the index is NA makes much sense: if the index is NA, this means that you do not know whether the observation satisfies a condition or not, thus that the resulting value is also missing. This is just the logical application of the NA propagation principle. Moreover, if people have to call dropna() and NAs from the index are not present in the result, they cannot know in the resulting variable what is the proportion of missing values within the complete initial sample.

Concretely, this also means that either you get an error, or you remove NAs with dropna(), and you get a vector whose length is different from the original one. There's no way to get a vector of the original length, as is needed in the use case raised by @simonster at #38 (comment). At least, with R's behavior, you have the choice to get NAs in the result, or to skip them (by using which(), which has a more explicit equivalent in Julia with dropna()).

Finally, regarding practical matters, in the real world, data always contains NAs, which means that you'd have to clutter the code with dropna() every time you index a DataFrame column with a condition on another column. You get much cleaner code by returning NAs where the index is NA, and let the functions analyzing data return NA where NAs are present. If NAs where not supposed to be present, people will notice them at that point if they did not check the data before that. This is were NAs will require explicit treatment.

(BTW, one could imagine supporting different NA payloads in the future, like NumPy describes at https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#future-expansion-to-multi-na-payloads. Then, NAs introduced via indexing could be given a special payload, and e.g. a frequency table would be able to tell you directly how many observations are missing because the variables determining their inclusion in the subsample were missing, and how many have the variable of interest itself missing.)

@nalimilan
Copy link
Member

Also something to keep in mind: contrary to most other languages, Julia provides basic types for when you do not expect NAs to be present in the data. DataArrays should only be used by people in code where they know NAs are (or may be) present. So it is justified to adapt the behavior of DataArrays to this situation, and not throw an error.

@johnmyleswhite
Copy link
Member Author

Thanks for pushing back, @nalimilan. I think we need to have more debate, since your points make me hesitate to make this change, but they also don't fully convince me that our current behavior is right.

Regarding the logical consequence of the NA rule, consider the following example:

x <- c(1, 2, 3, 4)
inds <- c(3, NA)
x[inds]

To me, this case is not properly an application of the NA propagation rule that I would like us to support because it conflates two types of uncertainty: (1) uncertainty about the values in x and (2) uncertainty about the indices that you requested through y. I think NA is only really mean to apply to Case (1), which is an intrinsic property of the data that cannot be resolved by stating your intentions without ambiguity. In contrast, Case (2) is a property of the statements made by the programmer and not a property of the data. If you use ambiguous language, it's perfectly reasonable for a Julia library to ask you to clarify what you meant.

Indexing with Booleans seems quite different than indexing with numbers, because it's not a question of what you're asking for, but only whether you're asking for each da[i]. In that sense, it seems acceptable to resolve ambiguity in favor of assuming that NA really meant false.

So I'm leaning to the principle that it's ok for indexing operations to use array(inds, replacement), but not for them to use dropna(inds). There's no sane replacement for numeric indexing, so we end up hallucinating a value that's not even in the underlying data. That seems risky to me, because it means you can't trust the output of statements like isna(da[inds]) to actually reflect properties of da: they're as likely to reflect properties of inds.

@nalimilan
Copy link
Member

Yeah, there are two different types of uncertainty (that's why I raised the possibility of using different payloads), but nothing allows you to consider that NA should only represent one type and not the other. isna(da[inds]) then reflects whether there is uncertainty on both the variable of interest or the definition of the subset.

I don't think you can make a distinction between integer indexes and booleans, most importantly because it would be confusing, which is IMHO the main issue to avoid with NAs. And secondarily because it does not make sense: if NAs are present in the boolean, this means you do not/cannot know whether you are asking for the value of not. That's really the same problem. (In the real world we often build a subsample considering NAs as false, and only take care of missing values in the variable of interest, but that's a lack of rigor. ;-)

That said, I think we should consider practical issues rather than theoretical considerations. We should review the different cases where we expect this problem to arise, and see what are the code patterns that we would recommend given the behavior we retain.

One of the examples where throwing an error would be a problem, is when in R you are recoding data.frames using constructs such as df$A[df$B > 1] = "X". I think that this kind of syntax is not very nice and that it could mostly go away in Julia in favor of if... else in loops and some syntactic sugar (JuliaData/DataFrames.jl#369). But we have to check that in no occasion it would be needed, for example when somebody is working with individual DataArrays rather than a DataFrame; or maybe as a short way to recode data in some cases.

The major situation where throwing an error is a no-go is when extracting a subset of a DataFrame's variable, such as df[:(A .> 1), "B"]. How would you handle this case?

Maybe it would be useful to get the opinion of people who have considered this problem before. We could ask Patrick Burns, the author of "The R Inferno" [1], about the mistakes not to repeat. NumPy's authors would probably have good ideas too (AFAICT their design document does not detail the choice of what to do with NA indexes).

1: http://www.burns-stat.com/documents/books/the-r-inferno/

@johnmyleswhite
Copy link
Member Author

Agreed that it would be good to get input from more people than the two of us. As we discuss this, I'm increasingly shifting towards a stricter stance on using NA as an index. I've just been told that NumPY takes the stricter attitude I'm about to propose.

To resolve our debate, I'd like to propose that we impose a simple bright line here that will simplify lots of future design decisions regarding DataArrays: any operation on an AbstractDataArray, except for a small whitelist of arithmetic operations, should fail when encountering NA unless the user takes explicit action to deal with the NA values. We already agreed to follow this rule for things like mean, but I'm starting to think that we should use it for indexing as well. Dropping NA should require the user to explicitly assert that NA values are missing completely at random. Doing this automagically is convenient for the analysis of experimental data, but it seems very problematic when working with observational data.

As I think about it more, the two ways that we've proposed dealing with NA in indices strike me deeply problematic for any project that's not asking a subset of safe questions about experimental data. In the boolean indexing case, you discard data that's potentially relevant and can produce conclusions that are totally unjustified by your data. We should ask that users make an explicit choice to do this every time by writing out things like dv[array(dv .> 1, false)]. It's less convenient, but it's also much more correct. Every time that correctness and convenience conflict, I think we should prefer correctness.

In the numeric indexing case, I continue to disagree with the proposal that we should hallucinate illusory NA values that aren't present in the data. What value is there is misleading the user about the structure of their data?

Looking more into R's approach, I think the design decisions taken in S and R in this regard are very poorly thought out. Consider, for example, this strange example of hallucinating NA values in R:

x <- c(1, 2, 3, 4)
x[NA]
# [1] NA NA NA NA

@nalimilan
Copy link
Member

I'm not particularly attached to R's behavior. But I think that correctness is just a word in many cases, just like adding confirmation dialogs is just security theater when it comes to protecting computer users from themselves. Such correctness will only protect people who known they will not get NAs in their data. Others will skip NAs using whatever way they can find if they didn't care in the first place.

Let me explain my point of view; maybe I have a different approach to NAs than others because of the field I work in. In the different surveys I've used (with thousands of individuals), I don't think there's a single variable (except maybe the year of survey or things like that) where no NA is present. So I basically need to go over all the "correctness" barriers in the API -- in the end, it's just noise to me, it doesn't add anything except making the code ugly. When you work with large survey data, the important information is not "is there at least one NA as opposed to none", it's more "what's the share of NAs", and even more profoundly "what's the profile of individuals with NAs and does it seem random". This can only be done by hand, or provided by verbose functions designed for interactive use.

In this approach, the main issue is to make it easy/clean to skip NAs in all cases. Given Julia's distinction between Arrays and DataArrays, it would even make sense to skip NAs silently. If you want to be sure NAs are not skipped, use failna() and work with standard arrays after that, not with DataArrays. Of course I don't want to impose this behavior on other, but maybe there should be two subtypes of AbstractDataArray, one with IGNORE semantics, and one with NA semantics (using NumPy's terms from https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#id4) by default; or an option to choose which one should apply to a particular DataFrame or DataArray. As another NumPy page states:

Several people have suggested that there should be a single system that has multiple missing values that each have different semantics, e.g., a MISSING value that has NA semantics, and a separate IGNORED value that has ignored semantics.

This is how e.g. SAS, SPSS and Stata handle missing values by default.

@johnmyleswhite
Copy link
Member Author

Ok. I've never worked with SAS, SPSS or Stata. Your proposal seems totally reasonable, but also adds a lot of work.

Will write more later. My plane is landing at SFO now.

@nalimilan
Copy link
Member

Work is something Julia people seem to be good at. ;-)

Seriously, the hardest part is finding the correct design for the different use cases we can identify. With the current code in DataArrays most of the support seems to be present already.

@johnmyleswhite
Copy link
Member Author

@nalimilan, reading your comments again, I'm a little confused as to why we would need two separate data structures to handle the ignore / missing semantics. Functions like mean(da, skipna = true/false) seem to capture the important parts of those semantics. What's missing?

We're definitely not going to make skipna = true the default for mean(DataArray). Just because DataArray's can contain missing data doesn't mean that we can simply ignore missing values by default. When you ask for mean(Array) you know that there's no room for debate about what will be calculated. When you ask for mean(DataArray), there's a lot of room for debate. Because there's ambiguity about what's appropriate, we should default to giving the safest possible answer, which is either returning NA or raising an error. Because returning NA sometimes breaks type inference, it seems much better to always raise an error.

@nalimilan
Copy link
Member

I was suggesting that there could be one way to choose that by default a given DataArray should have skipna = true, in addition to the current skipna = false. The rationale is that, as I said, when working with survey data, you always pass skipna = true since there always are NAs, which means it no longer brings any safety benefit, just verbosity.

While in the case of mean() typing mean(da, true) is not a big deal, having two different default behaviors would allow solving the problem of NAs in indices: one could throw an error, and the other ignore them silently (unless skipna is explicit set). Type inference would then work. I'm saying "survey data", but from the NumPy account of the MISSING vs. IGNORE use cases, I understand that other fields would be happy to have such a behavior.

Instead of types, one could also imagine having a special field indicating what should be done with NAs.

@johnmyleswhite
Copy link
Member Author

Including a field that always skips NA seems ok to me. It adds some troubling non-locality, but we're also thinking of doing that with orderings for PooledDataArray's, so it's hard to fault.

@johnmyleswhite
Copy link
Member Author

Any thoughts on this issue from @HarlanH, @tshort, @simonster, @ViralBShah, @StefanKarpinski, @kmsquire?

@HarlanH
Copy link

HarlanH commented Jan 7, 2014

Tricky issue. I sorta like the approach of adding a metadata field to DataArrays that indicate how to handle NAs, defaulting to "throw an error." I also like the rule of thumb that says that indexing should not create NAs that weren't there. It's not clear that the preferences for one apply to the other, though, in general. Maybe the metadata field should only apply to reducing operations like mean, and not to indexing? And you should have to use array or whatever to do what use to be called replaceNA when indexing? I'm not sure.

@johnmyleswhite
Copy link
Member Author

For me, the main trouble with indexing is that it's not a function call without function syntax, so we can't just add a keyword argument.

@tshort
Copy link

tshort commented Jan 7, 2014

For now, I'm in favor of throwing an error.

That said, it'd be interesting to see someone write a LooseDataArray or modify DataArray with metadata like Harlan suggests to see how it works in practice.

@HarlanH
Copy link

HarlanH commented Jan 7, 2014

John, yeah. It's a shame that the iterator-based functional-programming syntax, which I still really like, wasn't faster. Maybe if dropna or replacena did a shallow copy of the DataArray, but with different metadata? That might be more efficient. Then you'd have mean(skipna(da)) and x[skipna(da)], instead of mean(da, skipna=true). Having a consistent approach seems worthwhile.

This is the sort of thing where mixins/multiple inheritance shines...

@nalimilan
Copy link
Member

@HarlanH I guess this will be easy to do with array views, and should be implemented. But for people who work all day with NAs everywhere, this still adds a lot of noise for no gain. So experimenting with a way to make skipping NA the default for some DAs would make Julia quite attractive.

@HarlanH
Copy link

HarlanH commented Jan 7, 2014

@nalimilan , I completely agree. Would the Julian syntax then be skipna!(da) to set the flag once?

@nalimilan
Copy link
Member

@HarlanH Yeah, something like that. Or maybe skipna!(da, true) or naaction!(da, :skipna) (the latter if we want to allow more actions, like returning NA rather than throwing, maybe that's not needed). An equivalent for setting this for some or all columns of a DataFrame would be useful too.

@StefanKarpinski
Copy link
Contributor

I have a really hard time seeing what indexing with an NA should do if not throw an error. In general, I've found it to be a very useful principle that when there isn't one obvious right thing to do, you should throw an error and force the user to be more explicit. In the case of NA the user either wants to ignore NA indices or maybe return a default value, either of which can be accomplished by explicitly handling NAs, no? I also think that explicitly dropping or replacing NA values before indexing with them is the clearest thing to understand if you're reading the code.

@nalimilan
Copy link
Member

@StefanKarpinski My suggestion is to allow making the behavior explicit only once for a given vector, most likely after loading the data (in the form of a DataFrame, typically), to make the code cleaner everywhere after that. For working with survey data, I think that's a big plus.

@StefanKarpinski
Copy link
Contributor

It seems like feature creep to me, but I'm very heavily anti-feature. I'd argue for leaving out the feature at first and making people explicitly filter out NAs or replace them and see how that pans out. That may turn out to be fine, in which case awesome! – no feature needed. It may also turn out to be intolerably annoying, in which case you'll be in a better position to decide how to proceed since you'll have ample use cases to consider.

@nalimilan
Copy link
Member

Except that we already have an example of how this works in R and NumPy: I have colleagues who find R terrible because it doesn't ignore NAs when indexing. I know how throwing an error will turn out for me: doable, but verbose and relatively painful. OTOH, it would be interesting to experiment with setting the default NA behavior for each DataVector, because we have no previous knowledge about how this works in practice. If we discover it doesn't work well before reaching a production stage, we can still change it or deprecate it.

I think you should seriously consider the use cases of scientific fields you may not be very familiar with: software used by social scientists, like SAS, SPSS and Stata ignore NAs by default; I guess epidemiologists must have the same requirements. When switching from them, people complain about this. A nice handling of this situation would help making Julia an even more compelling alternative to e.g. R or Python.

@nalimilan
Copy link
Member

I've discussed this issue with Terry Therneau, who participated in the development of S and then R. Interestingly, he remembers that in early days S modeling functions did not automatically skip NAs, they always required you to specify via an argument that you wanted this behavior (else they failed). Splus improved the situation by adding a global na.action option which you only had to set once. R further changed things by setting the default to skip NAs, which nobody seems to complain about.

Even if the issue is not exactly the same (NAs in models vs NAs in indexes and more generally when passed to functions), this historical point shows that for practical uses, being able to ignore NAs is really a requirement. Let's not repeat S's mistakes if we don't want Julia to know the same fate among researchers working primarily on data.

Terry gave me some further thoughts about this:

I you have an option that has to be added to a substantial fraction of the lines in a
program, so as to make that program useful, AND you would like people to use your program,
then something needs to be done about it. Don't make them plant flowers with a backhoe.

I work with medical research data. Missing values are everywhere. People don't answer a
question, a lab test wasn't order or failed (some do you know), a date is only partially
known (month/year), etc. If you want people to ever use Julia, then you need to make the
program useful and useable for people like me.

And regarding what to do with NAs by default in functions like mean and sum:

This is a harder question. When I am typing at the keyboard I clearly want a global
option. The reason is that having mean or quantile return NA is completely useless -- I
immediately have to retype the expression again. Thus its only function is to annoy.

When used inside a function, however, the author of the function should never depend on
something that might have been changed at the top level. That is, they should explictly
put the NA action in. Now the question is whether we force them to be good coders by
slapping their hands, allow their test cases to succeed via a local setting of the global
option (that will later fail someone else) and use social pressure to encourage good
coding practice, or try to find a way to make the option only work in some of the cases.
I'm most in sympathy with b, since I find it infuriating to have a system argue with me
in those cases where I DO know what I'm doing and it isn't smart enough to shut up and
leave me alone. But there is a lot of merit in the other two options.

It seems to me that deciding via a property of the DataFrame whether NAs should be skipped or raise an error solves the dilemma between global options (which suddenly stop working when code is copied to another context) and the painful requirement to repeat skipna everywhere.

@johnmyleswhite
Copy link
Member Author

Honestly, I don't see Julia ever winning over the R audience. So I think adapting things to appeal to R programmers is a losing strategy for Julia, which is quickly making gains among other groups of programmers, many of whom hold views that are diametrically opposed to Terry Therneau's views.

In general, I think there are two archetypal categories of users that you can optimize a programming language for:

  • Users that want to do interactive data analysis in an interpreter. They generally don't write programs; they just type commands into an interpreter to see results in real-time. Convenience is their main concern because programming is only a small part of their work -- and it's a part of their work that no one will encourage them to do more of. For this group, every data analysis is one-off. As such, they just need the specific analysis they are doing right now to produce the right answer for the data they're currently looking at. If you can get them to that right answer faster by setting global preferences, that's helpful.
  • Users that want to build systems that analyze data at scale. They generally don't do any interactive data analysis, except in an initial debugging phase. Correctness is their main concern because their code will be executed anew every day for months or years to come. Because they are generating tools that must work for all data sets that might ever be fed into the system, they want to identify any potential source of problems immediately. Rapid generation of fatal errors is one of the best ways to ensure that a system will work on unforeseen inputs. People who do this kind of engineering see global settings as poisonous, because they make it impossible to perform the kind of local reasoning that's necessary for developing a large codebase that will be shared between dozens, hundreds or thousands of programmers.

I am unequivocally on the side of the latter group. And I'm not very interested in writing systems to help the first group, because there's no reason why they can't keep using R. They don't write enough code to be interested in better programming languages. They just use other people's code. The R community can keep this group of people happy indefinitely, because they can just write good libraries in C and then ship those libraries to the non-experts.

In contrast, the second group of people is easily convinced of the merits of new languages because programming is their main profession. And they want things to raise errors. Strictness is something they value, because it makes their work easier, rather than harder. Raising errors during indexing helps you catch potential bugs quicker.

@ViralBShah
Copy link

I fully agree with @johnmyleswhite 's sentiment, and in general, that has largely been the case with Julia. We maintain superficial similarity to other languages where it makes sense to do so and is the right design decision. Where it is not, we should try and do the right thing.

@simonster
Copy link
Member

As far as avoiding the verbosity of repeating skipna over and over, I'm not convinced it's a good idea to make skipna a property of a DataFrame or DataArray. That approach has some of the same issues as a global flag, since code affected by the flag could be in an entirely different file or module from code that sets the flag. But it seems reasonable to me to have a @skipna macro with which you could wrap whatever code you want, but would apply only to that code, setting skipna=true for functions that accept a skipna argument.

Even then, I'm not sure @skipna should apply to indexing because of the issue I raised in #38: If you drop NA indices, the positions of the values in the output no longer match the position of the indices in the input. Usually skipping NAs will simply stop you from being able to tell that some of the data your analyzing was missing, but in that particular case you may actually get the wrong results unless you have carefully considered what you're doing.

@nalimilan
Copy link
Member

I am unequivocally on the side of the latter group. And I'm not very interested in writing systems to help the first group, because there's no reason why they can't keep using R. They don't write enough code to be interested in better programming languages. They just use other people's code. The R community can keep this group of people happy indefinitely, because they can just write good libraries in C and then ship those libraries to the non-experts.

I'm in both groups depending on whether I analyze data (group 1) or whether I write packages in order to allow people and myself to analyze data (group 2). And I'm not happy with R even for use case 1. I do think correctness is essential even when writing code you'll only run on your own machine on a single data set: R makes it very easy to screw your data and get incorrect results because of the lack of type checking (factors converted to integers, anyone?), its slowness sometimes forces to use convoluted vectorized operations when loops would be simpler, it doesn't work well with large data sets due to its functional semantics and keeping everything in memory...

Also, you almost never write code exclusively at the REPL: anybody who needs to recode data, or who needs to be able to replicate results (science, you know...) will write a script which can be run again later from a clean session, which makes use case 1 closer to use case 2. Very often you'll need to rework your code some time later to add more data, more complex treatments, etc., and it's just like another user was running your code as you don't remember every subtlety of it and the variables may have changed.

So I really think Julia can attract this kind of users, and I can perfectly imagine teaching students Julia instead of R for data analysis, in particular because it's less "dangerous" due to being stricter. And I guess @dmbates and many others would be porting regression models to Julia if they didn't believe users of group 1 may be interested in Julia. There's also the advantage that if people writing packages like Julia better than R+C, because writing reliable code is easier, Julia is going to get a great ecosystem and many expert supporters, and experts are often the ones who decide what other people are going to use (teaching, advice...).

Coming back to the precise issue of indexing, maybe that's not a big problem after all. Ideally, I think most operations could be done on DataFrames with specialized functions/macros, without relying on indexing DataArrays at all. Currently this does not work very well in R anyway (notably because NAs in indexes are propagated).

Regarding other use cases, like calling mean on a DataArray or a DataFrame column, maybe we should in the long-term have two different types with similar behaviors, except for the default of skipna. This would make everybody happy and still make it clear what happens when reading the code. The downside would be quite a lot of duplication. (A @skipna macro wouldn't really fix the problem since it wouldn't work well when running lines one at a time, which typically happens when analyzing data, at least before storing what works in a script.)

@kmsquire
Copy link
Contributor

kmsquire commented Jul 9, 2014

+1 for @nalimilan's comments. I think that it's really good to focus on correctness first, and online data analysis ala R could and should still be relatively easy to do in Julia.

@johnmyleswhite
Copy link
Member Author

I'm not opposed to interactive data analysis. I'm opposed to features that prioritize interactivity at the expense of maintainability. Global settings are one way in which you can over-prioritize interactivity. They make sense in programs like SPSS, because people don't build things in SPSS. But I'm hopeful that they'll build things in Julia.

@johnmyleswhite
Copy link
Member Author

Ugh. Just saw that you said more, @nalimilan. I hate the fact that Github only sends me half of the e-mails it's supposed to send me. Will respond more in a bit.

@matthieugomez
Copy link

To compare with Stata. In Stata, one can index with booleans but one can't index with integers. In this case keep if v1==1 would drop missing values by default. This behavior ends up looking like transforming NA into false. This works quite well in my experience.

I too don't really like the R behavior. I'd venture to say that half of the times a typical user indexes with NA in R, he/she could not predict / did not think of what should happen when the indexing vector has an NA.

I'd personally like of one of these two solutions :

  1. throw an error everytime one indexes with NA
  2. When indexing with integers and NA, throw an error. When indexing with bool and NA, consider NA as false.

The 2. behavior seems more convenient than 1. (i.e. it avoids the use of repeated && isna), but I understand @nalimilan argument that it may be confusing.

An issue I can see with the flag option is that skipna means very different things in different contexts. For instance, should NA be skipped when computing the mean?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants