-
Notifications
You must be signed in to change notification settings - Fork 50
Don't allow NA inside indices #39
Comments
Coming a little late in the conversation, but I don't think throwing an error when the index contains NAs is the best solution. While it may sound safer, it's neither logical nor practical. Logical: returning NA when the index is NA makes much sense: if the index is NA, this means that you do not know whether the observation satisfies a condition or not, thus that the resulting value is also missing. This is just the logical application of the NA propagation principle. Moreover, if people have to call Concretely, this also means that either you get an error, or you remove NAs with Finally, regarding practical matters, in the real world, data always contains NAs, which means that you'd have to clutter the code with (BTW, one could imagine supporting different NA payloads in the future, like NumPy describes at https://github.com/numpy/numpy/blob/master/doc/neps/missing-data.rst#future-expansion-to-multi-na-payloads. Then, NAs introduced via indexing could be given a special payload, and e.g. a frequency table would be able to tell you directly how many observations are missing because the variables determining their inclusion in the subsample were missing, and how many have the variable of interest itself missing.) |
Also something to keep in mind: contrary to most other languages, Julia provides basic types for when you do not expect NAs to be present in the data. |
Thanks for pushing back, @nalimilan. I think we need to have more debate, since your points make me hesitate to make this change, but they also don't fully convince me that our current behavior is right. Regarding the logical consequence of the x <- c(1, 2, 3, 4)
inds <- c(3, NA)
x[inds] To me, this case is not properly an application of the Indexing with Booleans seems quite different than indexing with numbers, because it's not a question of what you're asking for, but only whether you're asking for each So I'm leaning to the principle that it's ok for indexing operations to use |
Yeah, there are two different types of uncertainty (that's why I raised the possibility of using different payloads), but nothing allows you to consider that I don't think you can make a distinction between integer indexes and booleans, most importantly because it would be confusing, which is IMHO the main issue to avoid with NAs. And secondarily because it does not make sense: if NAs are present in the boolean, this means you do not/cannot know whether you are asking for the value of not. That's really the same problem. (In the real world we often build a subsample considering NAs as That said, I think we should consider practical issues rather than theoretical considerations. We should review the different cases where we expect this problem to arise, and see what are the code patterns that we would recommend given the behavior we retain. One of the examples where throwing an error would be a problem, is when in R you are recoding The major situation where throwing an error is a no-go is when extracting a subset of a Maybe it would be useful to get the opinion of people who have considered this problem before. We could ask Patrick Burns, the author of "The R Inferno" [1], about the mistakes not to repeat. NumPy's authors would probably have good ideas too (AFAICT their design document does not detail the choice of what to do with NA indexes). |
Agreed that it would be good to get input from more people than the two of us. As we discuss this, I'm increasingly shifting towards a stricter stance on using To resolve our debate, I'd like to propose that we impose a simple bright line here that will simplify lots of future design decisions regarding DataArrays: any operation on an AbstractDataArray, except for a small whitelist of arithmetic operations, should fail when encountering As I think about it more, the two ways that we've proposed dealing with In the numeric indexing case, I continue to disagree with the proposal that we should hallucinate illusory Looking more into R's approach, I think the design decisions taken in S and R in this regard are very poorly thought out. Consider, for example, this strange example of hallucinating
|
I'm not particularly attached to R's behavior. But I think that correctness is just a word in many cases, just like adding confirmation dialogs is just security theater when it comes to protecting computer users from themselves. Such correctness will only protect people who known they will not get NAs in their data. Others will skip NAs using whatever way they can find if they didn't care in the first place. Let me explain my point of view; maybe I have a different approach to NAs than others because of the field I work in. In the different surveys I've used (with thousands of individuals), I don't think there's a single variable (except maybe the year of survey or things like that) where no NA is present. So I basically need to go over all the "correctness" barriers in the API -- in the end, it's just noise to me, it doesn't add anything except making the code ugly. When you work with large survey data, the important information is not "is there at least one NA as opposed to none", it's more "what's the share of NAs", and even more profoundly "what's the profile of individuals with NAs and does it seem random". This can only be done by hand, or provided by verbose functions designed for interactive use. In this approach, the main issue is to make it easy/clean to skip NAs in all cases. Given Julia's distinction between
This is how e.g. SAS, SPSS and Stata handle missing values by default. |
Ok. I've never worked with SAS, SPSS or Stata. Your proposal seems totally reasonable, but also adds a lot of work. Will write more later. My plane is landing at SFO now. |
Work is something Julia people seem to be good at. ;-) Seriously, the hardest part is finding the correct design for the different use cases we can identify. With the current code in |
@nalimilan, reading your comments again, I'm a little confused as to why we would need two separate data structures to handle the ignore / missing semantics. Functions like We're definitely not going to make |
I was suggesting that there could be one way to choose that by default a given While in the case of Instead of types, one could also imagine having a special field indicating what should be done with NAs. |
Including a field that always skips NA seems ok to me. It adds some troubling non-locality, but we're also thinking of doing that with orderings for PooledDataArray's, so it's hard to fault. |
Any thoughts on this issue from @HarlanH, @tshort, @simonster, @ViralBShah, @StefanKarpinski, @kmsquire? |
Tricky issue. I sorta like the approach of adding a metadata field to DataArrays that indicate how to handle NAs, defaulting to "throw an error." I also like the rule of thumb that says that indexing should not create NAs that weren't there. It's not clear that the preferences for one apply to the other, though, in general. Maybe the metadata field should only apply to reducing operations like |
For me, the main trouble with indexing is that it's not a function call without function syntax, so we can't just add a keyword argument. |
For now, I'm in favor of throwing an error. That said, it'd be interesting to see someone write a |
John, yeah. It's a shame that the iterator-based functional-programming syntax, which I still really like, wasn't faster. Maybe if This is the sort of thing where mixins/multiple inheritance shines... |
@HarlanH I guess this will be easy to do with array views, and should be implemented. But for people who work all day with NAs everywhere, this still adds a lot of noise for no gain. So experimenting with a way to make skipping NA the default for some DAs would make Julia quite attractive. |
@nalimilan , I completely agree. Would the Julian syntax then be |
@HarlanH Yeah, something like that. Or maybe |
I have a really hard time seeing what indexing with an |
@StefanKarpinski My suggestion is to allow making the behavior explicit only once for a given vector, most likely after loading the data (in the form of a |
It seems like feature creep to me, but I'm very heavily anti-feature. I'd argue for leaving out the feature at first and making people explicitly filter out NAs or replace them and see how that pans out. That may turn out to be fine, in which case awesome! – no feature needed. It may also turn out to be intolerably annoying, in which case you'll be in a better position to decide how to proceed since you'll have ample use cases to consider. |
Except that we already have an example of how this works in R and NumPy: I have colleagues who find R terrible because it doesn't ignore NAs when indexing. I know how throwing an error will turn out for me: doable, but verbose and relatively painful. OTOH, it would be interesting to experiment with setting the default NA behavior for each I think you should seriously consider the use cases of scientific fields you may not be very familiar with: software used by social scientists, like SAS, SPSS and Stata ignore NAs by default; I guess epidemiologists must have the same requirements. When switching from them, people complain about this. A nice handling of this situation would help making Julia an even more compelling alternative to e.g. R or Python. |
I've discussed this issue with Terry Therneau, who participated in the development of S and then R. Interestingly, he remembers that in early days S modeling functions did not automatically skip NAs, they always required you to specify via an argument that you wanted this behavior (else they failed). Splus improved the situation by adding a global Even if the issue is not exactly the same (NAs in models vs NAs in indexes and more generally when passed to functions), this historical point shows that for practical uses, being able to ignore NAs is really a requirement. Let's not repeat S's mistakes if we don't want Julia to know the same fate among researchers working primarily on data. Terry gave me some further thoughts about this:
And regarding what to do with NAs by default in functions like
It seems to me that deciding via a property of the |
Honestly, I don't see Julia ever winning over the R audience. So I think adapting things to appeal to R programmers is a losing strategy for Julia, which is quickly making gains among other groups of programmers, many of whom hold views that are diametrically opposed to Terry Therneau's views. In general, I think there are two archetypal categories of users that you can optimize a programming language for:
I am unequivocally on the side of the latter group. And I'm not very interested in writing systems to help the first group, because there's no reason why they can't keep using R. They don't write enough code to be interested in better programming languages. They just use other people's code. The R community can keep this group of people happy indefinitely, because they can just write good libraries in C and then ship those libraries to the non-experts. In contrast, the second group of people is easily convinced of the merits of new languages because programming is their main profession. And they want things to raise errors. Strictness is something they value, because it makes their work easier, rather than harder. Raising errors during indexing helps you catch potential bugs quicker. |
I fully agree with @johnmyleswhite 's sentiment, and in general, that has largely been the case with Julia. We maintain superficial similarity to other languages where it makes sense to do so and is the right design decision. Where it is not, we should try and do the right thing. |
As far as avoiding the verbosity of repeating Even then, I'm not sure |
I'm in both groups depending on whether I analyze data (group 1) or whether I write packages in order to allow people and myself to analyze data (group 2). And I'm not happy with R even for use case 1. I do think correctness is essential even when writing code you'll only run on your own machine on a single data set: R makes it very easy to screw your data and get incorrect results because of the lack of type checking (factors converted to integers, anyone?), its slowness sometimes forces to use convoluted vectorized operations when loops would be simpler, it doesn't work well with large data sets due to its functional semantics and keeping everything in memory... Also, you almost never write code exclusively at the REPL: anybody who needs to recode data, or who needs to be able to replicate results (science, you know...) will write a script which can be run again later from a clean session, which makes use case 1 closer to use case 2. Very often you'll need to rework your code some time later to add more data, more complex treatments, etc., and it's just like another user was running your code as you don't remember every subtlety of it and the variables may have changed. So I really think Julia can attract this kind of users, and I can perfectly imagine teaching students Julia instead of R for data analysis, in particular because it's less "dangerous" due to being stricter. And I guess @dmbates and many others would be porting regression models to Julia if they didn't believe users of group 1 may be interested in Julia. There's also the advantage that if people writing packages like Julia better than R+C, because writing reliable code is easier, Julia is going to get a great ecosystem and many expert supporters, and experts are often the ones who decide what other people are going to use (teaching, advice...). Coming back to the precise issue of indexing, maybe that's not a big problem after all. Ideally, I think most operations could be done on Regarding other use cases, like calling |
+1 for @nalimilan's comments. I think that it's really good to focus on correctness first, and online data analysis ala R could and should still be relatively easy to do in Julia. |
I'm not opposed to interactive data analysis. I'm opposed to features that prioritize interactivity at the expense of maintainability. Global settings are one way in which you can over-prioritize interactivity. They make sense in programs like SPSS, because people don't build things in SPSS. But I'm hopeful that they'll build things in Julia. |
Ugh. Just saw that you said more, @nalimilan. I hate the fact that Github only sends me half of the e-mails it's supposed to send me. Will respond more in a bit. |
To compare with Stata. In Stata, one can index with booleans but one can't index with integers. In this case I too don't really like the R behavior. I'd venture to say that half of the times a typical user indexes with NA in R, he/she could not predict / did not think of what should happen when the indexing vector has an NA. I'd personally like of one of these two solutions :
The 2. behavior seems more convenient than 1. (i.e. it avoids the use of repeated An issue I can see with the flag option is that |
As discussed in #38, any indexing operation that uses an NA index should fail. Right now, we simply drop NA's indices, but it's safer if you know that leaving NA's in your indices will always fail.
The text was updated successfully, but these errors were encountered: