Rename removeNA and fix its implementation #38

johnmyleswhite · 2013-12-23T05:37:17Z

I've come to really dislike the use of caps in the name for removeNA, which I think is also too long. It's renamed dropna in this PR.

I also dislike the absence of a custom implementation of dropna for PDA's, so I added one.

I decided to make dropna only apply to DV's and PDV's because the operation doesn't make sense for higher-order tensors.

Looking through these changes, I realize that the iterators look really ugly now. Will fix them before this is merged.

simonster · 2013-12-23T17:58:47Z

src/dataarray.jl

-end
-
-removeNA(a::AbstractArray) = a
+dropna(dv::DataVector) = copy(dv.data[!dv.na])


I don't think copy should be necessary here.

You're right. This was done for future compatibility with reference-based slices, but we should probably write this code differently when that happens.

simonster · 2013-12-23T18:21:50Z

This behavior isn't something you changed in this PR, but I'm not entirely sure about the way we presently handle indexing with NAs. I think that indexing with a DataVector of indices should return a DataVector that has NAs in place of NA indices, e.g.:

A = @data ones(Int, 3)
A[@data [1, NA, 2]]

should give [1, NA, 1] and not [1, 1]. This is what R does, and it seems a little less surprising than silently dropping the indices. R also follows this behavior when indexing with Bools, e.g.:

A = @data ones(Int, 3)
A[@data [false, true, NA]]

would give [1, NA] and not [1]. I lean toward this as well, but I'm not as sure.

johnmyleswhite · 2013-12-23T18:28:31Z

Both seem like reasonable changes to make, although I have to admit that both the current and proposed behaviors strike me as kind of odd. What would we do for DataFrames with a Boolean row index containing NA? Insert a row of all NA's?

simonster · 2013-12-23T18:41:16Z

That sounds right to me. Philosophically, I think it's important that, if data contains NAs, people need to take explicit action to deal with them. We should make those actions as simple as possible, but it should be hard for someone to manipulate data that contains NAs and get a result that looks right but is wrong because they didn't realize there were NAs. The only alternative strategy along these lines that I can think of is to throw if there are any NAs in the index array, which seems less convenient.

johnmyleswhite · 2013-12-23T18:45:26Z

I agree abstractly with your philosophy, but returning an NA for an NA index seems also bad to me because you don't know whether the index was NA or whether the indexed value was NA.

Throwing, even though it's kind of in your face, actually seems like a much safer strategy when indexing.

johnmyleswhite · 2013-12-23T18:50:07Z

Throwing an error is really growing on me while I think about it...

simonster · 2013-12-23T19:03:13Z

Throwing an error is fine by me. I wonder whether there are cases where you'd want the behavior I propose above, and if so, how we'd expose it, but we could throw an error for now and wait until there's a use case for something else.

johnmyleswhite · 2013-12-23T19:10:28Z

In the boolean case, I've always felt like treating NA as false was acceptable because the returned values should definitely satisfy the tested predicate. In the numeric indexing case, I'm not sure when you'd want to get NA's back except when you want to make sure that the length of your return value is known in advance.

Throwing just feels so right now that I consider it. That way you know that (a) you always get the same number of entities back as you asked for and (b) the entities you get back are really entries in the underlying data, not backfill.

If we come up with a good use case for your proposal, let's find a nice way to expose it.

simonster · 2013-12-23T20:25:06Z

Yeah, I can't think of an example where the current boolean behavior is really problematic, just the numeric indexing behavior. In the numeric case, if the positions of the values don't correspond to the positions of the indices, I think something like a[find(c[b[a]])] with b numeric with NAs gives a result that is actually incorrect.

johnmyleswhite · 2014-01-04T23:43:41Z

This is ready to go now.

I fixed a bug with cor_spearman, made the names for setlevels and setlevels! more consistent with Julia standards and turned off a test that doesn't make sense until DataFrames is loaded.

Rename removeNA and fix its implementation

simonster reviewed Dec 23, 2013
View reviewed changes

johnmyleswhite mentioned this pull request Dec 23, 2013

Don't allow NA inside indices #39

Closed

Rename removeNA to dropna and fix its implementation

34ee2a2

johnmyleswhite added a commit that referenced this pull request Jan 5, 2014

Merge pull request #38 from JuliaStats/dropna

b49113f

Rename removeNA and fix its implementation

johnmyleswhite merged commit b49113f into master Jan 5, 2014

johnmyleswhite deleted the dropna branch January 5, 2014 03:17

johnmyleswhite mentioned this pull request Jan 5, 2014

[WIP] Indexing changes #47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rename removeNA and fix its implementation #38

Rename removeNA and fix its implementation #38

johnmyleswhite commented Dec 23, 2013

simonster Dec 23, 2013

johnmyleswhite Dec 23, 2013

simonster commented Dec 23, 2013

johnmyleswhite commented Dec 23, 2013

simonster commented Dec 23, 2013

johnmyleswhite commented Dec 23, 2013

johnmyleswhite commented Dec 23, 2013

simonster commented Dec 23, 2013

johnmyleswhite commented Dec 23, 2013

simonster commented Dec 23, 2013

johnmyleswhite commented Jan 4, 2014

Rename removeNA and fix its implementation #38

Rename removeNA and fix its implementation #38

Conversation

johnmyleswhite commented Dec 23, 2013

simonster Dec 23, 2013

Choose a reason for hiding this comment

johnmyleswhite Dec 23, 2013

Choose a reason for hiding this comment

simonster commented Dec 23, 2013

johnmyleswhite commented Dec 23, 2013

simonster commented Dec 23, 2013

johnmyleswhite commented Dec 23, 2013

johnmyleswhite commented Dec 23, 2013

simonster commented Dec 23, 2013

johnmyleswhite commented Dec 23, 2013

simonster commented Dec 23, 2013

johnmyleswhite commented Jan 4, 2014