Skip to content
This repository has been archived by the owner on May 4, 2019. It is now read-only.

Rename removeNA and fix its implementation #38

Merged
merged 1 commit into from
Jan 5, 2014
Merged

Conversation

johnmyleswhite
Copy link
Member

I've come to really dislike the use of caps in the name for removeNA, which I think is also too long. It's renamed dropna in this PR.

I also dislike the absence of a custom implementation of dropna for PDA's, so I added one.

I decided to make dropna only apply to DV's and PDV's because the operation doesn't make sense for higher-order tensors.

Looking through these changes, I realize that the iterators look really ugly now. Will fix them before this is merged.

end

removeNA(a::AbstractArray) = a
dropna(dv::DataVector) = copy(dv.data[!dv.na])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think copy should be necessary here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. This was done for future compatibility with reference-based slices, but we should probably write this code differently when that happens.

@simonster
Copy link
Member

This behavior isn't something you changed in this PR, but I'm not entirely sure about the way we presently handle indexing with NAs. I think that indexing with a DataVector of indices should return a DataVector that has NAs in place of NA indices, e.g.:

A = @data ones(Int, 3)
A[@data [1, NA, 2]]

should give [1, NA, 1] and not [1, 1]. This is what R does, and it seems a little less surprising than silently dropping the indices. R also follows this behavior when indexing with Bools, e.g.:

A = @data ones(Int, 3)
A[@data [false, true, NA]]

would give [1, NA] and not [1]. I lean toward this as well, but I'm not as sure.

@johnmyleswhite
Copy link
Member Author

Both seem like reasonable changes to make, although I have to admit that both the current and proposed behaviors strike me as kind of odd. What would we do for DataFrames with a Boolean row index containing NA? Insert a row of all NA's?

@simonster
Copy link
Member

That sounds right to me. Philosophically, I think it's important that, if data contains NAs, people need to take explicit action to deal with them. We should make those actions as simple as possible, but it should be hard for someone to manipulate data that contains NAs and get a result that looks right but is wrong because they didn't realize there were NAs. The only alternative strategy along these lines that I can think of is to throw if there are any NAs in the index array, which seems less convenient.

@johnmyleswhite
Copy link
Member Author

I agree abstractly with your philosophy, but returning an NA for an NA index seems also bad to me because you don't know whether the index was NA or whether the indexed value was NA.

Throwing, even though it's kind of in your face, actually seems like a much safer strategy when indexing.

@johnmyleswhite
Copy link
Member Author

Throwing an error is really growing on me while I think about it...

@simonster
Copy link
Member

Throwing an error is fine by me. I wonder whether there are cases where you'd want the behavior I propose above, and if so, how we'd expose it, but we could throw an error for now and wait until there's a use case for something else.

@johnmyleswhite
Copy link
Member Author

In the boolean case, I've always felt like treating NA as false was acceptable because the returned values should definitely satisfy the tested predicate. In the numeric indexing case, I'm not sure when you'd want to get NA's back except when you want to make sure that the length of your return value is known in advance.

Throwing just feels so right now that I consider it. That way you know that (a) you always get the same number of entities back as you asked for and (b) the entities you get back are really entries in the underlying data, not backfill.

If we come up with a good use case for your proposal, let's find a nice way to expose it.

@simonster
Copy link
Member

Yeah, I can't think of an example where the current boolean behavior is really problematic, just the numeric indexing behavior. In the numeric case, if the positions of the values don't correspond to the positions of the indices, I think something like a[find(c[b[a]])] with b numeric with NAs gives a result that is actually incorrect.

@johnmyleswhite
Copy link
Member Author

This is ready to go now.

I fixed a bug with cor_spearman, made the names for setlevels and setlevels! more consistent with Julia standards and turned off a test that doesn't make sense until DataFrames is loaded.

johnmyleswhite added a commit that referenced this pull request Jan 5, 2014
Rename removeNA and fix its implementation
@johnmyleswhite johnmyleswhite merged commit b49113f into master Jan 5, 2014
@johnmyleswhite johnmyleswhite deleted the dropna branch January 5, 2014 03:17
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants