-
Notifications
You must be signed in to change notification settings - Fork 50
Rename removeNA and fix its implementation #38
Conversation
end | ||
|
||
removeNA(a::AbstractArray) = a | ||
dropna(dv::DataVector) = copy(dv.data[!dv.na]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think copy
should be necessary here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. This was done for future compatibility with reference-based slices, but we should probably write this code differently when that happens.
This behavior isn't something you changed in this PR, but I'm not entirely sure about the way we presently handle indexing with NAs. I think that indexing with a DataVector of indices should return a DataVector that has NAs in place of NA indices, e.g.: A = @data ones(Int, 3)
A[@data [1, NA, 2]] should give A = @data ones(Int, 3)
A[@data [false, true, NA]] would give |
Both seem like reasonable changes to make, although I have to admit that both the current and proposed behaviors strike me as kind of odd. What would we do for DataFrames with a Boolean row index containing NA? Insert a row of all NA's? |
That sounds right to me. Philosophically, I think it's important that, if data contains NAs, people need to take explicit action to deal with them. We should make those actions as simple as possible, but it should be hard for someone to manipulate data that contains NAs and get a result that looks right but is wrong because they didn't realize there were NAs. The only alternative strategy along these lines that I can think of is to throw if there are any NAs in the index array, which seems less convenient. |
I agree abstractly with your philosophy, but returning an NA for an NA index seems also bad to me because you don't know whether the index was NA or whether the indexed value was NA. Throwing, even though it's kind of in your face, actually seems like a much safer strategy when indexing. |
Throwing an error is really growing on me while I think about it... |
Throwing an error is fine by me. I wonder whether there are cases where you'd want the behavior I propose above, and if so, how we'd expose it, but we could throw an error for now and wait until there's a use case for something else. |
In the boolean case, I've always felt like treating NA as false was acceptable because the returned values should definitely satisfy the tested predicate. In the numeric indexing case, I'm not sure when you'd want to get NA's back except when you want to make sure that the length of your return value is known in advance. Throwing just feels so right now that I consider it. That way you know that (a) you always get the same number of entities back as you asked for and (b) the entities you get back are really entries in the underlying data, not backfill. If we come up with a good use case for your proposal, let's find a nice way to expose it. |
Yeah, I can't think of an example where the current boolean behavior is really problematic, just the numeric indexing behavior. In the numeric case, if the positions of the values don't correspond to the positions of the indices, I think something like |
This is ready to go now. I fixed a bug with |
Rename removeNA and fix its implementation
I've come to really dislike the use of caps in the name for
removeNA
, which I think is also too long. It's renameddropna
in this PR.I also dislike the absence of a custom implementation of
dropna
for PDA's, so I added one.I decided to make
dropna
only apply to DV's and PDV's because the operation doesn't make sense for higher-order tensors.Looking through these changes, I realize that the iterators look really ugly now. Will fix them before this is merged.