Clean up basic functions like mean and std #325

johnmyleswhite · 2013-07-15T21:10:26Z

Our current implementation of mean, std, etc. is kind of a mess. The biggest issue is that the failNA, replaceNA, removeNA trichotomy we have only works well for DataVector. I'm up for keeping these functions around for use with DataVector objects, but they don't make any sense for higher-order tensors.

In addition, our functions don't work properly on higher-order tensors because you can't pass in a set of dimensions for splicing. This means that we have things like colsums(x), whereas the Julian thing to do would be sum(x, 2).

I'm going to go through and start removing some of this functionality through deprecations.

In addition, I'm going to start writing custom NA handling keyword arguments for each function. This seems like more work, but it's the only sane thing to do: a call to svd for a DataMatrix needs to offer an option like impute = false, because the solution to having any NA entries isn't described in terms of failNA, replaceNA or removeNA. Sadly, I think this is the rule, not the exception: NA handling has to be handled in a special way for each function.

The text was updated successfully, but these errors were encountered:

StefanKarpinski · 2013-07-15T22:31:23Z

Would it be possible to change the API to these things by making it mean(x, na=replace), etc. or something like that? Where replace could be a global constant that is an instance of NAHandler or something like that. I'm taking this design inspiration from the new sorting framework. Using a type to fill in the handling of NAs is useful too since it allows specialization an inlining.

johnmyleswhite · 2013-07-15T22:37:27Z

Yeah, I'd be happy with something like that. So I take it the compiler is able to specialize on keyword arguments with distinct types?

What really bothers me with these functions is that we have to duplicate so much functionality from Base. I've started trying to exploit more of the AbstractArray functionality. Unfortunately, it doesn't really work in many cases, because the basic functions don't have any mechanism for taking special actions on NA. For example, mean works except when you want to ignore NA in your calculation by skipping over it entirely.

StefanKarpinski · 2013-07-15T22:52:08Z

So I take it the compiler is able to specialize on keyword arguments with distinct types?

No, it's actually not. That's why the way the sorting API works is by translating keyword calls into a non-keyword call which can get specialized on the argument types. For example, sort!(v, by=length, alg=TimSort) becomes sort!(v, TimSort, Sort.By(length)). The sort!(::AbstractVector, ::Algorithm, ::Ordering) method serves as the "narrow waist" of the sorting "stack" – different implementations can be provided below it while different interfaces can be provided above it, but all calls to sort, regardless of interface or implementation go through a call to that three-argument sort.

StefanKarpinski · 2013-07-15T22:55:55Z

I should clarify that sort!(::AbstractVector, ::Algorithm, ::Ordering) is actually not necessarily a single method since, for e.g. floating-point vectors, it is specialized. Rather it's that particular "interface" or subset of the general sort! function that serves as the narrow waist. I considered making that and everything "below" it a separate internal, non-public generic function, but it seemed more obvious to implement new sorting algorithms by adding methods to sort! than some other function.

StefanKarpinski · 2013-07-15T22:57:36Z

Unfortunately, it doesn't really work in many cases, because the basic functions don't have any mechanism for taking special actions on NA. For example, mean works except when you want to ignore NA in your calculation by skipping over it entirely.

I'm torn between saying that such duplication is unavoidable, and considering providing some mechanism for plugging in custom behaviors for certain collection/element type combinations. Hard to design such a generic mechanism but probably not impossible.

johnmyleswhite · 2013-07-16T16:40:31Z

For now, I think we should go through and accept the duplication. As we figure things out, we may be able to design a more generic system.

nfoti · 2013-08-14T03:22:58Z

Is there any progress on this? I would like to use DataFrames for some upcoming work and I would definitely like to have the functionality to operate along dimensions of DataArrays. If there's anything I can do then I'm glad to help.

johnmyleswhite · 2013-08-14T03:26:51Z

I was holding off while @simonster was working on operators. Anything that gets this moving forward would be welcome in my book. I suspect just nailing the interface for mean and a few other functions will get us most of the way.

nfoti · 2013-08-14T16:51:02Z

A straight-forward way to implement some of this is to define operators that skip NA values and then use reducedims to compute the reductions. For example we could define a plusnona that adds two values treating NAs as 0. Similarly there could be a prodnona that multiplies two values treating NA as 1. The same could be constructed for other operators to define what to do when ignoring NAs. Obviously this approach doesn't work directly for functions like mean, but mean could be constructed from a sum that uses these operators.

Alternatively, it may be better to write cache-friendly versions of these functions for DataArray{T,2} (and maybe DataArray{T,3}) and use a reducedims approach for higher order tensors. This will entail more code duplication but it will also be faster.

simonster · 2013-08-14T20:26:11Z

Sorry for the delay on operators. I'll try to finish things up this weekend.

I don't think it's possible to get reducedims to give good performance, since reducedims doesn't specialize on function arguments for anything but a few functions explicitly defined in Base, and because elements of a DataArray have polymorphic types (either NA or the parametrized type) so specialization doesn't actually help that much (although JuliaLang/julia#3796 may improve things somewhat). I think the version of mapreduce from NumericExtensions might be able to do this efficiently if applied to the data and na fields directly. In any case it's probably worth looking into how NumericExtensions handles reductions, since I know @lindahua put a lot of work into performance.

nfoti · 2013-08-14T21:48:38Z

Yes, the reducedims approach won't be very efficient. The same approach can be used with Functors from NumericExtensions which may be faster because of @lindahua's design. That is we could define an AddNoNA functor that performs addition that ignores NAs. However, there may still be a problem with polymorphic types. Writing the functions as the explicit loops accessing the data and na fields is another option.

I don't understand how you were thinking of using mapreduce on the data and na fields directly. Sorry for my ignorance.

simonster · 2013-08-15T02:01:40Z

Re: mapreduce, my idea was something like (untested):

type NAOrZero <: BinaryFunctor; end
evaluate(::NAOrZero, x, y::Bool) = y ? x : zero(typeof(x))
sum(darray::DataArray, dims) = mapreduce(NAOrZero(), Add(), darray.data, darray.na, dims)

I don't think this quite works because the NumericExtensions version of mapreduce doesn't accept BitArrays (I think this is just a method signature issue) and I don't know if evaluate is presently inlined, but otherwise I think this should generate code that's close to optimal.

nfoti · 2013-08-15T16:39:24Z

That makes a lot of sense. When you're done updating operators.jl I can take a shot implementing these operations.

simonster · 2014-07-17T23:42:35Z

Fixed by JuliaStats/DataArrays.jl#101

nfoti mentioned this issue Aug 14, 2013

Mathematical operations for DataFrames #346

Closed

simonster mentioned this issue Aug 19, 2013

RFC: Clean up operators #351

Merged

nfoti mentioned this issue Aug 22, 2013

WIP: Operations for DataArray that skip NA values #354

Closed

simonster mentioned this issue Oct 29, 2013

Statistics functions that ignore NAs JuliaStats/DataArrays.jl#4

Closed

simonster closed this as completed Jul 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up basic functions like mean and std #325

Clean up basic functions like mean and std #325

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

johnmyleswhite commented Jul 16, 2013

nfoti commented Aug 14, 2013

johnmyleswhite commented Aug 14, 2013

nfoti commented Aug 14, 2013

simonster commented Aug 14, 2013

nfoti commented Aug 14, 2013

simonster commented Aug 15, 2013

nfoti commented Aug 15, 2013

simonster commented Jul 17, 2014

Clean up basic functions like mean and std #325

Clean up basic functions like mean and std #325

Comments

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

johnmyleswhite commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

johnmyleswhite commented Jul 16, 2013

nfoti commented Aug 14, 2013

johnmyleswhite commented Aug 14, 2013

nfoti commented Aug 14, 2013

simonster commented Aug 14, 2013

nfoti commented Aug 14, 2013

simonster commented Aug 15, 2013

nfoti commented Aug 15, 2013

simonster commented Jul 17, 2014