Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up basic functions like mean and std #325

Closed
johnmyleswhite opened this issue Jul 15, 2013 · 14 comments
Closed

Clean up basic functions like mean and std #325

johnmyleswhite opened this issue Jul 15, 2013 · 14 comments
Labels

Comments

@johnmyleswhite
Copy link
Contributor

Our current implementation of mean, std, etc. is kind of a mess. The biggest issue is that the failNA, replaceNA, removeNA trichotomy we have only works well for DataVector. I'm up for keeping these functions around for use with DataVector objects, but they don't make any sense for higher-order tensors.

In addition, our functions don't work properly on higher-order tensors because you can't pass in a set of dimensions for splicing. This means that we have things like colsums(x), whereas the Julian thing to do would be sum(x, 2).

I'm going to go through and start removing some of this functionality through deprecations.

In addition, I'm going to start writing custom NA handling keyword arguments for each function. This seems like more work, but it's the only sane thing to do: a call to svd for a DataMatrix needs to offer an option like impute = false, because the solution to having any NA entries isn't described in terms of failNA, replaceNA or removeNA. Sadly, I think this is the rule, not the exception: NA handling has to be handled in a special way for each function.

@StefanKarpinski
Copy link
Member

Would it be possible to change the API to these things by making it mean(x, na=replace), etc. or something like that? Where replace could be a global constant that is an instance of NAHandler or something like that. I'm taking this design inspiration from the new sorting framework. Using a type to fill in the handling of NAs is useful too since it allows specialization an inlining.

@johnmyleswhite
Copy link
Contributor Author

Yeah, I'd be happy with something like that. So I take it the compiler is able to specialize on keyword arguments with distinct types?

What really bothers me with these functions is that we have to duplicate so much functionality from Base. I've started trying to exploit more of the AbstractArray functionality. Unfortunately, it doesn't really work in many cases, because the basic functions don't have any mechanism for taking special actions on NA. For example, mean works except when you want to ignore NA in your calculation by skipping over it entirely.

@StefanKarpinski
Copy link
Member

So I take it the compiler is able to specialize on keyword arguments with distinct types?

No, it's actually not. That's why the way the sorting API works is by translating keyword calls into a non-keyword call which can get specialized on the argument types. For example, sort!(v, by=length, alg=TimSort) becomes sort!(v, TimSort, Sort.By(length)). The sort!(::AbstractVector, ::Algorithm, ::Ordering) method serves as the "narrow waist" of the sorting "stack" – different implementations can be provided below it while different interfaces can be provided above it, but all calls to sort, regardless of interface or implementation go through a call to that three-argument sort.

@StefanKarpinski
Copy link
Member

I should clarify that sort!(::AbstractVector, ::Algorithm, ::Ordering) is actually not necessarily a single method since, for e.g. floating-point vectors, it is specialized. Rather it's that particular "interface" or subset of the general sort! function that serves as the narrow waist. I considered making that and everything "below" it a separate internal, non-public generic function, but it seemed more obvious to implement new sorting algorithms by adding methods to sort! than some other function.

@StefanKarpinski
Copy link
Member

Unfortunately, it doesn't really work in many cases, because the basic functions don't have any mechanism for taking special actions on NA. For example, mean works except when you want to ignore NA in your calculation by skipping over it entirely.

I'm torn between saying that such duplication is unavoidable, and considering providing some mechanism for plugging in custom behaviors for certain collection/element type combinations. Hard to design such a generic mechanism but probably not impossible.

@johnmyleswhite
Copy link
Contributor Author

For now, I think we should go through and accept the duplication. As we figure things out, we may be able to design a more generic system.

@nfoti
Copy link
Contributor

nfoti commented Aug 14, 2013

Is there any progress on this? I would like to use DataFrames for some upcoming work and I would definitely like to have the functionality to operate along dimensions of DataArrays. If there's anything I can do then I'm glad to help.

@johnmyleswhite
Copy link
Contributor Author

I was holding off while @simonster was working on operators. Anything that gets this moving forward would be welcome in my book. I suspect just nailing the interface for mean and a few other functions will get us most of the way.

@nfoti
Copy link
Contributor

nfoti commented Aug 14, 2013

A straight-forward way to implement some of this is to define operators that skip NA values and then use reducedims to compute the reductions. For example we could define a plusnona that adds two values treating NAs as 0. Similarly there could be a prodnona that multiplies two values treating NA as 1. The same could be constructed for other operators to define what to do when ignoring NAs. Obviously this approach doesn't work directly for functions like mean, but mean could be constructed from a sum that uses these operators.

Alternatively, it may be better to write cache-friendly versions of these functions for DataArray{T,2} (and maybe DataArray{T,3}) and use a reducedims approach for higher order tensors. This will entail more code duplication but it will also be faster.

@simonster
Copy link
Contributor

Sorry for the delay on operators. I'll try to finish things up this weekend.

I don't think it's possible to get reducedims to give good performance, since reducedims doesn't specialize on function arguments for anything but a few functions explicitly defined in Base, and because elements of a DataArray have polymorphic types (either NA or the parametrized type) so specialization doesn't actually help that much (although JuliaLang/julia#3796 may improve things somewhat). I think the version of mapreduce from NumericExtensions might be able to do this efficiently if applied to the data and na fields directly. In any case it's probably worth looking into how NumericExtensions handles reductions, since I know @lindahua put a lot of work into performance.

@nfoti
Copy link
Contributor

nfoti commented Aug 14, 2013

Yes, the reducedims approach won't be very efficient. The same approach can be used with Functors from NumericExtensions which may be faster because of @lindahua's design. That is we could define an AddNoNA functor that performs addition that ignores NAs. However, there may still be a problem with polymorphic types. Writing the functions as the explicit loops accessing the data and na fields is another option.

I don't understand how you were thinking of using mapreduce on the data and na fields directly. Sorry for my ignorance.

@simonster
Copy link
Contributor

Re: mapreduce, my idea was something like (untested):

type NAOrZero <: BinaryFunctor; end
evaluate(::NAOrZero, x, y::Bool) = y ? x : zero(typeof(x))
sum(darray::DataArray, dims) = mapreduce(NAOrZero(), Add(), darray.data, darray.na, dims)

I don't think this quite works because the NumericExtensions version of mapreduce doesn't accept BitArrays (I think this is just a method signature issue) and I don't know if evaluate is presently inlined, but otherwise I think this should generate code that's close to optimal.

@nfoti
Copy link
Contributor

nfoti commented Aug 15, 2013

That makes a lot of sense. When you're done updating operators.jl I can take a shot implementing these operations.

@simonster
Copy link
Contributor

Fixed by JuliaStats/DataArrays.jl#101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants