Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mean function on dimension #2265

Closed
dfagnan opened this issue Feb 11, 2013 · 58 comments
Closed

mean function on dimension #2265

dfagnan opened this issue Feb 11, 2013 · 58 comments
Labels
needs decision A decision on this change is needed
Milestone

Comments

@dfagnan
Copy link

dfagnan commented Feb 11, 2013

I believe in the recent shuffle that mean on dimensions of a matrix does not work.

Eg. mean([2 3 4; 1 2 3],2)

@pao
Copy link
Member

pao commented Feb 11, 2013

#2266 didn't appear to address this.

@andreasnoack
Copy link
Member

#2200 indirectly discussed this. I asked for mean, median and var treating this the same way and I guess the result is that functions along dimensions is delgated to new amap style functions (which I think is fine), but I am not sure. @StefanKarpinski may have something to add here. One problem with the Matlab way is that var(Matrix,Number) is used when the mean is known.

@StefanKarpinski
Copy link
Member

This was intentionally removed. @JeffBezanson may have some suggestions of ways to accomplish this instead.

@JeffBezanson
Copy link
Member

mean(A, dim) can always be added, implemented in terms of whatever higher-order function we prefer. Conflicts with things like var(Matrix,Number) are the real problem. That just requires an API decision.

@blakejohnson
Copy link
Contributor

I think it is best to have mean, var, etc work not just along an axis but on regions (vectors of dimensions), because it is common when working on high-dimensional data sets to need to do reductions along multiple axes. I have an open pull request out on Stats.jl, but this really ought to be in base.

@StefanKarpinski
Copy link
Member

The reason it's not in base is that it was unclear how these calls should look. The std and var functions had known-mean variants that were completely at odds with the dimension argument for mean. Until that is all straightened out with a coherent story, it's not going into base.

@blakejohnson
Copy link
Contributor

Fair enough. I'll take a look at what other languages do for std and var with known-mean to see what the options are.

@StefanKarpinski
Copy link
Member

There's also a point where you want to stop adding dimension arguments to every function and start using higher order functions for that kind of thing. I feel like this may be that point.

@blakejohnson
Copy link
Contributor

Do you feel the same way about min, max, and sum?

@StefanKarpinski
Copy link
Member

Yeah, that's the problem. Those are heavily used in Matlab style code and they are pretty handy to have a dimension argument for. So I don't know.

@kmsquire
Copy link
Member

Keyword arguments (#485) would probably be useful here.

@timholy
Copy link
Member

timholy commented Feb 22, 2013

How about adding a Dim type? For example, B = mean(A, Dim(2)). It would make the meaning of the 2 unambiguous. For std one could support B = std(A, Dim(2)) and B = std(A, Dim(2), m) and B = std(A, m, Dim(2)) so the order of arguments wouldn't matter. This gives keyword-like behavior without requiring them in general.

In a simple implementation Dim would simply wrap a single integer. To include @blakejohnson's multidimensional request, the implementation of Dim might be

type Dim{A<:AbstractVector{Int}}
    index::A
end
Dim(i::Int) = Dim([i])

Note it stores an AbstractArray, but should almost surely not be a subtype of it.

I'm playing with this right now in the context of image processing, adding an img[Seq(14)] syntax to allow one to generically access the 14th image in a sequence no matter whether your images are two- or three-dimensional, grayscale/color, etc. It's too early to tell how much I like it, but if I do I will probably adopt this and finally push out an Image package.

@ViralBShah
Copy link
Member

Using Dim does make the code more readable, but also a bit more verbose to type. That may actually be ok, since we do not yet have keyword args.

@ViralBShah
Copy link
Member

In case of max etc., we could just support both max(A, d) and max(A, Dim(d)).

@timholy
Copy link
Member

timholy commented Feb 22, 2013

What if you want to support max([2,-3,5],0) -> [2,0,5]?

@ViralBShah
Copy link
Member

I would prefer to force the use of Dim() if we adopt it. The behaviour you propose is also perhaps desirable, but it will confuse a lot of matlab users, and more so, it has the potential to break existing programs silently. I don't really have a good solution here.

@timholy
Copy link
Member

timholy commented Feb 22, 2013

I was simply responding to your proposal to support both max(A, d) and max(A, Dim(d)), which from the fact that both used d I interpreted to mean as automatically converting d->Dim(d). I don't think we should do that, and the example I gave was intended to illustrate the pitfall of doing so. I support the idea of forcing Dim.

FYI Matlab behaves exactly as in the example I just gave you:

>> a = [2, -3, 5]

a =

     2    -3     5

>> max(a, 0)

ans =

     2     0     5

Matlab uses the ugly max(a, [], d) to handle the maximum over a dimension.

If we go with the Dim approach, we could also add max!(a, 0) as a sweet syntax for truncating arrays to nonnegative values.

@ViralBShah
Copy link
Member

Oh yes - it has been so long since I used matlab that I actually forgot how I hated that syntax.

@blakejohnson
Copy link
Contributor

The Dim syntax is pretty clear. I do have to ask: this whole problem seems to exist because of these variants with known mean. Do other languages provide similar methods? It seems like var(A - m) is just as good as var(A, m) if I know the mean of A is m.

@blakejohnson
Copy link
Contributor

I take that back, var(A - m) is only just as good if var always assumes that the mean is zero. Which is not a great assumption to bake in.

@andreasnoack
Copy link
Member

I like det Dim type and maybe the idea could be used generally instead of keywords. It is kind of the same thing I have done with the Hermitian and Triangular types in #2308.

@dfagnan
Copy link
Author

dfagnan commented Feb 22, 2013

I'm fairly confident that the known-mean variant is not widely used in
MATLAB and pales in comparison to the regular use of mean along a given
dimension, so it seems a shame to complicate the usage for this reason. To
be honest, I was quite shocked when this simple functionality was broken -
so much so that I stopped updating julia. The forced Dim argument seems
quite reasonable, although I personally feel even this is unnecessary.

On Fri, Feb 22, 2013 at 9:26 AM, Andreas Noack Jensen <
notifications@github.com> wrote:

I like det Dim type and maybe the idea could be used generally instead of
keywords. It is kind of the same thing I have done with the Hermitian and
Triangular types in #2308 #2308
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2265#issuecomment-13945372.

@timholy
Copy link
Member

timholy commented Feb 22, 2013

Agreed. It's actually a more serious issue for max(A, i::Int). Is that i a dimension index, or is it the comparison value?

blakejohnson added a commit to blakejohnson/julia that referenced this issue Feb 26, 2013
First pass at implemented the Dim type for JuliaLang#2265. For an array A, you can now
call, e.g.: max(A, Dim(1)). The Dim type can be constructed with integers,
arrays, ranges, or tuples, so all of the following are valid:
Dim(1)
Dim(1, 2)
Dim([1])
Dim([1,2])
Dim([1 2])
Dim(1:2)
@JeffBezanson
Copy link
Member

I strongly feel that the sanest thing is to use a higher-order function apply_to_dims(mean, A, 2), with whatever name is deemed good. The syntax mean(A, Dim(2)) is reasonable by itself, but it means you don't generally know what mean(A, x) does (x might be a Dim). I've always been ok with sum(A,2), but the situation it leaves you in for max and min is just awful. There are also tons of vector->scalar and vector->vector functions, and you can't expect all of them to be written to handle Dim arguments.

@stevengj
Copy link
Member

Note that fft(A, dim) and fft(A, (dim1, dim2, ...)) follow the old convention (which used to have a type union called Dimspec but is now duck-typed as anything that acts like an iterable sequence of integers). I have to admit I like the old syntax and find Dim(d) to be an unnecessary complication in most cases.

@JeffBezanson, the fft example also shows that this kind of thing should not necessarily be a high-order function, for performance reasons.

@blakejohnson
Copy link
Contributor

An apply_to_dims method would certainly be useful, but I also feel that, for convenience, certain basic methods ought to have a dimension argument. I find @timholy's last comment persuasive that max(A, b) should not mean the max of A along dimension b, but really something like max(max(A), b).

@JeffBezanson
Copy link
Member

The performance issue is a good point; the higher-order function can't do the fastest thing unless it has lots of special cases inside it anyway. I would also not be fully comfortable telling people to write something much longer than sum(A,2). So it seems there will have to be two approaches in play. In that case I'd vote for using func(A, dim) in as many cases as possible, for conciseness and familiarity. Perhaps the known-mean versions of var and std should just be renamed.
max is the most annoying case. I think max(A, (), d) has to go, and we can use the higher-order function for that.

@timholy
Copy link
Member

timholy commented Feb 27, 2013

If the performance problem could be solved, I'd agree with @JeffBezanson. But right now the performance gap would be too big. Simple example:

julia> A = randn(1000,1000,10);

julia> @time aA = abs(A);
elapsed time: 0.10193109512329102 seconds

julia> @time mA = map(abs, A);
elapsed time: 1.2647428512573242 seconds

I'm not really excited about that kind of price, so I think we need a short-term solution.

BTW, nice work @blakejohnson for putting this together so we can all see what it would look like in practice.

@JeffBezanson
Copy link
Member

I know this discussion has been going on for a while and consensus is emerging, but I really feel like max(a,b) giving the larger of a and b is an absolutely standard mathematical function, as standard as sin. It also appears in that form in every language from C++ to Excel. If you google "max function", every result agrees.

@StefanKarpinski
Copy link
Member

I think Jeff makes a very valid point here. Personally, I'm not sold on the maxof thing.

@blakejohnson
Copy link
Contributor

@JeffBezanson Not every language. Python/NumPy uses maximum for element-wise comparison, and max along a dimension.

@blakejohnson
Copy link
Contributor

So, what's the solution here? The options suggested so far:

  1. Drop support for dimension arguments entirely and rely on some kind of apply_to_dims method. As @stevengj points out, however, there are certain cases like fft where this would lead to a large performance penalty.
  2. Use a Dim type to distinguish the comparison element from the dimension arguments (see WIP: min, max, sum, prod, mean etc on dimensions #2409).
  3. Change the method name to distinguish the binary and unary forms of max (see WIP: max/min -> maxof/minof #2419).
  4. Use a keyword argument... when Julia supports keyword arguments.

@wlbksy
Copy link
Contributor

wlbksy commented Feb 28, 2013

I prefer No.4

@JeffBezanson
Copy link
Member

True, but non-numpy python also has max(a,b) for scalars with the usual meaning. So I'd only insist that max(1,2)==2.
From there I'd prefer but not insist that max vectorize normally and be elementwise if a and/or b are arrays.
@nolta 's branch shows that we could add keyword arguments fairly quickly, even if not with great performance at first.

@ViralBShah
Copy link
Member

Do we know how much performance we would lose with keyword arguments? If it is something like 10-25%, and we have a plan for regaining the lost performance, it would certainly be the right way to go. In general, it would be nice if people's codes do not slow down across releases.

One exercise could be to implement max with keywords on @nolta 's branch and measure the performance difference.

@ViralBShah
Copy link
Member

Of the four suggestions above outlined by @blakejohnson, my preferences are in the order: 4, 3, 2, 1.

@blakejohnson
Copy link
Contributor

I think everyone likes option 4, when it becomes available. Option 2 is the closest we can get right now, and it looks pretty clean in use. It's also no more keystrokes than a keyword argument...

@timholy
Copy link
Member

timholy commented Feb 28, 2013

It also leverages type dispatch better than a keyword argument.

@timholy
Copy link
Member

timholy commented Mar 1, 2013

I should explain the last comment more thoroughly. From a user's perspective, max(A,dim = 2) is at least as nice as max(A,Dim(2)). From a function author's perspective it may not be so nice. To me, the role of keyword arguments is to rationalize optional inputs. So when I consider writing this function with keyword arguments, depending on which implementation of keywords gets adopted I get something like this:

function max(kw::Keywords, A, B = nothing, dim = nothing)
  ...
end

Here, B is the comparision value, as in max(3,5)->5. Now, it seems extremely unlikely that we'd want to simultaneously support comparing A vs B and reducing over a dimension, because it makes the implementation of max quite complicated, with lots of branching for different cases. But you kind of have to, because you've declared that this is a 3-input function.

In contrast, the Dim solution is nice and clean:

function max(A, d::Dim)
  ...
end
function max(A, B)
  ...
end

The point is, the type system is actually quite good for eliminating ambiguity, which is really what we're talking about here: is that integer a comparison value or a dimension index? This is getting even better with the impending arrival of immutable types, for which the whole point (as far as I can tell) is to provide interpretive context for sequences of bits.

Keyword arguments seem to be best for "overriding" behavior that you'd rather not have to specify on a regular basis, but that's really quite different from the conceptual challenge here.

@blakejohnson
Copy link
Contributor

@timholy that's a very good point, and bumps option 2 (Dim type) to my preferred choice.

If we go with a Dim type, where do we use it? Everywhere we want a dimension argument? Or only where we need to distinguish different call signatures?

@JeffBezanson
Copy link
Member

With the implementations we're considering, you could define:

function max(A, dim = nothing)
  ...
end

function max(A)
  ...
end

function max(A,B)
  ...
end

So while not entirely dispatching on keyword args, still possible to dispatch on whether any are present at all.

@toivoh
Copy link
Contributor

toivoh commented Mar 1, 2013

Aha, the first case in favor of dispatch on keyword arguments! :)

@tshort
Copy link
Contributor

tshort commented Mar 1, 2013

I played around with some timings on Jeff's jb/kwargs branch. I couldn't measure any difference between max(x, dim) and max(x, dim = somedim). The problem is that my max along a dimension was slow enough to swamp any overhead from keyword args. With a simpler function involving scalars, the overhead can be sizeable.

@cmey
Copy link
Contributor

cmey commented Mar 5, 2013

So, please, what's the current alternative to doing:

mean([1; 2; 3], 1) == 2

since it disapeared in release-0.1 and broke my code ?

@blakejohnson
Copy link
Contributor

Install the Stats package and then that line will work.

On Tuesday, March 5, 2013 at 9:40 AM, Christophe MEYER wrote:

So, please, what's the current alternative to doing:
mean([1; 2; 3], 1)
since it disapeared in release-0.1 and broke my code ?


Reply to this email directly or view it on GitHub (#2265 (comment)).

@nolta
Copy link
Member

nolta commented Mar 13, 2013

I think we made the wrong call on std and var. Neither R nor Matlab accepts a known mean as an argument. The fact that our std is much more complicated than R's sd should be a huge red flag ;-)

I mean, look at this help entry:

Base.std(v[, corrected])

   Compute the sample standard deviation of a vector "v". If the
   optional argument "corrected" is either left unspecified or is
   explicitly set to the default value of "true", then the algorithm
   will return an estimator of the generative distribution's standard
   deviation under the assumption that each entry of "v" is an IID
   draw from that generative distribution. This computation is
   equivalent to calculating "sqrt(sum((v .- mean(v)).^2) /
   (length(v) - 1))" and involves an implicit correction term
   sometimes called the Bessel correction which insures that the
   estimator of the variance is unbiased. If, instead, the optional
   argument "corrected" is set to "false", then the algorithm will
   produce the equivalent of "sqrt(sum((v .- mean(v)).^2) /
   length(v))", which is the empirical standard deviation of the
   sample.

Base.std(v, m[, corrected])

   Compute the sample standard deviation of a vector "v" with known
   mean "m". If the optional argument "corrected" is either left
   unspecified or is explicitly set to the default value of "true",
   then the algorithm will return an estimator of the generative
   distribution's standard deviation under the assumption that each
   entry of "v" is an IID draw from that generative distribution.
   This computation is equivalent to calculating "sqrt(sum((v .-
   m).^2) / (length(v) - 1))" and involves an implicit correction
   term sometimes called the Bessel correction which insures that the
   estimator of the variance is unbiased. If, instead, the optional
   argument "corrected" is set to "false", then the algorithm will
   produce the equivalent of "sqrt(sum((v .- m).^2) / length(v))",
   which is the empirical standard deviation of the sample.

I vote for std(x[, dim]), var(x[, dim], and mean(x[, dim]).

@JeffBezanson
Copy link
Member

Not my specialty and not my call, but I agree with @nolta . The known-mean version is exotic enough to be renamed. Or we can add a mean=x keyword arg eventually.

@StefanKarpinski
Copy link
Member

Agreed. The proposed known-mean alternatives are stdm(m,v) and varm(m,x).

@johnmyleswhite
Copy link
Member

I still think the names stdm(m,v) and varm(m,x) are the way to go.

As for the help docs (which I wrote), I think their verbosity reflects the continued absence of a viable mechanism for splitting documentation into appropriate components. The current docs are an attempt to provide something like a condensed version of what R documentation would call the "Details" section. For things like variances which have different definitions for different communities, I think it's really important that the exact object being computed is described in detail somewhere in our documentation.

@blakejohnson
Copy link
Contributor

@nolta So, are you also suggesting to drop support for the 'corrected' flag to choose between N and N-1 normalization?

@nolta
Copy link
Member

nolta commented Mar 13, 2013

@blakejohnson Yes, for two reasons: 1) R's sd doesn't support it, and 2) i think something like central_moment(2, x) would be clearer.

@blakejohnson
Copy link
Contributor

I'm on board with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs decision A decision on this change is needed
Projects
None yet
Development

No branches or pull requests