Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

^ is slow #2741

Closed
jtravs opened this issue Apr 3, 2013 · 52 comments
Closed

^ is slow #2741

jtravs opened this issue Apr 3, 2013 · 52 comments
Labels
performance Must go faster

Comments

@jtravs
Copy link
Contributor

jtravs commented Apr 3, 2013

This is my second day using Julia and I'm getting increasingly excited!

However, I nearly didn't make it even this far, as my early test programs had awful performance, and I nearly gave up with the common statement "beautiful language but the performance sucks". This was after carefully reading the documentation, especially the performance tips section.

Luckily, I happened across this discussion:
https://groups.google.com/d/topic/julia-dev/_UZ2A_Jp8Jc/discussion
and after changing one line of code from using A.^3, where A is a large two dimensional array, to using a nested for loop, I got performance faster than my C++ code (with an almost fair comparison).

I am merely suggesting here that a sentence is added to the performance tips section to note that currently, explicit loops can be much faster than array operations. Such a note would have saved me time, and prevented me from almost abandoning julia.

Thanks for making a truly beautiful (and fast) language,
John

@ViralBShah
Copy link
Member

Good point. This should be noted in the performance tips. This is something that is clearly on the roadmap for improvement - and it is not that the vectorized code performance is bad, but we need better memory allocation and GC.

Also check out the Devectorize package for automatic devectorization until this is fixed.

@JeffBezanson
Copy link
Member

Hi @jtravs, if you could show a specific code example it will help me determine if there is a performance bug that could be fixed.

@jtravs
Copy link
Contributor Author

jtravs commented Apr 3, 2013

Hi Jeff (and others), thanks for the responses. Here is the code I first noticed this issue on:

function f(in::Array{Float64,2}, out::Array{Float64,2})
    out = in.^3
end

function g(in::Array{Float64,2}, out::Array{Float64,2})
    for i=1:size(in,2)
        for j=1:size(in,1)
           out[j,i] = in[j,i]*in[j,i]*in[j,i]
        end
    end   
end

A = rand(8192, 200);
P = similar(A)

@time f(A, P);
@time f(A, P);
@time g(A, P);
@time g(A, P);

Q = similar(A)
f(A,P);
g(A,Q);
println(all(Q == P));

I get:

elapsed time: 0.247349689 seconds
elapsed time: 0.218872701 seconds
elapsed time: 0.021574802 seconds
elapsed time: 0.00601552 seconds
true

@johnmyleswhite
Copy link
Member

One source of your performance trouble is that your f function is spending a bunch of time allocating space, which g doesn't have to do. This happens for two reasons:

  • You create a temporary array in.^3 that you try to assign to out.
  • The assignment to out doesn't actually work for reasons that are a little subtle. You just end up changing the local binding of the out variable rather than editing the contents of the array you pass in. This is one of the things about Julia's syntax that is definitively not intuitive.

@StefanKarpinski
Copy link
Member

The assignment to out doesn't actually work for reasons that are a little subtle. You just end up changing the local binding of the out variable rather than editing the contents of the array you pass in. This is one of the things about Julia's syntax that is definitively not intuitive.

I realize my notion of what is or isn't intuitive in programming languages may not be typical, but I don't see how this could possibly do anything else. But, yes, John is totally right about these points.

@johnmyleswhite
Copy link
Member

The problem is that intuitively accessible and logical consistent are largely distinct properties: Julia might not be able to function in any other way, but the fact that out[:] = foo works and that out = foo does not is quite subtle -- in many ways, this distinction requires understanding the nature of pointers.

@timholy
Copy link
Member

timholy commented Apr 3, 2013

The real problem is that taking a number to a general power is much slower than 3 multiplies.

@JeffBezanson
Copy link
Member

@timholy go to the head of the class! All of the time is in the pow function. If g is modified to compute in[j,i]^3, the times are the same.

It turns out openlibm pow has a special case for 2 , and given the large time difference it looks like it would be worth it to add one for 3 as well.

@JeffBezanson
Copy link
Member

I added cases for 3 and 4 in openlibm. Closing this as unrelated to array expressions.

@ViralBShah
Copy link
Member

@JeffBezanson
Copy link
Member

Maybe we should use system libm on mac? Can you try a quick performance comparison of pow in openlibm vs. mac?

@jtravs
Copy link
Contributor Author

jtravs commented Apr 4, 2013

Being able to choose the libm would be nice; then for example some users could use the optimized intel libM: http://software.intel.com/en-us/articles/implement-the-libm-math-library

@ViralBShah
Copy link
Member

We already have the ability to choose between openlibm and the system libm. Perhaps we need to make the libm choice flexible like we do with the BLAS.

Cc: @staticfloat

@ViralBShah
Copy link
Member

@JeffBezanson See experiments in https://gist.github.com/ViralBShah/5309148

Basically, except for your performance hack for small integer powers, Apple's pow is faster.

@ViralBShah
Copy link
Member

APSL qualifies as a free software license according to FSF:
http://www.gnu.org/philosophy/apsl.html

It may be worth pulling in a few of these functions into openlibm.

@pao
Copy link
Member

pao commented Apr 4, 2013

FSF-free doesn't necessarily mean it's license compatible (for instance, APSL is not GPL compatible). I'm not sure what the rules are for MIT.

@ViralBShah
Copy link
Member

Ah yes, I forgot there are all those fine distinctions. It is such a shame that after all these decades of numerical computation, we do not have one open source high quality high performance libm implementation.

@kmsquire
Copy link
Member

kmsquire commented Apr 4, 2013

The main reason it's not GPL compatible seems to be that you can't combine it with GPL code and simply release the whole thing as GPL, because the APSL requires that you still include license notices, and the GPL wants to be the whole shebang.

Compliance with the license mainly seems to entail

It explicitly allows linking with other libraries, even non-free ones, as long as the license terms are followed. So linking and releasing don't seem to be an issue, as long as you're willing to comply with the license.

Some obvious issues:

  • It's unclear how prominent the notice needs to be.
  • Neither the files pointed to above nor the directory the file is in contain a license, only a copyright notice, which technically gives no rights at all (but some of the other files have license information)
  • The terms of the license may be inconvenient enough that you might not want to include it in openlibm

The first two might be cleared up by contacting Apple and/or the original developers, if anyone cares to (and hey, maybe the developers would be interested in Julia).

@JeffBezanson
Copy link
Member

I'd say GPL-incompatible is not acceptable.

@ViralBShah
Copy link
Member

Yes, incompatibility with GPL is certainly a deal breaker.

@kmsquire
Copy link
Member

kmsquire commented Apr 4, 2013

No problem, of course, but I'm curious why being incompatible with the GPL
is bad for Julia? You don't have any GPL code (yet) in mainline julia, and
there seems to be explicit emphasis on excluding GPL code.

Don't get me wrong--I'm actually a fan of the GPL, though I don't think
it's appropriate for everything, and I have no issues with the license
choices made for Julia. I'm just not following your reasoning in this case.

@JeffBezanson
Copy link
Member

There are many GPL libraries everybody uses with julia, so the base system has to be at least GPL-compatible.

@kmsquire
Copy link
Member

kmsquire commented Apr 4, 2013

Okay, so that would, e.g., allow julia to be included/embedded in a GPL
program. Makes sense, thanks.

@ViralBShah
Copy link
Member

openlibm is also something that could potentially be used more widely, and it would be nice to be GPL compatible.

@jtravs
Copy link
Contributor Author

jtravs commented Apr 9, 2013

Is the discussion in this thread:
https://groups.google.com/d/topic/julia-users/c6TmwkOkEeg/discussion
a related issue? Replacing

factor = h/(norm(dr)^3)

(where dr is a 3 element array) with:

denom = sqrt(dr[1]_dr[1] + dr[2]_dr[2] + dr[3]_dr[3])
factor = h/(denom_denom*denom)

gives a very big speed-up.

Thanks,
J

@timholy timholy reopened this Apr 9, 2013
@timholy
Copy link
Member

timholy commented Apr 9, 2013

It's worth noting that part of the speedup is due to replacing BLAS.norm with the explicit formula, but the majority of the benefit comes from avoiding even the new-and-improved pow function. Changing

denom = sqrt(dr[1]*dr[1] + dr[2]*dr[2] + dr[3]*dr[3])

to

denom = sqrt(dr[1]^2 + dr[2]^2 + dr[3]^2)

leads to a substantial performance hit.

@ViralBShah
Copy link
Member

This issue has a misleading title. Perhaps we should have a new issue.

@simonster
Copy link
Member

This was optimized into three multiplies, but it isn't optimized after f2b3192 (necessary to fix #6506), so ultimately we're waiting on the LLVM bug that prevents us from using the powi intrinsic to get fixed here.

@simonster
Copy link
Member

I suppose another question is if this optimization is safe. I tested (x*x)*(x*x) against x^4 on 100,000 random numbers generated with rand() (code here). It seems that x^4 always gives the correct result to 0.5 ulp (it always matches float64(big(x)^4)) whereas (x*x)*(x*x) gives a less accurate result for half of the inputs. If I have computed it correctly, then the maximum error of (x*x)*(x*x) exceeds 1.8 ulp within this range. For higher powers the result is even less accurate.

@JeffBezanson
Copy link
Member

That is a very good point.

@StefanKarpinski
Copy link
Member

That's interesting because pairwise summation tends to be more accurate than iterative summation. This is a case where you've found pairwise multiplication of the same value to be less accurate than iterative multiplication. Clearly it's a rather different issue because since you're dealing with the same value over and over again, but I wonder if it's really generalizable or not. Is iterative always worse than pairwise? Or are there situations where the pairwise optimization gives better results than iterative?

@andrioni
Copy link
Member

Actually, he compared pairwise against libm pow, not against iterated multiplication. I've just updated the gist with it.

@StefanKarpinski
Copy link
Member

Ah, sorry. I misunderstood that.

@JeffBezanson
Copy link
Member

I added a special case to openlibm for y==4. Perhaps that was ill-advised. However we are using the powi_llvm intrinsic, which calls whatever pow function it can find (to work around an llvm bug), probably the system libm.

@simonster
Copy link
Member

Yes, it seems this must be calling something besides openlibm, which would also explain why ^ is always within 0.5 ulp for me but not quite for @andrioni. If I call openlibm directly with ccall((:pow, Base.Math.libm), Float64, (Float64, Float64), x, 4), then the results are unsurprisingly the same as pairwise multiplication given the special case. For x^5 I can directly compare with openlibm, and it looks like multiplication gives a max error of ~2.6 ulp while openlibm gives ~0.8 ulp and (on my system) ^ gives <0.5 ulp.

@andrioni
Copy link
Member

^ for me had 0.5008007537332742 as the maximum ulp. I'm on OS X 10.7.

@simonster
Copy link
Member

For me ^ is 0.49999994903638395 on Linux. Apparently glibc's pow guarantees that the result is correctly rounded to nearest for all inputs. Also apparently it is (or maybe only used to be?) so slow on pathological inputs that many bug reports were filed.

@timholy
Copy link
Member

timholy commented Jun 24, 2014

It's definitely not calling libm, because otherwise x^2 would be much slower than x*x---last I knew, we can't inline ccalls.

@JeffBezanson
Copy link
Member

No, but LLVM does. It replaces pow(x,2) with x*x. This is mildly sketchy, but in this particular case it's quite nice.

@timholy
Copy link
Member

timholy commented Jun 25, 2014

Ah, didn't realize that. That is nice.

@quinnj
Copy link
Member

quinnj commented Aug 29, 2014

Any other work to be done here? Seems like this has evolved over a few different issues.

@stevengj
Copy link
Member

Still waiting on a fix for the LLVM bug so that we can use powi, I think?

@quinnj
Copy link
Member

quinnj commented Aug 29, 2014

Was that what @Keno referred to as fixed?

@Keno
Copy link
Member

Keno commented Aug 29, 2014

Yes, that is fixed in LLVM 3.6

@stevengj
Copy link
Member

So, once we upgrade to LLVM 3.6, we can re-enable the powi intrinsic and close this issue.

@Keno
Copy link
Member

Keno commented Aug 29, 2014

We, can do it right now with ifdef LLVM36 (will do, hold on).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests