Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hungry for more SIMD vectorization #7687

Closed
timholy opened this issue Jul 21, 2014 · 18 comments
Closed

Hungry for more SIMD vectorization #7687

timholy opened this issue Jul 21, 2014 · 18 comments

Comments

@timholy
Copy link
Member

timholy commented Jul 21, 2014

The first of these vectorizes (given Matrix{Float32} inputs), but the second doesn't.

function myblur_y!(blur_y, A)
    size(blur_y) == size(A) || error("mismatch")
    interior_x, interior_y = 2:size(A,1)-1, 2:size(A,2)-1
    @inbounds begin
        for y = interior_y, x = 1:size(A,1)
            blur_y[x,y] = A[x,y-1]+A[x,y]+A[x,y+1]
        end
    end
end

function myblur_x!(out, blur_y)
    size(out) == size(blur_y) || error("mismatch")
    interior_x, interior_y = 2:size(out,1)-1, 2:size(out,2)-1
    @inbounds begin
        for y = interior_y, x = interior_x
            out[x,y] = (1.0/9)*(blur_y[x-1,y] + blur_y[x,y] + blur_y[x+1,y])
        end
    end
end

However, the following vectorizes with a Vector{Float32} input:

function testvec2(lead)
    lag = similar(lead, length(lead)+1)
    @inbounds begin
        for i = 1:length(lead)
            lag[i+1] = 2*lead[i]
        end
    end
    lag
end

This makes me wonder whether there might be something relatively simple needed to turn on vectorization for myblur_x!?

Also, it would be great to be able to vectorize for Float64, although I recognize that the incremental benefit will be smaller.

CC @ArchRobison.

@timholy
Copy link
Member Author

timholy commented Jul 21, 2014

Hah. I turns out the lack of vectorization in myblur_x! was entirely due to the 1.0/9 term. Changing that to one(eltype(out))/9 fixes it.

@ArchRobison, should I leave this open for the Float64 issue, or just close?

@ArchRobison
Copy link
Contributor

Let's leave it open. I tried a similar C++ example with the Intel compiler and it vectorized the code, so presumably in principle Julia could vectorize it too. I'll try it with LLVM trunk to see if it does any better.

@ArchRobison
Copy link
Contributor

The LLVM trunk vectorizer rejects the code because Julia codegen is generating volatile loads/stores in the loop. Here is a trivial example:

julia> TimesOne( a ) =  a*1.0
TimesOne (generic function with 1 method)

julia> code_llvm(TimesOne,(Float32,))

; Function Attrs: sspreq
define double @julia_TimesOne18932(float) #0 {
top:
  store volatile float %0, float* bitcast (i128* @jl_float_temp to float*), align 8, !dbg !8
  %1 = load volatile float* bitcast (i128* @jl_float_temp to float*), align 8, !dbg !8
  %2 = fpext float %1 to double, !dbg !8
  ret double %2, !dbg !8
}

The problem is in src/intrinsics.cpp:

    HANDLE(fpext,2) {
        // when extending a float32 to a float64, we need to force
        // rounding to single precision first. the reason is that it's
        // fine to keep working in extended precision as long as it's
        // understood that everything is implicitly rounded to 23 bits,
        // but if we start looking at more bits we need to actually do the
        // rounding first instead of carrying around incorrect low bits.
        Value *x = auto_unbox(args[2],ctx);
        builder.CreateStore(FP(x), builder.CreateBitCast(prepare_global(jlfloattemp_var),FT(x->getType())->getPointerTo
        return builder.CreateFPExt(builder.CreateLoad(builder.CreateBitCast(prepare_global(jlfloattemp_var),FT(x->getTy
                                   FTnbits(try_to_determine_bitstype_nbits(args[1],ctx)));
    }

This approach for rounding was introduced to fix issue #41. I suspect there is a cleaner way to enforce the rounding in LLVM. I'll go see what Clang does in similar circumstances. For example, C99 has a mode where extra precision is not allowed.

@lindahua
Copy link
Contributor

What about we emit more efficient codes when the source codes are tagged with @simd?

@ArchRobison
Copy link
Contributor

I'd like to fix the problem in general since it impacts optimization of scalar code too. For example, the volatile sequence foils constant folding.

@JeffBezanson
Copy link
Member

Issue #41 seemed to only happen on a 32-bit system; indeed I can't reproduce it now on a 64-bit system. Also @Keno pointed out that bitcast behaves as if the value were stored to memory, so we should try using bitcast instead of the volatile store/load:

http://llvm.org/docs/LangRef.html#bitcast-to-instruction

@ArchRobison
Copy link
Contributor

64-bit systems tend to forsake the x87 unit except in rare circumstances. I'd like to be able to reproduce the original problem before trying the bitcast fix. Is there an easy way to configure Julia on a 64-bit system to build as a 32-bit executable?

@timholy
Copy link
Member Author

timholy commented Jul 21, 2014

While it would be great to fix these mixed-type operations, I bet in practice many people will use computations that are all Float64 or a mix of Float64 and Int. AFAICT those are still not vectorizing: compare code_llvm on testvec2 with Vector{Float64} vs Vector{Float32} . Unless I'm really misunderstanding something, I'm guessing this must be a separate issue from the rounding behavior.

@ArchRobison
Copy link
Contributor

The failure of LLVM to vectorize Float64 is a separate issue. I've been playing with LLVM trunk on a machine with AVX, and the problem sees to be in LLVM's cost model. It considers a scalar load of Float64 to have a cost of 1, but a 4-way SIMD load of Float64 to have a cost of 51. That seems way out of proportion.

@staticfloat
Copy link
Member

@ArchRobison to build julia as a 32-bit executable, I think you'll just need to pass -m32 as a CFLAG to julia and all dependencies. If you'd like to install dependencies as 32-bit so that you don't have to bother with recompiling everything, you can usually request it. Here's how you'd do it on Ubuntu/Debian: sudo apt-get install llvm-3.3:i386.

@rickhg12hs
Copy link
Contributor

As a related aside ...
@ArchRobison @staticfloat Building everything with -m32 on Travis might also catch some of the occasional 32-bit issues.

@ArchRobison
Copy link
Contributor

It's the passing -m32 to all dependencies that the problem. There' are lots of dependencies. I tried setting CFLAGS = -m32 in Make.user but the compilation of deps/libunwind-1.1 stopped, from some kind of confusion between 32-bit and 64-bits. I'll poke around to see if I can find a 32-bit environment here.

@staticfloat
Copy link
Member

You'll probably have to make distclean all your dependencies. You may also just need to export CFLAGS=-m32 in your shell, as variables getting passed down from Make.user to dependencies is not automatic, and we may have missed a few spots.

@JeffBezanson
Copy link
Member

Starting a 32-bit VM is probably easier.

@staticfloat
Copy link
Member

@ArchRobison I have some 32-bit VMs already up and running, if you need them. I can email you ssh usernames and passwords if you want.

@ArchRobison
Copy link
Contributor

@staticfloat Thanks for the offer. I may take it up if the few 32-bit machines/VMs that I found here don't work out.

@staticfloat
Copy link
Member

No problem. I've been setting up VMs specifically to build Julia all week, so the pathways have already been setup over here.

Incidentally, as I was trying to find your github account, I discovered that Google knows where your true loyalties lie, and even sorts them appropriately: :P
Google Knows All

ArchRobison pushed a commit to ArchRobison/julia that referenced this issue Jul 22, 2014
@vtjnash
Copy link
Member

vtjnash commented Jul 23, 2014

If you set ARCH=i686 (on a clean checkout), Julia will configure and build for 32-bit i686, assuming you have all the 32-bit dependencies. This is a very recent addition, and I think it is mentioned in the readme. You can optionally also set MARCH to further refine the target selection.

ArchRobison pushed a commit to ArchRobison/julia that referenced this issue Jul 24, 2014
…g#41

to use volatile store/load on 32-bit x86 only.
ArchRobison pushed a commit to ArchRobison/julia that referenced this issue Jul 24, 2014
…g#41

to use volatile store/load on 32-bit x86 only.
ArchRobison pushed a commit to ArchRobison/julia that referenced this issue Jul 24, 2014
…g#41

to use volatile store/load on 32-bit x86 only.
JeffBezanson added a commit that referenced this issue Jul 24, 2014
Fix issue #7687, by revising original fix for issue #41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants